DragonFly BSD
DragonFly users List (threaded) for 2005-02
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: Dragonfly and Hyperthreading....


From: Jonathan Dama <jd@xxxxxxxxxxx>
Date: Mon, 21 Feb 2005 21:34:24 -0800

As a practicing VLSI engineer allow me to emend some of your commentary.

> The *really* interesting point of the Opteron vs. Xeon, or Athlon/64
> vs. P4 benchmarks are that the AMD CPUs are within 10% of the Intel
> CPUs, yet they are clocked up to 1000 MHz *slower*.  That right there
> is the biggest indication that Intel has screwed up somewhere in the
> core of the CPU.
That isn't an indication of that at all.  What they did do is waste 
engineering time.  

Clock speed gains typically come in one of two ways: process technology
advancements and pipelining.  The first way effects how fast the transistors
can switch.  Contrary to popular opinion, scaling is not a magic bullet.
True enough, the transistor switching time scales proportionally with the 
feature size but the wires do not scale.  This was okay until the wiring 
delays began to dominate.  The result: scaling doesn't make the clock speed
faster (any more).  This is why chips switched to using copper a few years
ago--they were trying to buy a little time.  But now, even copper wire delays
dominate. 

There is another effect as well.  The real reason that the transistors get 
faster as the feature size scales down is because the threshold voltage is 
scaling down as well.  As the threshold voltage goes down, the leakage (off)
current goes up.  This is a fundamental physical effect relating to the 
Boltzmann factor (google for more information).  There is a second source of 
leakage relating to the gate oxide--that problem is going away through the use
of different gate dielectrics (google high-k dielectric).

Clock speeds also increase by reducing the amount of logic in every pipeline
stage.  The trouble is that pipelining introduces latencies and implementation
complexities.  (In order to gain throughput).  Branch operations though tend to 
be latency sensitive.  As a consequence, eventually the trade-off isn't beneficial
(given some frequency of branch instructions).  Intel was caught in a bit of 
marketing snafu--marketing reasons gave them cause to push pipelining very far.

As I said, pipelining increases throughput.  But throughput of what?  Remember 
before I said that pipelining is introduced by doing less in each stage.  Various 
things can be done to ensure a net gain (the details of which are not relevant here).

So against a fundamental benchmark (like computing conjugate gradient) many different 
combinations of cycle-time and complexity/cycle will yield the same the performance.

Thus, calling any one combination wrong is presumptuous.  Nothing is fundamentally
wrong with the P4 core.  It just picks a certain combination.  AMD's Opteron picks
another combination.

I can tell you though that Intel's decision likely cost substantially in engineering 
time.  e.g., Dragonfly and FreeBSD may end up with similar performance.  But the
FreeBSD approach may have required much more development time.  This is not evidence
that the FreeBSD implementation is screwed up.  Merely that some engineering techniques
are mmore productive than others.

> Sure, the P4 can be cranked up to 4 GHz, but what's the point if the
> Athlon64 at 2.8 GHz gives you just as much performance, for less cost,
> less heat waste, and less energy??
Exactly.  AMD was right, clock speed is just marketing foo.

> Intel's implementation of SMT (known as HyperThreading) is nothing
> more than a hack to try and keep the P4's overly-long pipeline full of
> instructions.  It's a hack because it requires an HT-aware scheduler
> to take full advantage of it (trying to run an SMP kernel on an HT
> system will usually slow things down as the SMP scheduler tries to
Uh.  You're a bit off the mark here.  Many processor advancements are intended to
yield benefits without changes to the programs, but this is not strictly the case.
	- New instructions require new compiler logic
	- Superscalar designs like the 686 imposing the 4-1-1 rule
	- The best progams are tuned to fit well into the processor's working set

> Other CPUs that use SMT (like IBM's POWER4 or POWER5) built into the
> design from the get-go (rather than kludged on afterward like with the
> P4) work much better.  Don't have much info on this beyond what's on
> Ars Technica, though.
I can't comment on whether the intel HTT architecture is actually good.

-Jon



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]