DragonFly commits List (threaded) for 2005-06
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]
Re: cvs commit: src/sys/kern imgact_elf.c init_main.c kern_checkpoint.c kern_descrip.c kern_event.c sys_generic.c sys_pipe.c uipc_syscalls.c uipc_usrreq.c vfs_aio.c vfs_syscalls.c src/sys/sys filedesc.h src/sys/dev/misc/streams streams.c ...
> If we care more about performance we could make it 16 bytes
> so the array lookup is scaled by a factor of two again, which
> I might just do actually because that removes the overhead of
> a multiply.
Call me a naive assembly programmer but...
I doubt it really matters at all. Some numbers:
PII, PIII:
add r,r 1 cycle
shl r,i "free"
lea r,[r+r*i] 1 cycle
mul 4 cycles
P4:
add r,r 0.5 *an instruction depending on the result
may overlap so the cost is amortized
shl r,i 4!!!
lea r, [r+r*i] 4
mul 16
mov r,r 0.5
Now x*12 = x*8+x*4 = 4*(2*x+x)
so we can expect on PII, PIII:
(multiples of 12)
shl eax, 2
lea eax, [eax+eax*2]
(1 cycle)
(multiples of 16)
"free"
on the P4:
(multiples of 12)
add eax, eax
add eax, eax
mov ebx, eax
add ebx, eax
add ebx, eax
(2.5)
(multiples of 16)
add eax, eax
add eax, eax
add eax, eax
add eax, eax
(2)
Note: in none of these cases do we ever actually MUL nor
does any decent compiler
*everyone* not just assembly programmers should read
http://www.agner.org/assem/pentopt.pdf
which I find is the single most concise and effective guide
to understanding the costs of "standard operations"--sorry
no commentary on SMP situations there.
also, I'm sure there are nay-sayers about optimizing against
x86 but architecture independent design for performance is
nonsense. Try running some of those NetBSD benchmarks on non-intel
hardware, watch the performance just drop to the floor.
-Jon
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]