DragonFly kernel List (threaded) for 2009-03
DragonFly BSD
DragonFly kernel List (threaded) for 2009-03
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: Add Toeplitz hash function to map packet's CPU


From: Sepherosa Ziehau <sepherosa@xxxxxxxxx>
Date: Thu, 12 Mar 2009 10:17:04 +0800

On Thu, Mar 12, 2009 at 12:43 AM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:
>
> :Hi all,
> :
> :After making two NICs multiple receive queue working, I now propose to
> :add Toeplitz hash function to map packet's CPU.  It is mainly use to
> :support "receive side scaling"
> :(http://www.microsoft.com/whdc/device/network/ndis_rss.mspx) in
> :..
> :forwarding path is even CPU localized).  If the packet is non-fragment
> :TCP, the hash is calculated based on laddr,faddr,lport,fport, else the
> :hash is calculated using laddr,faddr.
> :
> :There are two things we need to overcome:
> :1) The result of hash function is non-commutative in the M$ paper,
> :i.e. faddr,laddr,fport,lport and laddr,faddr,lport,fport gives
> :different result.
> :Thanks to corecode's suggestion, as long as 0xabcd is replicated to
> :form the key, the result of the hash function _is_ commutative.
> :
> :2) It is computational heavy
> :Thanks to corecode again, we could cache a pre-calculated result
> :table, so we actually only need to index a array and OR the results.
> :..
> :A simple implementation is at:
> :http://leaf.dragonflybsd.org/~sephe/toe.c
> :
> :I used it to verify that hardware gives the correct result :)
> :
> :The whole thing is not implemented yet, but if you don't think its bad
> :idea, I will move on to implement it.  Note, it is not intended to
> :replace the current packet hash function.
> :
> :Best Regards,
> :sephe
>
>    Yes, I agree, we definitely need to make this an end-to-end capable
>    feature, similar to propogating hardware checksum state.  My suggestion
>    is that we add appropriate flags and a hash field in the mbuf header.

Yeah, that's my plan too; we accidentally have 2 unused bytes at the
end of m_pkthdr :)

>
>    I don't understand the 0xabcd replication comment w/ regards to making

The receive side scaling requires max 40bytes key.

According to M$ paper the key content is completely random.
Following is the key in M$'s RSS verification suite:
	0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2,
	0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0,
	0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4,
	0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c,
	0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa
The hashing result is not commutative.

However, if we pick two random bytes and form the key by replicating them, e.g.
the two random byte is 0x5a6d (that's what I mean by 0xabcd :), then
the key will be:
	0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a,
	0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a,
	0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a,
	0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a,
	0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a
The hashing result is commutative.

I need to note the reason that we only use _2_ random bytes to form
the key is that the hash on TCP segment is calculated over a binary
string formed by concatenating laddr,faddr,lport,fport, so we are
actually limitted by the size of lport/fport which is 2 bytes.  While
this also reduces the size of the pre-calculated hash result table.

>    both directions of the connection produce the same hash value, but I
>    agree that would be desirable.  Does it avoid having to recalculate
>    a hardware-supplied hash code?

The hash is only calculated if userland application sends a packet; a
simple optimization is to save the hash value in inpcb for TCP socket,
so we could avoid calculating the hash for every segments (the hash
results are actually same).
On the receive side, the hash value is calculate by hardware, and the
hardware will deliver it to us in RX desc, so we don't need to
calculate it by ourselves.

=============

When we dispatch a packet upon RX, there are two conditions that we
take into consideration:
1) packet type
2) hash value of the packet

As long as the hardware provides them, we even don't need to take a
look at the data part of the mbuf (mbuf cluster is always used on RX
path, so the data is always in different memory position)

1) shows the "receive side scaling" design flaw, M$ says nothing about
it.  However, hardware vendors seem to begin to realize it.  jme
(relative newer chips) provides packet type in RX desc.  While old
Intel hardware supported by em/emx does not provide that information
without split header RX desc.  However, it could be infered from the
hash type and the RXCSUM results.  Their newer hardware (82576) seems
to provide the packet type in RX desc.  Hmm, this does look like the
trend :)

Best Regards,
sephe

-- 
Live Free or Die



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]