DragonFly hashes input packets and sockets using Toeplitz. The Toeplitz key is created from 2 bytes, so the hash result is TCP-4-tuple and UDP-2-tuple commutative. The TCP-4-tuple and UDP-2-tuple commutativity of hash result is important. In DragonFly, there is one NETISR kernel thread for each CPU and is bound to that CPU. The TCP syncache, connected socket inpcb, inpcb connection hash table and inpcb wildcard hash table is per-cpu. However, the TCP listen socket's completion queue and incompletion queue are shared across NETISRs, i.e. CPUs. The inpcb wildcard hash and TCP listen socket: CPU0 CPU1 inpwild_hash inpwild_hash : : : : : : +-----------+ : : +----+ | so_comp | +----+ | |------>|-----------|<------| | +----+ | so_incomp | +----+ : : +-----------+ : : : : A A : : A | | A | | | | | | | | SYNCACHE0------+ +------SYNCACHE1 A A | | | | NETISR0 NETISR1 so_comp and so_incomp accessing is protected by pooled token. In nginx case, this is what happens w/ the traditional listen socket inheritance. The inpcb connection hash and TCP connected socket: CPU0 CPU1 inpconn_hash inpconn_hash : : : : : : : : +----+ +---------+ +---------+ +----+ | |-->| TCP inp | | TCP inp |<--| | +----+ +---------+ +---------+ +----+ : : A A : : : : | | : : A | | A | | | | | | | | NETISR0-------+ +--------NETISR1 The operations on TCP connected socket do not need protection and is CPU localized. The connected socket's TCP-4-tuple Toeplitz hash is masked w/ ncpus2_mask to pick up the CPU, on which the TCP connected socket should be processed. In nginx case, this is the picture when the sockets are accepted. The SO_REUSEPORT introduces per-cpu inpcb group hash, here it shows the number of TCP listen sockets is same as number of CPUs: accept(2) accept(2) UTHREAD0 ----------------+ UTHREAD1 ----------+ (TCP listen sock0) | (TCP listen sock1) | | | CPU0 | CPU1 | V | inpgroup_hash entry +-------------+ inpgroup_hash entry | +->| comp/incomp |<-+ | +-----+ | +-------------+ | +-----+ | | 0 |---------+ A +---------| 0 | | +-----+ +--------------+ +-----+ | | 1 |----|----+ +---------| 1 | | +-----+ | | +-------------+ | +-----+ | A | +->| comp/incomp |<-+ A | | | +-------------+ | | | | A A | | | | | +------------------------------------+ | | | | SYNCACHE0---+ +--------------------SYNCACHE1 A A | | | | NETISR0 NETISR1 The incoming TCP SYN's Toeplitz hash is moduloed w/ the number of SO_REUSEPORT TCP listen sockets on the same port/addr to generate the index into the inpgroup_hash entry, so the so_comp and so_incomp accessing, i.e. accept(2), is CPU localized. The data output path: CPU0 UTHREAD0 (TCP listen sock0) | | mbuf to-be-send on inp0 NETISR0<------+ | | mbuf V inp0 so_snd As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0 and it stays on CPU0 (I have not provided CPU binding hint yet, however, DragonFly scheduler helps a lot here), the send(2) is CPU localized. The data input path: CPU0 UTHREAD0 (TCP listen sock0) | inpconn_hash | : : | extract the rcvd mbuf : : | +----+ V | |--->inp0 so_rcv +----+ A : : | : : | rcvd mbuf | NETISR0----------+ As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0 and it stays on CPU0, the recv(2) is CPU localized. The socket close path: CPU0 UTHREAD0 (TCP listen sock0) | | close inp0 msg NETISR0<------+ : soclose(inp0) As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0 and it stays on CPU0, the close(2) is CPU localized. THE REST OF THE RELATED PARTS: If the network hardware does not support RSS, using MSI or line interrupt: CPU0 CPU1 NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc) A A | | | pkt(hash==0) | pkt(hash==1) | | hash=(SW Toeplitz hash & ncpus2_mask)-----------+ | NIC_ITHREAD---->RX ring If the network hardware does not support RSS, using polling(4): CPU0 CPU1 NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc) | A A | | | | | pkt(hash==0) | pkt(hash==1) | | | hash=(SW Toeplitz hash & ncpus2_mask)-----------+ | V RX ring If the network hardware supports RSS, unlike other OSes, we set the same key used by software to the hardware, and configure the redirect table in following fashion: (hash & ring_cnt_mask) == rdr_table[(hash & rdr_table_mask)] In DragonFly, MSI-X ithread is bound to the specific CPU set by the driver, which has the knowledge of the CPU that should processing the input packets on the specific RX ring. If the network hardware supports RSS, using MSI-X: CPU0 CPU1 NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc) A A | | | pkt(hash==0) | pkt(hash==1) | | hash=(HW Toeplitz hash & ncpus2_mask) hash=(HW Toeplitz hash & ncpus2_mask) | | MSIX_ITHREAD0 MSIX_ITHREAD1 | | V V RX ring0 RX ring1 If the network hardware supports RSS, using polling(4): CPU0 CPU1 NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc) | A | A | | | | | | pkt(hash==0) | | pkt(hash==1) | | | | hash=(HW Toeplitz hash & ncpus2_mask) hash=(HW Toeplitz hash & ncpus2_mask) | | V V RX ring0 RX ring1 The tranmission path, when number of hardware transmissions is less than the number of CPUs or ALTQ packet scheduler is enabled (assume CPU0 processing hardware TXEOF): CPU0 CPU1 NETISR0 NETISR1 | | | mbuf | mbuf | | V V +----------------------------+ | if_subq0 | +----------------------------+ | | | | +........................\ | TX ring contended : | : TX ring not contended | : V V +----------------------------+ | TX ring | +----------------------------+ The tranmission path, when number of hardware transmissions is equal to or more than the number of CPUs: CPU0 CPU1 NETISR0 NETISR1 | | | mbuf | mbuf | | V V if_subq0 if_subq1 | | | | V V TX ring0 TX ring1 MY ORIGINAL PLAN, BEFORE SO_REUSEPORT: CPU0 CPU1 UTHREAD0 UTHREAD1 | : : | | :................................. : | | : : | | .....................................: | | : : | | : : | V V V V +-------------+ +-----------------+ +-------------+ | so_comphdr0 |---| TCP listen sock |---| so_comphdr1 | +-------------+ +-----------------+ +-------------+ A A | | | | SYNCACHE0 SYNCACHE1 A A | | | | NETISR0 NETISR1 UTHREAD0 will try dequeuing from so_comphdr0, if it is empty, then it goes to so_comphdr1; same applies to UTHREAD1. However, this could has trouble when waking up the waiters on TCP listen socket, since they are actually wait on the same socket. Additionally, it could introduce so too many changes to the kernel, which may affect other kinds of sockets (AF_LOCAL, SOCK_STREAM/SOCK_SEQPACKET). Comparatively speaking, the implementation of SO_REUSEPORT is much simpler, much less invasive, straightforward and the ending result is good. 2013.09.05 sephe