DragonFly hashes input packets and sockets using Toeplitz.  The Toeplitz key
is created from 2 bytes, so the hash result is TCP-4-tuple and UDP-2-tuple
commutative.  The TCP-4-tuple and UDP-2-tuple commutativity of hash result
is important.

In DragonFly, there is one NETISR kernel thread for each CPU and is bound to
that CPU.  The TCP syncache, connected socket inpcb, inpcb connection hash
table and inpcb wildcard hash table is per-cpu.  However, the TCP listen
socket's completion queue and incompletion queue are shared across NETISRs,
i.e. CPUs.  The inpcb wildcard hash and TCP listen socket:

       CPU0                             CPU1

    inpwild_hash                     inpwild_hash
      :    :                           :    :
      :    :       +-----------+       :    :
      +----+       |  so_comp  |       +----+
      |    |------>|-----------|<------|    |
      +----+       | so_incomp |       +----+
      :    :       +-----------+       :    :
      :    :        A         A        :    :
         A          |         |          A
         |          |         |          |
         |          |         |          |
     SYNCACHE0------+         +------SYNCACHE1
         A                               A
         |                               |
         |                               |
      NETISR0                         NETISR1

so_comp and so_incomp accessing is protected by pooled token.  In nginx
case, this is what happens w/ the traditional listen socket inheritance.

The inpcb connection hash and TCP connected socket:

       CPU0                                  CPU1

    inpconn_hash                          inpconn_hash
      :    :                                :    :
      :    :                                :    :
      +----+   +---------+    +---------+   +----+
      |    |-->| TCP inp |    | TCP inp |<--|    |
      +----+   +---------+    +---------+   +----+
      :    :        A              A        :    :
      :    :        |              |        :    :
         A          |              |          A
         |          |              |          |
         |          |              |          |
      NETISR0-------+              +--------NETISR1

The operations on TCP connected socket do not need protection and is CPU
localized.  The connected socket's TCP-4-tuple Toeplitz hash is masked w/
ncpus2_mask to pick up the CPU, on which the TCP connected socket should
be processed.  In nginx case, this is the picture when the sockets are
accepted.

The SO_REUSEPORT introduces per-cpu inpcb group hash, here it shows
the number of TCP listen sockets is same as number of CPUs:

                      accept(2)                                 accept(2)
        UTHREAD0  ----------------+                  UTHREAD1  ----------+
    (TCP listen sock0)            |              (TCP listen sock1)      |
                                  |                                      |
         CPU0                     |                    CPU1              |
                                  V                                      |
   inpgroup_hash entry     +-------------+      inpgroup_hash entry      |
                        +->| comp/incomp |<-+                            |
        +-----+         |  +-------------+  |         +-----+            |
        |  0  |---------+         A         +---------|  0  |            |
        +-----+    +--------------+                   +-----+            |
        |  1  |----|----+                   +---------|  1  |            |
        +-----+    |    |  +-------------+  |         +-----+            |
           A       |    +->| comp/incomp |<-+            A               |
           |       |       +-------------+               |               |
           |       |            A   A                    |               |
           |       |            |   +------------------------------------+
           |       |            |                        |
       SYNCACHE0---+            +--------------------SYNCACHE1
           A                                             A
           |                                             |
           |                                             |
        NETISR0                                       NETISR1

The incoming TCP SYN's Toeplitz hash is moduloed w/ the number of SO_REUSEPORT
TCP listen sockets on the same port/addr to generate the index into the
inpgroup_hash entry, so the so_comp and so_incomp accessing, i.e. accept(2),
is CPU localized.

The data output path:

        CPU0     UTHREAD0 (TCP listen sock0)
                     |
                     | mbuf to-be-send on inp0
       NETISR0<------+
          |
          | mbuf
          V
     inp0 so_snd

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0 (I have not provided CPU binding hint yet, however,
DragonFly scheduler helps a lot here), the send(2) is CPU localized.

The data input path:

        CPU0      UTHREAD0 (TCP listen sock0)
                       |
     inpconn_hash      |
       :    :          | extract the rcvd mbuf
       :    :          |
       +----+          V
       |    |--->inp0 so_rcv
       +----+          A
       :    :          |
       :    :          | rcvd mbuf
                       |
      NETISR0----------+

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0, the recv(2) is CPU localized.

The socket close path:

        CPU0     UTHREAD0 (TCP listen sock0)
                     |
                     | close inp0 msg
       NETISR0<------+
          :
     soclose(inp0)

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0, the close(2) is CPU localized.


THE REST OF THE RELATED PARTS:

If the network hardware does not support RSS, using MSI or line interrupt:

       CPU0                                     CPU1

      NETISR0 (ETHERNET/IP/TCP proc)           NETISR1 (ETHERNET/IP/TCP proc)
         A                                        A
         |                                        |
         | pkt(hash==0)                           | pkt(hash==1)
         |                                        |
  hash=(SW Toeplitz hash & ncpus2_mask)-----------+
         |
    NIC_ITHREAD---->RX ring

If the network hardware does not support RSS, using polling(4):

       CPU0                                     CPU1

      NETISR0 (ETHERNET/IP/TCP proc)           NETISR1 (ETHERNET/IP/TCP proc)
         |           A                            A
         |           |                            |
         |           | pkt(hash==0)               | pkt(hash==1)
         |           |                            |
  hash=(SW Toeplitz hash & ncpus2_mask)-----------+
         |
         V
      RX ring

If the network hardware supports RSS, unlike other OSes, we set the same
key used by software to the hardware, and configure the redirect table in
following fashion: (hash & ring_cnt_mask) == rdr_table[(hash & rdr_table_mask)]
In DragonFly, MSI-X ithread is bound to the specific CPU set by the driver,
which has the knowledge of the CPU that should processing the input packets on
the specific RX ring.

If the network hardware supports RSS, using MSI-X:

       CPU0                                     CPU1

      NETISR0 (ETHERNET/IP/TCP proc)           NETISR1 (ETHERNET/IP/TCP proc)
         A                                        A
         |                                        |
         | pkt(hash==0)                           | pkt(hash==1)
         |                                        |
    hash=(HW Toeplitz hash & ncpus2_mask)    hash=(HW Toeplitz hash & ncpus2_mask)
         |                                        |
    MSIX_ITHREAD0                            MSIX_ITHREAD1
         |                                        |
         V                                        V
      RX ring0                                 RX ring1

If the network hardware supports RSS, using polling(4):

       CPU0                                     CPU1

      NETISR0 (ETHERNET/IP/TCP proc)           NETISR1 (ETHERNET/IP/TCP proc)
         |          A                             |          A
         |          |                             |          |
         |          | pkt(hash==0)                |          | pkt(hash==1)
         |          |                             |          |
    hash=(HW Toeplitz hash & ncpus2_mask)    hash=(HW Toeplitz hash & ncpus2_mask)
         |                                        |
         V                                        V
      RX ring0                                 RX ring1

The tranmission path, when number of hardware transmissions is less than the
number of CPUs or ALTQ packet scheduler is enabled (assume CPU0 processing
hardware TXEOF):

      CPU0                     CPU1

     NETISR0                  NETISR1
        |                        |
        | mbuf                   | mbuf
        |                        |
        V                        V
      +----------------------------+
      |         if_subq0           |
      +----------------------------+
        |                        |
        |                        |
        +........................\
        |   TX ring contended    :
        |                        : TX ring not contended
        |                        :
        V                        V
      +----------------------------+
      |          TX ring           |
      +----------------------------+

The tranmission path, when number of hardware transmissions is equal to or
more than the number of CPUs:

      CPU0                     CPU1

     NETISR0                  NETISR1
        |                        |
        | mbuf                   | mbuf
        |                        |
        V                        V
     if_subq0                 if_subq1
        |                        |
        |                        |
        V                        V
     TX ring0                 TX ring1


MY ORIGINAL PLAN, BEFORE SO_REUSEPORT:

           CPU0                                    CPU1

         UTHREAD0                                UTHREAD1
           |  :                                    :  |
           |  :.................................   :  |
           |                                   :   :  |
           |  .....................................:  |
           |  :                                :      |
           |  :                                :      |
           V  V                                V      V
      +-------------+   +-----------------+   +-------------+
      | so_comphdr0 |---| TCP listen sock |---| so_comphdr1 |
      +-------------+   +-----------------+   +-------------+
             A                                       A
             |                                       |
             |                                       |
         SYNCACHE0                               SYNCACHE1
             A                                       A
             |                                       |
             |                                       |
          NETISR0                                 NETISR1

UTHREAD0 will try dequeuing from so_comphdr0, if it is empty, then it goes
to so_comphdr1; same applies to UTHREAD1.

However, this could has trouble when waking up the waiters on TCP listen
socket, since they are actually wait on the same socket.  Additionally,
it could introduce so too many changes to the kernel, which may affect
other kinds of sockets (AF_LOCAL, SOCK_STREAM/SOCK_SEQPACKET).  Comparatively
speaking, the implementation of SO_REUSEPORT is much simpler, much less
invasive, straightforward and the ending result is good.


2013.09.05
sephe