socket: Extend SO_REUSEPORT to distribute workload to available sockets The idea is from Linux's recently added SO_REUSEPORT support from Google: https://lwn.net/Articles/542629/ (thank aggelos@ for pointing it to me) In DragonFly, SO_REUSEPORT is already supported. However, the original support only allows the first wildcard address bound socket or the last non-wildcard address bound socket to receive input, e.g. accept(2) on TCP socket or receive datagrams on UDP socket; the rest of the sockets bound to the same port will _not_ get any input. In this commit, we extend SO_REUSEPORT to allow all sockets bound to the same address and same port to receive input based on the input packet's hash, so the workload, e.g. accept(2) or datagram reception, could be evenly distributed among different sockets (imagine each socket is handled by one process/thread). This extension could also reduce the contention from user space on TCP listen socket's so_comp or UDP socket's so_rcv, when it is compared with the traditinally and commonly used one socket model. The implementation details: - Introduce inp_localgroup, which groups inpcbs bound to the same address and same port. - Add inp_localgroup hash table to inpcbinfo. This hash table is allocated only for protocols supporting SO_REUSEPORT extension. Currently only TCP and UDP support SO_REUSEPORT extension. - When inpcb is inserted into inpcbinfo wildcard hash table, it is also inserted into the cooresponding inp_localgroup. - Before locating inpcb from inpcbinfo wildcard hash table, we check inpcbinfo's inp_localgroup hash table first. If there is a matching inp_localgroup, packet hash will be used to pick one of the inpcbs from the inp_localgroup, and this inpcb will be used for further processing on this packet. Packet hash's bits (ncpus2_shift), which are used to dispatch packet to the proper netisr, are ignored, since they may introduce unfairness between inpcbs in the same inp_localgroup. Hash-threshold instead of modulo-N is used to pick the inpcb from the inpcbs in the same inp_localgroup (http://tools.ietf.org/html/rfc2992 for hash-threshold and modulo-N). inp_localgroup hash table | : | +----------+ +--------------+ +--------------+ | 79 | |inp_localgroup| |inp_localgroup| +----------+ +--------------+ +--------------+ | 80 |----->| *:80 |----->|192.168.2.1:80| +----------+ +--------------+ +--------------+ | 81 | | inpcb1 | | inpcb4 | +----------+ +--------------+ +--------------+ | : | | inpcb2 |<--+ +--------------+ | | inpcb3 | | +--------------+ | | input SYN dst 10.0.0.1:80 | | 15 3 2 0 | +-------------+---+ | | hash | | +-------------+---+ +--|<-- used -->| (ncpus == 8) Limitation: - Each inp_localgroup could hold at most 256 inpcbs, which probably should be enough. - Jailed sockets will not be entered into inp_localgroup, since the original inpcb preference of in_pcblookup_hash() must be kept. - Wildcard IPv4 mapped INET6 sockets will not be entered into inp_localgroup, since the original inpcb preference of in_pcblookup_hash() must be kept. - If one of the sockets in the inp_localgroup is closed, e.g. the process handles the socket is crashed: For TCP, certain amount of TCP syncache may be dropped prematurely by syncache timeout and the sockets on the closed socket's so_comp are all closed. For UDP, all of the datagrams on the closed socket's so_rcv are dropped. However, these will happen even before this commit. Sysctl nodes net.inet.tcp.reuseport_ext and net.inet.udp.reuseport_ext are added to enable/disable this SO_REUSEPORT extension on TCP and UDP. They are enabled by default.