What is softnet? ---------------- Since 2.3.15 linux networking is multithreaded. However, it still has serious drawback: big part of processing occurs in globally serialized BH context. This patch presents one solution to this problem. It is worth to emphasize: networking _is_ multithreaded, but this power is not used because linux core stuck in the mud and we have no choice as to make some surgery in core including assembly parts. * softnet-980823 1. This patch applied to linux-2.3.15 2. This patch is made by Dave for Ultra and modified by me a bit. i.e. all kudos are to Dave, all blames are to me. 8) Modifications are: --- Intel bits. --- some improvements to softnet_process. --- deleted good load of sed -e 's/mark_bh(NET_BH)/netif_wake_queue(dev)/' to keep patch short. Well, I use tulip, tulip is patched. Fix another adapters with your own hands, guys! It is easier than apply huge patch. Warnings: --- It works only on Intel and Ultra now. --- It will crash on SMP machine, if you try to use any protocol family but SMP-safe ones (safe ones are IP, IPv6, packet, unix, netlink, probably decnet). Forget about IPX, appletalk etc. * softnet-980827 - Alpha bits. - One race condition in TCP slow timers. - Several non-softnet specific bug fixes (README.netbugs) * softnet-980828 - Intel/Alpha were broken yet and worked only due to plain luck 8)8) - Per-CPU softnet statistics to make it more cache friendly and to enjoy with cpu sharing 8)8) - Several non-softnet specific bug fixes (README.netbugs) * softnet-990902 - Temporary return to globally serialized device xmit runqueue to try to catch some hypothetical bugs. RX and TX are decoupled. - TCP keepalive patch by Pavel Krauz - Bug in ip_queue_xmit(). - Some minor textual edit to encapsulate sk->dst_cache to sk_dst_*(). * softnet-990903 - softnet -> softirq. Two softirqs are allocated: NET_RX_SOFTIRQ and NET_TX_SOFTIRQ Warnings: - Ultra is broken. Alpha is not tested yet. Intel is OK. - more scalable netdevice runqueue are started, the remnants of new code are commented out with NETIF_NEW_RUNQ. * softnet-990904 - Output queue rewrite is finished. Amen! - Added debugging counters for skbs crossing CPUs. - Added debugging cpuids to packet socket, tcpdump prints cpus, where packet was queued and really transmitted. Warnings: - all the devices are broken and only those, which I need, are repaired. - Ultra is still broken, alpha is still not tested. * softnet-990906 - Alpha is tested. It does not work 8) Though it is not because of net. Something is wrong with PCI or Matrox Millenium. It is ok until X is started, but X cannot initialize matrox card more. - 3c59x, slip, ppp. * softnet-990915 - synced to changes in ppp made at vger to 990915. - tasklets. KEYBOARD_BH is converted to tasklet as demo example. Actually, all the BHs may be lightened in this manner, but it is not eveident what SMP bugs will be triggered. * softnet-990916 - SMP bug in IP/IPv6 frag ID generator. 2.2 is buggy too! - Weird SMP bug in tcp retransmit timer. 2.2 is buggy too! - diffserv. - fix number of retransmisiions by Pavel Krauz - Attempt to apply per-cpu slab patch by Ingo Molnar failed. It wants 8 pages per slab! I do not eat such crap. Less efficient half-solution for skb heads. * softnet-990917 - eepro100, sunlance. - Ultra _wouldbe_ repaired. - Sparc32 bits. My sparc compiles so slowly, that I am unlikely to test it today. Well, it is my first (and the last, I hope) exercise in sparc/ultra assembly, so that do not judge me too strictly. [ I apologize, I forgot to add --new-file and diffserv disappered from this snapshot ] * softnet-990918 --- [ _STATUS_ _CHANGE_ ] - Old bottom halves are killed and converted to tasklets. - softirq are made architecture independant, global_bh_count is prepared to death. 8) - SMP synchronization flag is added to timer_list. - Several relatively easy cases, where timer->sync allows to eliminate start_bh_atomic() painlessly are cured. * softnet-990919 --- [ _SECOND_ _STATUS_ _CHANGE_ ] - {start|end}_bh_atomic and synchronize_bh() are deleted, they are not used more. global_bh_count is dead. global_bh_lock is still used by wait_on_irq() (grr...) and for serialization of old style BHs (grr ^ 2). Death was not painless. Agony was so long and terrible, that I decided to shot it in pity. [ sunrpc does not compile now. It needs sanitization to lock sunrpc queues by real locks. ] - IPv4 routing cache is threaded in TCP style (lock per bucket). * softnet-990921 - "completion" softirq. skbs never freed on irq and networking core may be cleaned of irq disabling. - sk->state_change() in tcp_rcv_synsent_state_process to cure nfs. [ and I lost diffserv again 8) ] * softnet-990922 - /proc/net/tcp deadlock [ by Angrea Arcangeli ] - neighbour table periodic timer is threaded. - the first attempt to sanitize sunrpc. [ don't use it 8) ] * softnet-990926 - sunrpc should work. No promises, but it is expected not worse than before 2.1.15. - inet peer tables by Andrey Savochkin. * softnet-991003 - merged to final version of inet peer tables. - tcp/ipv6 initial seq selection: the bug was similar to one in ipv4. Digesting only last 32 bits opens possibility to discover seqno using different /96 prefix. - [ALERT!] TCP backlog is process under user lock rather than under spinlock. It should be correct but can open some bugs. - IPv6 addrconf on non-ethernet media. * softnet-991024 - Nothing especially new, it is checkpoint snapshot. The next one will contain timewait redux code. - However, whole bunch of bug fixes and improvemnts appeared since previous snapshot: * BUG: blocking PF_UNIX accept() was broken (Lars Heete ) * BUG: missing rtnl_lock in ipconfig.c * BUG: .... [forgotten :-( I'll fill it later] * zero linger aborts TCP connection immediately. * send reset after fin-wait-2 timeout. * reset pending connections, when disconnecting listening socket. * softnet-991107 - TIME-WAIT recycling. The changes are relatively large, but they are enough well-thought not to add new bugs. * softnet-991120 - listening socket is restructured: it is cleaner and faster now. 1. Established connections are removed from synq and enqueued to separate accept_queue. Hence, accept() has guaranteed low latency now. [ It means that we can optimize it even more, not grabbing user sleep lock and replacing it with spinlock. ] 2. Packets falling on listening socket are checked only against synq and additional lookup is made in main socket hash table. [ One more optimization is pending: relookup is required only if packet was retrieved from backlog. ] - By-product: memory leakage in syncookies.c is found, IP options are not freed. Check 2.2! * softnet-991123 - One change: file flags are inherited via accept() like other OSes. It is due to this change ftp.inr.ac.ru was not available until 991125 8) - Synchronized to vger as of 991123. Now it should fit to Linus' kernels with a few of rejects in sparc and ppp, which are still not synchronized to Linus. NOTE. Comments above about possible deserialization of listener appeared to be wrong. Alas. Yes, it is evident that multithreading of listening socket is impossible: incoming connection requests have to synchonize to single thread to be correct... It means that the only way to MTize listener is to open several listening sockets and to bind them to several IP addresses or interfaces. * softnet-991124 - One small change in net/ipv4/devinet.c inet_select_addr(). 2.2 is buggy too. - Merge to vger as of 991124 (Linus' 2.3.22). NB: some evil penguinoid again broke sunrpc... I backed out my cures. When I see combinations sort of: start_bh_atomic(); spin_lock(&xprt_lock); I have to take pain-killers and try to relax. OK, I defer review and cleaning this crap until final merge of vger and Linus' tree. * softnet-991125 - One less kernel lock in net/socket.c. Thanks, Andrea. No news from vger. Continue to sit at 2.3.22. * softnet-991130 - Andrea's patch to remove the last kernel lock in net/socket.c. and to get rid of useless i_count hack. (Andrea Arcangeli) - SNMP statistics and sockets_in_use are made SMP consistent. NB: do not forget about prot->inuse, it is still broken. - Do nagling according to draft-minshall-nagle-01.txt (Cacophonix Gaul) - Some (most likely wrong, but not breaking anything) experiments with delayed acks heuristics. Big goal is to try to make happy Beowulf folks and not to make unhappy all the rest of world. * softnet-991211-2.3.31 - [status change] Ingo Molnar ported the patch to 2.3.30. OK, I see, Dave has some troubles with synchronization, so that I move to 2.3.31 too. - Some precautions made in tcp_recvmsg and caused by absense of socket semaphore in <=2.2 are removed. - UNIX98 say that msg_name, msg_namelen are _ignored_ for TCP! Superb, it is real optimization 8) - Nagle algorithm changed: "small" frames are not sizemaster and two helper functions) - sk->nonagle is moved to its proper place, namely, tcp_opt. - inet_stream_connect is rectificated. Damn, now it just cannot be wrong! [ Explanation: I have report that 2.2.13 still sticks in connect(). I ignored such reports because explained them by that bug, which was fixed only in 2.2.11. Now I have report for 2.2.13... I have to review inet_stream_connect(). It was really incorrect, but I still did not find reason of such unpleasant 2.2.13 behaviour. ] - Send ACKs not each _two_ MSS, but _more_ than _one_ MSS. It is equivalent for well-formed full-MSS stream, but behaves better with nagled connections (packet tails generate immediate ACK). Certainly, total number of ACKs increases. - move get_port() back to inet_stream_connect(). I moved it earlier to make TW recycling, now it is not required. [ 991213: damn, it crashed today! Seems, somewhere in recovery after router failure. I'll invetigate it tomorrow. ] * softnet-991214-wrt-2.3.32 and softnet-991214-wrt-2.3.31-991212 - patch for ingres qdisc from Jamal Hadi Salim. - serialize notifiers, they were racy. - some comment edit - bug exposed yesterday is fixed (it was really stupid; return value of tcp_v4_route_req() was not checked for NULL) - start to check for local output errors in TCP. The very beginning. The idea is to move send_head only if packet was _really_ sent, and use probe0 timer for local retransmits. Seems, the idea is very promising and the result is expected to be nice. * softnet-991214-wrt-2.3.33 - Fixed for 2.3.33, the changes confilct to patch-2.3.33. - Some minor changes also: return values, less magic numbers. - Notifier fixes are backed out! Networking need not them, it can use rtnl_lock(). And let callers using notifier from interrupts take care of themselves. - preparing to war with local retransmits: /proc/net/tcp shows probe0 timer status. Seems, we are well weaponed and can overthrow the enemy 8) * softnet-991218-wrt-2.3.33-991214 (incremental) [ BEWARE! This one can contain surprises in TCP ] - next portion of diffserv fixes from Jamal. - local flow control for TCP as outlined above. - nont-SMP safe protocols are protected by spinlock. - TCP_LINGER2 to reduce FIN-WAIT-2 on http servers. - Nagle affects only _tail_ of queue!!! - FIN is sent even with TCP_CORK. It saves one syscall per session. - some fixes from Andrea to compile on Alpha. * softnet-991220-wrt-2.3.33-991214 (incremental) softnet-991220-wrt-2.3.34 - lost ingres patches are restored. - some documentation bits in ip-sysctl.txt - tcp_orphan_retries, tcp_abort_on_overflow - drop SYN requests when acceptq is full. - tcp_listen_start() moved to tcp.c, prepared to doing something real with sysn queues. Current code is pathetically not scalable. - inet_wait_for_connect() was buggy yet 8) * softnet-991224-wrt-2.3.34-991220 (incremental) - Listening socket is reworked. Syn queue is hashed. - Slow timers are killed. Well, the last thing to do is moving orphaned FIN-WAIT-2 sockets to time-wait like buckets. It is easy. There are two things, which _must_ be made, but apparently not in this millenium. 8) Namely: - receiver window control to avoid congestion by small packets. - receiver window control to help sending peer not to congest our slow link. And thing scheduled to this millenium: driver updates... Boring... but it has to be made. I'll put them to separate patch. * softnet-991225-wrt-2.3.34-991220 (incremental) + softnet-drivers-991225-wrt-2.3.34-991225 (applied after previous one!) [NEW] - ip-sysctl.txt is edited to refelect new status. - new Jamal's skeleton.c - Beat me, blame me, I used FIONBIO instead of FIONREAD. Now netscape and emacs under X work again 8) - New option TCP_DEFER_ACCEPT will decrease scheduling load on HTTP servers. Well, bulk work on softnetting drivers. I apologize, it is pure edit, do not blame if is contains misprints, forgotten prototypes etc. Updated drivers are mainly equivalent to old ones with all the bugs, so that they should work equally both with softnet and without it. * softnet-991226-wrt-2.3.34-991220 (incremental) + old softnet-drivers-991225-wrt-2.3.34-991225 will fit. - FIN-WAIT-2 buckets. ANother optimization, dead FIN-WAIT-2 sockets are handled likely time-wait buckets i.e. eat only ~120 bytes and does not eat destination cache entry. Great win! BEWARE! 1. It is not well tested. 2. TW recycling is disabled _temporarily_. * softnet-991227-wrt-2.3.34-991220 (incremental) + old softnet-drivers-991225-wrt-2.3.34-991225 will fit. - Timewait recycling is restored back * softnet-991228-wrt-2.3.34-991227 (incremental) + old softnet-drivers-991225-wrt-2.3.34-991225 will fit. - TCP checksumming/copying/processing in user process context. - SO_RCVLOWAT Not commented. 8) [ I am not sure really, that it is better.] Advices: - Use SO_RCVLOWAT or MSG_WAITALL. - I can donate ftp client, modified to deliver to mmaped file. It is not easy to compile it though, I think. [ I am sorry, I planned to prepare full diff wrt vger today, but it refused to compile. It is pure problem with my old assembler. I'll fix it tomorrow. ] * softnet-991229-wrt-2.3.34-991227 (incremental) softnet-991229-wrt-vger-991228 (it is big one wrt VGER tree) + softnet-drivers-991229-wrt-2.3.34-991229 (all that I edited to this moment). - Nothing changed really, some minor improvements and merge to vger. - softnet-drivers are updated. Status is the same, all of them (except for contained in main patch) were _never_ tested and _never_ compiled. [ BAD NEWS ] Andrea was bitten by a weird bug in routing in 2.3.35, which is _not_ fixed by softnet. Well, apparently, it will open new season of hunting in new millenium 8) NOTE: the patch wrt vger backs out apparently buggy skb_expand. Paul, please, be more accurate next time when submitting patches. * softnet-000103-wrt-vger-000103 (wrt vger as of Jan 3) - in hurry (and being after holiday) I forgot to record, what changed there 8) Trying to recover. - OK. Nothing changed really, it is merge to vger as of 000103, restoring new skb_expand etc. The only change is in tcp.c: - Use the fact that tcp receive queue cannot contain skbs with users!=1. * softnet-000104-wrt-vger-000103 (wrt vger as of Jan 3) softnet-000104-wrt-000103 (incremental wrt yesterday's one) - Dave's Ultra code. - Dave's fixes to netsyms, slip etc. - Fix for bug found by Ingo. After I changed lookup in open request chain to lookup in hash table, I forgot to return check for user lock. It is strange. I do not forget such things 8) Actually, I have to think more, seems, this silly situation is specific for khttpd. - Relax per-host PAWS check, giving it some replay window. - Reverted change of 991123, adding inheriting O_NONBLOCK via accept(). Alas, it broke RedHat inetd. [ Important note to everyone's notebook. Linux historically had lots of bad bugs in blocking socket semantics. Particularly, inetd should not use O_NONBLOCK at all, when accept()ing, because accept() must not block after select() success. Old Linuxes did block sometimes! And 2.3 should behave as it is supposed. It is the reason, why linux ports are full of useless fiddling with O_NONBLOCK. Actually, looking into ip-routing, I see that I added to all the ports from BSD useless (from normal human's viewpoint) FIONBIOs and fcntl(O_NONBLOCK), preventing illegal blocks after select() success. All this is _history_. Each case of blocking after select() is fatal bug, which must be not worked around in applications, but searched and shot in the kernel. ] - Errors in comments. * softnet-000105-wrt-000104 FROZEN AT THIS POINT. Waiting for good weather 8) - bitter comments around tcp_sendmsg(). 8) - Bug in young_qlen syn counter. - Temporarily added special /proc/net/tcp record formatting for listening sockets (qlen instead of sequence numbers) - do not sleep in backlog accelerator in tcp_recvmsg. |=========================================================================== | softnet-000105-wrt-vger-000108 | | - Synchronization to partial merge to 2.3.38. | Are date stamps confuxing? Well, it means that softnet-990105 | is merged and diffed against vger of Jan 8. | |========================================================================== * softnet-000108-wrt-vger-000108 - Synchronized to vger of Jan 8 (read, 2.3.38) ~100K are gone, uhh... Namely, 2.3.38 includes the following relatively big chunks: - DiffServ (Jamal and Werner) - Andrey's inetpeer and fragment ID generator. - My new pmtu discovery for IP datagrams, I hope now it is really fullflegded. Now we send all the IP data with DF set by default. - Steve's decnet updates. - Part of my IP route.c updates, related to per-bucket lock and size depending on amount of memory. - Some bug fixes (not all, unfortunately) And several small changes: - A few of new bugs are found and fixed: * skb leak was introduced in tcp_child_process() by previous patchlet. * PAWS in receiving ACKs for active connect() is removed. "Reversed" PAWS is enough in this case, "direct" PAWS will evidently break, when contacting to these stupid clustered servers (Oh! How I hate these shitty dumb clusters! I bet soon we will see Internet flooded with permanent ACK storms, if their engineers will not decide to read some textbooks eventually.) - A few of changes: * Exclusive wakeups in datagram, packet, unix, netlink and tcp listening hash table lock, using wake_up_*_all. Silly goto wake_all in inet_wait_for_connect() is removed. * MSS is bounded by max(SND.WND)/2. It is always good (at least two segments in pipe) and makes sender SWS avoidance essentially trivial. Anti-MacOS dead lock breaker is removed. Let those people to report this bug again. I am sure that the situation can be rectified without ugly (and wrong) tricks. * Small optimization of VJ path, pred_flags are reset in state change, so that we need not check for sk->state installing reader task. Also, tcp_prequeue() is extracted to separate primitive in tcp.h and small comment is added. |=========================================================================== | softnet-000105-wrt-vger-000111 | | - Synchronization to vger of Jan 11 . | Are date stamps confuxing? Well, it means that softnet-990105 | is merged and diffed against vger of Jan 11 | |========================================================================== * softnet-000111-wrt-vger-000111 - Synchronized to vger of Jan 11 (read, 2.3.39) >60K are gone. Namely, 2.3.39 includes the following: - threaded SNMP statistics Bugs: - Check dev->hard_header return value in arp_send and IPv6 MLD. - POLLHUP. More analysys is still required. Old way was wrong, I was bitten by this. - Exclusive waiting was a bit wrong, fixed. And several small changes: - A few of changes: * Memory accounting in prequeue. * Fiddling with skb_set_owner_r / skb_orphan in fast path is killed. All the related variables and destructors are touched only if packet is really queued to tcp socket. Now window is strict constant, while prequeue is active and some cpu is saved. Superb. * tcp_raise_window() is proven to be useless. It is paradox, but the only its function was _CLOSING_ window 8)8) in header prediction path with probability of 50%. Now this thing is made explicitly and in all 100% of cases 8) * SO_RCVTIMEO and SO_SNDTIMEO * TCP_WINDOW_CLAMP. For now it is the only weapon against "prune" storms for TCP_NODELAY small meaasage streams. If TCP_WINDOW_CLAMP is set to enough low value, problem disappears. * TCP_DEFER_ACCEPT accept time to timeout in seconds, rather than boolean. * softnet-000118-wrt-vger-000118 vger-000118-wrt-2.3.40-pre4 * Most of the patch is in vger now. Only "softnet" part is not there. Full diff between vger as of Jan 18 and 2.3.34-pre4 is in vger-000118-wrt-2.3.40-pre4. So, this patch does not contain anything new, because all "new" is already in vger. However, I summarize the differences invisible in this patch. * kfree_skb() checks refcnt before attempt to decrease it. Minus one atomic modification per skb. * Two new debugging entries to /proc/net/netstat accounting header prediction hits. * Some cleanup in sock.h, #define *_MIN_*BUF etc. * TCP header prediction flags are set more carefully. Header prediction really works now in sender path and with window scaling. * poll() on sockets. Do not set POLLHUP, when socket is still writable. sk->prot->poll() method is deleted as redundant indirection. * TCP_MORE/LESS_COARCE_ACKS to change acking frequency in some situations. * softnet-000121-wrt-vger-000121 vger-000121-wrt-2.3.40 * Synchronization to final 2.3.40. Full diff between vger as of Jan 21 and 2.3.40 is in vger-000121-wrt-2.3.40. * Also, three bugs are found (ipv4/proc.c, ipv4/tcp.c, ipv4/tcp_input.c). Submitted to Dave. * 000220 * LAUNCHED! (2.3.43-pre8) * Pre-fixed drivers are in softnet-drivers-2.3.43-pre8.dif. * Next phase is "net". Goto README.net Alexey Kuznetsov kuznet@ms2.inr.ac.ru