ARP-like resolution of IPSEC rules.
Currently, if a policy needs to be resolved by a key manager during connect() we current behave as follows:
KAME handles this by just dropping the first packet. TCP retransmits over and over until the IPSEC route is resolved. This behavior isn't very nice either.
The currently designed solution is to implement something like ARP. How ARP works is that it queues packets until neighbour discovery is complete, then it transmits these packets.
Patrick McHardy and Herbert Xu came up with some draft ideas wrt. implementation.
See pages such as this one for some ideas and analysis.
Start moving towards optional IP routing cache.
Multiple hw queues can be used to spread receive processing across CPUs; this will eliminate main cpu% as a bottleneck for 10GbE performance.
Using a NIC that supports multiple hw queues and MSI-X, a network driver can do a decent job on distributing kernel part of receive traffic processing across CPUs - as long as it is not important which session lands on which cpu. This part doesn't require any changes outside of the driver.
This scheme can be further improved upon, if the host tells the driver what CPU it wished to run a particular session on. With this information, the driver can steer a session to the same CPU that the scheduler runs the socket reads on, and achieve the best cache locality for both kernel and user level rx processing.
So far, the best idea for doing this seems to be the one that Andi came up with - adding a new callback in the netdevice structure that is invoked every time a scheduler migrates socket reads to a different cpu.
This would allow the driver to migrate the kernel part of rx processing to the same cpu that the read is running on. In addition to the cpu number, it will be beneficial to get priority for the socket as well. This is because NIC capacity for explicit "session to cpu" steering may not be unlimited.
This can be arguably left for now to the driver-only implementation, since the support needed from the stack - ability to accept fragmented skb that is bigger that MTU - is already there. The only other thing to consider may be forcing an ACK per LRO frame; not sure if this is worthwhile...
In addition to distributing rx processing across multiple CPUs (#1 above), hw queues can be used for other things, like QoS for incoming traffic. In this case, separate queues for higher priority traffic will guarantee things like lower latency, better bandwidth, better DoS protection and more fine-tuned (per queue, not per NIC) interrupt moderation.
This part needs more discussion. Possibly NAPI can make some changes to utilize the feature, and some common user-level configuration options (via do_ioctl) may be useful too.
MPLS support is really needed for us to be taken seriously as a full VPN solution in some environments.
DaveM wrote an skeletal implementation long ago and passed it on to Jamal who enhanced the netlink layer significantly in order to support configuration of things like MPLS much better.
Unfortunately, we all ended up in a spat with the maintainer of another MPLS Linux implementation, nobody yielded and everything ended up stuck in the mud.
It is argued that a saner way needs to exist in order to implement correct module unload for non-trivial modules such as IPV6.
Alexey has proposed a multi-stage unload sequence. In the first stage, the module removes all of it's public interfaces. In the second stage, we wait for references to existing objects to go away.
Rusty is in general agreement, although he wants us to exercise caution before we go down any avenue at all. He also wants us to be aware of the good points about the current counter based system in 2.5.x
Alexey and myself (davem) fear that when using the counter system in a complex module, the whole thing would be polluted with module_{get,put}() calls everywhere. We also argue that, because a module has to make it's own object management and reference counting, the module refcounting facility is superfluous.
There are two sets of applications that want more flexible sk_buff handling for device drivers. The one is SKB recycling as experimented with by Robert Olsson. The other is for network devices which use pools of large and small buffers (typically the large buffers are page sized and the small ones are 256 bytes).
The way the smart devices work is they watch TCP flows and accumulate data contiguously into pages. The header portions go into the small buffers. With these devices it is pretty easy to implement receive zero-copy.
With these clever devices the big question is what exactly is the header portion. Implementations I (DaveM) am aware of allow one to teach it the basics of various protocols. For example, you can tell it what a SunRPC header looks like after the TCP part. This is all important so that the data part accumulated into page sized chunks can be flipped directly into the file system cache. Otherwise, if the data is not really page aligned, we can't zero-copy it.
In IPSEC trees route based NAT is broken, the code needs to be converted to use the xfrm engine. Actually, it's been entirely deleted from the tree now. Thomas Graf supposedly has some code coming which will reintroduce this feature.
Its gc replicates the most first variant of ipv4 brother, which was proven to be sick ages ago.
Several spots check things like skb->users and skb_shared() on transmit, which can never be true these days. Known offenders are tunnel devices, ipmr.c and loopback device. Alexey says that ipmr.c case is so bad it should be rewritten instead of trying to repair existing xmit code.
The caller of this function is passing in a netlink SKB for which the IPMR route should be filled in to. However, if a cached IPMR route is not found, it tries to use this SKB to send out a packet to force IPMR route resolution!