Things that need to get done in the Linux kernel networking.

ARP-like resolution of IPSEC rules.

Currently, if a policy needs to be resolved by a key manager during connect() we current behave as follows:
- O_NONBLOCK: Continually return -EAGAIN until resolution is complete.
- not-O_NONBLOCK: We sleep until the key manager resolves the policy or we time out.
KAME handles this by just dropping the first packet. TCP retransmits over and over until the IPSEC route is resolved. This behavior isn't very nice either.

The currently designed solution is to implement something like ARP. How ARP works is that it queues packets until neighbour discovery is complete, then it transmits these packets.

Patrick McHardy and Herbert Xu came up with some draft ideas wrt. implementation.
SKBs are too big, ongoing work...
See pages such as this one for some ideas and analysis.
- skb->h is really useless and can be eliminated immediately. The only place where it is really used is checksumming offload on output. skb->h is used there to mark the beginning of area to checksum, the idea was to support offload for protocols other than TCP and UDP. Given that this generality is not used, it can be replaced with direct parsing of IP header.
- skb->mac.raw, it can be removed easily provided skb->mac_len is left intact. They cannot both be removed: skb->mac.raw is used in packet socket to return MAC header back. This information could be passed as an argument to ptype handler. Unfortunately, MAC header is removed inside device driver, so it would require lots of changes. Another use is for logging and filtering by MAC address in netfilter in some other places (for example even net/ipv4/route.c uses it). It is not clear how to remove this without reducing functionality.
- net.raw has similar issues: IP header is used in recvmsg() to fetch, for example, IP addresses. Essentially, to remove it we have to hold skb->data at IP header and then reparse the packet in recvmsg(). Honestly, it may not be worth the effort.
- skb->input_dev can be made optional under CONFIG_NET_CLS_ACT. No reference counting is done for input_dev and thus references to them outside of the softirq handler are illegal. One idea is to use ifindex of input device.
- skb->dev is also used for interlevel argument passing. It could be killed in theory, but in practice it would be a lot of fuss.
Start moving towards optional IP routing cache.
- Get rid of use of source addresses and information depending upon source address in dst entry: rt_src etc. It could be optionally "cached" there, but retrieved by another callback to routing, when "cached" result is not available. This would allow radical reduction of routing cache pressure at least when routing does not depend on source address.
- Get rid of use of destination addresses in dst entries. This would allow to aggregate dst entries and to use direct references to underlying fib_info instead.
TSO Support for IPV6
USO (UDP TSO) support, patch exists, needs review and integration
Investigate TCP traffic steering, ie. TCP flow association
Investigate various receive side offloads
- 1. support for distributing receive processing across multiple CPUs (using NIC hw queues).
  Multiple hw queues can be used to spread receive processing across CPUs; this will eliminate main cpu% as a bottleneck for 10GbE performance.
  Using a NIC that supports multiple hw queues and MSI-X, a network driver can do a decent job on distributing kernel part of receive traffic processing across CPUs - as long as it is not important which session lands on which cpu. This part doesn't require any changes outside of the driver.
  This scheme can be further improved upon, if the host tells the driver what CPU it wished to run a particular session on. With this information, the driver can steer a session to the same CPU that the scheduler runs the socket reads on, and achieve the best cache locality for both kernel and user level rx processing.
  So far, the best idea for doing this seems to be the one that Andi came up with - adding a new callback in the netdevice structure that is invoked every time a scheduler migrates socket reads to a different cpu.
  This would allow the driver to migrate the kernel part of rx processing to the same cpu that the read is running on. In addition to the cpu number, it will be beneficial to get priority for the socket as well. This is because NIC capacity for explicit "session to cpu" steering may not be unlimited.
- 2. LRO.
  This can be arguably left for now to the driver-only implementation, since the support needed from the stack - ability to accept fragmented skb that is bigger that MTU - is already there. The only other thing to consider may be forcing an ACK per LRO frame; not sure if this is worthwhile...
- 3. Additional support for multiple hw queues.
  In addition to distributing rx processing across multiple CPUs (#1 above), hw queues can be used for other things, like QoS for incoming traffic. In this case, separate queues for higher priority traffic will guarantee things like lower latency, better bandwidth, better DoS protection and more fine-tuned (per queue, not per NIC) interrupt moderation.
  This part needs more discussion. Possibly NAPI can make some changes to utilize the feature, and some common user-level configuration options (via do_ioctl) may be useful too.
Scan networking code for __read_mostly candidates.
When policy and security assosciations change, sockets can be left with stale routes. The socket sk_dst_check() only is looking at the obsolete state of the dst at the top of the bundle. One idea is to check the flow cache generation counter, and store a snapshot of that along with the cached dst in the socket.
MPLS stack really desirable for real VPN support.
MPLS support is really needed for us to be taken seriously as a full VPN solution in some environments.

DaveM wrote an skeletal implementation long ago and passed it on to Jamal who enhanced the netlink layer significantly in order to support configuration of things like MPLS much better.

Unfortunately, we all ended up in a spat with the maintainer of another MPLS Linux implementation, nobody yielded and everything ended up stuck in the mud.
Need ability to handle non-trivial modules sanely.
It is argued that a saner way needs to exist in order to implement correct module unload for non-trivial modules such as IPV6.

Alexey has proposed a multi-stage unload sequence. In the first stage, the module removes all of it's public interfaces. In the second stage, we wait for references to existing objects to go away.

Rusty is in general agreement, although he wants us to exercise caution before we go down any avenue at all. He also wants us to be aware of the good points about the current counter based system in 2.5.x

Alexey and myself (davem) fear that when using the counter system in a complex module, the whole thing would be polluted with module_{get,put}() calls everywhere. We also argue that, because a module has to make it's own object management and reference counting, the module refcounting facility is superfluous.
Need an API so device can manage it's receive buffer memory.
There are two sets of applications that want more flexible sk_buff handling for device drivers. The one is SKB recycling as experimented with by Robert Olsson. The other is for network devices which use pools of large and small buffers (typically the large buffers are page sized and the small ones are 256 bytes).

The way the smart devices work is they watch TCP flows and accumulate data contiguously into pages. The header portions go into the small buffers. With these devices it is pretty easy to implement receive zero-copy.

With these clever devices the big question is what exactly is the header portion. Implementations I (DaveM) am aware of allow one to teach it the basics of various protocols. For example, you can tell it what a SunRPC header looks like after the TCP part. This is all important so that the data part accumulated into page sized chunks can be flipped directly into the file system cache. Otherwise, if the data is not really page aligned, we can't zero-copy it.
CONFIG_IP_ROUTE_NAT needs to be converted over to xfrm engine.
In IPSEC trees route based NAT is broken, the code needs to be converted to use the xfrm engine. Actually, it's been entirely deleted from the tree now. Thomas Graf supposedly has some code coming which will reintroduce this feature.
IPV6 gc engine is broken and needs revision
Its gc replicates the most first variant of ipv4 brother, which was proven to be sick ages ago.
Fix skb->users and skb_shared() bogosity on transmit.
Several spots check things like skb->users and skb_shared() on transmit, which can never be true these days. Known offenders are tunnel devices, ipmr.c and loopback device. Alexey says that ipmr.c case is so bad it should be rewritten instead of trying to repair existing xmit code.
ipmr_get_route() is hopelessly broken.
The caller of this function is passing in a netlink SKB for which the IPMR route should be filled in to. However, if a cached IPMR route is not found, it tries to use this SKB to send out a packet to force IPMR route resolution!
Fix remaining abuses of IFF_RUNNING (syncppp, s390/net)
Use the new zero-copy sequential skb data read interface where appropriate to handle non-linear skbs.
Cleanup locking in net/sched/
Add routing attributes to meta ematch, depends on the work going on to remove the route cache so this is on hold.
Fix 64bit netlink alignment issues (gen_stats, ...)
Generic netlink attribute macros (NLA_*)
devconfig via rtnetlink, the idea is mainly to have a TLV like concept to allow managing all the simple id=value settings.
xfrm2
generic netlink family to be used first by TIPC
a little fun project to implement Rabin fingerprints using the ematch stuff. Its a pretty useless algorithm given KMP and BM already outclass it. We want to use it to validates Thomas' callbacks etc (Read: How fast can you do it the LinuxWay?) as well as giving us a laugh test check (we have it too - a really bad excuse, but i hope to have fun).
meta action
dummy for IMQ