The TCP protocol must handle unreliable data delivery. To that end it must not blindly send data frames out then forget about them. Instead, it must hold onto them and wait until the receiver acknowledges receipt of the data. If a timeout occurs, or some other indication arrives saying some data packet was lost, the held onto copy is resent as needed.

As a result, the TCP output engine is mildly non-trivial. Add in variables such as support hardware checksumming offload on the network card, and TCP segmentation offload, and things get even more interesting.

Here is a diagram of the top-level TCP output queue under Linux.


TCP output queue layout Here we can see that the TCP socket maintains a doubly linked list of all pending output data packets header by sk_write_queue. Another TCP socket member, sk_send_head, maintains where in the queue the next packet is to be sent.

Note that sk_send_head is for keeping track of which packets we have not sent out yet. It is not used for packet retransmission. If sk_send_head is equal to NULL it means that all packets on the sk_write_queue have been sent once already.

As ACK packets arrive from the receiver and more send window space becomes available, we walk the packet queue starting at sk_send_head and send as many frames as we can.



Outgoing packets have to maintain some TCP specific information per-packet, and we use the skb->cb[] control block area for this. This control block is actually used for TCP packets received as well, so this control block serves multiple purposes and also has some extra constraints which would not exist if the control block were only used for output processing.

Learn about the TCP SKB control block here.


TCP output engine source layout

In the TCP output engine, all paths lead to tcp_transmit_skb() regardless of whether we are sending a TCP data packet for the first time, or a retransmit, or even a SYN packet in response to a connect() system call.

At the top-level, tcp_sendmsg() and tcp_sendpage() gather up data (either from userspace or the page cache) into SKB packets and tack them onto the sk_write_queue() of the TCP socket. At appropriate times they invoke either tcp_write_xmit() or tcp_push_one() to try and output those data frames.

As ACKs arrive, TCP input processing calls tcp_data_snd_check() to see if we have some data packets in the write queue to send. If so, it invokes tcp_write_xmit() to do the work.

Finally, when we retransmit data frames (either via retranmission timeout, or in response to SACK blocks received), we invoke tcp_retransmit_skb() to do the work which then passes the frame to retransmit off to tcp_transmit_skb().

Thus, the actual output of TCP data packets will occur either in response to new data from the user or in response to events (such as receiving an ACK packet, or a TCP timer expiring).



As stated previously, TCP always must keep around a copy of a data packet until it is sure the other end has received the packet. The SKB clone facility is used to implement this in the cheapest possible manner. First, packets are built with purely the data area of the TCP frame. Then the SKB holding the data is cloned. Finally, the output engine uses this SKB clone to build the TCP and IPV4/IPV6 headers.


TCP output engine SKB clone system

tcp_write_xmit() walks the TCP socket sk_write_queue starting at sk_send_head. If the packet can be sent now, we clone it.

This clone is passed down to tcp_transmit_skb() for TCP header building. Note especially how the DATA area of the SKB on the sk_write_queue and the clone are completely shared.

When building packets, tcp_sendmsg() and tcp_sendpage() make sure to reserve enough room at the head of the SKB data area so that the TCP headers, the protocol layer (ie. IPV4/IPV6) headers and the hardware level headers can all be fit at the front of the data.



As packets are output for the first time, sk_send_head is advanced one SKB at a time. If we run out of packets on the sk_write_queue for initial transmission, we reset sk_send_head to NULL. As the user supplies us with more data packets, sk_send_head will be set to point to the first of those new SKBs.

If we haven't sent a TCP data packet yet, and the SKB is not full MSS sized, it is legal for tcp_sendmsg() and tcp_sendpage() to tack on more data to the tail of such SKBs until they contain MSS bytes of data.


TCP initial send head state

For example, let us assume we begin in the state at the right. The sk_write_queue has three full sized SKBs and the sk_send_head points at the first one.

tcp_write_xmit() is invoked, and examines the first SKB as pointed to by sk_send_head. It checks to see if the TCP packet fits into the current send window and the current congestion window. It also makes sure the NAGLE rule passes as well.

The SKB is cloned, and the clone is passed to tcp_transmit_skb() which invokes update_send_head() to advance sk_send_head.



TCP send head state, part 2 TCP send head state, part 3

Now we are in the state at the left.

If we end up successfully sending all three packets in the sk_write_queue before more packets are added by the user then the sk_send_head has no valid packets to point to any longer.

At which point we'll end up in the state at the right, where sk_send_head points to NULL.

Notice that even after sending all the packets on the queue, they stay there. We cannot unlink and release them from the write queue until the other end ACKs the data.



TCP send head state, part 4

Eventually, the receiver sends back an ACK packet covering the sequence space of one or more of the SKBs in the sk_write_queue

This is the event which allows us to unlink and free up the SKB from the queue.

As illustrated to the right, the ACK arrives, the sk_write_queue is walked by TCP input processing. For each SKB covered by the ACK's sequence number, we unlink the SKB from the queue and free it up using kfree_skb().



Google