Anatomy of the Linux networking stack

From sockets to device drivers

Source Origin

Level: Introductory

M. Tim Jones (mtj@mtjones.com), Consultant Engineer, Emulex Corp.

27 Jun 2007

One of the greatest features of the Linux® operating system is its networking stack. It was initially a derivative of the BSD stack and is well organized with a clean set of interfaces. Its interfaces range from the protocol agnostics, such as the common sockets layer interface or the device layer, to the specific interfaces of the individual networking protocols. This article explores the structure of the Linux networking stack from the perspective of its layers and also examines some of its major structures.

Protocols introduction

While formal introductions to networking commonly refer to the Open Systems Interconnection (OSI) model, this introduction to the basic networking stack in Linux uses the four-layer model known as the Internet model (see Figure 1).


Figure 1. The Internet model of a network stack


At the bottom of the stack is the link layer. The link layer refers to the device drivers providing access to the physical layer, which could be numerous mediums, such as serial links or Ethernet devices. Above the link layer is the network layer, which is responsible for directing packets to their destinations. The next layer, called the transport layer, is responsible for peer-to-peer communication (for example, within a host). While the network layer manages communication between hosts, the transport layer manages communication between endpoints within those hosts. Finally, there's the application layer, which is commonly the semantic layer that understands the data being moved. For example, the Hypertext Transfer Protocol (HTTP) moves requests and responses for Web content between a server and a client.

Practically speaking, the layers of the networking stack go by much more recognizable names. At the link layer, you find Ethernet, the most common high-speed medium. Older link-layer protocols include the serial protocols such as the Serial Line Internet Protocol (SLIP), Compressed SLIP (CSLIP), and the Point-to-Point Protocol (PPP). The most common network layer protocol is Internet Protocol (IP), but other protocols exist at the network layer that satisfy other needs, such as the Internet Control Message Protocol (ICMP) and the Address Resolution Protocol (ARP). At the transport layer is the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). Finally, the application layer includes many familiar protocols, including the standard Web protocol, HTTP, and the e-mail protocol, Simple Mail Transfer Protocol (SMTP).


Core network architecture

Now on to the architecture of the Linux network stack and how it implements the Internet model. Figure 2 provides a high-level view of the Linux network stack. At the top is the user space layer, or application layer, which defines the users of the network stack. At the bottom are the physical devices that provide connectivity to the networks (serial or high-speed networks such as Ethernet). In the middle, or kernel space, is the networking subsystem that is the focus of this article. Through the interior of the networking stack flow socket buffers (sk_buffs) that move packet data between sources and sinks. You'll see the sk_buff structure shortly.


Figure 2. Linux high-level network stack architecture


First, here's a quick overview of the core elements of the Linux networking subsystem, followed by more detail in later sections. At the top (see Figure 2) is the system call interface. This simply provides a way for user-space applications to gain access to the kernel's networking subsystem. Next is a protocol-agnostic layer that provides a common way to work with the underlying transport-level protocols. Next are the actual protocols, which in Linux include the built-in protocols of TCP, UDP, and, of course, IP. Next is another agnostic layer that permits a common interface to and from the individual device drivers that are available, followed at the end by the individual device drivers themselves.


System call interface

The system call interface can be described from two perspectives. When a networking call is made by the user, it is multiplexed through the system call interface into the kernel. This ends up as a call to sys_socketcall in ./net/socket.c, which then further demultiplexes the call to its intended target. The other perspective of the system call interface is the use of normal file operations for networking I/O. For example, typical read and write operations may be performed on a networking socket (which is represented by a file descriptor, just as a normal file). Therefore, while there exist a number of operations that are specific to networking (creating a socket with the socket call, connecting it to a destination with the connect call, and so on), there are also a number of standard file operations that apply to networking objects just as they do to regular files. In the end, the syscall interface provides the means to transfer control between the user-space application and the kernel.


Protocol agnostic interface

The sockets layer is a protocol agnostic interface that provides a set of common functions to support a variety of different protocols. The sockets layer not only supports the typical TCP and UDP protocols, but also IP, raw Ethernet, and other transport protocols, such as Stream Control Transmission Protocol (SCTP).

Communication through the network stack takes place with a socket. The socket structure in Linux is struct sock, which is defined in linux/include/net/sock.h. This large structure contains all of the required state of a particular socket, including the particular protocol used by the socket and the operations that may be performed on it.

The networking subsystem knows about the available protocols through a special structure that defines its capabilities. Each protocol maintains a structure called proto (found in linux/include/net/sock.h). This structure defines the particular socket operations that can be performed from the sockets layer to the transport layer (for example, how to create a socket, how to establish a connection with a socket, how to close a socket, and so on).


Network protocols

The network protocols section defines the particular networking protocols that are available (such as TCP, UDP, and so on). These are initialized at start of day in a function called inet_init in linux/net/ipv4/af_inet.c (as TCP and UDP are part of the inet family of protocols). The inet_init function registers each of the built-in protocols using the proto_register function. This function is defined in linux/net/core/sock.c, and, in addition to adding the protocol to the active protocol list, it also optionally allocates one or more slab caches if required.

You can see how individual protocols identify themselves through the proto structure in files tcp_ipv4.c, udp.c, and raw.c in linux/net/ipv4/. Each of these protocol structures are mapped by type and protocol into the inetsw_array, which maps the built-in protocols to their operations. The structure of inetsw_array and its relationships is shown in Figure 3. Each of the protocols in this array is initialized at start of day into inetsw through a call to inet_register_protosw from inet_init. Function inet_init also initializes the various inet modules, such as the ARP, ICMP, the IP modules, and the TCP and UDP modules.


Figure 3. Structure of the Internet protocol array



Socket protocol correlation
Recall that when a socket is created one defines the type and protocol, such as my_sock = socket( AF_INET, SOCK_STREAM, 0 ). The AF_INET indicates an Internet address family with a stream socket defined as SOCK_STREAM (as shown here in inetsw_array).

Note from Figure 3 that the proto structure defines the transport-specific methods, while the proto_ops structure defines the general socket methods. Additional protocols can be added to inetsw protocol switch through a call to inet_register_protosw. For example, the SCTP adds itself through a call to sctp_init in linux/net/sctp/protocol.c. For more information about the SCTP, check out the Resources section.

Data movement for sockets takes place using a core structure called the socket buffer (sk_buff). An sk_buff contains packet data and also state data that cover multiple layers of the protocol stack. Each packet sent or received is represented with an sk_buff. The sk_buff structure is defined in linux/include/linux/skbuff.h and shown in Figure 4.


Figure 4. Socket buffer and its relationship to other structures


As shown, multiple sk_buff may be chained together for a given connection. Each sk_buff identifies the device structure (net_device) to which the packet is being sent or from which the packet was received. As each packet is represented with an sk_buff, the packet headers are conveniently located through a set of pointers (th, iph, and mac for the Media Access Control, or MAC, header). Because the sk_buff are central to the socket data management, a number of support functions have been created to manage them. Functions exist for sk_buff creation and destruction, cloning, and queue management.

Socket buffers are designed to be linked together for a given socket and include a multitude of information, including the links to the protocol headers, a timestamp (when the packet was sent or received), and the device associated with the packet.


Device agnostic interface

Below the protocols layer is another agnostic interface layer that connects protocols to a variety of hardware device drivers with varying capabilities. This layer provides a common set of functions to be used by lower-level network device drivers to allow them to operate with the higher-level protocol stack.

First, device drivers may register or unregister themselves to the kernel through a call to register_netdevice or unregister_netdevice. The caller first fills out the net_device structure and then passes it in for registration. The kernel calls its init function (if one is defined), performs a number of sanity checks, creates a sysfs entry, and then adds the new device to the device list (a linked list of devices active in the kernel). You can find the net_device structure in linux/include/linux/netdevice.h. The various functions are implemented in linux/net/core/dev.c.

To send an sk_buff from the protocol layer to a device, the dev_queue_xmit function is used. This function enqueues an sk_buff for eventual transmission by the underlying device driver (with the network device being defined by the net_device or sk_buff->dev reference in the sk_buff). The dev structure contains a method, called hard_start_xmit, that holds the driver function for initiating transmission of an sk_buff.

Receiving a packet is performed conventionally with netif_rx. When a lower-level device driver receives a packet (contained within an allocated sk_buff), the sk_buff is passed up to the network layer through a call to netif_rx. This function then queues the sk_buff to an upper-layer protocol's queue for further processing through netif_rx_schedule. You can find the dev_queue_xmit and netif_rx functions in linux/net/core/dev.c.

Recently, a new application program interface (NAPI) was introduced into the kernel to allow drivers to interface with the device agnostic layer (dev). Some drivers use NAPI, but the large majority still use the older frame reception interface (by a rough factor of six to one). NAPI can yield better performance under high loads by avoiding taking an interrupt for each incoming frame.


Device drivers

At the bottom of the network stack are the device drivers that manage the physical network devices. Examples of devices at this layer include the SLIP driver over a serial interface or an Ethernet driver over an Ethernet device.

At initialization time, a device driver allocates a net_device structure and then initializes it with its necessary routines. One of these routines, called dev->hard_start_xmit, defines how the upper layer should enqueue an sk_buff for transmission. This routine takes an sk_buff. The operation of this function is dependent upon the underlying hardware, but commonly the packet described by the sk_buff is moved to a hardware ring or queue. Frame receipt, as described in the device agnostic layer, uses the netif_rx interface or netif_receive_skb for a NAPI-compliant network driver. A NAPI driver puts constraints on the capabilities of the underlying hardware. See the Resources section for more details.

After a device driver configures its interfaces in the dev structure, a call to register_netdevice makes it available for use. You can find the drivers specific to network devices in linux/drivers/net.


Going further

The Linux source code is a great way to learn about the design of device drivers for a multitude of device types, including network device drivers. What you'll find is a variation in design and usage of the available kernel APIs, but each is useful for instruction or as a starting point for a new device driver. The remaining code in the network stack is common and usable unless you require a new protocol. Even then, the implementations of TCP (for a stream protocol) or UDP (for a message-based protocol) serve as useful models for starting out with new development.



Resources

Learn

Get products and technologies

Discuss


About the author

M. Tim Jones is an embedded software architect and the author of GNU/Linux Application Programming, AI Application Programming, and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.