Module 4: Operating system kernel tracing

Network stack

One of the largest kernel subsystem is a network stack. It is called a stack because it consists from multiple protocols where each of them works on top of the more primitive protocol. That hierarchy is defined by different models such as OSI model or TCP/IP stack. When user data is passed through network, it is encapsulated into packets of that protocols: when data is passed to a protocol driver it puts some service data to the packet header and tail so operating system on receiver host can recognize them and build original message even when some data was lost or order of packets had changed during transmission.

Each layer of network stack has its responsibilities so they are not of concern of higher-layer protocols. For example, IP allows to send datagrams through multiple routers and networks, can reassemble packets but doesn't guarantee reliability when some data is lost –- it is implemented in TCP protocol. Both of them can only transmit raw data, encoding or compression is implemented on higher layer like HTTP.

Network subsystem (which transmits data between hosts) has a major difference over block input-output (which stores data): It is very sensitive to latency, so writing or reading data cannot be deferred. Due to that, sending and receiving is usually performed in the same thread context.

Network stack in Unix systems can be split into three generic layers:

Socket layer which implements BSD sockets through series of system calls.
Intermediate protocol drivers such as ip, udp and tcp and packet filters.
Media Access Control (MAC) layer on the bottom which providing access to network interface cards (NICs) and NIC API itself. It is called GLD in Solaris.

image:net

Network input-output can require transferring huge amounts of data, so it may be ineffective to explicitly send write commands for each packet. Instead of handling each packet individually, NIC and its driver maintain shared ring buffer where driver puts data while card uses DMA (direct memory access) mechanisms to read data and send it over network. Ring buffers are defined by two pointers: head and tail:

image:ringbuf

When driver wants to queue packet for transmission it puts it into memory area of designated ring buffer and updates tail pointer appropriately. When NIC transfers data over a network it will update head pointer.

Data structures are usually shared between stack layers. In Linux packets are represented by a generic sk_buff structure:

image:linux/net

That structure keeps two pointers: head and data and includes offsets for protocol headers. Data length is kept in len field, time stamp of packet in tstamp field. sk_buff structures form a doubly-linked list through next and prev structures. They refer network device descriptor which is represented by net_device structure and a socket which is represented by pair of structures: socket which holds generic socket data including file pointer which points to VFS node (sockets in Linux and Solaris are managed by special filesystems) and sock which keeps more network-related data including local address which is kept in skc_rcv_saddr and skc_num and peer address in skc_daddr and skc_dport correspondingly.

Note that CPU byte order may differ from network byte order, so you should use conversion functions to work with addresses such as ntohs, ntohl or ntohll to convert to host byte order and htons, htonl and htonll for reverse conversions. They are provided both by SystemTap and DTrace and have same behaviour as their C ancestors.

Here are sample script for tracing message receiving in Linux 3.9:

# stap -e '
    probe kernel.function("tcp_v4_rcv") {
        printf("[%4s] %11s:%-5d -> %11s:%-5d len: %d\n",
            kernel_string($skb->dev->name),
                    
            ip_ntop($skb->sk->__sk_common->skc_daddr), 
            ntohs($skb->sk->__sk_common->skc_dport),
                    
            ip_ntop($skb->sk->__sk_common->skc_rcv_saddr), 
            $skb->sk->__sk_common->skc_num,
                    
            $skb->len);
    }'

Earlier versions of Linux (2.6.32 in this example) use different structure called inet_sock:

# stap -e '
    probe kernel.function("tcp_v4_do_rcv") {
        printf("%11s:%-5d -> %11s:%-5d len: %d\n",
                ip_ntop(@cast($sk, "inet_sock")->daddr), 
                ntohs(@cast($sk, "inet_sock")->dport),
                    
                ip_ntop(@cast($sk, "inet_sock")->saddr), 
                ntohs(@cast($sk, "inet_sock")->sport),
                    
                $skb->len);
    }'

Solaris has derived STREAMS subsystem from System V which is intended to provide API for passing messages between multiple architectural layers which is perfectly fits to how network stack look like. Each message is represented by an mblk_t structure:

image:solaris/streams

Consumer reads data referred by b_rptr pointer while producer puts it under b_wptr pointer if there is enough space in allocated buffer (it is referred by b_datap) or allocates a new message and sets up forward and backward pointers b_next and b_prev so these messages form a doubly-linked list.

Note that unlike sk_buff from Linux, these messages do not contain pointers to the management structure. Instead of doing that, functions pass pointer to them as a separate argument which is usually first argument of the function (arg0 in DTrace): mac_impl_t for MAC layer, ill_t for IP layer and conn_t for TCP/UDP protocols:

image:solaris/net

Solaris wraps sockets into sonode structure which are handled by virtual file system called sockfs. so_vnode field in that structure points to VFS node. Like we mentioned before, TCP and UDP connection are managed by conn_t structure. It keeps addresses in connua_laddr and connu_lport fields for local address and uses connua_faddr and connu_lport for remote ports. Note that these names are different in Solaris 10.

Here are example DTrace script for tracing message receiving in Solaris 11:

# dtrace -n '
    tcp_input_data:entry {
        this->conn = (conn_t*) arg0;
        this->mp = (mblk_t*) arg1;
        
        printf("%11s:%-5d -> %11s:%-5d len: %d\n",
                inet_ntoa((ipaddr_t*) &(this->conn->connua_v6addr.
                                            connua_faddr._S6_un._S6_u32[3])),
                ntohs(this->conn->u_port.connu_ports.connu_fport),
                
                inet_ntoa((ipaddr_t*) &(this->conn->connua_v6addr.
                                            connua_laddr._S6_un._S6_u32[3])),
                ntohs(this->conn->u_port.connu_ports.connu_lport),
                
                (this->mp->b_wptr - this->mp->b_rptr));
    }'

Solaris 11 introduced new providers for tracing network: tcp, udp and ip. Here are probes that are provided by them and their siblings from Linux and SystemTap:

Action DTrace SystemTap

TCP

Connection to remote node tcp:::connect-request
tcp:::connect-established
tcp:::connect-refused kernel.function("tcp_v4_connect")

Accepting remote connection tcp:::accept-established
tcp:::accept-refused kernel.function("tcp_v4_hnd_req")

Disconnecting fbt:::tcp_disconnect tcp.disconnect

State change tcp::state-change -

Transmission tcp:::send tcp.sendmsg

Receiving tcp:::receive tcp.receive
tcp.recvmsg

IP

Transmission ip:::send kernel.function("ip_output")

Receiving ip:::receive kernel.function("ip_rcv")

Network device

Transmission mac_tx:entry, or function from NIC driver like e1000g_send:entry netdev.transmit
netdev.hard_transmit

Receiving mac_rx_common:entry, or function from NIC driver like e1000g_receive:entry netdev.rx

Sockets can be traced using syscall tracing. SystemTap provides special tapset socket for that.

Both Linux and Solaris provide various network statistics which are provided by SNMP and accessible through netstat -s command. Many events registered by these counters are implemented using mib provider from DTrace or tcpmib, ipmib and linuxmib tapsets in SystemTap, but they do not have connection-specific data.

Action	DTrace	SystemTap
TCP
Connection to remote node	`tcp:::connect-request` `tcp:::connect-established` `tcp:::connect-refused`	`kernel.function("tcp_v4_connect")`
Accepting remote connection	`tcp:::accept-established` `tcp:::accept-refused`	`kernel.function("tcp_v4_hnd_req")`
Disconnecting	`fbt:::tcp_disconnect`	`tcp.disconnect`
State change	`tcp::state-change`	-
Transmission	`tcp:::send`	`tcp.sendmsg`
Receiving	`tcp:::receive`	`tcp.receive` `tcp.recvmsg`
IP
Transmission	`ip:::send`	`kernel.function("ip_output")`
Receiving	`ip:::receive`	`kernel.function("ip_rcv")`
Network device
Transmission	`mac_tx:entry`, or function from NIC driver like `e1000g_send:entry`	`netdev.transmit` `netdev.hard_transmit`
Receiving	`mac_rx_common:entry`, or function from NIC driver like `e1000g_receive:entry`	`netdev.rx`