From bing cache of blogspot
cache URL: http://cc.bingj.com/cache.aspx?q=linux+netif+to+networking+stack&d=4954074211159852&mkt=en-US&setlang=en-US&w=i5H3dp0S6HHM2iHbpuKi2URA4nak-r4J
original URL: http://linuxinme.blogspot.com/2007/08/rough-notes-on-linux-networking-stack.html
Rough Notes on Linux Networking Stack
Table of Contents
1.
Existing Optimizations
2. Packet Copies
3. ICMP Ping/Pong : Function
Calls
4. Transmit Interrupts and Flow Control
5. NIC driver
callbacks and ifconfig
6. Protocol Structures in the Kernel
7.
skb_clone() vs. skb_copy()
8. NICs and Descriptor Rings
9. How much
networking
work does the ksoftirqd do?
10. Packet Requeues in Qdiscs
11.
Links
12. Specific TODOs
References
1. Existing
Optimizations
A great deal of thought has gone into Linux
networking
implementation and many optmizations have made their way to the
kernel
over the years. Some prime examples include:
* NAPI - Receive interrupts
are coalesced to reduce changes of a
livelock. Thus, now each packet
receive does not generate an
interrupt. Required modifications to
device driver interface.
Has been in the stable kernels since
2.4.20.
* Zero-Copy TCP - Avoids the overhead of kernel-to-userspace
and
userspace-to-kernel packet copying.
http://builder.com.com/5100-6372-1044112.html describes this is
some
detail.
2. Packet Copies
When a packet is received, the device
uses DMA to put it in main
memory (let‘s ignore non-DMA or non-NAPI code
and drivers). An skb
is constructed by the poll() function of the device
driver. After
this point, the same skb is used throughout the networking
stack,
i.e., the packet is almost never copied within the kernel (it is
copied
when delivered to user-space).
This design is borrowed from BSD and
UNIX SVR4 - the idea is to
allocate memory for the packet only once. The
packet has 4 primary
pointers - head, end, data, tail into the packet data
(character
buffer). head points to the beginning of the packet - where the
link
layer header starts. end points to the end of the packet. data
points to the location the current networking
layer can start
reading from (i.e., it changes as the packet moves up from
the link
layer, to IP, to TCP). Finally, tail is where the current
protocol
layer can begin writing data to (see alloc_skb(), which sets
head,
data, tail to the beginning of allocated memory block and end to
data + size).
Other implementations refer to head, end, data, tail
as base, limit,
read, write respectively.
There are some
instances where a packet needs to be duplicated. For
example, when running
tcpdump the packet needs to be sent to the
userspace process as well as to
the normal IP handler. Actually, int
this case too, a copy can be avoided
since the contents of the
packet are not being modified. So instead of
duplicating the packet
contents, skb_clone() is used to increase the
reference count of a
packet. skb_copy() on the other hand actually
duplicates the
contents of the packet and creates a completely new
skb.
See also:
http://oss.sgi.com/archives/netdev/2005-02/msg00125.html
A related
question: When a packet is received, are the tail and end
pointers equal?
Answer: NO. This is because memory for packets
received is allocated
before the packet is received, and the address
and size of this memory is
communicated to the NIC using receive
descriptors - so that when it is
actually received the NIC can use
DMA to transfer the packet to main
memory. The size allocated for a
received packet is a function of the MTU
of the device. The size of
an Ethernet frame actually received could be
anything less than the
MTU. Thus, tail of a received packet will point to
the end of the
received data while end will point to the end of the
memory
allocated for the packet.
3. ICMP Ping/Pong : Function
Calls
Code path (functions called) when an ICMP ping is received
(and
corresponding pong goes out), for linux
2.6.9: First the packet is
received by the NIC and it‘s interrupt handler
will ultimately call
net_rx_action() to be called (NAPI, [1]). This will
call the device
driver‘s poll function which will submit packets (skb‘s)
to the
networking
stack via
netif_receive_skb.
The rest is outlined below:
1. ip_rcv() --> ip_rcv_finish()
2.
dst_input() --> skb->dst->input = ip_local_deliver()
3.
ip_local_deliver() --> ip_local_deliver_finish()
4. ipprot->handler
= icmp_rcv()
5. icmp_pointers[ICMP_ECHO].handler == icmp_echo() -- At
this point
I guess you could say that the "receive" path is complete,
the
packet has reached the top. Now the outbound (down the stack)
journey begins)
6. icmp_reply() -- Might want to look into the
checks this function
does
7. icmp_push_reply()
8.
ip_push_pending_frames()
9. dst_output() --> skb->dst->output =
ip_output()
10. ip_output() --> ip_finish_output() -->
ip_finish_output2()
11. dst->neighbour->output ==
4. Transmit
Interrupts and Flow Control
Transmit interrupts are generated after
every packet transmission
and this is key to flow control. However, this
does have significant
performance implications under heavy
transmit-related I/O (imagine a
packet forwarder where the number of
transmitted packets is equal to
the number of received oned). Each device
provides a means to slow
down transmit (Tx) interrupts. For example,
Intel‘s e1000 driver
exposes "TxIntDelay" that allows transmit interrupts
to be delayed
in units of 1.024 microseconds. The default value is 64,
thus eavy
under heavy transmissions an interrupt‘s are spaced
65.536
microseconds apart. Imagine the number of transmissions that
can
take place in this time.
5. NIC driver callbacks and
ifconfig
Interfaces are configured using the ifconfig command. Many of
these
commands will result in a function of the NIC driver being
called.
For example, ifconfig eth0 up should result in the device
driver‘s
open() function being called (open is a member of
struct
net_device). ifconfig communicates with the kernel through
ioctl()
on any socket. The requests are a struct ifreq (see
/usr/include/net/if.h and
http://linux.about.com/library/cmd/blcmdl7_netdevice.htm.
Thus,
ifconfig eth0 up will result in the following:
1. A socket
(of any kind) is opened using socket()
2. A struct ifreq is prepared with
ifr_ifname set to "eth0"
3. An ioctl() with request SIOCGIFFLAGS is done
to get the current
flags and then the IFF_UP and IFF_RUNNING flags are
set with
another ioctl() (with request SIOCSIFFLAGS).
4. Now
we‘re inside the kernel. sock_ioctl() is called, which in
turn calls
dev_ioctl() (see net/socket.c and net/core/dev.c)
5. dev_ioctl() -->
... --> dev_open() --> driver‘s open()
implementation.
6. Protocol Structures in the Kernel
There are
various structs in the kernel which consist of function
pointers for
protocol handling. Different structures correspond to
different layers of
protocols as well as whether the functions are
for synchronous handling
(e.g., when recv(), send() etc. system
calls are made) or asynchronous
handling (e.g., when a packet
arrives at the interface and it needs to be
handled). Here is what I
have gathered about the various structures so
far:
* struct packet_type - includes instantiations such as
ip_packet_type, ipv6_packet_type etc. These provide low-level,
asynchronos packet handling. When a packet arrives at the
interface,
the driver ultimately submits it to the networking
stack by a
call to netif_receive_skb(),
which iterates to the
list of registered packet handlers and submits
the skb to them.
For example, ip_packet_type.func = ip_rcv, so
ip_rcv() is where
one can say the IP protocol first receives a packet
that has
arrived at the interface. Packet-types are registred with
the
networking
stack by a
call to dev_add_pack().
* struct net_proto_family - includes
instantiations such as
inet_family_ops, packet_family_ops etc. Each
net_proto_family
structure handles one type of address family (PF_INET
etc.).
This structure is associated with a BSD socket (struct
socket)
and not the networking
layer representation of sockets (struct
sock). It essentially
provdides a create() function which is
called in response to the
socket() system call. The
implementation of create() for each family
typically allocates
the struct sock and also associates other
synchronous operations
(see struct proto_ops below) with the socket.
To cut a long
story short - net_proto_family provides the
protocol-specific
part of the socket() system call. (NOTE: Not all BSD
sockets
will have a networking
socket associated with it. For example,
unix sockets (the
PF_UNIX address family).
unix_family_ops.create = unix_create
does not allocate a struct
sock). The net_proto_family structure is
registered with the
networking
stack by a
call to sock_register().
* struct proto_ops - includes
instantiations such as
inet_stream_ops, inet_dgram_ops, packet_ops
etc. These provide
implementations of networking
layer synchronous calls
(connect(), bind(), recvmsg(), ioctl() etc.
system calls). The
ops member of the BSD socket structure (struct
socket) points to
the proto_ops associated with the socket. Unlike the
above two
structures, there is no function that explicitly registers
a
struct proto_ops with the networking
stack.
Instead, the
create() implementation of struct net_proto_family just
sets the
ops field of the BSD socket to the appropriate
proto_ops
structure.
* struct proto - includes
instantiations such as tcp_prot,
udp_prot, raw_prot. These provide
protocol handlers inside a
network
family. It seems that currently this means only over-IP
protocols as I
could find only the above three instantiations.
These also provide
implementations for synchronous calls. The
sk_prot field of the
networking
socket (struct sock) points to
such a structure. The sk_prot field
would get set by the create
function in struct net_proto_family and
the functions provided
will be called by the implementations of
functions in the struct
proto_ops structure. For example,
inet_family_ops.create =
inet_create allocates a struct sock and would
set sk_prot =
udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM, 0);
system
call. A recvfrom() system call made on the socket would
then
invoke inet_dgram_ops.recvmsg = sock_common_recvmsg, which
calls
sk_prot->recvmsg = udp_recvmsg. Like proto_ops, struct
protos
aren‘t explicitly "registered" with the networking
stack
using a
function, but are "regsitered" by the BSD socket create()
implementation in the struct net_proto_family.
* struct
net_protocol - includes instantiations such as
tcp_protocol,
udp_protocol, icmp_protocol etc. These provide
asynchronous packet
receive routines for IP protocols. Thus,
this structure is specific to
the inet-family of protocols.
Handlers are registered using
inet_add_protocol(). This
structure is used by the IP-layer routines
to hand off to a
layer 4 protocol. Specifically, the IP handler
(ip_rcv()) will
invoke ip_local_deliver_finish() for packets that are
to be
delivered to the local host. ip_local_deliver_finish() uses
a
hash table (inet_protos) to decide which function to pass the
packet to based on the protocol field in the IP header. The hash
table is populated by the call to inet_add_protocol().
7. skb_clone() vs.
skb_copy()
When a packet needs to be delivered to two separate
handlers (for
example, the IP layer and tcpdump), then it is "cloned"
by
incrementing the reference count of the packet instead of being
"copied". Now, though the two handlers are not expected to modify
the
packet contents, they can change the data pointer. So, how do we
ensure
that processing by one of the handlers doesn‘t mess up the
data pointer
for the other?
A. Umm... skb_clone means that there are separate head,
tail, data,
end etc. pointers. The difference between skb_copy() and
skb_clone()
is precisely this - the former copies the packet completely,
while
the latter uses the same packet data but separate pointers into
the
packet.
8. NICs and Descriptor Rings
NOTE: Using the
Intel e1000, driver source version 5.6.10.1, as an
example. Each
transmission/reception has a descriptor - a "handle"
used to access buffer
data somewhat like a file descriptor is a
handle to access file data. The
descriptor format would be NIC
dependent as the hardware understands and
reads/writes to the
descriptor. The NIC maintains a circular ring of
descriptors, i.e.,
the number of descriptors for TX and RX is fixed
(TxDescriptors,
RxDescriptors module parameters for the e1000 kernel
module) and the
descriptors are used like a circular queue.
Thus, there are three structures:
* Descriptor Ring (struct
e1000_desc_ring) - The list of
descriptors. So, ring[0], ring[1]
etc. are individual
descriptors. The ring is typically allocated just
once and thus
the DMA mapping of the ring is "consistent". Each
descriptor in
the ring will thus have a fixed DMA and memory address.
In the
e1000, the device registers TDBAL, TDBAH, TDLEN stand for
"Transmit Descriptors Base Address Low", "High" and "Length" (in
bytes of all descriptors). Similarly, there are RDBAL, RDBAH,
RDLEN
* Descriptors (struct e1000_rx_desc and struct e1000_tx_desc)
-
Essentially, this stores the DMA address of the buffer which
contains actual packet data, plus some other accounting
information such as the status (transmission successsful?
receive
complete? etc.), errors etc.
* Buffers - Now actual data cannot have a
"consistent" DMA
mapping, meaning we cannot ensure that all skbuffs
for a
particular device always have some specific memory addresses
(those that have been setup for DMA). Instead, "streaming" DMA
mappings need to be used. Each descriptor thus contains the DMA
address of a buffer that has been setup for streaming mapping.
The
hardware uses that DMA address to pickup a packet to be sent
or to
place a received packet. Once the kernel‘s stack
picks up
the buffer, it can allocate new resources (a new buffer)
and
tell the NIC to use that buffer next time by setting up a new
streaming mapping and putting the new DMA handle in the
descriptor.
The e1000 uses a struct e1000_buffer as a wrapper around
the
actual buffer. The DMA mapping however is setup only for
skb->data, i.e., where raw packet data is to be placed.
9. How much
networking
work does the ksoftirqd do?
Consider what the NET_RX_SOFTIRQ does:
1. Each softirq invokation (do_softirq()) processes up to
net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART packets, if
available. The default values lead to 300 x 10 = 3000 pkts.
2. Every
interrupt calls do_softirq() when exitting (irq_exit()) -
including
the timer interrupt and NMIs too?
3. Default transmit/receive ring sizes
on the NIC are less than
3000 (the e1000 for example defaults to 256
and can have at most
4096 descriptors on its ring)
Thus, the
number of times ksoftirqd will be switched in/out depends
on how much
processing is done by do_softirq() invokations on
irq_exit(). If the
softirq handling on interrupt is able to clean up
the NIC ring faster than
a new packet comes in, then ksoftirqd won‘t
be doing anything.
Specifically, if the inter-packet-gap is greater
than the time it takes to
pick-up and process a single packet from
the NIC, then ksoftirq will not
be scheduled (and if the number of
descriptors on the NIC is less than
3000).
Without going into details, some quick experimental
verification:
Machine A continuously generates UDP packets for Machine B
which is
running an "sink" application, i.e., it just loops on a
recvfrom().
When the size of the packet sent from A was 60 bytes
(and
inter-packet gap averaged 1.5?μs), then the ksoftirqd thread on B
observed a total of 375 context swithces (374 involuntary and 1
voluntary). When the packet size was 1280 bytes (and now
inter-packet gap increased almost 7 times to 10?μs) then the
ksoftirqd
thread was NEVER scheduled (0 context switches). The
single voluntary
context switch in the former case probably happened
after all packets were
processed (i.e., the sender stopped sending
and the receiver processed all
that it got).
10. Packet Requeues in Qdiscs
The queueing
discipline (struct Qdisc) provides a requeue().
Typically, packets are
dequeued from the qdisc and submitted to the
device driver (the
hard_start_xmit function in struct net_device).
However, at times it is
possible that the device driver is "busy",
so the dequeued packet must be
"requeued". "Busy" here means that
the xmit_lock of the device was held.
It seems that this lock is
acquired at two places: (1) qdisc_restart() and
(2) dev_watchdog().
The former handles packet dequeueing from the qdisc,
acquiring the
xmit_lock and then submitting the packet to the device
driver
(hard_start_xmit()) or alternatively requeuing the packet if
the
xmit_lock was already held by someone else. The latter is invoked
asynchronously, periodically - its part of the watchdog timer
mechanism.
My understanding is that two threads cannot be in
qdisc_restart()
for the same qdisc at the same time, however the xmit_lock
may have
been acquired by the watchdog timer function causing a
requeue.
11. Links
This is just a dump of links that might be
useful.
* http://www.spec.org and SpecWeb http://www.spec.org/web99/
* linux-net
and netdev mailing lists:
http://www.ussg.iu.edu/hypermail/linux/net/
and
http://oss.sgi.com/projects/netdev/archive/
*
Linux
Traffic Control HOWTO
12. Specific TODOs
* Study watchdog
timer mechanism and figure out how flow control
is implemented in the
receive and transmit side
References
[3] Beyond Softnet. Jamal
Hadi Salim, Robert Olsson, and Alexey
Kuznetsov. Nov 2001. USENIX. 5.
.
[5] A Map of the Networking
Code in Linux
Kernel 2.4.20. Miguel Rio,
Mathieu Goutelle, Tom Kelly, Richard
Hugh-Jones, Jean-Phillippe
Martin-Flatin, and Yee-Ting Li. Mar
2004.
[4] Understanding the Linux
Kernel. Daniel P. Bovet and Marco
Cesati. O‘Reilly & Associates. 2nd
Edition. 81-7366-589-3.
[quote] Rough Notes on Linux Networking Stack,布布扣,bubuko.com