Linux流量控制(TC)之表面

1.1 流量控制是什么

​ 流量控制是路由器上报文的接收和发送机制及排队系统的统称。这包括在一个输入接口上决定以何种速率接收何种报文,在一个输出接口上以何种速率、何种顺序输出何种报文。

​ 传统的流量控制涉及到整流(sharping),调度(scheduling), 分类(classifying),监管(policing),dropping(丢弃), 标记(marking)等工作。

  • 整流。整流器通过延迟数据包来使流量保持在一定速率。整流就是让包在输出队列上被发送之前进行延时,然后一定的速率发送,使网络流量保持在一定的速率之下,这是大部分用户进行流量控制的目的。
  • 调度。调度就是对队列中的输入输出报文进行排列。最常的调度方法就是FIFO(先进先出),更广泛的来说,在输出队列上的任何流量控制都可以被称作调度,因为报文被排列以被输出。
  • 分类。分类就是将流量进行划分以便区别处理,例如拆分后放到不同的输出队列中。在报文的接收、路由、发送过程中,网络设备可以用多种方式来分类报文。分类包括对报文进行标记,标记可以在边际网络中由一个单一的控制单元来完成,也可以在每一跳中都进行标记。
  • 监管。监管作为流量控制的一部分,就是用于限制流量。监管常用于网络边际设备,使某个节点不能使用多于分配给它的带宽。监管器以特定的速率接收数据包,当流量超过这一速率时就对接收的数据包执行相应的动作。最严格的动作就是丢弃数据包,尽管该数据包可以被重新分类。
  • 丢弃。丢弃就是通过某种机制来选择哪个数据包被丢掉。如RED。
  • 标记。标记流量控制在数据包中插入了DSCP部分,在一个可管理网络中,其可被其它路由器利用和识别(通常用于DiffServ,差分服务)。

1.2 为什么需要流量控制

​ 分组交换网络和电路交换网络的一个重要不同之处是:分组交换网络是无状态的,而电路交换网络(比如电话网)必须保持其状态。分组交换网络和IP网络一样被设计成无状态的,实际上,无状态是IP的一个根本优势。

​ 无状态的缺陷是不能对不同类型数据流进行区分。但通过流量控制,管理员就能够基于报文的属性对其进行排队和区别。它甚至能够被用于模拟电路交换网络,将无状态网络模拟成有状态网络。

​ 有很多实际的理由去考虑使用流量控制,并且流量控制也有很多有意义的应用场景。下面是一些利用流量控制可以解决或改善的问题的例子,下面的列表不是流量控制可以解决的问题的完整列表,此处仅仅介绍了一些能通过流量控制来解决的几类问题

常用的流量控制解决方案

  • 通过TBF和带子分类的HTB将带宽限制在一个数值之下
  • 通过HTB分类(HTB class)和分类(classifying)并配合filter,来限制指定用户、服务或客户端的带宽。
  • 通过提升ACK报文的优先级,以及使用wondershaper来最大化非对称线路上的TCP吞吐量。
  • 通过带子分类的HTB和分类(classifying)为某个应用或用户保留带宽。
  • 通过HTB分类(HTB class)中的(优先级)PRIO机制来提高延时敏感型应用的性能。
  • 通过HTB的租借机制来管理多余的带宽。
  • 通过HTB的租借机制来实现所有带宽的公平分配。
  • 通过监管器(policer)加上带丢弃动作的过滤器(filter)来使某种类型的流量被丢弃。

1.3 如何进行流量控制

1.3.1 流量控制一般组成

一个流量控制系统,根据需要实现的功能,大致包含一下几个组件:

  • 调度器
  • 分类器(可选)
  • 监管器
  • 过滤器

其中,分类器不是必须的,如一些无类流量控制系统。下表是Linux中的对应实现的组件概念。

traditional element Linux component
shaping The class offers shaping capabilities.
scheduling A qdisc is a scheduler. Schedulers can be simple such as the FIFO or complex, containing classes and other qdiscs, such as HTB.
classifying The filter object performs the classification through the agency of a classifier object. Strictly speaking, Linux classifiers cannot exist outside of a filter.
policing A policer exists in the Linux traffic control implementation only as part of a filter.
dropping To drop traffic requires a filter with a policer which uses "drop" as an action.
marking The dsmark qdisc is used for marking.

1.3.2 Linux TC

Linux TC包含了强大的流控各方面的功能。在使用之前,先简单了解一下其中的逻辑。

Linux TC流量控制的相关名词解释:

  • Queueing Discipline (qdisc)

    An algorithm that manages the queue of a device, either incoming (ingress) or outgoing (egress).

  • root qdisc

    The root qdisc is the qdisc attached to the device.

  • Classless qdisc

    A qdisc with no configurable internal subdivisions.

  • Classful qdisc

    A classful qdisc contains multiple classes. Some of these classes contains a further qdisc, which may again be classful, but need not be. According to the strict definition, pfifo_fast is classful, because it contains three bands which are, in fact, classes. However, from the user's configuration perspective, it is classless as the classes can't be touched with the tc tool.

  • Classes

    A classful qdisc may have many classes, each of which is internal to the qdisc. A class, in turn, may have several classes added to it. So a class can have a qdisc as parent or an other class. A leaf class is a class with no child classes. This class has 1 qdisc attached to it. This qdisc is responsible to send the data from that class. When you create a class, a fifo qdisc is attached to it. When you add a child class, this qdisc is removed. For a leaf class, this fifo qdisc can be replaced with an other more suitable qdisc. You can even replace this fifo qdisc with a classful qdisc so you can add extra classes.

  • Classifier

    Each classful qdisc needs to determine to which class it needs to send a packet. This is done using the classifier.

  • Filter

    Classification can be performed using filters. A filter contains a number of conditions which if matched, make the filter match.

  • Scheduling

    A qdisc may, with the help of a classifier, decide that some packets need to go out earlier than others. This process is called Scheduling, and is performed for example by the pfifo_fast qdisc mentioned earlier. Scheduling is also called 'reordering', but this is confusing.

  • Shaping

    The process of delaying packets before they go out to make traffic confirm to a configured maximum rate. Shaping is performed on egress. Colloquially, dropping packets to slow traffic down is also often called Shaping.

  • Policing

    Delaying or dropping packets in order to make traffic stay below a configured bandwidth. In Linux, policing can only drop a packet and not delay it - there is no 'ingress queue'.

  • Work-Conserving

    A work-conserving qdisc always delivers a packet if one is available. In other words, it never delays a packet if the network adaptor is ready to send one (in the case of an egress qdisc).

  • non-Work-Conserving

    Some queues, like for example the Token Bucket Filter, may need to hold on to a packet for a certain time in order to limit the bandwidth. This means that they sometimes refuse to pass a packet, even though they have one available.

1.3.2 Linux TC详解

首先需要注意的是:Linux tc只对egress方向实现了良好的控制,而对ingress方向控制有限,简而言之,控发不控收。

下面看实现中的几个重要概念:

  • 队列。队列是流控的基础概念。通过使用队列和其他机制,可以进行整流,调度等工作。

  • 令牌桶。这是个非常重要的因素。为了控制出队的速率,一种方式就是直接统计队列中出队的报文或字节数,但是为了保证精确性就需要复杂的计算。在流量控制中广泛应用的另一种方式就是令牌桶,令牌桶以一定的速率产生令牌,报文或字节出队时从令牌桶中取令牌,只有取到令牌后才能出队。

    我们可以打一个比方,一群人正排队等待乘坐游乐场的游览车。让我们想象现在有一条固定的道路,游览车以固定的速度抵达,每个人都必须等待游览车到达后才能乘坐。游览车和游客就可以类比为令牌和报文,这种机制就是速率限制或流量整形,在一个固定的时间段内只有一部分人能乘坐游览车。

    继续上面的比方,设想有大量的游览车正停在车站等待游客乘坐,但现在没有一个游客。如果现在有一大群游客同时过来了,那么他们都可以马上乘上游览车。在这里,我们就可以将车站类比为桶,一个桶中包含一定数量的令牌,桶中的令牌可以一次性被使用完而不管数据包到达的时间。

    让我们来完成这个比方,游览车以固定的速率抵达车站,如果没人乘坐就会停满车站,即令牌以一定的速率进入桶中,如果令牌一直没被使用那么桶就可以被装满,而如果令牌不断的被使用那么桶就不会满。令牌桶是处理会产生流量突发应用(比如HTTP)的关键思想。

    使用令牌桶过滤器的排队规则(TBF qdisc,Token Bucket Filter)是流量整形的一个经典例子(在TBF小节中有一个图表,通过该图表可以形象化的帮助读者理解令牌桶)。TBF以给定的速度产生令牌,当桶中有令牌时才发送数据,令牌是整流的基本思想。

Linux tc中主要的组件是qdisc, class, filter。

  • qdisc包含classful qdisc和classless disc。两者的区别是glassful qdisc可以包含多个分类,可以更加精细的控制流量。

    • 常见的classless qdisc有:choke, codel, p/bfifo,fq, fq_codel, gred, hhf, ingress,mqprio, multiq, netem, pfifo_fast, pie, red, rr, sfb, sfq, tbf。linux默认使用的就是fifo_fast。

    • 常见的classful qdisc有:ATM, CBQ, DRR, DSMARK, HFSC, HTB, PRIO, QFQ

  • 分类只存在于可分类排队规则(classful qdisc)(例如,HTB和CBQ)中。分类可以很复杂,它可以包含多个子分类,也可以只包含一个子qdisc。在超级复杂的流量控制应用场景中,一个类中再包含一个可分类qdisc也是可以的。

    任何一个分类都可以和任意多个filter相关联,这样就可以选择一个子分类或运用一个filter来重新排列或丢弃进入分类中的数据包。

    叶子分类是qdisc中的最后一个分类,它包含一个qdisc(默认是pfifo)并且不包含任意子分类。任何包含子分类的分类都是内部分类而不是子分类。

  • Linux的过滤器可以允许用户利用一个或多个过滤器将数据包分类至输出队列上。它包含了一个分类器实现,常见的分类器如u32,u32分类器可以允许用户基于数据包的属性来选择数据包。

无论是qdisc,还是class, 都需要有一个唯一标识符。就是所说的句柄。它们都采用major:minor格式来命名,注意他们都是以十六进制解析。对于他们的使用,在栗子中会做具体说明。

接下来我们主要介绍一下classful qdisc的情况。看一下数据包的流程。

  • flow within classful qdisc & class

    When traffic enters a classful qdisc, it needs to be sent to any of the classes within - it needs to be 'classified'. To determine what to do with a packet, the so called 'filters' are consulted. It is important to know that the filters are called from within a qdisc, and not the other way around!

    The filters attached to that qdisc then return with a decision, and the qdisc uses this to enqueue the packet into one of the classes. Each subclass may try other filters to see if further instructions apply. If not, the class enqueues the packet to the qdisc it contains.

    Besides containing other qdiscs, most classful qdiscs also perform shaping. This is useful to perform both packet scheduling (with SFQ, for example) and rate control. You need this in cases where you have a high speed interface (for example, ethernet) to a slower device (a cable modem).

  • How filters are used to classify traffic

    Recapping, a typical hierarchy might look like this:

                     1:   root qdisc
                      |
                     1:1    child class
                   /  |  \
                  /   |   \
                 /    |    \
                 /    |    \
              1:10  1:11  1:12   child classes
               |      |     | 
               |     11:    |    leaf class
               |            | 
               10:         12:   qdisc
              /   \       /   \
           10:1  10:2   12:1  12:2   leaf classes

​ But don't let this tree fool you! You should not imagine the kernel to be at the apex of the tree and the network below, that is just not the case. Packets get enqueued and dequeued at the root qdisc, which is the only thing the kernel talks to.

​ A packet might get classified in a chain like this: 1: -> 1:1 -> 1:12 -> 12: -> 12:2

​ The packet now resides in a queue in a qdisc attached to class 12:2. In this example, a filter was attached to each 'node' in the tree, each choosing a branch to take next. This can make sense. However, this is also possible: 1: -> 12:2

​ In this case, a filter attached to the root decided to send the packet directly to 12:2.

  • How packets are dequeued to the hardware

    When the kernel decides that it needs to extract packets to send to the interface, the root qdisc 1: gets a dequeue request, which is passed to 1:1, which is in turn passed to 10:, 11: and 12:, each of which queries its siblings, and tries to dequeue() from them. In this case, the kernel needs to walk the entire tree, because only 12:2 contains a packet.

    In short, nested classes ONLY talk to their parent qdiscs, never to an interface. Only the root qdisc gets dequeued by the kernel!

    The upshot of this is that classes never get dequeued faster than their parents allow. And this is exactly what we want: this way we can have SFQ in an inner class, which doesn't do any shaping, only scheduling, and have a shaping outer qdisc, which does the shaping.

1.3.3 HTB的配置使用

HTB是一种classful qdisc,是一种分层分类流控方法,是Linux常用的一种流控配置。接下来就来看一下使用配置:

配置HTB需要四个步骤:

  • 创建root qdisc
  • 创建class
  • 创建filter,关联到class
  • 添加leaf class disc(非必需)
#tc qdisc add dev eth0 root handle 1: htb default 30 //添加root qdisc, 1:是 1:0的简写
#tc class add dev eth0 parent 1: classid 1:1 htb rate 6mbit burst 15k //以根1:为根,创建class
#tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5mbit burst 15k 
#tc class add dev eth0 parent 1:1 classid 1:20 htb rate 3mbit ceil 6mbit burst 15k 
#tc class add dev eth0 parent 1:1 classid 1:30 htb rate 1kbit ceil 6mbit burst 15k 
#tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10 //为leaf class添加qdisc,默认为pfifo
#tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10 
#tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10 
# 添加过滤器 , 直接把流量导向相应的类 : 
#U32="tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32"
#$U32 match ip dport 80 0xffff flowid 1:10 //关联filter到class
#$U32 match ip sport 25 0xffff flowid 1:20

其中创建class时,其中的参数意义如下:

default

这是HTB排队规则的一个可选参数,默认值为0, 当值为0时意味着会绕过所有和rootqdisc相关联的分类,然后以最大的速度出队任何未分类的流量。

rate

这个参数用来设置流量发送的最小期望速率。这个速率可以被当作承诺信息速率(CIR), 或者给某个叶子分类的保证带宽。

ceil

这个参数用来设置流量发送的最大期望速率。租借机制将会决定这个参数的实际用处。 这个速率可以被称作“突发速率”。

burst

这个参数是rate桶的大小(参见令牌桶这一节)。HTB将会在更多令牌到达之前将burst个字节的数据包出队。

cburst

这个参数是ceil桶的大小(参见令牌桶这一节)。HTB将会更多令牌(ctoken)到达之前将cburst个字节的数据包出队。

quantum

这个是HTB控制租借机制的关键参数。正常情况下,HTB自己会计算合适的quantum值,而不是由用户来设定。对这个值的轻微调整都会对租借和整形造成巨大的影响,因为HTB不仅会根据这个值向各个子分类分发流量(速率应高于rate,小于ceil),还会根据此值输出各个子分类中的数据。

r2q

通常,quantum 的值由HTB自己计算,用户可以通过此参数设置一个值来帮助HTB为某个分类计算一个最优的quantum值。

mtu

prio

1.3.4 入向流控

入向的流控常见做法是通过把接口的流量重定向到ifb设备,然后在ifb的egress上做流控,间接达到控制入向的目的。简单的使用示例如下:

#modprobe ifb    //需要加载ifb模块

#ip link set dev ifb0 up txqueuelen 1000

#tc qdisc add dev eth1 ingress  //添加ingress qdisc

#tc filter add dev eth1 parent ffff: protocol ip u32 match u32 0 0flowid 1:1 action mirred egress redirect dev ifb0   //重定向流量到ifb

#tc qdisc add dev ifb0 root netem delay 50ms loss 1%  //在ifb上配置操作,这里使用了netem,也可以和出向一样,配置qdisc, class, filter。

1.3.5 统计查看

  • 使用tc qdisc show dev xx 查看qdisc
  • 使用tc class show dev xx 查看class
  • 使用tc filter show dev xx 查看filter,注意这里都是查看默认为root,即出向的规则,如果要查看入向的,需要使用tc filter show dev xx ingress 。
The tc tool allows you to gather statistics of queuing disciplines in Linux. Unfortunately statistic results are not explained by authors so that you often can't use them. Here I try to help you to understand HTB's stats.
First whole HTB stats. The snippet bellow is taken during simulation from chapter 3.

# tc -s -d qdisc show dev eth0
 qdisc pfifo 22: limit 5p
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 

 qdisc pfifo 21: limit 5p
 Sent 2891500 bytes 5783 pkts (dropped 820, overlimits 0) 

 qdisc pfifo 20: limit 5p
 Sent 1760000 bytes 3520 pkts (dropped 3320, overlimits 0) 

 qdisc htb 1: r2q 10 default 1 direct_packets_stat 0
 Sent 4651500 bytes 9303 pkts (dropped 4140, overlimits 34251) 

First three disciplines are HTB's children. Let's ignore them as PFIFO stats are self explanatory.
overlimits tells you how many times the discipline delayed a packet. direct_packets_stat tells you how many packets was sent thru direct queue. Other stats are sefl explanatory. Let's look at class' stats:

tc -s -d class show dev eth0
class htb 1:1 root prio 0 rate 800Kbit ceil 800Kbit burst 2Kb/8 mpu 0b 
    cburst 2Kb/8 mpu 0b quantum 10240 level 3 
 Sent 5914000 bytes 11828 pkts (dropped 0, overlimits 0) 
 rate 70196bps 141pps 
 lended: 6872 borrowed: 0 giants: 0

class htb 1:2 parent 1:1 prio 0 rate 320Kbit ceil 4000Kbit burst 2Kb/8 mpu 0b 
    cburst 2Kb/8 mpu 0b quantum 4096 level 2 
 Sent 5914000 bytes 11828 pkts (dropped 0, overlimits 0) 
 rate 70196bps 141pps 
 lended: 1017 borrowed: 6872 giants: 0

class htb 1:10 parent 1:2 leaf 20: prio 1 rate 224Kbit ceil 800Kbit burst 2Kb/8 mpu 0b 
    cburst 2Kb/8 mpu 0b quantum 2867 level 0 
 Sent 2269000 bytes 4538 pkts (dropped 4400, overlimits 36358) 
 rate 14635bps 29pps 
 lended: 2939 borrowed: 1599 giants: 0

I deleted 1:11 and 1:12 class to make output shorter. As you see there are parameters we set. Also there are level and DRR quantum informations.
overlimits shows how many times class was asked to send packet but he can't due to rate/ceil constraints (currently counted for leaves only).
rate, pps tells you actual (10 sec averaged) rate going thru class. It is the same rate as used by gating.
lended is # of packets donated by this class (from its rate) and borrowed are packets for whose we borrowed from parent. Lends are always computed class-local while borrows are transitive (when 1:10 borrows from 1:2 which in turn borrows from 1:1 both 1:10 and 1:2 borrow counters are incremented).
giants is number of packets larger than mtu set in tc command. HTB will work with these but rates will not be accurate at all. Add mtu to your tc (defaults to 1600 bytes).

1.3.6 杂项说明

  • 查看统计信息时,看不到统计速度rate等?内核为了性能,默认关闭了显示,可以通过echo 1 > /sys/module/sch_htb/parameters/htb_rate_est来打开。

1.3.7 参考文档

上一篇:关于UI系统的问题


下一篇:残缺棋盘的覆盖问题