环境:
两台host(各配有一块双端口40Gbps ConnectX-3 网卡,驱动版本为4.1-1.0.2.0,OS为Ubuntu 16.04)
一台32端口Mellanox Spectrum交换机SN2700,onyx版本为3.6.8102.
PFC背景知识:
PFC:https://blog.csdn.net/bandaoyu/article/details/115346857
引用Juniper对PFC的介绍,“基于优先级的流控制(PFC,Priority-based flow control),IEEE标准802.1Qbb,是一种链路级流控制机制。该流控制机制与IEEE 802.3x的暂停机制类似,但是暂停的是链路上某个优先级的消息(每个级别是一个虚拟通道,暂停某个虚拟通道),而不是整个链路暂停。PFC允许您根据其类别有选择地暂停流量。”
可见,相比于IEEE 802.3x(整个链路),PFC的粒度更小(暂停某个虚拟通道)。因此配置的过程可以理解为将应用流量映射到某一个优先级的过程。根据对流量标记位置的不同,可以分为Trust L2和Trust L3。由于ConnectX-3仅支持RoCE v1,因此本文只关注Trust L2。
在端主机侧,映射关系为:
ToS -> skb_priority -> Vlan-qos (也记为User Priority,即UP,其值为Vlan tag中PCP的值) -> tc。
在交换机侧,映射关系为:
PCP + DEI -> switch-priority -> ingress Port Group (PG)。其中PG包含对PFC阈值的配置。
本文使用tc 4以及switch-priority 4为例。
配置过程:
首先配置交换机:
0. 进入配置模式:
switch-6bd534 [standalone: master] > enable
switch-6bd534 [standalone: master] # configure terminal
1. 创建VLAN,并设置交换机端口为hybrid模式:
switch-6bd534 [standalone: master] (config) # vlan 10
switch-6bd534 [standalone: master] (config vlan 10) # exit
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport mode hybrid
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport hybrid allowed-vlan add 10
2. 关闭所有端口的flow control:
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol send off force
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol receive off force
3.使能priority 4,并在所有端口启用PFC:
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm enable pfc globally: yes
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control priority 4 enable
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 dcb priority-flow-control mode on force
注:如需关闭PFC
switch-6bd534 [standalone: master] (config) # no dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm disable pfc globally: yes
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 no dcb priority-flow-control mode force
4. 修改端口的buffer配置,并做switch-priority和PG buffer之间的映射:
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 egress-buffer ePort.tc4 map pool ePool0 reserved 1500 shared alpha inf
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 bind switch-priority 4
5. 做PCP+DEI到switch-priority的映射:
switch-6bd534 [standalone: master] (config) # qos trust L2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 qos map pcp 4 dei 0 to switch-priority 4
这样,交换机侧就配置好了。
接下来配置端主机:
1. 设置pfctx和pfcrx 参数:
# vim /etc/modprobe.d/mlx4.conf
添加:
options mlx4_en pfctx=0x16 pfcrx=0x16
注意,pfctx和pfcrx均为8 bits的bitmap,使能priority 4即为0x16.
然后重启网卡:
# /etc/init.d/openibd restart
验证:
# RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX
输出结果为:0x16 即正确。
2. 创建VLAN,并设置IP。
# modprobe 8021q
# vconfig add eth2 10
Added VLAN with VID == 10 to IF -:eth2:-
# ifconfig eth2.10 10.10.10.5/24 up
3. 对TCP/IP流量做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:
# for i in {0..7}; do vconfig set_egress_map eth2.10 $i 4 ; done
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
4. 对不经过内核的流量,即RDMA流量,做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:
# tc_wrap.py -i eth2 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP 0
UP 1
UP 2
UP 3
UP 4
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
skprio: 0 (vlan 10)
skprio: 1 (vlan 10)
skprio: 2 (vlan 10 tos: 8)
skprio: 3 (vlan 10)
skprio: 4 (vlan 10 tos: 24)
skprio: 5 (vlan 10)
skprio: 6 (vlan 10 tos: 16)
skprio: 7 (vlan 10)
UP 5
UP 6
UP 7
5. 做UP到TC的映射,将UP 4映射到TC 4,其他UP各自映射到相应的TC,并开启priority 4上的PFC:
# mlnx_qos -i eth2 -p 0,1,2,3,4,5,6,7 -f 0,0,0,0,1,0,0,0
Priority trust mode is not supported on your system
Priority trust mode: none
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 0 1 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
这样就都配置完成了。
最后,保存配置,防止重启失效:
switch-6bd534 [standalone: master] (config) # write memory
验证
用ib_write_bw测试(使用rdma_cm建立连接),一台做sender,一台做receiver。
receiver:
$ ib_write_bw -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10
sender:$ ib_write_bw 10.10.10.6 -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10
然后在交换机上查看PG4是否接收到了数据:
switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pg 4
PG 4:
44321827 packets
48853700404 bytes
0 queue depth
0 no buffer discard
0 shared buffer discard
或者查看PFC (注意,并不一定会触发PFC)
switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pfc prio 4
PFC 4:
Rx:
0 pause packets
0 pause duration
Tx:
18 pause packets
4 pause duration
在端主机侧查看priority 4的counter:
$ ethtool -S eth2 | grep prio_4
rx_pause_prio_4: 88
rx_pause_duration_prio_4: 0
rx_pause_transition_prio_4: 0
tx_pause_prio_4: 0
tx_pause_duration_prio_4: 11
tx_pause_transition_prio_4: 44
rx_prio_4_packets: 9155756
rx_prio_4_bytes: 752828084
tx_prio_4_packets: 862787989
tx_prio_4_bytes: 950840867498
参考:
HowTo Run RoCE over L2 Enabled with PFC
How to Enable PFC on Mellanox Switches (Spectrum)
HowTo Configure PFC on ConnectX-4
Mellanox support
原文链接:https://blog.csdn.net/u013431916/article/details/82385641