【RDMA】基于RoCE v1配置PFC

环境:


两台host(各配有一块双端口40Gbps ConnectX-3 网卡,驱动版本为4.1-1.0.2.0,OS为Ubuntu 16.04)

一台32端口Mellanox Spectrum交换机SN2700,onyx版本为3.6.8102.

 

PFC背景知识:

PFC:https://blog.csdn.net/bandaoyu/article/details/115346857

引用Juniper对PFC的介绍,“基于优先级的流控制(PFC,Priority-based flow control),IEEE标准802.1Qbb,是一种链路级流控制机制。该流控制机制与IEEE 802.3x的暂停机制类似,但是暂停的是链路上某个优先级的消息(每个级别是一个虚拟通道,暂停某个虚拟通道),而不是整个链路暂停。PFC允许您根据其类别有选择地暂停流量。”

 

可见,相比于IEEE 802.3x(整个链路),PFC的粒度更小(暂停某个虚拟通道)。因此配置的过程可以理解为将应用流量映射到某一个优先级的过程。根据对流量标记位置的不同,可以分为Trust L2和Trust L3。由于ConnectX-3仅支持RoCE v1,因此本文只关注Trust L2。

在端主机侧,映射关系为:

ToS -> skb_priority -> Vlan-qos (也记为User Priority,即UP,其值为Vlan tag中PCP的值) -> tc。

在交换机侧,映射关系为:

PCP + DEI -> switch-priority -> ingress Port Group (PG)。其中PG包含对PFC阈值的配置。

本文使用tc 4以及switch-priority 4为例。

 

配置过程:


首先配置交换机:

0. 进入配置模式:

switch-6bd534 [standalone: master] > enable
switch-6bd534 [standalone: master] # configure terminal


1. 创建VLAN,并设置交换机端口为hybrid模式:

switch-6bd534 [standalone: master] (config) # vlan 10
switch-6bd534 [standalone: master] (config vlan 10) # exit
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport mode hybrid
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport hybrid allowed-vlan add 10


2. 关闭所有端口的flow control:

switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol send off force
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol receive off force

3.使能priority 4,并在所有端口启用PFC:

switch-6bd534 [standalone: master] (config) # dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm enable pfc globally: yes
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control priority 4 enable
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 dcb priority-flow-control mode on force

注:如需关闭PFC

switch-6bd534 [standalone: master] (config) # no dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm disable pfc globally: yes
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 no dcb priority-flow-control mode force

4. 修改端口的buffer配置,并做switch-priority和PG buffer之间的映射:

switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 egress-buffer  ePort.tc4 map pool ePool0 reserved 1500 shared alpha inf
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 bind switch-priority 4

5. 做PCP+DEI到switch-priority的映射:

switch-6bd534 [standalone: master] (config) # qos trust L2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 qos map pcp 4 dei 0 to switch-priority 4


这样,交换机侧就配置好了。

接下来配置端主机:

1. 设置pfctx和pfcrx 参数:

# vim /etc/modprobe.d/mlx4.conf

添加:

options mlx4_en pfctx=0x16 pfcrx=0x16


注意,pfctx和pfcrx均为8 bits的bitmap,使能priority 4即为0x16.

然后重启网卡:

# /etc/init.d/openibd restart

验证:

# RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX

输出结果为:0x16 即正确。

2. 创建VLAN,并设置IP。

# modprobe 8021q
# vconfig add eth2 10
Added VLAN with VID == 10 to IF -:eth2:-
# ifconfig eth2.10 10.10.10.5/24 up


3. 对TCP/IP流量做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:

# for i in {0..7}; do vconfig set_egress_map eth2.10 $i 4 ; done
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10


4. 对不经过内核的流量,即RDMA流量,做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:

# tc_wrap.py -i eth2 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP  0
UP  1
UP  2
UP  3
UP  4
        skprio: 0
        skprio: 1
        skprio: 2 (tos: 8)
        skprio: 3
        skprio: 4 (tos: 24)
        skprio: 5
        skprio: 6 (tos: 16)
        skprio: 7
        skprio: 8
        skprio: 9
        skprio: 10
        skprio: 11
        skprio: 12
        skprio: 13
        skprio: 14
        skprio: 15
        skprio: 0 (vlan 10)
        skprio: 1 (vlan 10)
        skprio: 2 (vlan 10 tos: 8)
        skprio: 3 (vlan 10)
        skprio: 4 (vlan 10 tos: 24)
        skprio: 5 (vlan 10)
        skprio: 6 (vlan 10 tos: 16)
        skprio: 7 (vlan 10)
UP  5
UP  6
UP  7


5. 做UP到TC的映射,将UP 4映射到TC 4,其他UP各自映射到相应的TC,并开启priority 4上的PFC:

# mlnx_qos -i eth2 -p 0,1,2,3,4,5,6,7 -f 0,0,0,0,1,0,0,0
Priority trust mode is not supported on your system
Priority trust mode: none
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   0   1   0   0   0
 
tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7


这样就都配置完成了。

最后,保存配置,防止重启失效:

switch-6bd534 [standalone: master] (config) # write memory

验证
用ib_write_bw测试(使用rdma_cm建立连接),一台做sender,一台做receiver。

receiver:

$ ib_write_bw -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10
sender:

$ ib_write_bw 10.10.10.6 -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10


然后在交换机上查看PG4是否接收到了数据:

switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pg 4
 
PG 4:
  44321827              packets
  48853700404           bytes
  0                     queue depth
  0                     no buffer discard
  0                     shared buffer discard


或者查看PFC (注意,并不一定会触发PFC)

 

switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pfc prio 4
 
PFC 4:
  Rx:
    0                     pause packets
    0                     pause duration
 
  Tx:
    18                    pause packets
    4                     pause duration


在端主机侧查看priority 4的counter:

$ ethtool -S eth2 | grep prio_4
     rx_pause_prio_4: 88
     rx_pause_duration_prio_4: 0
     rx_pause_transition_prio_4: 0
     tx_pause_prio_4: 0
     tx_pause_duration_prio_4: 11
     tx_pause_transition_prio_4: 44
     rx_prio_4_packets: 9155756
     rx_prio_4_bytes: 752828084
     tx_prio_4_packets: 862787989
     tx_prio_4_bytes: 950840867498
 

参考:

HowTo Run RoCE over L2 Enabled with PFC 

How to Enable PFC on Mellanox Switches (Spectrum)

HowTo Configure PFC on ConnectX-4

Mellanox support

原文链接:https://blog.csdn.net/u013431916/article/details/82385641

上一篇:css学习笔记选择器、字体和文本、边框、背景


下一篇:Dyno-queues 分布式延迟队列 之 生产消费