mlx rdma网卡指标参数简介
综述
mlx5 driver在linux sysfs下有一系列的mlx网卡参数和计数器分布在/sys/class/infiniband/mlx5_x/ports/1/counters
和/sys/class/infiniband/mlx5_x/ports/1/hw_counters
目录下,这些参数统计了某种类型的事件发生的次数,如某种错误数,收包数等等。理解这些参数,可以帮助我们更好的理解mlx网卡的运行状态,通过监控,可以更快的定位rdma报错的根因
hw_counter
-
rnr_nak_retry_err
:本机作为发送方,收到对端发来的RNR NAK包的数量。如果接收方qp的srq没有空闲了,这个计数会涨 -
out_of_buffer
:本机作为接收方,收包的时候发现没有buffer了,如果自己qp的srq满了,这个计数会涨 -
out_of_sequence
:收包乱序 -
local_ack_timeout_err
:发送的rdma请求超时计数 -
packet_seq_err
:本机收到NAK包计数 -
req_cqe_error
:本机CQE报错计数 -
duplicate_request
:本机收到重复包 -
np_ecn_marked_roce_packets
:本机收到的ecn
counter
-
port_rcv_data
: Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter. -
port_rcv_packets
: Total number of packets (this may include packets containing Errors. This is 64 bit counter. -
port_xmit_data
: Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter. -
port_xmit_packets
: Total number of packets transmitted on all VLs from this port. This may include packets with errors. -
unicast_rcv_packets
: Total number of unicast packets, including unicast packets containing errors. -
unicast_xmit_packets
: Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors.
参考链接
- Understanding mlx5 Linux Counters and Status Parameters
- Understanding mlx5 ethtool Counters
- Nak Errors