seastar的用户态协议栈

前段时间测试了seastar的用户态协议栈,性能很强悍,做一次总结。

seastar

seastar是一个高性能的IO框架,C++14炫技式实现。

http://www.seastar-project.org/

单张万兆网卡的性能

1core ----> 53w
2core ---->  106w
3core ----> 159w
4core ----> 200w
5core ----> 262w
6core ----> 316w
7core ----> 365w
8core ----> 402w
9core ----> 410w
9core ----> 409w
16core ----> 889w

测试场景是echo1个字节,1000个连接,可以看到非常线性。

seastar环境搭建

  • 拷贝scylladb.tar到/opt
  • 修改动态库路径
vim /etc/ld.so.conf.d/scylla.x86_64.conf
 /opt/scylladb/lib64
ldconfig
ldconfig -p
  • 添加gcc-5.3路径
export PATH=/opt/scylladb/bin/:$PATH
  • 安装依赖
sudo yum install cryptopp.x86_64 cryptopp-devel.x86_64  -y -b test
sudo yum install -y libaio-devel hwloc-devel numactl-devel libpciaccess-devel cryptopp-devel libxml2-devel xfsprogs-devel gnutls-devel lksctp-tools-devel lz4-devel gcc make protobuf-devel protobuf-compiler libunwind-devel systemtap-sdt-devel
  • config.py配置dpdk
./configure.py --enable-dpdk --disable-xen --with apps/echo/tcp_echo
  • 编译seastar
ninjia-build -j90
  • 编译dpdk的内核模块并安装内核模块
cd seastar/dpdk
   ./tools/dpdk-setup.sh
   modprobe  uio
   insmod  igb_uio.ko

接管网卡

  • 选择要被dpdk接管的网卡,解除banding
echo -eth4 >  /sys/class/net/bond0/bonding/slaves
  • 使用seastar中的脚本查看被接管网卡(dpdk中也有一个脚本)
./scripts/dpdk_nic_bind.py  --status
Network devices using DPDK-compatible driver
    ============================================
    <none>
    Network devices using kernel driver
    ===================================
    0000:03:00.0 'I350 Gigabit Network Connection' if=eno1 drv=igb unused=igb_uio
    0000:03:00.1 'I350 Gigabit Network Connection' if=eno2 drv=igb unused=igb_uio
    0000:03:00.2 'I350 Gigabit Network Connection' if=eno3 drv=igb unused=igb_uio
    0000:03:00.3 'I350 Gigabit Network Connection' if=eno4 drv=igb unused=igb_uio
    0000:04:00.0 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth4 drv=ixgbe unused=igb_uio
    0000:04:00.1 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth5 drv=ixgbe unused=igb_uio
    可以看到eth4是Eth的82599万兆网卡,也是我即将接管的网卡。
  • 接管网卡
./scripts/dpdk_nic_bind.py  --bind=igb_uio eth4
  • 检查接管成功
./scripts/dpdk_nic_bind.py  --status
Network devices using DPDK-compatible driver
    ============================================
    0000:04:00.0 '82599EB 10-Gigabit SFI/SFP+ Network Connection' drv=igb_uio unused=
    Network devices using kernel driver
    ===================================
    0000:03:00.0 'I350 Gigabit Network Connection' if=eno1 drv=igb unused=igb_uio
    0000:03:00.1 'I350 Gigabit Network Connection' if=eno2 drv=igb unused=igb_uio
    0000:03:00.2 'I350 Gigabit Network Connection' if=eno3 drv=igb unused=igb_uio
    0000:03:00.3 'I350 Gigabit Network Connection' if=eno4 drv=igb unused=igb_uio
    0000:04:00.1 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth5 drv=ixgbe unused=igb_uio
    Other network devices
    =====================

运行seastar的用户态协议栈

  • 配置巨页
echo 1024  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   mount -t hugetlbfs /dev/hugepages/
  • 运行测试程序server
./build/release/apps/echo/tcp_echo_server  --network-stack native --dpdk-pmd   --collectd 0 --port 10000 --dhcp 0 --host-ipv4-addr 10.107.139.20   --netmask-ipv4-addr 255.255.252.0 --gw-ipv4-addr 100.81.243.247   --no-handle-interrupt  --poll-mode --poll-aio 0 --hugepages /dev/hugepages   --memory 30G

--hugepages /dev/hugepages 指定了数据包内存分配方式,可以减少一次内存拷贝

  • 运行测试程序client
./build/release/apps/echo/tcp_echo_client --network-stack native --dpdk-pmd  --dhcp 1  --poll-mode --poll-aio 0 --hugepages /dev/hugepages   --memory 30G  --smp 16  -s  "10.107.139.20:10000" --conn 10

劫持网卡的数据包

  • seastar在每个core上跑一个用户态的协议栈,这个协议栈的输入是2层包,如何在用户态获取网卡上到来的2层数据包呢?
seastar支持两种模式的数据包获取方式:
1)通过tap,vhost-net,,vring获取,这个是开发模式,不需要接管网卡,方便部署;
2)dpdk接管网卡,polling数据包

seastar协议栈初始化

native-stack.cc中设置协议栈工厂函数

network_stack_registrator nns_registrator{
       "native", nns_options(), native_network_stack::create
   };

根据命令行参数netowrk-stack调用 native_network_stack::create

生成dpdk_net_device

(dpdk_device*) dev = create_dpdk_net_devic()

初始化dpdk_device

dpdk dev的初始化
   rte_eth_dev_configure

每个cpu上init_local_queue

auto qp = sdev->init_local_queue(opts, qid);
   创建一个dpdk::dpdk_qp<false>,在dpdk_qp的构造函数中会:
   0) 构造父类qp,会注册poll_tx
   1) 注册rx_gc
   2) 注册_tx_buf_factory.gc()
   3) init_rx_mbuf_pool
   4) rte_eth_rx_queue_setup
   5) rte_eth_tx_queue_setup

每个cpu上dpdk_device::init_port_fini

rte_eth_dev_start
   rte_eth_dev_rss_reta_update
   注册定时器rte_eth_link_get_nowait

每个cpu上set_local_queue, 设置qp指针

_queues[engine().cpu_id()] = qp.get();

每个cpu上create_native_stack

interface::_rx = device::receive() {
                          auto sub = _queues[engine().cpu_id()]->_rx_stream.listen(std::move(next_packet)); //注册2层协议处理函数!!!
                          _queues[engine().cpu_id()]->rx_start(); //rx_start会注册dpdk_qp::poll_rx_once()到poll_once中
                          return std::move(sub);
                      }

收包过程reactor::run() ----> poll_once()

在初始化interface的时候会把对应的qp的poll_rx_once注册到当前engine的poll_once中

dpdk_qp::poll_rx_once() {
       uint16_t rx_count = rte_eth_rx_burst(_dev->port_idx(), _qid, buf, packet_read_size);
       process_packets(buf, rx_count);
   }
void process_packets(struct rte_mbuf **bufs, uint16_t count) {
       for (uint16_t i = 0; i < count; i++) {
           struct rte_mbuf *m = bufs[i];
           // 处理VLAN
           // 处理rx_csum_offload
           _dev->l2receive(std::move(*p));
       }
   }
   void l2receive(packet p) { _queues[engine().cpu_id()]->_rx_stream.produce(std::move(p)); } ---->
   _rx_stream的回调是在interface构造的时候赋值的:
   interface::dispatch_packet() {
   }
上一篇:从DieHard分析TLA+是如何计算状态的


下一篇:Linux网络解读(7) - TCP之connect