背景
我们正在将测试环境逐步从mesos框架迁移到阿里云的容器服务,在此过程中测试了四种不同的服务之间相互访问的模式的网络性能。本文阐述了该性能测试的方法,数据和结论。
测试目标
服务容器之间的交互存在四种不同的访问方式:
docker link
docker提供的link方式,可以在一个容器中访问另一个容器。 测试host为app-link。
hostname
在编排服务时,为每一个服务指定一个hostname,在其他服务中可以使用hostname访问对应的服务。测试host为app。
服务发现
容器服务基于HAProxy实现的一套服务间HTTP访问和负载均衡的机制。测试host为app.local。
SLB
传统的负载均衡服务,有HTTP和TCP两种监听模式。测试中HTTP SLB host为192.168.1.10,TCP SLB host为192.168.1.40。
分别通过这四种网络访问目标机器上的http服务接口/ok并测试延迟(latency)和吞吐量(throughput)。
测试工具
- httping:测试延迟
- ab(apache benchmarking tool):测试吞吐量
ubuntu 下安装:
apt-get install httping
apt-get install apache2-utils
延迟测试:
- link:
httping -c100 -i 0.01 -g 'http://app-link/ok'
...
connected to 172.18.1.3:80 (121 bytes), seq=96 time=0.45 ms
connected to 172.18.1.3:80 (121 bytes), seq=97 time=0.46 ms
connected to 172.18.1.3:80 (121 bytes), seq=98 time=0.42 ms
connected to 172.18.1.3:80 (121 bytes), seq=99 time=1.63 ms
--- http://app-link/ok ping statistics ---
100 connects, 100 ok, 0.00% failed, time 1050ms
round-trip min/avg/max = 0.4/0.5/1.6 ms
- hostname
httping -c100 -i 0.01 -g 'http://app/ok'
...
connected to 172.18.1.3:80 (121 bytes), seq=96 time=10.80 ms
connected to 172.18.1.3:80 (121 bytes), seq=97 time=0.43 ms
connected to 172.18.1.3:80 (121 bytes), seq=98 time=0.44 ms
connected to 172.18.1.3:80 (121 bytes), seq=99 time=0.46 ms
--- http://app/ok ping statistics ---
100 connects, 100 ok, 0.00% failed, time 1073ms
round-trip min/avg/max = 0.4/0.7/10.8 ms
- 服务发现
httping -c100 -i 0.01 -g 'http://app.local/ok'
...
connected to 172.18.1.2:80 (219 bytes), seq=96 time=0.69 ms
connected to 172.18.1.2:80 (219 bytes), seq=97 time=0.67 ms
connected to 172.18.1.2:80 (219 bytes), seq=98 time=0.74 ms
connected to 172.18.1.2:80 (219 bytes), seq=99 time=0.65 ms
--- http://app.local/ok ping statistics ---
100 connects, 100 ok, 0.00% failed, time 1090ms
round-trip min/avg/max = 0.6/0.9/6.0 ms
- HTTP SLB
httping -c100 -i 0.01 -g 'http://192.168.1.10/ok'
...
connected to 192.168.1.10:80 (140 bytes), seq=96 time=1.19 ms
connected to 192.168.1.10:80 (140 bytes), seq=97 time=1.08 ms
connected to 192.168.1.10:80 (140 bytes), seq=98 time=1.15 ms
connected to 192.168.1.10:80 (140 bytes), seq=99 time=1.30 ms
--- http://192.168.1.10/ok ping statistics ---
100 connects, 100 ok, 0.00% failed, time 1123ms
round-trip min/avg/max = 1.0/1.2/2.9 ms
- TCP SLB
httping -c100 -i 0.01 -g 'http://192.168.1.40/ok'
...
connected to 192.168.1.40:80 (121 bytes), seq=96 time=1.18 ms
connected to 192.168.1.40:80 (121 bytes), seq=97 time=1.25 ms
connected to 192.168.1.40:80 (121 bytes), seq=98 time=1.06 ms
connected to 192.168.1.40:80 (121 bytes), seq=99 time=1.34 ms
--- http://192.168.1.40/ok ping statistics ---
100 connects, 100 ok, 0.00% failed, time 1137ms
round-trip min/avg/max = 1.0/1.3/2.7 ms
测试结果
测试了100次HEAD请求,平均时延如下表:
访问方式 | 延时(ms) |
---|---|
docker link | 0.5 |
hostname | 0.7 |
服务发现 | 0.9 |
HTTP SLB | 1.2 |
TCP SLB | 1.3 |
吞吐量测试:
- link
ab -lkc 10000 -n 10000 'http://app-link/ok'
Concurrency Level: 10000
Time taken for tests: 0.864 seconds
Complete requests: 10000
Failed requests: 0
Keep-Alive requests: 10000
Total transferred: 2020000 bytes
HTML transferred: 610000 bytes
Requests per second: 11571.74 [#/sec](mean)
Time per request: 864.174 [ms](mean)
Time per request: 0.086 [ms](mean, across all concurrent requests)
Transfer rate: 2282.71 [Kbytes/sec] received
- hostname
ab -lkc 10000 -n 10000 'http://app/ok'
Concurrency Level: 10000
Time taken for tests: 1.055 seconds
Complete requests: 10000
Failed requests: 0
Keep-Alive requests: 10000
Total transferred: 2020000 bytes
HTML transferred: 610000 bytes
Requests per second: 9476.49 [#/sec](mean)
Time per request: 1055.243 [ms](mean)
Time per request: 0.106 [ms](mean, across all concurrent requests)
Transfer rate: 1869.39 [Kbytes/sec] received
- 服务发现
ab -lkc 10000 -n 10000 'http://app.local/ok'
Concurrency Level: 10000
Time taken for tests: 4.276 seconds
Complete requests: 10000
Failed requests: 0
Keep-Alive requests: 10000
Total transferred: 3000000 bytes
HTML transferred: 610000 bytes
Requests per second: 2338.60 [#/sec](mean)
Time per request: 4276.066 [ms](mean)
Time per request: 0.428 [ms](mean, across all concurrent requests)
Transfer rate: 685.14 [Kbytes/sec] received
- HTTP SLB
ab -lkc 10000 -n 10000 'http://192.168.1.10/ok'
Concurrency Level: 10000
Time taken for tests: 6.308 seconds
Complete requests: 10000
Failed requests: 0
Non-2xx responses: 580
Keep-Alive requests: 10000
Total transferred: 2141800 bytes
HTML transferred: 732380 bytes
Requests per second: 1585.41 [#/sec](mean)
Time per request: 6307.517 [ms](mean)
Time per request: 0.631 [ms](mean, across all concurrent requests)
Transfer rate: 331.60 [Kbytes/sec] received
同时10000个并发请求,有580个请求SLB nginx返回了504错误,以下为response。
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<h1>504 Gateway Time-out</h1>
<p>The gateway did not receive a timely response from the upstream server or application.</body>
</html>
WARNING: Response code not 2xx (504)
- TCP SLB
ab -lkc 10000 -n 10000 'http://192.168.1.40/ok'
Concurrency Level: 10000
Time taken for tests: 1.891 seconds
Complete requests: 10000
Failed requests: 0
Keep-Alive requests: 10000
Total transferred: 2020000 bytes
HTML transferred: 610000 bytes
Requests per second: 5287.14 [#/sec](mean)
Time per request: 1891.383 [ms](mean)
Time per request: 0.189 [ms](mean, across all concurrent requests)
Transfer rate: 1042.97 [Kbytes/sec] received
测试结果
并发请求10000次,每秒处理的请求数如下表:
访问方式 | 吞吐量(RPS) |
---|---|
docker link | 11571.74 |
hostname | 9476.49 |
服务发现 | 2338.60 |
HTTP SLB | 1585.41 |
TCP SLB | 5287.14 |
结论:
- 从数据上看,使用docker link机制访问服务,无论是延迟和吞吐量都是最好的,hostname方式其次
- 同是 HTTP 模式的服务发现和HTTP SLB,性能最差
- HTTP SLB的并发性能似乎并不理想,10000个请求有5.8%的请求返回了504 Gateway错误
问题:
虽然docker link和hostname网络性能最佳,但不清楚其负载能力如何。测试中我们发现hostname方式是具有负载能力的,不过在官方帮助文档中,hostname访问方式被放在『不具备负载均衡能力的访问方式』中,而且被描述为『能做到一定的负载均衡的作用』。可见阿里云并没有强调其负载能力。如果在生产环境中使用,负载均衡能力也是相当重要的一个指标。
最后,这两个问题还需要向阿里云进一步确认:
- link和hostname模式的负载能力到底如何?
- 对于具备负载衡量能力、HTTP模式,内网使用这三个要求,是否有推荐使用的网络模式?