The C10K problem
[Help save the best Linux news source on the web -- subscribe to Linux Weekly News! ]
It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.
And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.
现在已经到了让web服务器同时处理上万个连接的时候了,不是吗?毕竟,互联网现在是一个巨大的网络了。
同时现在的电脑也越来越强大了。你可以花费大约1200美元买到一台主频1000MHZ的电脑配上2G内存和1000Mbit/sec的网卡。让我们看看,当其处理20000个客户连接时,它将给每一个客户连接分配的资源是50KHz,100KB和50kbits/sec。那么没有什么为20000个客户连接中的每一个每秒钟内从磁盘上读取4KB数据并将其发送到网络上时更耗资源的了。(顺便说一下,那将为每一个客户耗费0.08美元。那些为每一个客户花费100美元的操作系统花费就开始显得有些昂贵了!)所以硬件已不再是瓶颈了。
In 1999 one of the busiest ftp sites, cdrom.com, actually handled 10000 clients simultaneously through a Gigabit Ethernet pipe. As of 2001, that same speed is now being offered by several ISPs , who expect it to become increasingly popular with large business customers.
在1999年,最繁忙的一个ftp网站,cdrom.com,事实上通过千兆以太网pipe已同时处理上万个连接了。到了2001年,同样的速度可以被一些ISP提供了,他们预期该趋势将会因为大量的商业用户而变得越来越普通。
And the thin client model of computing appears to be coming back in style -- this time with the server out on the Internet, serving thousands of clients.
With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, as that's my personal area of interest, but Windows is also covered a bit.
目前thin客户端模式又变得流行起来了——服务器运行在Internet上并为数千个客户提供服务。考虑到这一点,这有一些怎样配置操作系统和编写代码来支持数千个客户连接的笔记。这些讨论主要围绕类Unix系统,那是我个人的兴趣范围,但是也会有一些windows相关的内容。
Contents
§ /dev/poll (Solaris 2.7+)
§ kqueue (FreeBSD, NetBSD)
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
§ epoll (Linux 2.6+)
§ Polyakov's kevent (Linux 2.6+)
§ Drepper's New Network Interface (proposal for Linux 2.6+)
§ Realtime Signals (Linux 2.4+)
§ kqueue (FreeBSD, NetBSD)
1 Serve many clients with each thread, and use asynchronous I/O and completion notification
Serve one client with each server thread
§ LinuxThreads (Linux 2.0+)
§ NGPT (Linux 2.4+)
§ NPTL (Linux 2.6, Red Hat 9)
§ Java threading support in JDK 1.3.x and earlier
§ Note: 1:1 threading vs. M:N threading
1 Build the server code into the kernel
- Comments
- Limits on open filehandles
- Limits on threads
- Java issues [Updated 27 May 2001]
o The sendfile() system call can implement zero-copy networking.
o Avoid small frames by using writev (or TCP_CORK)
o Some programs can benefit from using non-Posix threads.
o Caching your own data can sometimes be a win.
o Interesting select()-based servers
o Interesting /dev/poll-based servers
o Interesting kqueue()-based servers
o Interesting realtime signal-based servers
o Interesting thread-based servers
o Interesting in-kernel servers
Related Sites
In October 2003, Felix von Leitner put together an excellent web page and presentation about network scalability, complete with benchmarks comparing various networking system calls and operating systems. One of his observations is that the 2.6 Linux kernel really does beat the 2.4 kernel, but there are many, many good graphs that will give the OS developers food for thought for some time. (See also the Slashdot comments; it'll be interesting to see whether anyone does followup benchmarks improving on Felix's results.)
在2003年十月,Felix von Leitner发表了一个瞩目的网页和关于网络可扩展性的介绍,完成了不同的操作系统和相关网络系统调用的性能基准对比,他的其中的一个结论是linux 2.6内核击败了linux2.4内核,当然还有许多的图片在一段时间内可以给OS的开发者一些想法。(看一下Slashdot的评论;看看人们是否遵循了Felix提供的基准来提高性能也是一件很有趣的事情)
Book to Read First
If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem. And while you're at it, go read Jeff Darcy's notes on high-performance server design .
(Another book which might be more helpful for those who are *using* rather than *writing* a web server isBuilding Scalable Web Sites by Cal Henderson.)
如果你还没有读过最新的W. Richard Stevens的UNP Vol1,你应该马上出去并去获得一份copy。它描述了许多I/O策略和编写高性能服务器的相关陷阱。它甚至讨论了“惊群”问题。你也可以阅读JeffDarcy关于高性能服务器设计的笔记。(如果你是需要使用而不是编写一个web服务器,那么另一本对你有用的书是Cal Henderson的Building Scalable Web Sites)
(注:“惊群”问题——简单说来,多线程/多进程(linux下线程进程也没多大区别)等待同一个socket事件,当这个事件发生时,这些线程/进程被同时唤醒,就是惊群。可以想见,效率很低下,许多进程被内核重新调度唤醒,同时去响应这一个事件,当然只有一个进程能处理事件成功,其他的进程在处理该事件失败后重新休眠(也有其他选择)。这种性能浪费现象就是惊群。)
I/O frameworks
Prepackaged libraries are available that abstract some of the techniques presented below, insulating your code from the operating system and making it more portable.
一下为几个已经包装好的库,它们概括了一些特定技术并使你的代码和操作系统相隔离,使其具有更好的可移植性。
- ACE , a heavyweight C++ I/O framework, contains object-oriented implementations of some of these I/O strategies and many other useful things. In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O.
- ACE,一个重量级C++ I/O框架,包括了用面向对象实现的一些I/O策略和其他有用的东西。特别的,他的Reactor是一个处理非阻塞I/O的OO方法,Proactor是一个处理异步I/O的OO方法。
- ASIO is an C++ I/O framework which is becoming part of the Boost library. It's like ACE updated for the STL era.
- ASIO,一个正成为Boost库的部分的C++ I/O框架,就像ACE在STL时代的更新。
- libevent is a lightweight C I/O framework by Niels Provos. It supports kqueue and select, and soon will support poll and epoll. It's level-triggered only, I think, which has both good and bad sides. Niels has a nice graph of time to handle one event as a function of the number of connections. It shows kqueue and sys_epoll as clear winners.
- Libevent,是一个由Niels Provos完成的轻量级C I/O框架。它支持kqueue和select,并且不久将支持poll和epoll。他只支持水平触发机制,我认为这既有好的一面也有坏的一面。Niels给出了一张图 来说明时间和连接数目在处理一个事件上的功能,从图上可以看出kqueue和sys_epoll明显胜出。
- My own attempts at lightweight frameworks (sadly, not kept up to date):
- 本人尝试过的一些轻量级框架(可惜的是没有坚持至今)
o Poller is a lightweight C++ I/O framework that implements a level-triggered readiness API using whatever underlying readiness API you want (poll, select, /dev/poll, kqueue, or sigio). It's useful forbenchmarks that compare the performance of the various APIs. This document links to Poller subclasses below to illustrate how each of the readiness APIs can be used.
o Poller 是一个轻量级的C++ I/O框架,它使用任何一种准备就绪API(poll, select, /dev/poll, kqueue, sigio)实现水平触发准备就绪API。以其它不同的API为基准 ,Poller的性能 好得多。该链接文档的下面一部分说明了如何使用这些准备就绪API。
o rn is a lightweight C I/O framework that was my second try after Poller. It's lgpl (so it's easier to use in commercial apps) and C (so it's easier to use in non-C++ apps). It was used in some commercial products.
o rn 是一个轻量级的C I/O框架,也是我继Poller后的第二个框架。该框架可以很容易的被用 于商业应用中,也容易的适用于非C++应用中。它如今已经在几个商业产品中使用。
- Matt Welsh wrote a paper in April 2000 about how to balance the use of worker thread and event-driven techniques when building scalable servers. The paper describes part of his Sandstorm I/O framework.
- Matt Welsh在2000年四月写了一篇关于当搭建可扩展服务器时怎样平衡worker线程和事件驱动技术的论文。这篇论文描述了部分他的Sandstorm I/O框架。
- Cory Nelson's Scale! library - an async socket, file, and pipe I/O library for Windows
- Cory Nelson's Scale! libray——一个包含异步套接字,文件和管道I/O的windows库。
I/O Strategies
Designers of networking software have many options. Here are a few:
网络软件的设计者有许多的选择,下面列出一些:
- Whether and how to issue multiple I/O calls from a single thread
- 是否和怎样在单线程里处理多个的I/O调用
o Don't; use blocking/synchronous calls throughout, and possibly use multiple threads or processes to achieve concurrency
o 不处理;从头到尾使用阻塞的/同步的调用,如果可能使用多线程或多进程来实现并发。
o Use nonblocking calls (e.g. write() on a socket set to O_NONBLOCK) to start I/O, and readiness notification (e.g. poll() or /dev/poll) to know when it's OK to start the next I/O on that channel. Generally only usable with network I/O, not disk I/O.
o 使用非阻塞调用(例如在套接字上调用write()时先设置O_NONBLOCK)来读取I/O,当I/O完成时发出通知(如poll()或/dev/poll)从而开始下一个I/O.这种主要使用在网络I/O上,而不是磁盘的I/O上。
o Use asynchronous calls (e.g. aio_write()) to start I/O, and completion notification (e.g. signals or completion ports) to know when the I/O finishes. Good for both network and disk I/O.
o 使用异步调用(如aio_write())来读取I/O,当I/O完成时会发出通知(如信号或完成端口),可以同时在网络I/O和磁盘I/O上使用。
- How to control the code servicing each client
o one process for each client (classic Unix approach, used since 1980 or so)
o 每一个进程服务一个客户(经典的Unix方法,大约从1980开始就一直使用)
o one OS-level thread handles many clients; each client is controlled by:
o 一个内核线程处理多个客户,每一个客户被下面的三者之一控制
§ a user-level thread (e.g. GNU state threads, classic Java with green threads)
§ 一个用户线程(如GNU状态线程,典型的Java Green线程)
§ a state machine (a bit esoteric, but popular in some circles; my favorite)
§ 一个状态机
§ a continuation (a bit esoteric, but popular in some circles)
§ 一个continuation
o one OS-level thread for each client (e.g. classic Java with native threads)
o 一个内核线程服务一个客户
o one OS-level thread for each active client (e.g. Tomcat with apache front end; NT completion ports; thread pools)
o 一个内核线程服务一个活动客户(如Tomcat with apache front end; NT完成端口;线程池)
- Whether to use standard O/S services, or put some code into the kernel (e.g. in a custom driver, kernel module, or VxD)
- 是否使用标准的操作系统服务,还是把一些代码放入内核中(如自定义驱动,内核模块,VxD)。
The following five combinations seem to be popular:
3 Serve many clients with each thread, and use nonblocking I/O and readiness change notification
4 Serve many clients with each server thread, and use asynchronous I/O
5 serve one client with each server thread, and use blocking I/O
6 Build the server code into the kernel
1. Serve many clients with each thread, and use nonblocking I/O and level-triggeredreadiness notification
... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. With this scheme, the kernel tells you whether a file descriptor is ready, whether or not you've done anything with that file descriptor since the last time the kernel told you about it. (The name 'level triggered' comes from computer hardware design; it's the opposite of 'edge triggered' . Jonathon Lemon introduced the terms in his BSDCON 2000 paper on kqueue() .)
将所有的网络句柄设置成非阻塞模式并且使用select或poll来告知哪一个句柄现在有数据等待处理。这是传统的处理方法。使用这种策略,内核将告诉你一个文件描述符是否可准备好被读写,或者在上次内核通知你后你是否对文件描述符进行了处理(这种名为“水平触发”模式来自于计算机的硬件设计,它与“边缘触发”相对。)
Note: it's particularly important to remember that readiness notification from the kernel is only a hint; the file descriptor might not be ready anymore when you try to read from it. That's why it's important to use nonblocking mode when using readiness notification.
提示:请记住内核的就绪通知仅仅是一个提示,当你去读一个文件描述符时可能描述符还没就绪,这就是为什么当你使用就绪通知时将描述符设置成非阻塞模式如此的重要。
An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
一个重要的性能瓶颈是当read()或sendfile()要读取存放在磁盘上的数据而这些数据却不在内存中时,设置磁盘文件描述符为非阻塞模式没有任何帮助。同样的事情也发生在内存映射磁盘文件上。当服务器第一次需要进行磁盘I/O时,它的进程将阻塞,所有的客户必须等待同时最初的非线程性能将被浪费。
This is what asynchronous I/O is for, but on systems that lack AIO, worker threads or processes that do the disk I/O can also get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel's 1999 Flash web server uses this trick; they gave a talk atUsenix '99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification . It's available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever .
在缺少AIO的系统上,异步I/O就是为此准备的,进行磁盘I/O的工作线程或进程也可以避开这个瓶颈。一种方法是使用内存映射文件,同时如果mincore()指明I/O是必须的话,命令工作线程去执行I/O,同时继续处理网络事件。Jef Poskanzer提到Pai, Druschel和Zwaenepoel的1999年的Flash web 服务器使用了这个方法;他们对此在Usenix '99上发表了一次演说,看上去就好像 FreeBSD和Solaris 中提供了mincore()一样,但是它并不是Single Unix Specification的一部分,在Linux的2.3.51 的内核中提供了该方法,感谢Chuck Lever。
But in November 2003 on the freebsd-hackers list, Vivek Pei et al reported very good results using system-wide profiling of their Flash web server to attack bottlenecks. One bottleneck they found was mincore (guess that wasn't such a good idea after all) Another was the fact that sendfile blocks on disk access; they improved performance by introducing a modified sendfile() that return something like EWOULDBLOCK when the disk page it's fetching is not yet in core. (Not sure how you tell the user the page is now resident... seems to me what's really needed here is aio_sendfile().) The end result of their optimizations is a SpecWeb99 score of about 800 on a 1GHZ/1GB FreeBSD box, which is better than anything on file at spec.org.
在2003.11的 freebsd-hackers list中,Vivek Pei上报了一个不错的成果,他们利用系统剖析 工具剖析它们的Flash Web服务器,然后再攻击其瓶颈。他们发现的其中一个瓶颈是mincore(猜测那毕竟不是一个好的想法)。另一个是sendfile在访问磁盘时阻塞了,他们通过引入一个修改后的sendfile来提高性能,当取磁盘页不在内存中时sendfile将返回类似于EWOULDBLOCK之类的值(不确定你怎样告诉用户磁盘页现在在内存中……对于我,我需要的是aio_sendfile())。他们优化的最终结果是在SpecWeb99 使用一台1GHZ/1GB FreeBSD box上得了800分。
There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:
这里有一些适用于单线程的方法来告知哪些非阻塞套接字准备好读取I/O:
- The traditional select()
Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. (Some versions of the C library let you raise this limit at user app compile time.)
不幸的是,select()因FD_SETSIZE个数目的句柄所限制。这个限制被编译到标准库和用户程序中(一些C库版本让你在应用编译时提高这个限制)
See Poller_select (cc , h ) for an example of how to use select() interchangeably with other readiness notification schemes.
- The traditional poll()
There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.
Poll()能处理的文件描述符数目没有硬编码限制,但是它在处理上千个描述符时会变慢,因为大多数文件描述符在某一时间段内处于闲置状态,并且扫描一遍上千个描述符花费很多时间。
Some OS's (e.g. Solaris 8) speed up poll() et al by use of techniques like poll hinting, which wasimplemented and benchmarked by Niels Provos for Linux in 1999.
See Poller_poll (cc , h , benchmarks ) for an example of how to use poll() interchangeably with other readiness notification schemes.
- /dev/poll
This is the recommended poll replacement for Solaris.
这是在Solaris上推荐的poll的代替物
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle.
/dev/poll的设计思想是利用这样的事实,即poll()经常使用同一参数被调用很多次。使用/dev/poll时,你获得一个打开的/dev/poll句柄,且将你对什么文件描述符感兴趣写入/dev/poll一次,并告知OS;从此,你就可以从/dev/poll句柄得到此时已经准备好读写的文件描述符集。
It appeared quietly in Solaris 7 (see patchid 106541 ) but its first public appearance was in Solaris 8 ;according to Sun , at 750 clients, this has 10% of the overhead of poll().
dev/poll 在Solaris 7(见补丁106541) 中就已经存在,不过在Solaris 8 中才公开现身。在750个客户端的情况下,仅花费poll()的十分之一开销。
Various implementations of /dev/poll were tried on Linux, but none of them perform as well as epoll, and were never really completed. /dev/poll use on Linux is not recommended.
在linux上有/dev/poll的不同的实现,但它们都没有达到epoll的性能,在linux上不推荐使用/dev/poll。
See Poller_devpoll (cc , h benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution - the example is for Linux /dev/poll, might not work right on Solaris.)
- kqueue()
This is the recommended poll replacement for FreeBSD (and, soon, NetBSD).
这是在FreeBSD系统上推荐使用的代替poll的方法。
See below. kqueue() can specify either edge triggering or level triggering.
2. Serve many clients with each thread, and use nonblocking I/O and readiness changenotification
Readiness change notification (or edge-triggered readiness notification) means you give the kernel a file descriptor, and later, when that descriptor transitions from not ready to ready , the kernel notifies you somehow. It then assumes you know the file descriptor is ready, and will not send any more readiness notifications of that type for that file descriptor until you do something that causes the file descriptor to no longer be ready (e.g. until you receive the EWOULDBLOCK error on a send, recv, or accept call, or a send or recv transfers less than the requested number of bytes).
就绪变化通知(或边缘触发就绪通知)意味着你给内核一个文件描述符,一段时间后,当这个描述符的状态从没就绪变为就绪时,内核会通过某种方法通知你。内核通知完毕后会假设你已经知道文件描述符准备好了,之后不会对该描述符再发送任何就绪通知给你直到你做了某些事导致描述符不再就绪(例如你在描述符上调用send,recv或accept直到收到EWOULDBLOCK错误,或发送或接收了少于需要的字节数)。
When you use readiness change notification, you must be prepared for spurious events, since one common implementation is to signal readiness whenever any packets are received, regardless of whether the file descriptor was already ready.
当你使用就需改变通知,你必须为伪事件有所准备,因为一个通常的实现是无论何时只要有包到达就会发送就绪信号,而不管文件描述符是否准备就绪。
This is the opposite of "level-triggered " readiness notification. It's a bit less forgiving of programming mistakes, since if you miss just one event, the connection that event was for gets stuck forever. Nevertheless, I have found that edge-triggered readiness notification made programming nonblocking clients with OpenSSL easier, so it's worth trying.
这与“水平触发”就绪通知所相对。这对编程错误少了几分宽容,因为如果你遗漏了仅仅一个事件,该事件的连接将会永远卡住。然而我发现边缘触发就绪通知机制使编写使用OpenSSL的非阻塞客户端更加容易,因此这值得一试。
[Banga, Mogul, Drusha '99] described this kind of scheme in 1999.
There are several APIs which let the application retrieve 'file descriptor became ready' notifications:
有几种APIs可以使得应用程序获得“文件描述符已就绪”的通知:
- kqueue() This is the recommended edge-triggered poll replacement for FreeBSD (and, soon, NetBSD).
kqueue() 这是在FreeBSD系统上推荐使用边缘触发的方法
FreeBSD 4.3 and later, and NetBSD-current as of Oct 2002 , support a generalized alternative to poll() called kqueue()/kevent() ; it supports both edge-triggering and level-triggering. (See also Jonathan Lemon's page and his BSDCon 2000 paper on kqueue() .)
FreeBSD 4.3及以后版本,NetBSD(2002.10)都支持 kqueue()/kevent(), 支持边沿触发和水平触发(请查看Jonathan Lemon 的网页和他的BSDCon 2000关于kqueue的论文)。
Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.
类似/dev/poll,你自行分配一个监听对象,不是打开/dev/poll文件而是调用kqueue()来分配一个。为了改变你监听的事件或得到当前事件的列表,你要在kqueue()返回的描述符上调用kevent()。它不仅能监听套接字上的就绪事件,还能监听普通文件的就绪,信号甚至I/O完成事件。
Note: as of October 2000, the threading library on FreeBSD does not interact well with kqueue(); evidently, when kqueue() blocks, the entire process blocks, not just the calling thread.
提示:在2000.10,FreeBSD的线程库和kqueue()并不能一起工作得很好,当kqueue()阻塞时, 那么整个进程都将会阻塞,而不仅仅是调用kqueue()的线程。
See Poller_kqueue (cc , h , benchmarks ) for an example of how to use kqueue() interchangeably with many other readiness notification schemes.
Examples and libraries using kqueue():
o PyKQueue -- a Python binding for kqueue()
o Ronald F. Guilmette's example echo server ; see also his 28 Sept 2000 post on freebsd.questions.
- epoll
This is the recommended edge-triggered poll replacement for the 2.6 Linux kernel.
这是Linux 2.6的内核中推荐使用的边沿触发poll。
On 11 July 2001, Davide Libenzi proposed an alternative to realtime signals; his patch provides what he now calls /dev/epoll www.xmailserver.org/linux-patches/nio-improve.html . This is just like the realtime signal readiness notification, but it coalesces redundant events, and has a more efficient scheme for bulk event retrieval.
2001.7.11, Davide Libenzi提议了一个实时信号的可选方法,他称之为/dev/epoll, 该方法类似与实时信号就绪通知机制,但是结合了其它更多的事件,从而在大多数的事件获取上拥有更高的效率。
Epoll was merged into the 2.5 kernel tree as of 2.5.46 after its interface was changed from a special file in /dev to a system call, sys_epoll. A patch for the older version of epoll is available for the 2.4 kernel.
epoll在将它的接口从一个/dev下的指定文件改变为系统调用sys_epoll后就合并到2.5版本的 Linux内核开发树中,另外也提供了一个为2.4老版本的内核可以使用epoll的补丁。
There was a lengthy debate about unifying epoll, aio, and other event sources on the linux-kernel mailing list around Halloween 2002. It may yet happen, but Davide is concentrating on firming up epoll in general first.
- Polyakov's kevent (Linux 2.6+) News flash: On 9 Feb 2006, and again on 9 July 2006, Evgeniy Polyakov posted patches which seem to unify epoll and aio; his goal is to support network AIO. See:
o the LWN article about kevent
- Drepper's New Network Interface (proposal for Linux 2.6+)
At OLS 2006, Ulrich Drepper proposed a new high-speed asynchronous networking API. See:
o his paper, "The Need for Asynchronous, Zero-Copy Network I/O "
- Realtime Signals
This is the recommended edge-triggered poll replacement for the 2.4 Linux kernel.
The 2.4 linux kernel can deliver socket readiness events via a particular realtime signal. Here's how to turn this behavior on:
Linux2.4内核中推荐使用的边沿触发poll.2.4的linux内核可以通过实时信号来分派套接字事件,示例如下:
/* Mask off SIGIO and the signal you want to use. */
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigaddset(&sigset, SIGIO);
sigprocmask(SIG_BLOCK, &m_sigset, NULL);
/* For each file descriptor, invoke F_SETOWN, F_SETSIG, and set O_ASYNC. */
fcntl(fd, F_SETOWN, (int) getpid());
fcntl(fd, F_SETSIG, signum);
flags = fcntl(fd, F_GETFL);
flags |= O_NONBLOCK|O_ASYNC;
fcntl(fd, F_SETFL, flags);
This sends that signal when a normal I/O function like read() or write() completes. To use this, write a normal poll() outer loop, and inside it, after you've handled all the fd's noticed by poll(), you loop callingsigwaitinfo() .
当一个普通的I/O函数如read()或write()完成时会发送那个信号。为了使用这个,在循环外写通常的poll(),在循环内你处理完所有poll()通知的描述符后你要循环调用sigwaitinfo()。
If sigwaitinfo or sigtimedwait returns your realtime signal, siginfo.si_fd and siginfo.si_band give almost the same information as pollfd.fd and pollfd.revents would after a call to poll(), so you handle the i/o, and continue calling sigwaitinfo().
如果sigwaitinfo()或sigtimedwait()返回了实时信号,那么siginfo.si_fd和 siginfo_si_band给出的信息和调用poll()后pollfd.fd和pollfd.revents的几乎一样。如果你处 理该I/O,那么就继续调用sigwaitinfo()。
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed, so you flush the signal queue by temporarily changing the signal handler to SIG_DFL , and break back to the outer poll() loop.
如果sigwaitinfo()返回了传统的SIGIO,那么信号队列溢出了,你必须通过临时 改变信号处理 程序为SIG_DFL来刷新信号队列,然后返回到外层的poll()循环。
See Poller_sigio (cc , h ) for an example of how to use rtsignals interchangeably with many other readiness notification schemes.
See Zach Brown's phhttpd for example code that uses this feature directly. (Or don't; phhttpd is a bit hard to figure out...)
[Provos, Lever, and Tweedie 2000 ] describes a recent benchmark of phhttpd using a variant of sigtimedwait(), sigtimedwait4(), that lets you retrieve multiple signals with one call. Interestingly, the chief benefit of sigtimedwait4() for them seemed to be it allowed the app to gauge system overload (so it couldbehave appropriately ). (Note that poll() provides the same measure of system overload.)
Signal-per-fd
Chandra and Mosberger proposed a modification to the realtime signal approach called "signal-per-fd" which reduces or eliminates realtime signal queue overflow by coalescing redundant events. It doesn't outperform epoll, though. Their paper ( www.hpl.hp.com/techreports/2000/HPL-2000-174.html ) compares performance of this scheme with select() and /dev/poll.
Signal-per-fd是由Chandra和Mosberger提出的对实时信号的一种改进,它可以减少甚至削除实 时信号的溢出通过oalescing redundant events。然而是它的性能并没有epoll好. 论文(www.hpl.hp.com/techreports/2000/HPL-2000-174.html) 比较了它和select(),/dev/poll的性能。
Vitaly Luban announced a patch implementing this scheme on 18 May 2001 ; his patch lives atwww.luban.org/GPL/gpl.html . (Note: as of Sept 2001, there may still be stability problems with this patch under heavy load. dkftpbench at about 4500 users may be able to trigger an oops.)
See Poller_sigfd (cc , h ) for an example of how to use signal-per-fd interchangeably with many other readiness notification schemes.
3. Serve many clients with each server thread, and use asynchronous I/O
This has not yet become popular in Unix, probably because few operating systems support asynchronous I/O, also possibly because it (like nonblocking I/O) requires rethinking your application. Under standard Unix, asynchronous I/O is provided by the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2.
这个方法在unix上还没流行,很可能是只有少部分操作系统支持异步I/O,也亦或是它(例如非阻塞I/O)需要你修改你的应用程序。在标准Unix下,异步I/O特性由aio_接口提供(下拉到链接“异步输入输出”),将每个I/O操作与一个信号和值关联起来。信号和它们的值排成一队依序高效的分发至用户进程中。这种方法属于POSIX 1003.1b实时标准的扩展,也属于Single Unix Specification,version 2。
AIO is normally used with edge-triggered completion notification, i.e. a signal is queued when the operation is complete. (It can also be used with level triggered completion notification by calling aio_suspend() , though I suspect few people do this.)
AIO通常使用边缘触发来完成通知,例如当一个操作完成时将会有一个信号被放入队列中。(通过调用aio_suspend(),AIO也可以使用水平触发来完成通知,尽管我怀疑没什么人会这样做)
glibc 2.1 and later provide a generic implementation written for standards compliance rather than performance.
Glibc 2.1 及其后续版本提供了一个更注重符合标准而不是注重性能的通用实现。
Ben LaHaise's implementation for Linux AIO was merged into the main Linux kernel as of 2.5.32. It doesn't use kernel threads, and has a very efficient underlying api, but (as of 2.6.0-test2) doesn't yet support sockets. (There is also an AIO patch for the 2.4 kernels, but the 2.5/2.6 implementation is somewhat different.) More info:
Ben LaHaise编写的Linux AIO实现合并到了2.5.32的内核中,它并没有采用内核线程,而是使 用了一个高效的underlying api,但是目前它还不支持套接字(2.4内核也有了AIO的补丁,不过 2.5/2.6的实现有一定程序上的不同)。
- The page "Kernel Asynchronous I/O (AIO) Support for Linux " which tries to tie together all info about the 2.6 kernel's implementation of AIO (posted 16 Sept 2003)
- Round 3: aio vs /dev/epoll by Benjamin C.R. LaHaise (presented at 2002 OLS)
- Asynchronous I/O Suport in Linux 2.5 , by Bhattacharya, Pratt, Pulaverty, and Morgan, IBM; presented at OLS '2003
- Design Notes on Asynchronous I/O (aio) for Linux by Suparna Bhattacharya -- compares Ben's AIO with SGI's KAIO and a few other AIO projects
- Linux AIO home page - Ben's preliminary patches, mailing list, etc.
- linux-aio mailing list archives
- libaio-oracle - library implementing standard Posix AIO on top of libaio. First mentioned by Joel Becker on 18 Apr 2003 .
Suparna also suggests having a look at the the DAFS API's approach to AIO .
Suparma建议先看看AIO的API.
Red Hat AS and Suse SLES both provide a high-performance implementation on the 2.4 kernel; it is related to, but not completely identical to, the 2.6 kernel implementation.
RedHat AS和Suse SLES都在2.4的内核中提供了高性能的实现,与2.6的内核实现相似,但并不完全一样。
In February 2006, a new attempt is being made to provide network AIO; see the note above about Evgeniy Polyakov's kevent-based AIO .
2006.2,在网络AIO有了一个新的尝试,具体请看Evgeniy Polyakov的基于kevent的AIO.
In 1999, SGI implemented high-speed AIO for Linux . As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads. It is still useful for people who can't wait for Ben's AIO to support sockets.
在1999年,SGI为linux实现了一个高速的AIO。作为1.1版本,据说它能在磁盘I/O和套接字上工作得很好。似乎是它使用了内核线程。目前该实现依然对那些不能等待Ben的AIO套接字支持的人来说是 很有用的。
The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.
O’Reilly 的“POSIX.4: Programming for the Real World”一书对aio做了很好的介绍.
A tutorial for the earlier, nonstandard, aio implementation on Solaris is online at Sunsite . It's probably worth a look, but keep in mind you'll need to mentally convert "aioread" to "aio_read", etc.
这里 有一个指南介绍了早期的非标准的aio实现,可以看看,但是请记住你得把”aioread”转换为”aio_read”。
Note that AIO doesn't provide a way to open files without blocking for disk I/O; if you care about the sleep caused by opening a disk file, Linus suggests you should simply do the open() in a different thread rather than wishing for an aio_open() system call.
注意AIO并没有提供无阻塞的为磁盘I/O打开文件的方法,如果你在意因打开磁盘文件而引起 sleep的话,Linus建议 你在另外一个线程中调用open()而不是把希望寄托在对aio_open()系统调用上。
Under Windows, asynchronous I/O is associated with the terms "Overlapped I/O" and IOCP or "I/O Completion Port". Microsoft's IOCP combines techniques from the prior art like asynchronous I/O (like aio_write) and queued completion notification (like when using the aio_sigevent field with aio_write) with a new idea of holding back some requests to try to keep the number of running threads associated with a single IOCP constant. For more information, see Inside I/O Completion Ports by Mark Russinovich at sysinternals.com, Jeffrey Richter's book "Programming Server-Side Applications for Microsoft Windows 2000" (Amazon , MSPress ), U.S. patent #06223207 , or MSDN .
在Windows下,异步I/O与术语”重叠I/O”和”IOCP”(I/O Completion Port,I/O完成端口)有一定联系。Microsoft的IOCP结合了 先前的如异步I/O(如aio_write)的技术,把事件完成的通知进行排队(就像使用了aio_sigevent字段的aio_write),并且它 为了保持单一IOCP线程的数量从而阻止了一部分请求。
4. Serve one client with each server thread
... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads. If each thread gets a 2MB stack (not an uncommon default value), you run out of *virtual memory* at (2^30 / 2^21) = 512 threads on a 32 bit machine with 1GB user-accessible VM (like, say, Linux as normally shipped on x86). You can work around this by giving each thread a smaller stack, but since most thread libraries don't allow growing thread stacks once created, doing this means designing your program to minimize stack use. You can also work around this by moving to a 64 bit processor.
让read()和write()阻塞,这样不好的地方在于要给每一个客户分配一个完整的栈帧,那太浪费内存了。许多OS在处理上百个线程时会遇到困难。如果每一个线程得到2MB栈空间(这不是通常的默认值),在一台32位机器上512个线程将消耗1GB的用户虚拟内存。你可以为每一个线程分配更少的栈来解决这个问题,但是因为大多数线程库一旦生成一个线程就不允许其线程栈增长,这意味着你的程序将设计得只能使用最小的栈空间。你也可以把程序移植到64位处理器上来解决这个问题。
The thread support in Linux, FreeBSD, and Solaris is improving, and 64 bit processors are just around the corner even for mainstream users. Perhaps in the not-too-distant future, those who prefer using one thread per client will be able to use that paradigm even for 10000 clients. Nevertheless, at the current time, if you actually want to support that many clients, you're probably better off using some other paradigm.
Linux,FreeBSD和Solaris对线程的支持正变得越来越好,64位处理器在不久的将来会越来越流行。可能在不久的将来,那些偏爱一个线程服务一个客户模型的人将会使用这样的范例来支持甚至1000个客户。然而,如果你现在想支持那么多的客户,你最好使用其他的范例模型。
For an unabashedly pro-thread viewpoint, see Why Events Are A Bad Idea (for High-concurrency Servers) by von Behren, Condit, and Brewer, UCB, presented at HotOS IX. Anyone from the anti-thread camp care to point out a paper that rebuts this one? :-)
LinuxThreads
LinuxTheads is the name for the standard Linux thread library. It is integrated into glibc since glibc2.0, and is mostly Posix-compliant, but with less than stellar performance and signal support.
LinuxTheads 是标准Linux线程库的命名。 它从glibc2.0开始已经集成在glibc库中,并且高度兼容Posix标准,不过在性能和信号的支持度上稍逊一筹。
NGPT: Next Generation Posix Threads for Linux
NGPT is a project started by IBM to bring good Posix-compliant thread support to Linux. It's at stable version 2.2 now, and works well... but the NGPT team has announced that they are putting the NGPT codebase into support-only mode because they feel it's "the best way to support the community for the long term". The NGPT team will continue working to improve Linux thread support, but now focused on improving NPTL. (Kudos to the NGPT team for their good work and the graceful way they conceded to NPTL.)
NGPT是一个由IBM发起的项目,目的是在linux上支持良好的POSIX兼容的线程。NGPT现在是稳定的2.2版本并且工作的很好……但是NGPT团队已经宣布他们正在把NGPT代码库变为support-only模式,因为他们觉得这是长期支持社区的最好的方法。NGPT团队会继续为提高linux的线程支持而继续工作,但是他们现在着眼于的是改进NPTL。
NPTL: Native Posix Thread Library for Linux
NPTL is a project by Ulrich Drepper (the benevolent dict^H^H^H^Hmaintainer of glibc ) and Ingo Molnar to bring world-class Posix threading support to Linux.
NPTL是由 Ulrich Drepper ( glibc的主要维护人员)和 Ingo Molnar发起的项目,目的是提供world-class的Posix Linux线程支持。
As of 5 October 2003, NPTL is now merged into the glibc cvs tree as an add-on directory (just like linuxthreads), so it will almost certainly be released along with the next release of glibc.
2003.10.5,NPTL作为一个add-on目录(就像linuxthreads一样)被合并到glibc的cvs树中,所以很有可能随glibc的下一次release而 一起发布。
The first major distribution to include an early snapshot of NPTL was Red Hat 9. (This was a bit inconvenient for some users, but somebody had to break the ice...)
Red Hat 9是最早的包含NPTL的发行版本(对一些用户来说有点不太方便,但是必须有人来打破这沉默)
NPTL links:
- Mailing list for NPTL discussion
- NPTL source code
- Initial announcement for NPTL
- Original whitepaper describing the goals for NPTL
- Revised whitepaper describing the final design of NPTL
- Ingo Molnar's first benchmark showing it could handle 10^6 threads
- Ulrich's benchmark comparing performance of LinuxThreads, NPTL, and IBM's NGPT . It seems to show NPTL is much faster than NGPT.
Here's my try at describing the history of NPTL (see also Jerry Cooperstein's article
):
In March 2002, Bill Abt of the NGPT team, the glibc maintainer Ulrich Drepper, and others met to figure out what to do about LinuxThreads. One idea that came out of the meeting was to improve mutex performance; Rusty Russell et al subsequently implemented fast userspace mutexes (futexes) ), which are now used by both NGPT and NPTL. Most of the attendees figured NGPT should be merged into glibc.
在2002年3月,NGPT团队的Bill Abt,glibc的维护者Ulrich Drepper和其他人员举行了一个会面,他们要弄清楚能为linux线程做些什么。会议的其中一个想法是提高mutex的性能;Rusty Russell 等人 随后实现了 fast userspace mutexes (futexes), (如今已在NGPT和NPTL中应用了)。 与会的大部分人都认为NGPT应该合并到glibc中。
Ulrich Drepper, though, didn't like NGPT, and figured he could do better. (For those who have ever tried to contribute a patch to glibc, this may not come as a big surprise :-) Over the next few months, Ulrich Drepper, Ingo Molnar, and others contributed glibc and kernel changes that make up something called the Native Posix Threads Library (NPTL). NPTL uses all the kernel enhancements designed for NGPT, and takes advantage of a few new ones. Ingo Molnar described the kernel enhancements as follows:
Ulrich Drepper尽管不喜欢NGPT,但任然提出他可以做的更好。(对那些曾经想提供补丁给glibc的人来说,这应该不会令他们感到惊讶)接下来的几个月,Ulrich Drepper, Ingo Molnar和其他人发布了集合了NPTL库的glibc和内核。NPTL利用了内核为NGPT的提高所设计的特性,并且利用了一些新的特性。Ingo Molnar对内核的增强特性描述如下:
While NPTL uses the three kernel features introduced by NGPT: getpid() returns PID, CLONE_THREAD and futexes; NPTL also uses (and relies on) a much wider set of new kernel features, developed as part of this project.
尽管NPTL使用了3个由NGPT引进的内核特性:返回进程id的getpid()函数,CLONE_THREAD和futexes;NPTL也使用了(并且依赖)一个更广泛的作为这个项目而发展出来的内核新特性。
Some of the items NGPT introduced into the kernel around 2.5.8 got modified, cleaned up and extended, such as thread group handling (CLONE_THREAD). [the CLONE_THREAD changes which impacted NGPT's compatibility got synced with the NGPT folks, to make sure NGPT does not break in any unacceptable way.]
一些由NGPT在内核2.5.8引进的特性得到了修改,清除和扩展,例如线程组handling(CLONE_THREAD)。
The kernel features developed for and used by NPTL are described in the design whitepaper, http://people.redhat.com/drepper/nptl-design.pdf ...
这些为NPTL开发的并且后来在NPTL中使用的内核特征都描述在设计白皮书中。
A short list: TLS support, various clone extensions (CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID), POSIX thread-signal handling, sys_exit() extension (release TID futex upon VM-release), the sys_exit_group() system-call, sys_execve() enhancements and support for detached threads.
一个简易列表:TLS支持,不同的clone扩展(CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID),POSIX 线程信号处理,sys_exit()扩展(在VM-release基础上实现了TID futex),sys_exit_group()系统调用,sys_execve()增强和分离线程的支持。
There was also work put into extending the PID space - eg. procfs crashed due to 64K PID assumptions, max_pid, and pid allocation scalability work. Plus a number of performance-only improvements were done as well.
PID的空间也得到了扩展-例如由于64k PID assumptions 从而procfs crashed ,max_pid 和 pid 分配的扩展。一些仅在性能提高方面的特性也工作的很好。
In essence the new features are a no-compromises approach to 1:1 threading - the kernel now helps in everything where it can improve threading, and we precisely do the minimally necessary set of context switches and kernel calls for every basic threading primitive.
本质上,新特性不是对1:1线程的妥协---内核为每一个能提高线程能力的地方提供帮助,同时我们为每一个基础的原始线程精确的做最小的上下文转换和内核函数调用。
One big difference between the two is that NPTL is a 1:1 threading model, whereas NGPT is an M:N threading model (see below). In spite of this, Ulrich's initial benchmarks seem to show that NPTL is indeed much faster than NGPT. (The NGPT team is looking forward to seeing Ulrich's benchmark code to verify the result.)
NGPT和NPTL的一个最大的不同就是NPTL是1:1的线程模型,而NGPT是M:N的编程模型(具体请看下面). 尽管这样, Ulrich的最初的基准测试 还是表明NPTL比NGPT快很多。
FreeBSD threading support
FreeBSD supports both LinuxThreads and a userspace threading library. Also, a M:N implementation called KSE was introduced in FreeBSD 5.0.
FreeBSD提供了linux线程和一个用户空间的线程库。一个叫做KSE的M:N的实现在FreeBSD 5.0中被引进。
On 25 Mar 2003, Jeff Roberson posted on freebsd-arch :
... Thanks to the foundation provided by Julian, David Xu, Mini, Dan Eischen, and everyone else who has participated with KSE and libpthread development Mini and I have developed a 1:1 threading implementation. This code works in parallel with KSE and does not break it in any way. It actually helps bring M:N threading closer by testing out shared bits. ...
感谢Julian, David Xu, Mini, Dan Eischen,和其它的每一位参加了KSE和libpthread开发的成员所提供的基础, Mini和我已经开发出了一个1:1模型的线程实现,它可以和KSE并行工作而不会带来任何影响。
And in July 2006, Robert Watson proposed that the 1:1 threading implementation become the default in FreeBsd 7.x :
I know this has been discussed in the past, but I figured with 7.x trundling forward, it was time to think about it again. In benchmarks for many common applications and scenarios, libthr demonstrates significantly better performance over libpthread... libthr is also implemented across a larger number of our platforms, and is already libpthread on several. The first recommendation we make to MySQL and other heavy thread users is "Switch to libthr", which is suggestive, also! ... So the strawman proposal is: make libthr the default threading library on 7.x.
我知道曾经讨论过这个问题,但是我认为随着7.x的向前推进,这个问题应该重新考虑。 在很多普通的应用程序和特定的基准测试中,libthr明显的比libpthread在性能上要好得多。 libthr是在我们大量的平台上实现的,而libpthread却只有在几个平台上。 最主要的是因为我们使得Mysql和其它的大量线程的使用者转换到”libthr”,所以strawman提议:让libthr成为7.x上的默认线程库。
NetBSD threading support
According to a note from Noriyuki Soda:
Kernel supported M:N thread library based on the Scheduler Activations model is merged into NetBSD-current on Jan 18 2003.
内核支持M:N模型的线程库,这个库基于Scheduler Activations模式,在2003年1月18日被合并于NetBSD中。
For details, see An Implementation of Scheduler Activations on the NetBSD Operating System by Nathan J. Williams, Wasabi Systems, Inc., presented at FREENIX '02.
Solaris threading support
The thread support in Solaris is evolving... from Solaris 2 to Solaris 8, the default threading library used an M:N model, but Solaris 9 defaults to 1:1 model thread support.See Sun's multithreaded programming guide andSun's note about Java and Solaris threading
Solaris的线程支持还在进一步提高evolving… 从Solaris 2到Solaris 8,默认的线程库使用的都是M:N模型, 但是Solaris 9却默认使用了1:1线程模型. 查看Sun多线程编程指南 和Sun的关于Java和Solaris线程的note.
Java threading support in JDK 1.3.x and earlier
As is well known, Java up to JDK1.3.x did not support any method of handling network connections other than one thread per client. Volanomark is a good microbenchmark which measures throughput in messsages per second at various numbers of simultaneous connections. As of May 2003, JDK 1.3 implementations from various vendors are in fact able to handle ten thousand simultaneous connections -- albeit with significant performance degradation. See Table 4 for an idea of which JVMs can handle 10000 connections, and how performance suffers as the number of connections increases.
大家都知道,Java一直到JDK1.3.x都没有支持任何处理网络连接的方法,除了一个线程服务一个客户端的模型之外。 Volanomark是一个不错的微型测试程序,可以用来测量在 某个时候不同数目的网络连接时每秒钟的信息吞吐量。在2003.5, JDK 1.3的实现实际上可以同时处理10000个连接,但是性能却严重下降了。 从Table 4 可以看出JVMs可以处理10000个连接,但是随着连接数目的增长性能也逐步下降。
Note: 1:1 threading vs. M:N threading
There is a choice when implementing a threading library: you can either put all the threading support in the kernel (this is called the 1:1 threading model), or you can move a fair bit of it into userspace (this is called the M:N threading model). At one point, M:N was thought to be higher performance, but it's so complex that it's hard to get right, and most people are moving away from it.
在实现线程库的时候有一个选择就是你可以把所有的线程支持都放到内核中(也就是所谓的1:1的模型),也可以 把一些线程移到用户空间上去(也就是所谓的M:N模型)。从某个角度来说, M:N被认为拥有更好的性能,但是由于很难被正确的编写, 所以大部分人都远离了该方法。
- Why Ingo Molnar prefers 1:1 over M:N
- Sun is moving to 1:1 threads
- NGPT is an M:N threading library for Linux.
- Although Ulrich Drepper planned to use M:N threads in the new glibc threading library , he has sinceswitched to the 1:1 threading model.
- MacOSX appears to use 1:1 threading.
- FreeBSD and NetBSD appear to still believe in M:N threading... The lone holdouts? Looks like freebsd 7.0 might switch to 1:1 threading (see above), so perhaps M:N threading's believers have finally been proven wrong everywhere.
5. Build the server code into the kernel
Novell and Microsoft are both said to have done this at various times, at least one NFS implementation does this, khttpd does this for Linux and static web pages, and "TUX" (Threaded linUX webserver) is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux. Ingo's September 1, 2000 announcement says an alpha version of TUX can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux , and explains how to join a mailing list for more info.
据说Novell和Microsoft在不同的时间完成了这个,至少一个NFS实现做到了这一点,khttpd在linux上实现了静态网页web服务器,Ingo Molnar完成了“TUX” (Threaded linUX webserver) ,这是一个Linux下的快速的可扩展的内核空间的HTTP服务器。 Ingo在2000.9.1宣布 alpha版本的TUX可以在 ftp://ftp.redhat.com/pub/redhat/tux下载, 并且介绍了如何加入其邮件列表来获取更多信息。
The linux-kernel list has been discussing the pros and cons of this approach, and the consensus seems to be instead of moving web servers into the kernel, the kernel should have the smallest possible hooks added to improve web server performance. That way, other kinds of servers can benefit. See e.g. Zach Brown's remarksabout userland vs. kernel http servers. It appears that the 2.4 linux kernel provides sufficient power to user programs, as the X15 server runs about as fast as Tux, but doesn't use any kernel modifications.
在Linux内核的邮件列表上讨论了该方法的好处和缺点,多数人认为不应该把web服务器放进内核中, 相反内核加入最小的钩子hooks来提高web服务器的性能,这样对其它形式的服务器就有益。 具体请看 Zach Brown的讨论 对比用户级别和内核的http服务器。 在2.4的linux内核中为用户程序提供了足够的权力(power),就像X15 服务器运行的速度和TUX几乎一样,但是它没有对内核做任何改变。
Comments
Richard Gooch has written a paper discussing I/O options .
Richard Gooch写了一片文章来讨论I/O选项
In 2001, Tim Brecht and MMichal Ostrowski measured various strategies for simple select-based servers. Their data is worth a look.
在2001年,Tim Brecht和MMichal Ostrowski 测试了基于简单的select服务器的几种实现策略。这些数据任然值得一看。
In 2003, Tim Brecht posted source code for userver , a small web server put together from several servers written by Abhishek Chandra, David Mosberger, David Pariag, and Michal Ostrowski. It can use select(), poll(), epoll(), or sigio.
在2003年,Tim Brecht 发布了userver的源代码,userver是一个集合了由Abhishek Chandra, David Mosberger, David Pariag, and Michal Ostrowski编写的几种服务器的小的web服务器。Userver能使用select(),poll(),epoll()和sigio。
Back in March 1999, Dean Gaudet posted :
I keep getting asked "why don't you guys use a select/event based model like Zeus? It's clearly the fastest." ...
在1999年3月,Dean Gaudet说:我老是被问到为什么你们不像Zeus那样使用基于select/event的模型?那明显是最快的。......
His reasons boiled down to "it's really hard, and the payoff isn't clear". Within a few months, though, it became clear that people were willing to work on it.
他把理由归纳为 :这的确很难,并且这样的回报并不明确。尽管还不到几个月,人们越来越愿意采用那样的方法。
Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft , Ingo Molnar's rebuttal of 30 April 1999 , Russinovich's comments of 2 May 1999 , a rebuttal from Alan Cox, and various posts to linux-kernel . I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO , it's not so true anymore.
Mark Russinovich 写了一篇评论和文章来讨论linux 2.2 内核中的I/O策略。尽管他好像在一些方面有了一些误解,但这任然值得一读。特别的,他认为linux 2.2 的异步I/O在数据准备好的时候不会通知用户进程,除非是一个新的连接到达了。这看起来是一个巨大的误解。再看看comments on an earlier draft , Ingo Molnar's rebuttal of 30 April 1999 , Russinovich's comments of 2 May 1999 , a rebuttal from Alan Cox, and various posts to linux-kernel。我怀疑他想说linux不支持磁盘异步I/O,那在以前是对的,但现在SGI已经实现了KAIO,因此这样的说法将会是不再正确的了。
See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of completion events, plus scheduling magic that tries to keep the number of running threads constant by allowing more threads to pick up completion events if other threads that had picked up completion events from this port are sleeping (perhaps doing blocking I/O).
看看sysinternals.com and MSDN上关于完成端口的相关页面,他说的仅仅是对NT的。简单的说,win32的overlapped I/O被证明太低级了以至于很不方便使用,但是完成端口是一个完成时间队列的包装,再加上魔法般的调度, 通过允许更多的线程来获得完成事件如果该端口上的其它已获得完成事件的线程处于睡眠中时(可能正在处理阻塞I/O),从而可以保持运行线程数目恒定。
See also OS/400's support for I/O completion ports .
There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections " (and the second week of the thread). Highlights:
在1999.9,在linux内核邮件列表上曾有一次非常有趣的讨论,讨论题目为 "15,000 Simultaneous Connections" (并且延续到第二周)
- Ed Hall posted a few notes on his experiences; he's achieved >1000 connects/second on a UP P2/333 running Solaris. His code used a small pool of threads (1 or 2 per CPU) each managing a large number of clients using "an event-based model".
- Mike Jagdis posted an analysis of poll/select overhead , and said "The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome..."
- Mike posted about his work on improving select() and poll() .
- Mike posted a bit about a possible API to replace poll()/select() : "How about a 'device like' API where you write 'pollfd like' structs, the 'device' listens for events and delivers 'pollfd like' structs representing them when you read it? ... "
- Rogier Wolff suggested using "the API that the digital guys suggested",http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
- Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32's WaitForMultipleObjects can, at least.
- Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you're interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
- Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they're pretty similar.
- Alan Cox noted that an older rev of SCT's SIGIO patch is included in 2.3.18ac.
- Jordan Mendelson posted some example code showing how to use F_SETSIG.
- Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: "With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism," but the library can set up its own signal handler, so this shouldn't affect the program (much).
- Doug Royer noted that he'd gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.
Interesting reading!
Limits on open filehandles
- Any Unix: the limits set by ulimit or setrlimit.
- Solaris: see the Solaris FAQ, question 3.46 (or thereabouts; they renumber the questions periodically).
- FreeBSD:
Edit /boot/loader.conf, add the line
set kern.maxfiles=XXXX
- where XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says
"FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file.
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."
- Another reader says
"As of FreeBSD 4.4, the tcptempl structure is no longer allocated; you no longer have to worry about one mbuf being chewed up per connection."
- See also:
o SYSCTL TUNING , LOADER TUNABLES , and KERNEL CONFIG TUNING in 'man tuning'
o The Effects of Tuning a FreeBSD 4.3 Box for High Performance , Daemon News, Aug 2001
o postfix.org tuning notes , covering FreeBSD 4.2 and 4.4
o the Measurement Factory's notes , circa FreeBSD 4.3
- OpenBSD: A reader says
"In OpenBSD, an additional tweak is required to increase the number of open filehandles available per process: the openfiles-cur parameter in /etc/login.conf needs to be increased. You can change kern.maxfiles either with sysctl -w or in sysctl.conf but it has no effect. This matters because as shipped, the login.conf limits are a quite low 64 for nonprivileged processes, 128 for privileged."
- Linux: See Bodo Bauer's /proc documentation . On 2.4 kernels:
echo 32768 > /proc/sys/fs/file-max
- increases the system limit on open files, and
ulimit -n 32768
- increases the current process' limit.
On 2.2.x kernels,
echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max
- increases the system limit on open files, and
ulimit -n 32768
- increases the current process' limit.
I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit.
In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes.
See also Oskar's 1998 post , which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.
Limits on threads
On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.
- Solaris: it supports as many threads as will fit in memory, I hear.
- Linux 2.6 kernels with NPTL: /proc/sys/vm/max_map_count may need to be increased to go above 32000 or so threads. (You'll need to use very small stack threads to get anywhere near that number of threads, though, unless you're on a 64 bit processor.) See the NPTL mailing list, e.g. the thread with subject "Cannot create more than 32K threads? ", for more info.
- Linux 2.4: /proc/sys/kernel/threads-max is the max number of threads; it defaults to 2047 on my Red Hat 8 system. You can set increase this as usual by echoing new values into that file, e.g. "echo 4000 > /proc/sys/kernel/threads-max"
- Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel . Wow. This document steps you through a lot of stuff that would be hard to figure out yourself, but is somewhat dated.
- Java: See Volano's detailed benchmark info , plus their info on how to tune various systems to handle lots of threads.
Java issues
Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model . There was a way to do nonblocking reads, but no way to do nonblocking writes.
In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!
HP's java also includes a Thread Polling API .
In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio ; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.
See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.
Before NIO, there were several proposals for improving Java's networking APIs:
- Matt Welsh's Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
- Interfacing Java to the Virtual Interface Architecture , by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
- JSR-51 was the Sun project that came up with the java.nio package. Matt Welsh participated (who says Sun doesn't listen?).
Other tips
- Zero-Copy
Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
o Thomas Ogrisegg's zero-copy send patch for mmaped files under Linux 2.4.17-2.4.20. Claims it's faster than sendfile().
o IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
o Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
o Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he'll make it available to userspace soon.
o Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD ; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don't immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernel for people's misgivings about the speed of those memory management tricks.
According to a note from Noriyuki Soda:
Sending side zero-copy is supported since NetBSD-1.6 release by specifying "SOSEND_LOAN" kernel option. This option is now default on NetBSD-current (you can disable this feature by specifying "SOSEND_NO_LOAN" in the kernel option on NetBSD_current). With this feature, zero-copy is automatically enabled, if data more than 4096 bytes are specified as data to be sent.
o The sendfile() system call can implement zero-copy networking.
The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, it's poorly documented at the moment; use _syscall4 to call it . Andi Kleen is writing new man pages that cover this. See also Exploring The sendfile System Call by Jeff Tranter in Linux Gazette issue 91.) Rumor has it , ftp.cdrom.com benefitted noticeably from sendfile().
A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001 .
One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
Solaris 8 (as of the July 2001 update) has a new system call 'sendfilev'. A copy of the man page is here. . The Solaris 8 7/01 release notes also mention it. I suspect that this will be most useful when sending to a socket in blocking mode; it'd be a bit of a pain to use with a nonblocking socket.
- Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though...
See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.
- Behave sensibly on overload.
[Provos, Lever, and Tweedie 2000 ] notes that dropping incoming connections when the server is overloaded improved the shape of the performance curve, and reduced the overall error rate. They used a smoothed version of "number of clients with I/O ready" as a measure of overload. This technique should be easily applicable to servers written with select, poll, or any system call that returns a count of readiness events per call (e.g. /dev/poll or sigtimedwait4()). - Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads. - Caching your own data can sometimes be a win.
"Re: fix for hybrid server problems" by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd , May 9th, states:
"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available athttp://www.cs.rice.edu/~vivek/flash99/ "
Other limits
- Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
- Many systems use 16 bit variables to hold process or thread id's. It would be interesting to port theVolano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
- Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
- Look at the performance comparison graph at the bottom ofhttp://www.acme.com/software/thttpd/benchmarks.html . Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.
Kernel Issues
For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux Weekly News , Kernel Traffic ,the Linux-Kernel mailing list , and my Mindcraft Redux page .
In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.
See also The Linux Scalability Project . They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem .
See also Mike Jagdis' work on improving select() and poll() ; here's Mike's post about it.
Measuring Server Performance
Two tests in particular are simple, interesting, and hard:
7 raw connections per second (how many 512 byte files per second can you serve?)
8 total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)
Jef Poskanzer has published benchmarks comparing many web servers. Seehttp://www.acme.com/software/thttpd/benchmarks.html for his results.
I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.
Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking . It's worth a read.
IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.
Examples
Interesting select()-based servers
- thttpd Very simple. Uses a single process. It has good performance, but doesn't scale with the number of CPU's. Can also use kqueue.
- mathopd . Similar to thttpd.
- fhttpd
- boa
- Roxen
- Zeus , a commercial server that tries to be the absolute fastest. See their tuning guide .
- The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
- BetaFTPd
- Flash-Lite - web server using IO-Lite.
- Flash: An efficient and portable Web server -- uses select(), mmap(), mincore()
- The Flash web server as of 2003 -- uses select(), modified sendfile(), async open()
- xitami - uses select() to implement its own thread abstraction for portability to systems without threads.
- Medusa - a server-writing toolkit in Python that tries to deliver very high performance.
- userver - a small http server that can use select, poll, epoll, or sigio
Interesting /dev/poll-based servers
- N. Provos, C. Lever , "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.
Interesting kqueue()-based servers
- thttpd (as of version 2.21?)
- Adrian Chadd says "I'm doing a lot of work to make squid actually LIKE a kqueue IO system"; it's an official Squid subproject; see http://squid.sourceforge.net/projects.html#commloops . (This is apparently newer than Benno 's patch .)
Interesting realtime signal-based servers
- Chromium's X15. This uses the 2.4 kernel's SIGIO feature together with sendfile() and TCP_CORK, and reportedly achieves higher speed than even TUX. The source is available under a community source (not open source) license. See the original announcement by Fabio Riccardi.
- Zach Brown's phhttpd - "a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment." Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.
Interesting thread-based servers
- Hoser FTPD . See their benchmark page .
- Peter Eriksson's phttpd and
- pftpd
- The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
- Sun's Java Web Server (which has been reported to handle 500 simultaneous clients )
Interesting in-kernel servers
- khttpd
- "TUX" (Threaded linUX webserver) by Ingo Molnar et al. For 2.4 kernel.
Other interesting links
- Jeff Darcy's notes on high-performance server design
- Ericsson's ARIES project -- benchmark results for Apache 1 vs. Apache 2 vs. Tomcat on 1 to 12 processors
- Prof. Peter Ladkin's Web Server Performance page.
- Novell's FastCache -- claims 10000 hits per second. Quite the pretty performance graph.
- Rik van Riel's Linux Performance Tuning site