kafka design (一)

4.design

4.1 Motivation

We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.
It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.
It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.

我们将 Kafka 设计为能够作为一个统一平台来处理大型公司可能拥有的所有实时数据feed流。 为此,我们必须考虑一系列相当广泛的用例。
它必须具有高吞吐量才能支持大容量事件流,例如实时日志聚合。
它需要优雅地处理大量数据积压,以便能够支持来自离线系统的定期数据加载。
这也意味着系统必须处理低延迟交付以处理更传统的消息传递用例。

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.
Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee fault-tolerance in the presence of machine failures.
Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messaging system. We will outline some elements of the design in the following sections.

我们希望支持分区的、分布式的、实时处理的feeds,来创建新的派生的feeds。 这激发了我们的分区和消费者模型。
最后,在流被“喂”送到其他数据系统以供服务的情况下,我们知道系统必须能够在出现机器故障时保证容错。
支持这些用途使我们的设计具有许多独特的元素,比传统的消息传递系统更类似于数据库日志。 我们将在以下部分概述设计的一些元素。

4.2 Persistence

Don‘t fear the filesystem!

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.

Kafka 严重依赖文件系统来存储和缓存消息。 人们普遍认为“磁盘速度很慢”,这让人们怀疑持久结构能否提供具有竞争力的性能。 事实上,磁盘比人们预期的要慢得多和快得多,这取决于它们的使用方式。 一个设计合理的磁盘结构通常可以和网络一样快。

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!

关于磁盘性能的关键事实是,在过去十年中,硬盘驱动器的吞吐量与磁盘寻道的延迟有所不同。 因此,在具有 6 个 7200rpm SATA RAID-5 阵列的 JBOD 配置上,线性写入的性能约为 600MB/秒,而随机写入的性能仅为约 100k/秒,相差超过 6000 倍。 这些线性读取和写入是所有使用模式中最可预测的,并由操作系统进行了大量优化。 现代操作系统提供预读和后写技术,这些技术“成倍地”以大块预取数据,并将较小的“逻辑写入”分组为“较大的物理写入”。 关于这个问题的进一步讨论可以在这篇 ACM 队列文章中找到; 他们实际上发现在某些情况下顺序磁盘访问比随机内存访问更快!

To compensate for this performance divergence, modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

为了弥补这种性能差异,现代操作系统在使用主存来作为“磁盘缓存”方面变得越来越积极。 现代操作系统很乐意将所有空闲内存作为磁盘缓存,因为在回收内存时性能损失很小。 所有磁盘读写都会经过这个统一的缓存。 如果不使用直接 I/O,则无法轻易关闭此功能,因此即使进程维护数据的进程内缓存,此数据也可能会在操作系统页面缓存中复制,从而有效地将所有内容存储两次。

Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:
The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

此外,我们是在 JVM 之上构建的,了解 Java 内存使用的人都知道两件事:

  • 对象的内存开销非常高,通常会使存储的数据大小增加一倍(或更糟)。
  • 随着堆内数据的增加,Java 垃圾收集变得越来越繁琐和缓慢。

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore, this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.

由于这些因素,使用文件系统和依赖页面缓存优于维护内存缓存或其他结构——我们通过自动访问所有空闲内存至少使可用缓存加倍,并且可能通过存储一个紧凑的字节结构而不是单个对象再次加倍。这样做将在 32GB 机器上产生高达 28-30GB 的缓存,而不会受到 GC 惩罚。此外,即使服务重新启动,此缓存也会保持温暖,而进程内缓存需要在内存中重建(对于 10GB 缓存可能需要 10 分钟),否则它将需要以完全冷的缓存启动(这可能意味着糟糕的初始性能)。这也极大地简化了代码,因为用于维护缓存和文件系统之间一致性的所有逻辑现在都在操作系统中,这往往比一次性的进程内尝试更有效、更正确。如果您的磁盘使用偏好线性读取,那么预读实际上是在每次磁盘读取时使用有用的数据预先填充此缓存。

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel‘s pagecache.

这表明了一种非常简单的设计:与其在内存中尽可能多地维护并在空间不足时恐慌地将其全部刷写到文件系统中,不如将其反转。 所有数据都会立即写入文件系统上的持久日志,而不必刷新到磁盘。 实际上,这只是意味着它被传输到内核的页面缓存中。

This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).

这种以页面缓存为中心的设计风格在一篇关于 Varnish 设计的文章中有所描述(以及适度的傲慢)。

Constant Time Suffices

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse than twice as slow.

消息传递系统中使用的持久数据结构通常是每个消费者队列,带有关联的 BTree 或其他通用随机访问数据结构,以维护有关消息的元数据。 BTrees 是可用的最通用的数据结构,可以在消息传递系统中支持各种事务性和非事务性语义。不过,它们确实带来了相当高的成本:Btree 操作是 O(log N)。通常认为 O(log N) 本质上等同于常数时间,但对于磁盘操作而言并非如此。磁盘寻道的时间为 10 毫秒,每个磁盘一次只能进行一次寻道,因此并行性受到限制。因此,即使是少量的磁盘搜索也会导致非常高的开销。由于存储系统混合了非常快的缓存操作和非常慢的物理磁盘操作,当数据随着固定缓存的增加而增加时,观察到的树结构的性能通常是超线性的——即将您的数据加倍会使事情变得比速度减慢两倍还要糟糕。

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.

直观上,持久队列可以建立在简单的读取和附加到文件的基础上,这是日志解决方案的常见情况。 这种结构的优点是所有操作都是 O(1) 并且读取不会阻塞写入或相互阻塞。 这具有明显的性能优势,因为性能与数据大小完全分离——一台服务器现在可以充分利用大量廉价、低转速的 1+TB SATA 驱动器。 尽管它们的寻道性能很差,但这些驱动器对于大型读取和写入具有可接受的性能,并且价格是其 1/3,容量是其 3 倍。

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, we can retain messages for a relatively long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.

在没有任何性能损失的情况下访问几乎无限的磁盘空间意味着我们可以提供一些通常在消息传递系统中找不到的功能。 例如,在 Kafka 中,我们可以将消息保留相对较长的时间(比如一周),而不是尝试在消息被消费后立即删除。 正如我们将要描述的,这为消费者带来了极大的灵活性。

kafka design (一)

上一篇:OpenCV里的SIFT可以放心用了


下一篇:YBTOJ:拔河比赛