论文解析 -- A Cloud-native Architecture for Replicated Data Services (HotCloud '20)

Introduction

传统的服务是正对本地服务器和本地磁盘设计的,所以在迁移到cloud上的时候会存储不匹配的问题

尤其对于分布式数据系统,传统的方式是他们都需要自己管理local的磁盘

在云上的方案是,云存储服务被封装成块存储或文件存储,让系统可以按照原来的方式访问。

Infrastructure-as-a-Service (or, cloud) providers have become the defacto(实际上的) standard for deploying services of all kinds.

However, migrating services engineered for bare metal servers to the cloud can come with varying degrees of difficulty, specifically when leveraging the highly available cloud storage.

When these services get deployed on the cloud, they often end up(最终) using cloud storage similar to local on-premise storage, and this mismatch introduces inefficiencies in the system.

In this paper we focus on an important class of data management systems, we refer to as replicated data services.

These services manage state (e.g. database state) and use replication for fault-tolerance and high-availability.

They include a variety of back-end services used to build various applications that power the digital world;

including replicated relational databases (e.g., PostgreSQL/XC), scalable key-value stores (e.g., Cassandra [20 ]), and ingest pipelines (e.g., Kafka [20 ]).

Many of these services were originally engineered for on-premise deployments and share a common property of their monolithic(大而复杂的) architecture: they all manage their own disk.

In particular, they manage their own copy of some shared state. It is certainly possible to deploy an existing replicated service to the cloud without changes to the service itself.

Cloud providers have gone to great lengths to make this an easy task.

Storage services can be exposed as block devices or network attached file systems, giving the abstraction of the local disk our services were designed to manage.

However, if we analyze the end-to-end architecture of this type of deployment there are two significant problems.

 

这种使用云存储的方式有两种问题,首先是多余的副本,在云存储中本身就做了副本保证一致性和可用性,但是replicated数据系统在应用层又做了一层副本。

这种应用层的副本在大部分情况下都是无用的和多余的。

 Redundant replication of storage: Cloud storage services provide fault-tolerance and high-availability using their own internal data replication.

Replicated services also replicate data to provide the same properties.

This additional application-level replication provided by the replicated service has little advantage over what the cloud storage already provides.

For example, consider running multiple copies of a service within a single availability zone (AZ) to tolerate host failure or network partitions.

If the storage service is available to all hosts within the AZ, then storing multiple copies of data within that storage service does not increase data availability.

Furthermore, the storage systems themselves already guarantee durability of stored data under various types of failure scenarios.

In some cases, the application-level replication is still needed.

For example, if a storage service is not available across multiple AZs within a geographic region, then application-level replication is required to preserve data availability under AZ failure.

However, if a storage service is available in all AZs, then storing multiple copies of data within that storage service again becomes redundant.

 

再者,云存储的性能特点不同,往往云存储的延时要远高于local disk,而传统的数据系统设计时,都是针对local disk的IO模型做的优化,是否适用于不同的云存储的IO?

不同的云存储的可用性特征也是不同的,比如EBS,GCF,EFS,这种不同的可用性会极大的影响数据系统的设计。

 Storage service characteristics: Cloud provided storage services have significantly different performance characteristics compared to each other and the on-premise physical disks.

For example, storage I/O latency for on-premise deployment (using local disk) is orders of magnitude lower than the I/O latency when using cloud storage.

Data centric services, like RDBMSs and scalable key-value stores, have gone to great lengths to optimize I/O performance on local disks.

Furthermore, different storage services have different availability properties.

Some are only available to a single host at a time  (e.g., AWS EBS [1 ]), some can be shared among hosts within a single availability zone or data centre (e.g., Google Cloud Filestore [15 ]),

and others can be shared among hosts across availability zones in a geographic region (e.g., AWS EFS [2 ]).

These availability properties can influence the architecture of a cloud-native replicated data service design, and can influence requirements on when data needs to be replicated in order to achieve particular fault-tolerance guarantees.

Each cloud storage service presents a new class of storage device with its own unique performance and availability characteristics.

 

Addressing these problems poses(提出) some difficult challenges.

To overcome the redundant replication problem, we need to reconsider the storage architecture of our replicated services in order to provide service availability without unnecessary data replication.

While reconsidering the storage architecture, we must also consider the different performance characteristics of cloud storage services to best exploit their properties.

 

 Contributions:

本文借用main-delta模型架构来解决上面的问题,并且在kafka和cassandra上实际实现

In this paper, we provide a general classification of the replication approaches used by a selection of popular replicated services

and analyze how the approaches fit the characteristics of the various types of storage services provided by the major cloud providers.

We then describe how a well-known architectural pattern originally designed for efficiently handling mixed read/update workloads, the main-delta model [19],

can be adapted to various classes of replicated services to solve the redundant replication problem when engineering these services for cloud deployment.

We refer to the new architecture as cloud-native replicated services. 

The main-delta architecture naturally lends itself to the decoupled compute and storage model available in the cloud.

Furthermore, this design allows us to tune the size of I/Os by adapting the policy a cloud-native replicated service uses to merge its deltas.

This allows us to simultaneously address the performance problem of using cloud storage services.

 

We support our analytical solution by implementing it in two popular replicated services, falling into two very different categories of replication type.

Apache Kafka (Section 3.1) and Apache Cassandra (Section 3.2).

We find that the implementation overhead to adapt these existing monolithic architectures to our proposed cloud-native design is very small, requiring only a few hundred lines of Java code modifications.

 

Problem & Solution Overview 

Problem of redundant replication

Replicated data services provide application-level replication of data for high read throughput, fault-tolerance, and high-availability.

On the other hand, cloud storage provides storage-level replication of the data for the same reasons.

When replicated services are deployed on the cloud the data replication quadruples(四倍) due to the two independent levels of replication that become the part of the whole system.

We call this the problem of redundant replication .

In Figure 1 , we demonstrate this problem in an example scenario where an application is using cloud storage, with application-level and storage-level replication factors of 3, and the data (e.g., key-value (a; 1) ) is stored nine times.

To solve the problem, we propose a solution that ensures only one application-level replica of the data is stored on the cloud storage, while maintaining the high-availability guarantees of the systems. 

Note that, our assumption is that this single replica is highly-available and durable as per(按照) the guarantees of the cloud service provider and hence is not a single point of failure.

Our solution is based on the main-delta architecture, which we describe next.

Following that are the details of our solution.

Main-Delta architecture overview :

As the name suggests, the architecture has two partitions where data is stored: a main and a delta.

The main data partition is static and read only, hence it is generally read-optimized.

Whereas the delta partition is write-optimized, and allows for insert and read operations.

Data in the main is physically modified by creating a new main.

A new main is created by merging outstanding delta operations. The merging of the delta is done periodically.

 

Solution: Cloud-native architecture

解决多余副本的问题,将main存储在cloud上单副本,但是delta在local做replicated,多个delta副本共享一个main

In our solution, the main partition is stored on persistent cloud storage, and the delta partition is stored locally within the application, either in-memory or on-disk.

We allow the delta partition to be replicated at the application-level whereas the main partition is replicated only once at the application-level.

This is outlined in Figure 2  where triangles inside the nodes N 0, ... Nk  represent the deltas.

All the nodes in the replicaset access the same main partition stored on cloud storage (which is internally replicated).

The deltas from one or more replica nodes are merged with the main periodically.

论文解析 -- A Cloud-native Architecture for Replicated Data Services (HotCloud '20)

 

This design exploits the fault-tolerance properties of cloud storage services and avoids unnecessary data redundancy, as such we call this architecture a cloud-native architecture for replicated data services.

Modelling the replicated services as main-delta systems is fortunately straightforward.

While not explicitly designed this way, many existing replicated data services already have this model internally, for example Kafka and Cassandra.

Generally speaking we treat their in-memory data buffers as deltas and on-disk files as main.

The challenge is in how the deltas should be merged with the main.

This is highly dependent on the replication semantics and the failure guarantees of a particular system; we discuss this in Section 3 .

 

Cloud-native architecture

There are several potential solutions to the problem of redundant replication.

We generally cannot control storage-level replication in the cloud, as these services are controlled by the cloud provider.

However, we can change how we do application-level replication.

One simple approach is to remove application-level replication.

Since the storage service already provides fault-tolerance via replication, there is no need for the application to replicate for fault-tolerance.

The drawback of this approach is that it results in the loss of availability of the service. (单个instance,单点)

If the single running instance is unreachable, e.g., due to process crash, machine failure, or network partition, then the entire service is unavailable.

An alternative solution is to have multiple copies of the service share a single primary copy of the data on the cloud storage service.

This way availability is maintained by having multiple instances of the services running without actually using application-level replication.

There are two drawbacks to this approach. (换成多个instance共享一份data,解决单点,这个方案的问题,buffer数据也会丢,需要落到cloud,延时大;多写问题)

The first is that all writes to the system need to be persisted to the cloud storage service to ensure no data is lost if a service fails.

For systems like Kafka and Cassandra that buffer writes in memory and flush them to storage in batches this introduces significant latency.

The second drawback is for services that are engineered as shared-nothing architectures and that have multi-writer designs (e.g. Cassandra).

These services would require concurrent writes to shared data, which would require re-engineering the storage of the system to coordinate concurrent updates to the shared storage, and introduces contention in the system.

 

We observe that main-delta systems have a desirable(理想的) property that can be exploited in cloud deployments:

they have a large read-only main data segment, which is periodically rebuilt to incorporate a batch of deltas  in a process called the delta merge.

Because the data in main is read-only, it would be possible for multiple instances of a replicated service to share a single main,

without introducing the same contention described above with multiple instances sharing a single primary copy of the data.

Only the occasional delta-merge process would need to be coordinated.

Deltas, on the other hand, are kept relatively small by the recurring(定期的) delta merges and are expunged(清除) after the merge takes place.

Each replica can maintain its delta using the application-level replication.

Delta could be kept on a local disk, in a private area of the cloud storage service, or in-memory depending on the environment and durability guarantees of the system.

Deciding which node within a replica-set merges the delta to main depends on the replication policy used by the application.

 

副本的类型,分为,单写单读,单写多读,多写多读,不同的副本策略,相应的delta-merge的策略也不相同

Therefore, before providing the details about the delta-merge strategy, it is important to understand the replication strategies used in practice.

Application-level replication strategies:  We classify these into three categories based on the strategies seen in practice.

(1) Single-writer/single-reader:  this has a single master node in a replica-set, and all the read and write operations are handled by the master node.

The role of replicas in this strategy is only to provide fault-tolerance and high-availability. For example, Kafka [18 ].

(2) Single-writer/multi-reader:  the writes are handled by a single master but the reads can be handled by any replica node.

The role of replicas here is to provide fault-tolerance, high-availability, and read scale-out.

Examples are MongoDB [9 ], Redis [22 ], PostgreSQL, MemSQL [21 ], Aurora [23 ], and Pnuts [10 ].

(3) Multi-writer/multi-reader:  reads and writes can be serviced by multiple nodes in a replica-set.

The role of replicas here is to provide fault-tolerance, high-availability, and read and write scale-out.

In some multi-writer/multi-reader systems, quorums of nodes are used to accept writes, which means that not all replicas in a replica-set are exact replicas of each other.

Examples are Cassandra [20 ], CouchDB [12 ], PostgresXC, Dynamo [13 ], Spanner [11 ].

Depending on the replication strategy, the delta-merge strategy could be as simple as master node always merging the delta, as in the case of Kafka and for other single-writer/single-reader

systems (in Section 3.1 ), or a more complex one, involving deltas of all replica nodes, as we will see in the case study of Cassandra or any other multi-writer/multi-reader system (in Section 3.2 ).

 

云存储分为3类,这里给出分类和每个的代表产品,好评啊

只有单写单读可以用1,其他的case都需要多个节点同时访问。

In addition to the delta-merge strategy, different replication strategies also determine which type of cloud storage can be used when using main-delta architecture for replicated services.

Cloud storage can be classified into the following three categories:

(1) Network attached block devices : This storage is similar to an on-premise disk; 

the storage is bound or attached to a single compute instance. That means only one instance at a time can mount the storage for reading and writing.

Examples are Elastic Block Store (EBS) [1 ], Google Cloud’s Persistent disk [14 ], and Disk Storage [6 ] by Azure. 等同于本地磁盘,同时被单个实例mount

 

(2) NFS shared storage : This is similar to a Network Files System (NFS) shared across multiple compute instances.

Any number of compute instances can mount the storage, hence allowing multiple instances to simultaneously read and write the data.

Examples are Elastic File System (EFS) [2 ] , Google Filestore [15 ], and Azure Files [7 ]. 网络文件系统,多个实例共享

 

(3) Object Stores : This type of storage allows reading and writing named objects.

This storage does not allow for in-place updates, data can be deleted and inserted again with new values.

Examples are Amazon S3 [3 ], Google Cloud Storage Buckets [14 ], and Azure Blob Storage [5 ]. 对象存储

 

For single-writer/single-reader  replication the delta can be merged only by the master node and the reads are also served by the master node.

Therefore, any cloud storage which allows for one or more compute nodes to read and write data is suitable.

That is, all of the above types of storage can be used.

For the single-writer/multi-reader  replication the delta is merged only by master but the reads are served by all the replica nodes,

therefore each node should have read access to the main. Hence, only NFS shared storage and Object Stores  can be used to store the main.

Similarly for multi-writer/multireader replication, the delta from all the nodes needs to be merged, and each node serves the reads.

Therefore, all nodes need read and write access to the storage. Hence, only NFS shared storage  and Object Stores can be used to store the main.

 

Case Study: Cloud-native Kafka

 Main-delta in Kafka:

在内存中的buffer就是delta,写入磁盘的文件就是main

Kafka internally implements an append only data model.

Updates are treated as new values and appended to the existing data.

Compaction runs in the background and deletes older versions of the same data.

To support the append only architecture Kafka has in-memory buffers to which new values are appended,

and these buffers are flushed to the persistent storage regularly where it is merged with the rest of the data.

The append only architecture lends itself naturally to the main-delta architecture,

where the in-memory buffers are equivalent to the deltas and the data stored on persistent storage is equivalent to the main.

 

Delta-merge: 

改成多个副本中只有主副本会persistent数据到磁盘

For every data partition, there is a fixed set of Kafka brokers, called replica-sets, owning the replicas of the partition.

For every replica-set only a single copy of the main is stored on persistent storage, but every broker in the replica-set maintains its own delta.

The multiple copies of the delta are kept synchronized by Kafka’s synchronous replication protocol.

The delta-merge strategy naturally follows from the fact that there is a single master broker per replica-set (i.e. single-writer/single-reader  replication strategy).

We employ the master replica to perform the delta-merge (flush the log tail) and read the main (persisted log) to/from persistent storage,

and block all the other brokers from accessing the storage. This ensures that only the master broker’s delta is merged to the main.

To decide when the in-memory buffer should be flushed to main we explicitly manage the in-memory buffers.

In our implementation we maintain fixed size Byte Buffers in memory as deltas.

Once a buffer is full it is flushed by a background thread to the file located on a persistent storage.

We maintain two buffers such that when one buffer is being flushed, the other buffer is available for writes.

 

Failure guarantees: 

In the case of master failure, Kafka has an algorithm to elect the new master broker.

Write permissions to the persistent storage and also the permission to merge delta are enabled only on the master.

The new master broker can also read the existing log from storage to answer read requests.

The guarantees provided by Kafka at the time of master node failure depends on the replication policy configured for Kafka.

Replication in Kafka can be configured to either synchronous or asynchronous replication.

Our modifications towards main-delta architecture allow for same replication policies to hold, because the deltas still get updates according to the specified replication policy.

Once the delta is merged it is persisted on fault-tolerant storage and all the guarantees of the storage service apply.

 

Case Study: Cloud-native Cassandra

 Main-delta in Cassandra:

Cassandra的主要特征就是,节点间非集中式,p2p;存储使用sstable;那么自然memtable就是delta,而sstable就是main

那么现在的问题在于,Cassandra是多写,所以比单写复杂,一个replica-set的每个节点都要可以访问同一个storage

Cassandra, in contrast to Kafka, is a peer-to-peer system with no notion of master or slaves.

Cassandra supports quorum reads and quorum writes, where n (i.e. quorum count) out of the k (number of replicas) nodes must respond to the read or write request.

Each Cassandra node writes data to an in-memory data structure called memtable, which are regularly flushed to the disk, and merged with on-disk structure called sstable.

Similar to Kafka, Cassandra does not support in-place updates.

The memtable and sstable structures of Cassandra [20] naturally lends itself to the main-delta architecture.

In-memory memtables are logically equal to the deltas, and on-disk multiple sstables together form the main.

However, what is required is to ensure that only a single copy of the main exists for every partition.

Ensuring that is not straightforward in systems like Cassandra, where the replication policy is driven  by quorum writes.

To implement the main-delta architecture within Cassandra we first need to decouple the storage from its processing engine.

We move from per-node-storage to per-replica-set-storage.

In per-node-storage, each node has its own set of memtables and dedicated storage for its sstables.

In per-replica-set-storage, each node still has its own set of memtables but the sstables (which together form the main) have only a single copy per replica-set.

These sstables, are stored on a shared storage (NFS or object store), and each node in the replica-set has read and write permissions to the sstables.

 

Delta-merge: 

多写,先独立写,后续进行统一的compaction

Due to quorum writes it is uncertain that any single node in the replica-set has all the recent writes.

Therefore, unlike Kafka, the delta merge needs to reconcile(调和) deltas (i.e. memtables in Cassandra) of all the nodes in the replicaset.

To achieve this, we allow each node to independently flush its delta to the cloud storage whenever their delta is full, 

and a background compaction task merges the multiple copies of flushed deltas into a single combined delta and then append it to the main (which contains remaining sstables).

Since each node can write its delta to the storage, this approach needs the main to be stored on shared storage that allows for multiple writes (i.e., NFS storage or object storage).

The quorum reads are handled the same as in the original Cassandra, i.e., each node in the replica-set searches their deltas and the single shared main,

and as soon as the quorum is satisfied the result is returned to the user.

 

Failure guarantees: 

Our modifications in Cassandra are mainly focused on the compaction job, which is a background process that is already part of Cassandra.

This keeps rest of the architecture including failure handling unchanged.

Therefore, the failure guarantees of the cloud-native Cassandra are same as original Cassandra.

 

 

 

 

上一篇:info replication


下一篇:springboot