文章目录
- 前言
- Ozone Datanode Chunk文件原有layout
- Ozone Datanode Chunk Layout:FILE_PER_CHUNK和FILE_PER_BLOCK
- Chunk新旧layout实际存储对比
- 引用
前言
在Ozone中,Ozone的文件对象数据是以Block的形式进行组织的,和HDFS想类似。不过Ozone对Block进行实际存储的时候,是以更细粒度的chunk文件的形式进行物理存储的。简单来说,Block下面又会划分出多个chunk个块,每个chunk文件对应有Block的一个相对offset。本文笔者要聊聊Ozone chunk的layout,即chunk文件在Datanode上的存储方式。在原有实现上,多个chunk文件分离存储的方式还导致有大量chunk的文件的生成,这在效率上并不高效。在最新社区的优化中,Block能支持一个chunk的形式进行存储,接下来我们就来聊聊这2种layout方式。
Ozone Datanode Chunk文件原有layout
首先我们来聊聊Ozone Chunk文件原有的layout布局方式,它是怎么样的一个过程呢?
1)首先是一个文件按照block size划分出多个Block块
2)BlockOutputStream按照chunk size的,分割出多个chunk文件进行写出
此种方式会有以下一些弊端:
- 如果Block设置的比较大,那么将会有很多chunk文件生成
- 每次的Block读写操作涉及到多个chunk文件的读写,读写效率不够高
- 读写过程要做多次文件的检查操作,同样影响了效率
于是,社区实现一种基于一个Block一个chunk的文件layout方式,来提升Ozone文件读写的效率。原有Chunk的layout方式在名称上称之为FILE_PER_CHUNK,新的则为FILE_PER_BLOCK。
Ozone Datanode Chunk Layout:FILE_PER_CHUNK和FILE_PER_BLOCK
以上Chunk layout模式对应实质上是分别由Ozone Datanode内部的FilePerChunkStrategy和FilePerBlockStrategy所控制的。
以上2种policy的本质区别在于Datanode这边对于BlockOutputStream发来的ChunkBuffer数据的处理方式:
FilePerChunkStrategy的处理方式(原有方式):以新chunk文件形式进行写出
FilePerBlockStrategy的处理方式:在原有chunk文件当前的offset处进行追加写
用一个更为直观的图示过程,如下所示:
上图FILE_PER_CHUNK左边的chunk1,2,3文件对应到FILE_PER_BLOCK的方式就是一个chunk文件内的不同段,而FILE_PER_BLOCK模式下,一个Block文件只有1个chunk文件。
下面我们来看具体的实现逻辑,主要在write chunk的文件方法:
FilePerChunkStrategy策略(FilePerChunkStrategy.java):
public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
ChunkBuffer data, DispatcherContext dispatcherContext)
throws StorageContainerException {
checkLayoutVersion(container);
Preconditions.checkNotNull(dispatcherContext);
DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
try {
KeyValueContainerData containerData = (KeyValueContainerData) container
.getContainerData();
HddsVolume volume = containerData.getVolume();
VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
// 1)获取此chunk文件的路径
File chunkFile = ChunkUtils.getChunkFile(containerData, info);
boolean isOverwrite = ChunkUtils.validateChunkForOverwrite(
chunkFile, info);
// 2)获取此chunk的临时文件路径
File tmpChunkFile = getTmpChunkFile(chunkFile, dispatcherContext);
if (LOG.isDebugEnabled()) {
LOG.debug(
"writing chunk:{} chunk stage:{} chunk file:{} tmp chunk file:{}",
info.getChunkName(), stage, chunkFile, tmpChunkFile);
}
long len = info.getLen();
// 忽略offset值,因为是一个新的独立文件
long offset = 0; // ignore offset in chunk info
switch (stage) {
case WRITE_DATA:
if (isOverwrite) {
// if the actual chunk file already exists here while writing the temp
// chunk file, then it means the same ozone client request has
// generated two raft log entries. This can happen either because
// retryCache expired in Ratis (or log index mismatch/corruption in
// Ratis). This can be solved by two approaches as of now:
// 1. Read the complete data in the actual chunk file ,
// verify the data integrity and in case it mismatches , either
// 2. Delete the chunk File and write the chunk again. For now,
// let's rewrite the chunk file
// TODO: once the checksum support for write chunks gets plugged in,
// the checksum needs to be verified for the actual chunk file and
// the data to be written here which should be efficient and
// it matches we can safely return without rewriting.
LOG.warn("ChunkFile already exists {}. Deleting it.", chunkFile);
FileUtil.fullyDelete(chunkFile);
}
if (tmpChunkFile.exists()) {
// If the tmp chunk file already exists it means the raft log got
// appended, but later on the log entry got truncated in Ratis leaving
// behind garbage.
// TODO: once the checksum support for data chunks gets plugged in,
// instead of rewriting the chunk here, let's compare the checkSums
LOG.warn("tmpChunkFile already exists {}. Overwriting it.",
tmpChunkFile);
}
// 3)如果是在写chunk文件数据阶段,则进行临时文件的写入
ChunkUtils.writeData(tmpChunkFile, data, offset, len, volumeIOStats,
doSyncWrite);
// No need to increment container stats here, as still data is not
// committed here.
break;
case COMMIT_DATA:
...
// 4)如果是chunk文件结束阶段,则进行临时文件到最终正式文件的rename操作
commitChunk(tmpChunkFile, chunkFile);
// Increment container stats here, as we commit the data.
containerData.updateWriteStats(len, isOverwrite);
break;
case COMBINED:
// directly write to the chunk file
ChunkUtils.writeData(chunkFile, data, offset, len, volumeIOStats,
doSyncWrite);
containerData.updateWriteStats(len, isOverwrite);
break;
default:
throw new IOException("Can not identify write operation.");
}
} catch (StorageContainerException ex) {
throw ex;
} catch (IOException ex) {
throw new StorageContainerException("Internal error: ", ex,
IO_EXCEPTION);
}
}
下面我们再来看另外一种chunk layout的policy策略实现:
FilePerBlockStrategy.java
@Override
public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
ChunkBuffer data, DispatcherContext dispatcherContext)
throws StorageContainerException {
checkLayoutVersion(container);
Preconditions.checkNotNull(dispatcherContext);
DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
...
KeyValueContainerData containerData = (KeyValueContainerData) container
.getContainerData();
// 1)同样获取chunk文件的路径名
File chunkFile = getChunkFile(containerData, blockID);
boolean overwrite = validateChunkForOverwrite(chunkFile, info);
long len = info.getLen();
// 2)拿到Chunk数据对应于Block数据内的offset,即这个block所写chunk文件的offset
long offset = info.getOffset();
if (LOG.isDebugEnabled()) {
LOG.debug("Writing chunk {} (overwrite: {}) in stage {} to file {}",
info, overwrite, stage, chunkFile);
}
HddsVolume volume = containerData.getVolume();
VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
// 3)从open文件cache中拿到此chunk文件对应的FileChannel
FileChannel channel = files.getChannel(chunkFile, doSyncWrite);
// 4)进行指定offset位置的数据读写
ChunkUtils.writeData(channel, chunkFile.getName(), data, offset, len,
volumeIOStats);
containerData.updateWriteStats(len, overwrite);
}
因为在FILE_PER_BLOCK模式下,一个Block文件可能会处于一段连续时间内被写入的状态,因此在这里实现了FileChannel的cache,避免短时间内多次文件的close,重新open操作。
private static final class OpenFiles {
private static final RemovalListener<String, OpenFile> ON_REMOVE =
event -> close(event.getKey(), event.getValue());
// OpenFile的文件cache
private final Cache<String, OpenFile> files = CacheBuilder.newBuilder()
.expireAfterAccess(Duration.ofMinutes(10))
.removalListener(ON_REMOVE)
.build();
/**
* Chunk文件的FileChannel获取,打开文件操作.
*/
public FileChannel getChannel(File file, boolean sync)
throws StorageContainerException {
try {
return files.get(file.getAbsolutePath(),
() -> open(file, sync)).getChannel();
} catch (ExecutionException e) {
if (e.getCause() instanceof IOException) {
throw new UncheckedIOException((IOException) e.getCause());
}
throw new StorageContainerException(e.getCause(),
ContainerProtos.Result.CONTAINER_INTERNAL_ERROR);
}
}
private static OpenFile open(File file, boolean sync) {
try {
return new OpenFile(file, sync);
} catch (FileNotFoundException e) {
throw new UncheckedIOException(e);
}
}
/**
* 当Open中的文件在cache中过期了,则进行cache清除操作,并附带文件close操作
*/
public void close(File file) {
if (file != null) {
files.invalidate(file.getAbsolutePath());
}
}
...
}
据社区对以上2种方式的数据写入的测试结果来对比,新的chunk layout方式要比原有FILE_PER_CHUNK的方式高效不少,也已经将FILE_PER_BLOCK的方式变为了默认的chunk layout了。相关配置如下:
<property>
<name>ozone.scm.chunk.layout</name>
<value>FILE_PER_BLOCK</value>
<tag>OZONE, SCM, CONTAINER, PERFORMANCE</tag>
<description>
Chunk layout defines how chunks, blocks and containers are stored on disk.
Each chunk is stored separately with FILE_PER_CHUNK. All chunks of a
block are stored in the same file with FILE_PER_BLOCK. The default is
FILE_PER_BLOCK.
</description>
</property>
Chunk新旧layout实际存储对比
笔者在实际测试集群对这2种chunk layout方式进行测试,看看实际chunk是怎样的形式存储的,以下是测试结果:
FILE_PER_BLOCK layout模式:
[hdfs@lyq containerDir0]$ cd 11/chunks/
[hdfs@lyq chunks]$ ll
total 16384
-rw-rw-r-- 1 hdfs hdfs 16777216 Mar 14 08:32 103822128652419072.block
FILE_PER_CHUNK layout模式:
[hdfs@lyq ~]$ ls -l /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
-rw-r--r-- 1 hdfs hdfs 12 Dec 24 07:56 /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
引用
[1].https://issues.apache.org/jira/browse/HDDS-2717