Ozone Block Chunk文件的layout方式

文章目录

前言


在Ozone中,Ozone的文件对象数据是以Block的形式进行组织的,和HDFS想类似。不过Ozone对Block进行实际存储的时候,是以更细粒度的chunk文件的形式进行物理存储的。简单来说,Block下面又会划分出多个chunk个块,每个chunk文件对应有Block的一个相对offset。本文笔者要聊聊Ozone chunk的layout,即chunk文件在Datanode上的存储方式。在原有实现上,多个chunk文件分离存储的方式还导致有大量chunk的文件的生成,这在效率上并不高效。在最新社区的优化中,Block能支持一个chunk的形式进行存储,接下来我们就来聊聊这2种layout方式。

Ozone Datanode Chunk文件原有layout


首先我们来聊聊Ozone Chunk文件原有的layout布局方式,它是怎么样的一个过程呢?

1)首先是一个文件按照block size划分出多个Block块
2)BlockOutputStream按照chunk size的,分割出多个chunk文件进行写出

此种方式会有以下一些弊端:

  • 如果Block设置的比较大,那么将会有很多chunk文件生成
  • 每次的Block读写操作涉及到多个chunk文件的读写,读写效率不够高
  • 读写过程要做多次文件的检查操作,同样影响了效率

于是,社区实现一种基于一个Block一个chunk的文件layout方式,来提升Ozone文件读写的效率。原有Chunk的layout方式在名称上称之为FILE_PER_CHUNK,新的则为FILE_PER_BLOCK。

Ozone Datanode Chunk Layout:FILE_PER_CHUNK和FILE_PER_BLOCK


以上Chunk layout模式对应实质上是分别由Ozone Datanode内部的FilePerChunkStrategy和FilePerBlockStrategy所控制的。

以上2种policy的本质区别在于Datanode这边对于BlockOutputStream发来的ChunkBuffer数据的处理方式:

FilePerChunkStrategy的处理方式(原有方式):以新chunk文件形式进行写出
FilePerBlockStrategy的处理方式:在原有chunk文件当前的offset处进行追加写

用一个更为直观的图示过程,如下所示:
Ozone Block Chunk文件的layout方式
上图FILE_PER_CHUNK左边的chunk1,2,3文件对应到FILE_PER_BLOCK的方式就是一个chunk文件内的不同段,而FILE_PER_BLOCK模式下,一个Block文件只有1个chunk文件。

下面我们来看具体的实现逻辑,主要在write chunk的文件方法:

FilePerChunkStrategy策略(FilePerChunkStrategy.java):

  public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
      ChunkBuffer data, DispatcherContext dispatcherContext)
      throws StorageContainerException {

    checkLayoutVersion(container);

    Preconditions.checkNotNull(dispatcherContext);
    DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
    try {

      KeyValueContainerData containerData = (KeyValueContainerData) container
          .getContainerData();
      HddsVolume volume = containerData.getVolume();
      VolumeIOStats volumeIOStats = volume.getVolumeIOStats();

      // 1)获取此chunk文件的路径
      File chunkFile = ChunkUtils.getChunkFile(containerData, info);

      boolean isOverwrite = ChunkUtils.validateChunkForOverwrite(
          chunkFile, info);
      // 2)获取此chunk的临时文件路径
      File tmpChunkFile = getTmpChunkFile(chunkFile, dispatcherContext);
      if (LOG.isDebugEnabled()) {
        LOG.debug(
            "writing chunk:{} chunk stage:{} chunk file:{} tmp chunk file:{}",
            info.getChunkName(), stage, chunkFile, tmpChunkFile);
      }

      long len = info.getLen();
      // 忽略offset值,因为是一个新的独立文件
      long offset = 0; // ignore offset in chunk info
      switch (stage) {
      case WRITE_DATA:
        if (isOverwrite) {
          // if the actual chunk file already exists here while writing the temp
          // chunk file, then it means the same ozone client request has
          // generated two raft log entries. This can happen either because
          // retryCache expired in Ratis (or log index mismatch/corruption in
          // Ratis). This can be solved by two approaches as of now:
          // 1. Read the complete data in the actual chunk file ,
          //    verify the data integrity and in case it mismatches , either
          // 2. Delete the chunk File and write the chunk again. For now,
          //    let's rewrite the chunk file
          // TODO: once the checksum support for write chunks gets plugged in,
          // the checksum needs to be verified for the actual chunk file and
          // the data to be written here which should be efficient and
          // it matches we can safely return without rewriting.
          LOG.warn("ChunkFile already exists {}. Deleting it.", chunkFile);
          FileUtil.fullyDelete(chunkFile);
        }
        if (tmpChunkFile.exists()) {
          // If the tmp chunk file already exists it means the raft log got
          // appended, but later on the log entry got truncated in Ratis leaving
          // behind garbage.
          // TODO: once the checksum support for data chunks gets plugged in,
          // instead of rewriting the chunk here, let's compare the checkSums
          LOG.warn("tmpChunkFile already exists {}. Overwriting it.",
                  tmpChunkFile);
        }
        // 3)如果是在写chunk文件数据阶段,则进行临时文件的写入 
        ChunkUtils.writeData(tmpChunkFile, data, offset, len, volumeIOStats,
            doSyncWrite);
        // No need to increment container stats here, as still data is not
        // committed here.
        break;
      case COMMIT_DATA:
         ...
        // 4)如果是chunk文件结束阶段,则进行临时文件到最终正式文件的rename操作
        commitChunk(tmpChunkFile, chunkFile);
        // Increment container stats here, as we commit the data.
        containerData.updateWriteStats(len, isOverwrite);
        break;
      case COMBINED:
        // directly write to the chunk file
        ChunkUtils.writeData(chunkFile, data, offset, len, volumeIOStats,
            doSyncWrite);
        containerData.updateWriteStats(len, isOverwrite);
        break;
      default:
        throw new IOException("Can not identify write operation.");
      }
    } catch (StorageContainerException ex) {
      throw ex;
    } catch (IOException ex) {
      throw new StorageContainerException("Internal error: ", ex,
          IO_EXCEPTION);
    }
  }

下面我们再来看另外一种chunk layout的policy策略实现:

FilePerBlockStrategy.java

  @Override
  public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
      ChunkBuffer data, DispatcherContext dispatcherContext)
      throws StorageContainerException {

    checkLayoutVersion(container);

    Preconditions.checkNotNull(dispatcherContext);
    DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
    ...

    KeyValueContainerData containerData = (KeyValueContainerData) container
        .getContainerData();

    // 1)同样获取chunk文件的路径名
    File chunkFile = getChunkFile(containerData, blockID);
    boolean overwrite = validateChunkForOverwrite(chunkFile, info);
    long len = info.getLen();
    // 2)拿到Chunk数据对应于Block数据内的offset,即这个block所写chunk文件的offset
    long offset = info.getOffset();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Writing chunk {} (overwrite: {}) in stage {} to file {}",
          info, overwrite, stage, chunkFile);
    }

    HddsVolume volume = containerData.getVolume();
    VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
    // 3)从open文件cache中拿到此chunk文件对应的FileChannel
    FileChannel channel = files.getChannel(chunkFile, doSyncWrite);
    // 4)进行指定offset位置的数据读写
    ChunkUtils.writeData(channel, chunkFile.getName(), data, offset, len,
        volumeIOStats);

    containerData.updateWriteStats(len, overwrite);
  }

因为在FILE_PER_BLOCK模式下,一个Block文件可能会处于一段连续时间内被写入的状态,因此在这里实现了FileChannel的cache,避免短时间内多次文件的close,重新open操作。

  private static final class OpenFiles {

    private static final RemovalListener<String, OpenFile> ON_REMOVE =
        event -> close(event.getKey(), event.getValue());

    // OpenFile的文件cache
    private final Cache<String, OpenFile> files = CacheBuilder.newBuilder()
        .expireAfterAccess(Duration.ofMinutes(10))
        .removalListener(ON_REMOVE)
        .build();

    /**
     * Chunk文件的FileChannel获取,打开文件操作.
     */
    public FileChannel getChannel(File file, boolean sync)
        throws StorageContainerException {
      try {
        return files.get(file.getAbsolutePath(),
            () -> open(file, sync)).getChannel();
      } catch (ExecutionException e) {
        if (e.getCause() instanceof IOException) {
          throw new UncheckedIOException((IOException) e.getCause());
        }
        throw new StorageContainerException(e.getCause(),
            ContainerProtos.Result.CONTAINER_INTERNAL_ERROR);
      }
    }

    private static OpenFile open(File file, boolean sync) {
      try {
        return new OpenFile(file, sync);
      } catch (FileNotFoundException e) {
        throw new UncheckedIOException(e);
      }
    }
    /**
     * 当Open中的文件在cache中过期了,则进行cache清除操作,并附带文件close操作
     */
    public void close(File file) {
      if (file != null) {
        files.invalidate(file.getAbsolutePath());
      }
    }
    ...
}

据社区对以上2种方式的数据写入的测试结果来对比,新的chunk layout方式要比原有FILE_PER_CHUNK的方式高效不少,也已经将FILE_PER_BLOCK的方式变为了默认的chunk layout了。相关配置如下:

  <property>
    <name>ozone.scm.chunk.layout</name>
    <value>FILE_PER_BLOCK</value>
    <tag>OZONE, SCM, CONTAINER, PERFORMANCE</tag>
    <description>
      Chunk layout defines how chunks, blocks and containers are stored on disk.
      Each chunk is stored separately with FILE_PER_CHUNK.  All chunks of a
      block are stored in the same file with FILE_PER_BLOCK.  The default is
      FILE_PER_BLOCK.
    </description>
  </property>

Chunk新旧layout实际存储对比


笔者在实际测试集群对这2种chunk layout方式进行测试,看看实际chunk是怎样的形式存储的,以下是测试结果:

FILE_PER_BLOCK layout模式:

[hdfs@lyq containerDir0]$ cd 11/chunks/
[hdfs@lyq chunks]$ ll
total 16384
-rw-rw-r-- 1 hdfs hdfs 16777216 Mar 14 08:32 103822128652419072.block

FILE_PER_CHUNK layout模式:

[hdfs@lyq ~]$ ls -l /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
-rw-r--r-- 1 hdfs hdfs 12 Dec 24 07:56 /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1

引用


[1].https://issues.apache.org/jira/browse/HDDS-2717

上一篇:HTTP协议之chunk介绍


下一篇:hadoop block missing处理