Presto Core Data Structures: Slice, Block & Page

Presto Core Data Structures: Slice, Block & Page

Overview

In Presto there are some very essential data structure we need to understand, Slice, Block & Page are three of them, I will introduce these data structures in this article.

Slice

From user point of view, Slice is a more developer friendly virtual memory, it defines a set of getters and setters, so you can use the memory like a piece of structured data:

Presto Core Data Structures: Slice, Block & Page

The typical usage for Slice is to represent a String:

// use it as utf8 encoded string
Slice slice = Slices.utf8Slice("hello");
Slice subSlice = SliceUtf8.substring(slice, 1, 2);

You can see we are using Slice just like a String, the reason that Presto prefers Slice over Strings is:

  • Strings are expensive to build(String concatenation, StringBuilder etc).
  • Slice is mutable while String is not, so it is more efficient when we need to do string calculation.
  • Strings are encoded as UTF16 in memory, while Slice use UTF8, which is more memory efficient.

    • UTF16 use minimum two bytes to represent a character, while UTF8 use minimum one bytes, so if the String content is mainly ascii characters, UTF8 can save a lot of memory.

Thanks @dain for sharing these insigts.

Another usage for Slice(in Presto) is to represent raw bytes(VARBINARY type in SQL):

// use it as raw bytes
block.getSlice().getBytes()

You can get a Slice from a Block just like primitive types Int, Long etc.

Block

Since Page is consisted of Blocks, so we will introduce Block first. Block can be think of an array of same class(int, long, Slice etc.) of data. Each data item occupies a position, the total position count represents the total number of rows of data the Block is holding (Block only holds one column of these rows).

Presto Core Data Structures: Slice, Block & Page

Block has defined several sets of APIs, one of them is the getXXX methods, lets take at getInt as an sample:

/**
    * Gets a little endian int at {@code offset} in the value at {@code position}.
    */
default int getInt(int position, int offset)
{
    throw new UnsupportedOperationException(getClass().getName());
}

Typically one Block only supports one getXxx method, because all the data in one Block belongs to one column which has one certain type.

Another method Block defined is copyPositions, instead of get one value from Block, it gets a list of values specified by a list of positions as a new Block, I suspect this is used in projection.

/**
 * Returns a block containing the specified positions.
 * All specified positions must be valid for this block.
 * <p>
 * The returned block must be a compact representation of the original block.
 */
Block copyPositions(List<Integer> positions);

Along with Block, Presto also defined BlockEncoding which determines how Block is serialized & deserialized:

public interface BlockEncoding
{
    ...

    /**
     * Read a block from the specified input.  The returned
     * block should begin at the specified position.
     */
    Block readBlock(SliceInput input);

    /**
     * Write the specified block to the specified output
     */
    void writeBlock(SliceOutput sliceOutput, Block block);
    ...
}

Lets take the simplest BlockEncoding: IntArrayBlockEncoding as an example, its readBlock looks like this:

int positionCount = block.getPositionCount();
sliceOutput.appendInt(positionCount);

encodeNullsAsBits(sliceOutput, block);

for (int position = 0; position < positionCount; position++) {
    if (!block.isNull(position)) {
        sliceOutput.writeInt(block.getInt(position, 0));
    }
}

As we can see from the code, it first encode the position, then encode data isNull as a bitmap, and finally write all the ints into the output. Looks very similar as network protocol code.

Page

Page is consisted of blocks:

public class Page
{
    private final Block[] blocks;
    private final int positionCount;
    ...
}

Besides blocks, Page has another concept called channel: every Block is a channel for the Page, the total count of blocks is the channel count. So lets summarize how data is structured here, When there's some rows to send, Presto will

  • Put every column into a separate Block.
  • Put these Blocks into a Page.
  • Send the Page

Page is the data structure which holds data and being transferred between Presto physical execution operators: upstream operator produce output through getOutput():

/**
 * Gets an output page from the operator.  If no output data is currently
 * available, return null.
 */
Page getOutput();

Downstream operators get input through addInput() method:

/**
 * Adds an input page to the operator.  This method will only be called if
 * {@code needsInput()} returns true.
 */
void addInput(Page page);

Presto Core Data Structures: Slice, Block & Page

Just like Block, Page also needs to serialized & deserialized, serialization happens when data needs to transferred between workers. When Page is serialized, it will first encode the Blocks using corresponding BlockEncoding, and then if a compressor is available, it will try to compress the encoded block data, if compression works well(have encoding rate lower than: 0.8), it will use the compressed the data, otherwise the uncompress data is used. The encoded block data will be put into a class named SerializedPage along with some statistic information: the byte size of the page before & after compression.

Summary

Today we introduced the three core data structures in Presto: Slice, Block and Page. In short, Slice is more developer friendly virtual memory, Block represents a column, Page represents a row group. Hope you enjoyed this article.

上一篇:DLA支持Parquet/ORC/OTS表的Alter Table Add Column


下一篇:带你领略Java运算符之美 | 带你学《Java编程入门》之四