varints

2022-11-18 09:56:20

Protocol Buffer技术详解(数据编码) - Stephen_Liu - 博客园 https://www.cnblogs.com/stephen-liu74/archive/2013/01/08/2845994.html

Posted on 2013-01-08 08:52 Stephen_Liu 阅读(16562) 评论(2) 编辑收藏

之前已经发了三篇有关Protocol Buffer的技术博客，其中第一篇介绍了Protocol Buffer的语言规范，而后两篇则分别基于C++和Java给出了一些相对比较实用而又简单的示例。由于近期工作压力很大，因此对于是否继续写本篇博客也确实让我纠结了几天。但每每想到善终如始则无败事这句话时，最终的决定还是既然开始了，就要尽自己最大的努力去做，而不要留有丝毫的遗憾。
      该篇Blog的内容将完全取自于Google的官方文档，只是为一些相对难以理解的技术点加入了适当的注解。但因技术能力有限，如解释有误，欢迎指正。
      这是一篇让你对Protocol Buffer知其然亦知其所以然的文档，即便你在并不了解这其中的技术细节和处理机制的情况下，仍然能够在你的应用程序中正常的使用Protocol Buffer，然而我相信，通过对这些细节和机制的深入了解，不仅可以让你更好的使用和驾驭Protocol Buffer，而且还能深深地感受到Google工程师的智慧和高超的编程技艺，因此在我看来，深入的研习对我们编程能力的提高和思路的拓宽都是大有裨益的。不积跬步无以致千里。

      一、简单消息编码布局：
让我们先看一下下面的消息定义示例：
message Test1 {
          required int32 a = 1;
      }
      假设我们在应用程序中将字段a的值设置为150（十进制），此后再将该对象序列化到Binary文件中，你可以看到文件的数据为：
      08 96 01
      这3个字节的含义又是什么呢？它们又是按照什么样的编码规则生成的呢？让我们拭目以待。

      二、Base 128 Varints：
      在理解Protocol Buffer的编码规则之前，你首先需要了解varints。varints是一种使用一个或多个字节表示整型数据的方法。其中数值本身越小，其所占用的字节数越少。
      在varint中，除了最后一个字节之外的每个字节中都包含一个msb(most significant bit)设置(使用最高位)，这意味着其后的字节是否和当前字节一起来表示同一个整型数值。而字节中的其余七位将用于存储数据本身。由此我们可以简单的解释一下Base 128，通常而言，整数数值都是由字节表示，其中每个字节为8位，即Base 256。然而在Protocol Buffer的编码中，最高位成为了msb，只有后面的7位存储实际的数据，因此我们称其为Base 128（2的7次方）。
      比如数字1，它本身只占用一个字节即可表示，所以它的msb没有被设置，如：
      0000 0001
      再比如十进制数字300，它的编码后表示形式为：
      1010 1100 0000 0010
      对于Protocol Buffer而言又是如何将上面的字节布局还原成300呢？这里我们需要做的第一步是drop掉每个字节的msb。从上例中可以看出第一个字节（1010 1100）的msb（最高位）被设置为1，这说明后面的字节将连同该字节表示同一个数值，而第二个字节（0000 0010）的msb为0，因此该字节将为表示该数值的最后一个字节了，后面如果还有其他的字节数据，将表示其他的数据。
      1010 1100 0000 0010
      -> 010 1100 000 0010
      上例中的第二行已经将第一行中每一个字节的msb去除。由于Protocol Buffer是按照Little Endian的方式进行数据布局的，因此我们这里需要将两个字节的位置进行翻转。
      010 1100 000 0010
      -> 000 0010 010 1100           //翻转第一行的两个字节
      -> 100101100                         //将翻转后的两个字节直接连接并去除高位0
      -> 256 + 32 + 8 + 4 = 300    //将上一行的二进制数据换算成十进制，其值为300

      三、消息结构：
      Protocol Buffer中的消息都是由一系列的键值对构成的。每个消息的二进制版本都是使用标签号作为key，而每一个字段的名字和类型均是在解码的过程中根据目标类型（反序列化后的对象类型）进行配对的。在进行消息编码时，key/value被连接成字节流。在解码时，解析器可以直接跳过不识别的字段，这样就可以保证新老版本消息定义在新老程序之间的兼容性，从而有效的避免了使用older消息格式的older程序在解析newer程序发来的newer消息时，一旦遇到未知（新添加的）字段时而引发的解析和对象初始化的错误。最后，我们介绍一下字段标号和字段类型是如何进行编码的。下面先列出Protocol Buffer可以支持的字段类型。

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
3	Start group	groups (deprecated)
4	End group	groups (deprecated)
5	32-bit	fixed32, sfixed32, float

由于在编码后每一个字段的key都是varint类型，key的值是由字段标号和字段类型合成编码所得，其公式如下：
      field_number << 3 | field_type
      由此看出，key的最后3个bits用于存储字段的类型信息。那么在使用该编码时，Protocol Buffer所支持的字段类型将不会超过8种。这里我们可以进一步计算出Protocol Buffer在一个消息中可以支持的字段数量为2的29次方减一。现在我们再来回顾一下之前给出的Test1消息被序列化后的第一个字节08的由来。
      0000 1000
      -> 000 1000                  //drop掉msb（最高位）
      最低的3位表示字段类型，即0为varint。我们再将结果右移3位( >> 3)，此时得到的结果为1，即字段a在消息Test1中的标签号。通过这样的结果，Protocol Buffer的解码器可以获悉当前字段的标签号是1，其后所跟随数据的类型为varint。现在我们可以继续利用上面讲到的知识分析出后两个字节(96 01)的由来。
      96 01 = 1001 0110 0000 0001
          -> 001 0110 000 0001   //drop两个字节的msb
          -> 000 0001 001 0110   //翻转高低字节
          -> 10010110                   //去掉最高位中没用的0
          -> 128 + 16 + 4 + 2 = 150

四、更多的值类型：
      1. 有符号整型
如前所述，类型0表示varint，其中包含int32/int64/uint32/uint64/sint32/sint64/bool/enum。在实际使用中，如果当前字段可以表示为负数，那么对于int32/int64和sint32/sint64而言，它们在进行编码时将存在着较大的差别。如果使用int32/int64表示一个负数，该字段的值无论是-1还是-2147483648，其编码后长度将始终为10个字节，就如同对待一个很大的无符号整型一样。反之，如果使用的是sint32/sint64，Protocol Buffer将会采用ZigZag编码方式，其编码后的结果将会更加高效。
      这里简单讲述一下ZigZag编码，该编码会将有符号整型映射为无符号整型，以便绝对值较小的负数仍然可以有较小的varint编码值，如-1。下面是ZigZag对照表：

Signed Original	Encoded As
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

其公式为：
(n << 1) ^ (n >> 31)    //sint32
      (n << 1> ^ (n >> 63)   //sint64
      需要补充说明的是，Protocol Buffer在实现上述位移操作时均采用的算术位移，因此对于(n >> 31)和(n >> 63)而言，如果n为负值位移后的结果就是-1，否则就是0。
      注：简单解释一下C语言中的算术位移和逻辑位移。他们的左移操作都是相同的，即低位补0，高位直接移除。不同的是右移操作，逻辑位移比较简单，高位全部补0。而算术位移则需要视当前值的符号位而定，补进的位和符号位相同，即正数全补0，负数全补1。换句话说，算术位移右移时要保证符号位的一致性。在C语言中，如果使用 int变量位移时就是算术位移，uint变量位移时是逻辑位移。
      2. Non-varint数值型
      double/fixed64始终都占用8个字节，float/fixed32始终占用4个字节。
3. Strings
其类型值为2，key信息之后是字节数组的长度信息，最后在紧随指定长度的实际数据值信息。如：
      message Test2 {
          required string b = 2;
      }
      现在我们设置b的值为"testing"。其编码后数据如下：
      12 07 74 65 73 74 69 6E 67
      第一个字节0x12表示key，通过解码可以得到字段类型2和字段标号2。第二个字节07表示testing的长度。后面7个红色高亮的字节则表示testing。

      五、嵌入消息：
这里是一个包含嵌入消息的消息定义。
message Test3 {
          required Test1 c = 3;
      }
      此时我们先将Test1的a字段值设置为150，其编码结果如下：
      1A 03 08 96 01
从上面的结果可以看出08 96 01和之前直接编码Test1时是完全一致的，只是在前面增加了key(字段类型 + 标号)和长度信息。新增信息的解码方式和含义与前面的Strings完全相同，这里不再重复解释了。

      六、Packed Repeated Fields：
Protocol Buffer从2.1.0版本开始引入了[pack = true]的字段级别选项。如果设置该选项，那么元素数量为0的repeated字段将不会被编码，否则数组中的所有元素会被编码成一个单一的key/value形式。毕竟数组中的每一个元素都具有相同的字段类型和标号。该编码形式，对包含较小值的整型元素而言，优化后的编码结果可以节省更多的空间。如：
message Test4 {
          repeated int32 d = 4 [pack=true];
}
      这里我们假设d字段包含3个元素，值分别为3,270,86942。编码结果如下：
      22             //key (字段标号4，类型为2)
      06             //数据中所有元素所占用的字节数量
      03             //第一个元素(varint 3)
      8E 02        //第二个元素(varint 270)
9E A7 05  //第三个元素(varint 86942)

七、字段顺序:
在.proto文件中定义消息的字段标号时，可以是不连续的，但是如果将其定义为连续递增的数值，将获得更好的编码和解码性能。

结束语：
本篇博客是Protocol Buffer技术详解系列的最后一篇博客，同时该系列博客又将是开源学习之旅系列主题中的第一个系列，希望今后能够借此平台与大家进行更多的技术交流，共同提高。如有意见或问题，欢迎留言。

https://developers.google.com/protocol-buffers/docs/encoding

Encoding

This document describes the binary wire format for protocol buffer messages. You don't need to understand this to use protocol buffers in your applications, but it can be very useful to know how different protocol buffer formats affect the size of your encoded messages.

A Simple Message

Let's say you have the following very simple message definition:

message Test1 {

  optional int32 a = 1;

}

In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you'd see three bytes:

08 96 01

So far, so small and numeric – but what does it mean? Read on...

Base 128 Varints

To understand your simple protocol buffer encoding, you first need to understand varints. Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.

Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

So, for example, here is the number 1 – it's a single byte, so the msb is not set:

0000 0001

And here is 300 – this is a bit more complicated:

1010 1100 0000 0010

How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint):

1010 1100 0000 0010

→ 010 1100  000 0010

You reverse the two groups of 7 bits because, as you remember, varints store numbers with the least significant group first. Then you concatenate them to get your final value:

000 0010  010 1100

→  000 0010 ++ 010 1100

→  100101100

→  256 + 32 + 8 + 4 = 300

Message Structure

As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value. In most language implementations this key is referred to as a tag.

The available wire types are as follows:

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
3	Start group	groups (deprecated)
4	End group	groups (deprecated)
5	32-bit	fixed32, sfixed32, float

Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.

Now let's look at our simple example again. You now know that the first number in the stream is always a varint key, and here it's 08, or (dropping the msb):

000 1000

You take the last three bits to get the wire type (0) and then right-shift by three to get the field number (1). So you now know that the field number is 1 and the following value is a varint. Using your varint-decoding knowledge from the previous section, you can see that the next two bytes store the value 150.

96 01 = 1001 0110  0000 0001

       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)

       → 10010110

       → 128 + 16 + 4 + 2 = 150

More Value Types

Signed Integers

As you saw in the previous section, all the protocol buffer types associated with wire type 0 are encoded as varints. However, there is an important difference between the signed int types (sint32 and sint64) and the "standard" int types (int32 and int64) when it comes to encoding negative numbers. If you use int32 or int64 as the type for a negative number, the resulting varint is always ten bytes long – it is, effectively, treated like a very large unsigned integer. If you use one of the signed types, the resulting varint uses ZigZag encoding, which is much more efficient.

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:

Signed Original	Encoded As
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

In other words, each value n is encoded using

(n << 1) ^ (n >> 31)

for sint32s, or

(n << 1) ^ (n >> 63)

for the 64-bit version.

Note that the second shift – the (n >> 31) part – is an arithmetic shift. So, in other words, the result of the shift is either a number that is all zero bits (if n is positive) or all one bits (if n is negative).

When the sint32 or sint64 is parsed, its value is decoded back to the original, signed version.

Non-varint Numbers

Non-varint numeric types are simple – double and fixed64 have wire type 1, which tells the parser to expect a fixed 64-bit lump of data; similarly float and fixed32 have wire type 5, which tells it to expect 32 bits. In both cases the values are stored in little-endian byte order.

Strings

A wire type of 2 (length-delimited) means that the value is a varint encoded length followed by the specified number of bytes of data.

message Test2 {

  optional string b = 2;

}

Setting the value of b to "testing" gives you:

12 07 74 65 73 74 69 6e 67

The red bytes are the UTF8 of "testing". The key here is 0x12 → field number = 2, type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.

Embedded Messages

Here's a message definition with an embedded message of our example type, Test1:

message Test3 {

  optional Test1 c = 3;

}

And here's the encoded version, again with the Test1's a field set to 150:

 1a 03 08 96 01

As you can see, the last three bytes are exactly the same as our first example (08 96 01), and they're preceded by the number 3 – embedded messages are treated in exactly the same way as strings (wire type = 2).

Optional And Repeated Elements

If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same field number. These repeated values do not have to appear consecutively; they may be interleaved with other fields. The order of the elements with respect to each other is preserved when parsing, though the ordering with respect to other fields is lost. In proto3, repeated fields use packed encoding, which you can read about below.

For any non-repeated fields in proto3, or optional fields in proto2, the encoded message may or may not have a key-value pair with that field number.

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. For embedded message fields, the parser merges multiple instances of the same field, as if with the Message::MergeFrom method – that is, all singular scalar fields in the latter instance replace those in the former, singular embedded messages are merged, and repeated fields are concatenated. The effect of these rules is that parsing the concatenation of two encoded messages produces exactly the same result as if you had parsed the two messages separately and merged the resulting objects. That is, this:

MyMessage message;

message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2;

message.ParseFromString(str1);

message2.ParseFromString(str2);

message.MergeFrom(message2);

This property is occasionally useful, as it allows you to merge two messages even if you do not know their types.

Packed Repeated Fields

Version 2.1.0 introduced packed repeated fields, which in proto2 are declared like repeated fields but with the special [packed=true] option. In proto3, repeated fields are packed by default. These function like repeated fields, but are encoded differently. A packed repeated field containing zero elements does not appear in the encoded message. Otherwise, all of the elements of the field are packed into a single key-value pair with wire type 2 (length-delimited). Each element is encoded the same way it would be normally, except without a key preceding it.

For example, imagine you have the message type:

message Test4 {

  repeated int32 d = 4 [packed=true];

}

Now let's say you construct a Test4, providing the values 3, 270, and 86942 for the repeated field d. Then, the encoded form would be:

22        // key (field number 4, wire type 2)

06        // payload size (6 bytes)

03        // first element (varint 3)

8E 02     // second element (varint 270)

9E A7 05  // third element (varint 86942)

Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".

Note that although there's usually no reason to encode more than one key-value pair for a packed repeated field, encoders must be prepared to accept multiple key-value pairs. In this case, the payloads should be concatenated. Each pair must contain a whole number of elements.

Protocol buffer parsers must be able to parse repeated fields that were compiled as packed as if they were not packed, and vice versa. This permits adding [packed=true] to existing fields in a forward- and backward-compatible way.

Field Order

While you can use field numbers in any order in a .proto, when a message is serialized its known fields should be written sequentially by field number, as in the provided C++, Java, and Python serialization code. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object – for instance, it's sometimes useful to merge two messages by simply concatenating them.

If a message has unknown fields, the current Java and C++ implementations write them in arbitrary order after the sequentially-ordered known fields. The current Python implementation does not track unknown fields.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated May 8, 2018.

码农公寓