Apache Avro 简介

2022-10-15 12:23:33

一、引言

1、简介

　　Avro是Hadoop中的一个子项目，也是Apache中一个独立的项目，Avro是一个基于二进制数据传输高性能的中间件。在Hadoop的其他项目中例如HBase(Ref)和Hive(Ref)的Client端与服务端的数据传输也采用了这个工具。Avro是一个数据序列化的系统，可以将数据结构或对象转化成便于存储或传输的格式。Avro设计的初衷是用来解决Hadoop中Writable类型的不足：缺乏语言的可移值性。

2、特点

Ø 丰富的数据结构类型；

Ø 快速可压缩的二进制数据形式，对数据二进制序列化后可以节约数据存储空间和网络传输带宽；

Ø 存储持久数据的文件容器；

Ø 可以实现远程过程调用RPC；

Ø 简单的动态语言结合功能。

　　avro支持跨编程语言实现（C, C++, C#，Java, Python, Ruby, PHP），类似于Thrift，但是avro的显著特征是：avro依赖于模式，动态加载相关数据的模式，Avro数据的读写操作很频繁，而这些操作使用的都是模式，这样就减少写入每个数据文件的开销，使得序列化快速而又轻巧。这种数据及其模式的自我描述方便了动态脚本语言的使用。当Avro数据存储到文件中时，它的模式也随之存储，这样任何程序都可以对文件进行处理。如果读取数据时使用的模式与写入数据时使用的模式不同，也很容易解决，因为读取和写入的模式都是已知的。

　　Avro和动态语言结合后，读/写数据文件和使用RPC协议都不需要生成代码，而代码生成作为一种可选的优化只需要在静态类型语言中实现。Avro依赖于模式(Schema)。通过模式定义各种数据结构，只有确定了模式才能对数据进行解释，所以在数据的序列化和反序列化之前，必须先确定模式的结构。正是模式的引入，使得数据具有了自描述的功能，同时能够实现动态加载，另外与其他的数据序列化系统如Thrift相比，数据之间不存在其他的任何标识，有利于提高数据处理的效率。

二、技术要领

1、类型

　　数据类型标准化的意义：一方面使不同系统对相同的数据能够正确解析，另一方面，数据类型的标准定义有利于数据序列化/反序列化。简单的数据类型：Avro定义了几种简单数据类型，下表是其简单说明：

类型	说明
null	no value
boolean	a binary value
int	32-bit signed integer
long	64-bit signed integer
float	single precision (32-bit) IEEE 754 floating-point number
double	double precision (64-bit) IEEE 754 floating-point number
bytes	sequence of 8-bit unsigned bytes
string	unicode character sequence

　　简单数据类型由类型名称定义，不包含属性信息，例如字符串定义如下：{"type": "string"}。

　　复杂数据类型：Avro定义了六种复杂数据类型，每一种复杂数据类型都具有独特的属性，下表就每一种复杂数据类型进行说明。

类型	属性	说明
Records	type name	record
	name	a JSON string providing the name of the record (required).
	namespace	a JSON string that qualifies the name(optional).
	doc	a JSON string providing documentation to the user of this schema (optional).
	aliases	a JSON array of strings, providing alternate names for this record (optional).
	fields	a JSON array, listing fields (required).
	name	a JSON string.
	type	a schema/a string of defined record.
	default	a default value for field when lack.
	order	ordering of this field.
Enums	type name	enum
	name	a JSON string providing the name of the enum (required).
	namespace	a JSON string that qualifies the name.
	doc	a JSON string providing documentation to the user of this schema (optional).
	aliases	a JSON array of strings, providing alternate names for this enum (optional)
	symbols	a JSON array, listing symbols, as JSON strings (required). All symbols in an enum must be unique.
Arrays	type name	array
	items	the schema of the array’s items.
Maps	type name	map
	values	the schema of the map’s values.
Fixed	type name	fixed
	name	a string naming this fixed (required).
	namespace	a string that qualifies the name.
	aliases	a JSON array of strings, providing alternate names for this enum (optional).
	size	an integer, specifying the number of bytes per value (required).
Unions		a JSON arrays

　　每一种复杂数据类型都含有各自的一些属性，其中部分属性是必需的，部分是可选的。

　　这里需要说明Record类型中field属性的默认值，当Record Schema实例数据中某个field属性没有提供实例数据时，则由默认值提供，具体值见下表。Union的field默认值由Union定义中的第一个Schema决定。

https://www.cnblogs.com/duanxz/p/3799256.html

Apache Avro 简介

码农公寓

相关文章