背景
用户有个场景,就是所有JSON数据对象连续存放在同一行。例如:
{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}
如何对这种数据文件进行分析、查询。通常的做法是用Spark、Hive等大数据平台,编写UDF函数,或者自定义实现Hive TextInputFormat,来处理这种类型的数据。阿里云Data Lake Analytics内置化支持对这种数据的处理能力,无需编写UDF,无需自定义实现Hive TextInputFormat,仅仅需要3步,就可以在云上轻松处理、查询、分析这种JSON数据。如果您还有其他特殊JSON数据存储格式需要最为便捷的处理方法,请到阿里云Data Lake Analytics(https://www.aliyun.com/product/datalakeanalytics)云产品提工单,我们以最快的速度满足您的需求。
第一步:JSON数据到阿里云OSS
利用各种手段,将JSON数据投递到OSS(https://www.aliyun.com/product/oss)中。
通常,对于云上日志链路,还有一种JSON到OSS的投递链路,可以参考“云原生日志数据分析上手指南”(https://zhuanlan.zhihu.com/p/69399941)其中的JSON部分。
第二步:DLA中建表
参考上述“云原生日志数据分析上手指南”(https://zhuanlan.zhihu.com/p/69399941),其中已经有海量JSON数据的分区模式建表方法了。本例中,以非分区表为例,假设,数据文件中每一行一个JSON数据,JSON数据放置的OSS路径为:
oss://your_bucket/json_data/...
则,在DLA中执行建表:
CREATE EXTERNAL TABLE single_line_json (
data STRING,
)
STORED AS
INPUTFORMAT 'com.alibaba.cloud.dla.inputformat.JsonSingleLineInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://your_bucket/json_data/';
第三步:利用DLA函数进行处理
利用如下函数对JSON数据进行补全:
wrap_ifnotexist(target_string, '{', '}')
然后即可使用Data Lake Analytics支持的各种JSON函数进行处理:
- DLA JSON函数:https://help.aliyun.com/document_detail/71064.html
- 云上最便捷的JSON数据处理:https://yq.aliyun.com/articles/706812
- DLA新增函数的JSON数组处理部分:https://yq.aliyun.com/articles/706150
例如:
select json_parse(wrap_ifnotexist(data, '{', '}')) as a from single_line_json;
->
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| a |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]} |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]} |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]} |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}} |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
Data Lake Analytics首购和流量包优惠
首购用户1元10TB,流量包阶梯折扣优惠:https://et.aliyun.com/bdad/datalake
产品详情:https://www.aliyun.com/product/datalakeanalytics
云栖社区:https://yq.aliyun.com/teams/396
知乎社区:https://zhuanlan.zhihu.com/data-lake-analytics