DLA处理JSON对象连续存放到同一行的场景

背景

用户有个场景,就是所有JSON数据对象连续存放在同一行。例如:

{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}{"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}

如何对这种数据文件进行分析、查询。通常的做法是用Spark、Hive等大数据平台,编写UDF函数,或者自定义实现Hive TextInputFormat,来处理这种类型的数据。阿里云Data Lake Analytics内置化支持对这种数据的处理能力,无需编写UDF,无需自定义实现Hive TextInputFormat,仅仅需要3步,就可以在云上轻松处理、查询、分析这种JSON数据。如果您还有其他特殊JSON数据存储格式需要最为便捷的处理方法,请到阿里云Data Lake Analytics(https://www.aliyun.com/product/datalakeanalytics)云产品提工单,我们以最快的速度满足您的需求。

第一步:JSON数据到阿里云OSS

利用各种手段,将JSON数据投递到OSS(https://www.aliyun.com/product/oss)中。
通常,对于云上日志链路,还有一种JSON到OSS的投递链路,可以参考“云原生日志数据分析上手指南”(https://zhuanlan.zhihu.com/p/69399941)其中的JSON部分。

第二步:DLA中建表

参考上述“云原生日志数据分析上手指南”(https://zhuanlan.zhihu.com/p/69399941),其中已经有海量JSON数据的分区模式建表方法了。本例中,以非分区表为例,假设,数据文件中每一行一个JSON数据,JSON数据放置的OSS路径为:

oss://your_bucket/json_data/...

则,在DLA中执行建表:

CREATE EXTERNAL TABLE single_line_json (
   data STRING,
)
STORED AS 
INPUTFORMAT 'com.alibaba.cloud.dla.inputformat.JsonSingleLineInputFormat'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://your_bucket/json_data/';

第三步:利用DLA函数进行处理

利用如下函数对JSON数据进行补全:

wrap_ifnotexist(target_string, '{', '}')

然后即可使用Data Lake Analytics支持的各种JSON函数进行处理:

例如:

select json_parse(wrap_ifnotexist(data, '{', '}')) as a from single_line_json;

->

+-----------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}   |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}   |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}   |
| {"employees":[{"firstName":"Bill","lastName":"Gates"},{"firstName":"George","lastName":"Bush"},{"firstName":"Thomas","lastName":"Carter"}]}} |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

Data Lake Analytics首购和流量包优惠

首购用户1元10TB,流量包阶梯折扣优惠:https://et.aliyun.com/bdad/datalake
产品详情:https://www.aliyun.com/product/datalakeanalytics
云栖社区:https://yq.aliyun.com/teams/396
知乎社区:https://zhuanlan.zhihu.com/data-lake-analytics

上一篇:云上最便捷的JSON数据处理


下一篇:MaxCompute数据仓库在更新插入、直接加载、全量历史表三大算法中的数据转换实践