参考:https://www.soinside.com/question/JjhZCytMUFpTNyk6W7ixZa
(没找到真正的出处,看拙劣的翻译,应该是从Stack Overflow扒过来的)
将数据如下的DataFrame压平
val json_string = """{ "Total Value": 3, "Topic": "Example", "values": [ { "value1": "#example1", "points": [ [ "123", "156" ] ], "properties": { "date": "12-04-19", "model": "Model example 1" } }, {"value2": "#example2", "points": [ [ "124", "157" ] ], "properties": { "date": "12-05-19", "model": "Model example 2" } } ] }"""
希望得到如下输出
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+ |Total Value| Topic |values 1 | values.points[0] | values.points[1] | values.properties.date | values.properties.model | +-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+ | 3 | Example | example1 | 123 | 156 | 12-04-19 | Model Example 1 | | 3 | Example | example2 | 124 | 157 | 12-05-19 | Model example 2 +-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
解决办法:
采用spark in-built
函数,尤其是explode函数(此函数在org.apache.spark.sql.functions包下)
先将第一层“炸”开
scala> val df = spark.read.json(Seq(json_string).toDS) scala> var dfd = df.select($"topic",$"total value",explode($"values").as("values"))
然后选择第二层中想要的列
scala> dfd.select($"topic",$"total value",$"values.points".getItem(0)(0).as("point_0"),$"values.points".getItem(0)(1).as("point_1"),$"values.properties.date".as("_date"),$"values.properties.model".as("_model")).show +-------+-----------+-------+-------+--------+---------------+ | topic|total value|point_0|point_1| _date| _model| +-------+-----------+-------+-------+--------+---------------+ |Example| 3| 123| 156|12-04-19|Model example 1| |Example| 3| 124| 157|12-05-19|Model example 2| +-------+-----------+-------+-------+--------+---------------+