通过Spark SQL关联查询两个HDFS上的文件操作

order_created.txt   订单编号  订单创建时间

  -- ::12.334+
-- ::12.342+
-- ::12.33+
-- ::12.33+
-- ::12.324+

order_picked.txt   订单编号  订单提取时间

  -- ::12.334+
-- ::12.342+
-- ::12.33+

上传上述两个文件到HDFS:

hadoop fs -put order_created.txt /data/order_created.txt
hadoop fs -put order_picked.txt /data/order_picked.txt

通过Spark SQL关联查询两个文件

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._ case class OrderCreated(order_no:String,create_date:String)
case class OrderPicked(order_no:String,picked_date:String) val order_created = sc.textFile("/data/order_created.txt").map(_.split("\t")).map( d => OrderCreated(d(),d()))
val order_picked = sc.textFile("/data/order_picked.txt").map(_.split("\t")).map( d => OrderPicked(d(),d())) order_created.registerTempTable("t_order_created")
order_picked.registerTempTable("t_order_picked") #手工设置Spark SQL task个数
hiveContext.setConf("spark.sql.shuffle.partitions","")
hiveContext.sql("select a.order_no, a.create_date, b.picked_date from t_order_created a join t_order_picked b on a.order_no = b.order_no").collect.foreach(println)

执行结果如下:

[,-- ::12.342+,-- ::12.342+]
[,-- ::12.334+,-- ::12.334+]
[,-- ::12.33+,-- ::12.33+]
上一篇:Spark基础:(六)Spark SQL


下一篇:修改Linux主机名