以spark为例:分治
百亿级的表按分区裁分,成几个表。最后每一个表很流进行join,最后在合并。
流和离线维表的join
核心代码:
val kafkaManager = new KafkaManager(kafkaParams)
val kafkaDirectStream = kafkaManager.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topicSet)
val res = kafkaDirectStream
.foreachRDD(rdd => {
if(!rdd.isEmpty()) {
val ds = spark.createDataset(rdd)
val schema = (new WechatLogSchema()).schema
ds.select(from_json(col("value"),schema).alias("l")).createTempView("FullLogView")