主要记录下streaming模式下动态分区怎么写文件,sql模式直接写就是了,streaming模式需要自己写下分区方法。大致的数据流程是消费kafka,拆解json,数据写到hdfs(sequenceFile)路径。
1、分区需要自定义,这里是读取流数据,获取分区字段
package partitionassigner;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import java.text.SimpleDateFormat;
import java.util.Date;
public class PartitionAssigner<IN> implements BucketAssigner<Tuple2<LongWritable, Text>, String> {
@Override
public String getBucketId(Tuple2<LongWritable, Text> textTuple2, Context context) {
String allCol = textTuple2.getField(1).toString();
//截出来分区字段
String[] strings = allCol.split("\\^");
//取出来时间戳字段
Long createTime = Long.parseLong(strings[10]);
//取出来rank分区字段
int rankPt = Integer.parseInt(strings[11]);
//时间戳 -> yyyyMMdd
Date date = new Date(createTime);
SimpleDateFormat simpleDateFormat = new SimpleDateFormat ("yyyyMMdd");
return "day=" + simpleDateFormat.format(date) + "/" + "rank_pt=" + rankPt;
}
@Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
}
2、文件输出调用分区生成方法
StreamingFileSink<Tuple2<LongWritable, Text>> sink = StreamingFileSink.forBulkFormat(
path,
new SequenceFileWriterFactory<>(hadoopConf, LongWritable.class, Text.class,"org.apache.hadoop.io.compress.GzipCodec",
SequenceFile.CompressionType.BLOCK))
.withBucketAssigner(new PartitionAssigner())
.build();
3、我的路径写的本地,可以看到文件夹,生产上改下文件路径就好
总结:这个处理是想在流上把数据处理好写到离线数仓的dwd层,省去离线的处理逻辑,流批真正一体要做的工作还是比较多啊,维护成本当下看也高。