所谓的维表Join: 进入Flink的数据,需要关联另外一些存储设备的数据,才能计算出来结果,那么存储在外部设备上的表称之为维表,可能存储在mysql也可能存储在hbase 等。维表一般的特点是变化比较慢。
需求:kafka输入的数据格式: 姓名,城市编号 例如 zhangsan,1001。
期望输出的数据: 姓名,城市编号,城市名称 例如 zhangsan,1001,北京
在MySQL创建城市表作为维表:
create table city(
city_id varchar(50) primary key,
city_name varchar(50)
);
insert into city values('1001','北京'),('1002','上海'),('1003','郑州') ;
1、 预加载维表
通过定义一个类实现RichMapFunction,在open()中读取维表数据加载到内存中,在kafka流map()方法中与维表数据进行关联。
RichMapFunction中open方法里加载维表数据到内存的方式特点如下:
- 优点:实现简单
- 缺点:因为数据存于内存,所以只适合小数据量并且维表数据更新频率不高的情况下。虽然可以在open中定义一个定时器定时更新维表,但是还是存在维表更新不及时的情况。另外,维表是变化慢,不是一直不变的,只是变化比较缓慢而已。
package com.bigdata.day06;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.HashMap;
import java.util.Map;
public class _04PreLoadDataDemo {
public static void main(String[] args) throws Exception {
//1. env-准备环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
//2. source-加载数据
DataStreamSource<String> dataStreamSource = env.socketTextStream("localhost", 9999);
//3. transformation-数据处理转换
dataStreamSource.map(new RichMapFunction<String, Tuple3<String,Integer,String>>() {
Map<Integer,String> cityMap = new HashMap<Integer,String>();
Connection connection;
PreparedStatement statement;
@Override
public void open(Configuration parameters) throws Exception {
// 将mysql的数据加载到map中
connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/test1","root","123456");
statement = connection.prepareStatement("select * from city");
ResultSet resultSet = statement.executeQuery();
while(resultSet.next()){
int cityId = resultSet.getInt("city_id");
String cityName = resultSet.getString("city_name");
cityMap.put(cityId,cityName);
}
}
@Override
public void close() throws Exception {
statement.close();
connection.close();
}
// zhangsan,1001
@Override
public Tuple3<String, Integer, String> map(String s) throws Exception {
String[] arr = s.split(",");
System.out.println("+++++++++++++++" +cityMap);
String cityName = cityMap.get(Integer.valueOf(arr[1]));
return Tuple3.of(arr[0],Integer.valueOf(arr[1]),cityName);
}
}).print();
//4. sink-数据输出
//5. execute-执行
env.execute();
}
}
测试:
在黑窗口输入:
张三,1001
李四,1001
王五,1002
那如果数据多了怎么办,数据更新了怎么办?可以进行查询,代码示例如下:
package com.bigdata.day06;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.HashMap;
import java.util.Map;
public class _05SelectDBDemo {
public static void main(String[] args) throws Exception {
//1. env-准备环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
//2. source-加载数据
DataStreamSource<String> dataStreamSource = env.socketTextStream("localhost", 9999);
//3. transformation-数据处理转换
dataStreamSource.map(new RichMapFunction<String, Tuple3<String,Integer,String>>() {
Connection connection;
PreparedStatement statement;
@Override
public void open(Configuration parameters) throws Exception {
// 将mysql的数据加载到map中
connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/test1","root","123456");
statement = connection.prepareStatement("select city_name from city where city_id = ? ");
}
@Override
public void close() throws Exception {
statement.close();
connection.close();
}
// zhangsan,1001
@Override
public Tuple3<String, Integer, String> map(String s) throws Exception {
String[] arr = s.split(",");
statement.setInt(1,Integer.valueOf(arr[1]));
ResultSet resultSet = statement.executeQuery();
String cityName = null;
if(resultSet.next()){
cityName = resultSet.getString("city_name");
}
return Tuple3.of(arr[0],Integer.valueOf(arr[1]),cityName);
}
}).print();
//4. sink-数据输出
//5. execute-执行
env.execute();
}
}
以上做法成功解决了我们以前的两个问题:数据更新怎么办,数据多了怎么办。
但是缺点是每次都得查询数据库,非常不方便。
2、 热存储维表
以前的方式是将维表数据存储在Redis、HBase、MySQL等外部存储中,实时流在关联维表数据的时候实时去外部存储中查询,这种方式特点如下:
- 优点:维度数据量不受内存限制,可以存储很大的数据量。
- 缺点:因为维表数据在外部存储中,读取速度受制于外部存储的读取速度;另外维表的同步也有延迟。
(1) 使用cache来减轻访问压力
可以使用缓存来存储一部分常访问的维表数据,以减少访问外部系统的次数,比如使用Guava Cache。
package com.bigdata.day06;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.cache.*;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.concurrent.TimeUnit;
public class _06GuavaCacheDemo {
public static void main(String[] args) throws Exception {
//1. env-准备环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
// 将程序的并行度设置为1,能够更好的展示缓存效果
env.setParallelism(1);
//2. source-加载数据
DataStreamSource<String> dataStreamSource = env.socketTextStream("localhost", 9999);
//3. transformation-数据处理转换
dataStreamSource.map(new RichMapFunction<String, Tuple3<String,Integer,String>>() {
Connection connection;
PreparedStatement statement;
// 定义一个Cache
LoadingCache<Integer, String> cache;
@Override
public void open(Configuration parameters) throws Exception {
cache = CacheBuilder.newBuilder()
//最多缓存个数,超过了就根据最近最少使用算法来移除缓存 LRU
.maximumSize(1000)
//在更新后的指定时间后就回收
// 不会自动调用,而是当过期后,又用到了过期的key值数据才会触发的。
.expireAfterWrite(10, TimeUnit.SECONDS)
//指定移除通知
.removalListener(new RemovalListener<Integer, String>() {
@Override
public void onRemoval(RemovalNotification<Integer, String> removalNotification) {
System.out.println(removalNotification.getKey() + "被移除了,值为:" + removalNotification.getValue());
}
})
.build(//指定加载缓存的逻辑
new CacheLoader<Integer, String>() {
// 假如缓存中没有数据,会触发该方法的执行,并将结果自动保存到缓存中
@Override
public String load(Integer cityId) throws Exception {
System.out.println("进入数据库查询啦。。。。。。。");
statement.setInt(1,cityId);
ResultSet resultSet = statement.executeQuery();
String cityName = null;
if(resultSet.next()){
System.out.println("进入到了if中.....");
cityName = resultSet.getString("city_name");
}
return cityName;
}
});
// 将mysql的数据加载到map中
connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/test1","root","123456");
statement = connection.prepareStatement("select city_name from city where city_id = ? ");
}
@Override
public void close() throws Exception {
statement.close();
connection.close();
}
// zhangsan,1001
@Override
public Tuple3<String, Integer, String> map(String s) throws Exception {
String[] arr = s.split(",");
String cityName = "" ;
if (cache.get(Integer.valueOf(arr[1])) != null){
cityName = cache.get(Integer.valueOf(arr[1]));
}
return Tuple3.of(arr[0],Integer.valueOf(arr[1]),cityName);
}
}).print();
//4. sink-数据输出
//5. execute-执行
env.execute();
}
}
设置的guawa缓存是每一个分区都有一个缓存,多个分区之间缓存不共享。所以你需要把并行度设置为1,方便查看效果。