1、Flink状态去重场景
在Flink运行的时候,往往是无休止的运行,在整个Flink程序运行的长河中,往往会出现很多状态的出现,那么状态的生命周期,也就是创建、使用和销毁,那么在我们写flink程序过程中,往往不需要关注flink 状态的清理,flink内部就会对我们的状态进行清理,例如我们开一个10分钟的窗口,那么在这十分钟的窗口中,这个状态也就是会发生创建、使用和销毁,那么我这里问大家一个问题?就是窗口结束后,状态会销毁吗。这里有一个场景,也就是说当我们开一个一天的窗口,计算当天的消费人数,那么这个时候一个用户可能会有多次消费,那么肯定要去重,肯定很多人想到了Redis,Flink State。这个时候有个疑问诞生了,我们用Redis是可以去重的,并且会持久化,还可以根据key来进行判断当天日期,这是很多人的一个操作,那么flink state也是可以的,我们可以在不管是process方法还是其他的一些方法中开辟一个状态,例如MapState。类似于下面的这段代码
private transient MapState<Student, Object> mapState; private transient final Object OBJECT_VALUE = new Object();
这个时候我们可以在窗口上开辟state,将userid放入map中来进行去重,当然使用KeyedState的时候一定要key by。当然还有non-keyed state 这里我们就不细细讨论了。那么我们现在有个想法,那就是我不开窗口我怎么实现去重的操作,因为开辟一天的窗口,在数据量很大的情况下,并且重复数据很多,真实数据并不多的场景下,开辟一天的窗口,可能压力会大一些。我的想法是,在一个keyby算子之后process方法中对数据进行去重,并不开窗口,那么随着时间的流动,状态会很大,我们要的是一分钟的数据去重,下一分钟再去重下一个分钟内的,这个操作我们怎么解决这个问题?
2、StateTtlConfig的引入
可以将生存时间(TTL)分配给任何类型的keyed State。如果配置了 TTL 并且状态值已过期,则将尽最大努力清理存储的值。
为了使用statettl,必须首先构建StateTtlConfig配置对象。然后,通过以下配置,可以在任何StateDescript 中启用TTL功能:
import org.apache.flink.api.common.state.StateTtlConfig; import org.apache.flink.api.common.state.ValueStateDescriptor; import org.apache.flink.api.common.time.Time; StateTtlConfig ttlConfig = StateTtlConfig .newBuilder(Time.seconds(1)) .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .build(); ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class); stateDescriptor.enableTimeToLive(ttlConfig);
-
Time.seconds(1) 周期过期时间
-
setUpdateType 更新类型
-
setStateVisibility 是否在访问state的时候返回过期值
setUpdateType:
-
StateTtlConfig.UpdateType.OnCreateAndWrite
- 只在创建和写的时候清除 (默认) -
StateTtlConfig.UpdateType.OnReadAndWrite
- 在读和写的时候清除
setStateVisibility:
-
StateTtlConfig.StateVisibility.NeverReturnExpired
- 从不返回过期值 -
StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp
- 如果仍然可用,则返回
在 NeverReturnExpired 的情况下,过期状态的就好像它不再存在一样,即使它未被删除。这个选项对于 TTL 之后之前的数据数据不可用。
另一个选项 ReturnExpiredIfNotCleanedUp 允许在清理之前返回数据,也就是说他ttl过期了,数据还没有被删除的时候,仍然可以访问。
那么接下来我们就看看怎么用吧:
3、实践场景
我现在有这样的一个需求,我想不开窗口在一分钟,一小时,一天的时候清理数据,也就是清理去重器。
因为是企业场景,将代码进行转换了其他业务类型不进行展示。
Hash 去重
public class HashDuplicationStrategy implements DuplicationStrategy, Serializable { private transient MapState<Student, Object> studentSetState; private transient MapState<TEACHERDO, Object> teacherSetState; private transient final Object OBJECT_VALUE = new Object(); public HashDuplicationStrategy(RuntimeContext context, int cleanTime) { StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(Time.seconds(cleanTime)).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .build(); MapStateDescriptor<Student, Object> studentSetStateDes = new MapStateDescriptor<>(StateDescriptor.STUDENT_SET_NAME_PREFIX + cleanTime, Student.class, Object.class); MapStateDescriptor<Teacher, Object> teacherSetStateDes = new MapStateDescriptor<>(StateDescriptor.TEACHER_SET_NAME_PREFIX+ cleanTime, Teacher.class, Object.class); studentSetStateDes.enableTimeToLive(stateTtlConfig); teacherSetStateDes.enableTimeToLive(stateTtlConfig); studentSetState = context.getMapState(studentSetStateDes); teacherSetState = context.getMapState(teacherSetStateDes); } @Override public boolean doDuplications(Person person) throws Exception { if (person instanceof Student) { Student student = (Student) person; if (!studentSetState.contains(student)) { studentSetState.put(student, OBJECT_VALUE); return true; } } if (person instanceof TEACHERDO) { Teacher teacher = (Teacher) person; if (!teacherSetState.contains(teacher)) { teacherSetState.put(teacher, OBJECT_VALUE); return true; } } return false; } }
BoomFilter 去重
public class BloomfilterDuplicationStrategy implements DuplicationStrategy { private transient ValueState<BloomFilter> studentBmState; private transient ValueState<BloomFilter> teacherBmState; private int defaultExpect = 1000000; private double defaultFpp = 0.01; public BloomfilterDuplicationStrategy(RuntimeContext context, int cleanTime) { StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(Time.seconds(cleanTime)).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .build(); ValueStateDescriptor<BloomFilter> studentBmStateDes = new ValueStateDescriptor<BloomFilter>(StateDescriptor.STUDENT_BM_NAME_PREFIX + cleanTime, BloomFilter.class); ValueStateDescriptor<BloomFilter> teacherBmStateDes = new ValueStateDescriptor<BloomFilter>(StateDescriptor.TEACHER_BM_NAME_PREFIX + cleanTime, BloomFilter.class); studentBmStateDes.enableTimeToLive(stateTtlConfig); teacherBmStateDes.enableTimeToLive(stateTtlConfig); studentBmState = context.getState(studentBmStateDes); teacherBmState = context.getState(teacherBmStateDes); } @Override public boolean doDuplications(Person person) throws Exception { BloomFilter studentBloomFilter = studentBmState.value(); if (studentBloomFilter == null) { studentBloomFilter = BloomFilter.create(Funnels.integerFunnel(), defaultExpect, defaultFpp); studentBmState.update(studentBloomFilter); } BloomFilter teacherBloomFilter = teacherBmState.value(); if (teacherBloomFilter == null) { teacherBloomFilter = BloomFilter.create(Funnels.integerFunnel(), defaultExpect, defaultFpp); teacherBmState.update(teacherBloomFilter); } if (Person instanceof Student) { Student student = (Student) person; //可以判断一定不包含 if (!studentBloomFilter.mightContain(student.hashCode())) { studentBloomFilter.put(student.hashCode()); studentBmState.update(studentBloomFilter); return true; } } if (person instanceof Teacher) { Teacher teacher = (Teacher) person; //可以判断一定不包含 if (!teacherBloomFilter.mightContain(teacher.hashCode())) { teacherBloomFilter.put(teacher.hashCode()); teacherBmState.update(teacherBloomFilter); return true; } } return false; } }