//求平均 方法一: groupByKey
textFile.mapToPair(line -> new Tuple2<>(line.split(" ")[0], Integer.parseInt(line.split(" ")[1])))
.groupByKey()
.mapToPair(info -> {
double sum = 0;
double count = 0;
Iterator<Integer> it = info._2().iterator();
while (it.hasNext()) {
sum += it.next();
count++;
}
double ave = sum / count;
return new Tuple2<>(info._1(), ave);
})
.collect()
.forEach(System.out::println);
//求平均 方法二: combineByKey
textFile.mapToPair(line -> new Tuple2<>(line.split(" ")[0], Integer.parseInt(line.split(" ")[1])))
.combineByKey(score -> new Tuple2<>(score, 1), // 将score映射为一个元组,作为分区内聚合初始值
(t, score) -> new Tuple2<>(t._1() + score, t._2() + 1), //分区内聚合,
(a, b) -> new Tuple2<>(a._1() + b._1(), a._2() + b._2())) //分区间聚合
.mapToPair(info -> new Tuple2<>(info._1(), info._2()._1()/info._2()._2()))
.collect()
.forEach(System.out::println);
data
.map(a => (a._1, (a._2, 1)))
.reduceByKey((a,b) => (a._1+b._1,a._2+b._2))
.map(t => (t._1,t._2._1/t._2._2))
总结:
方法一可读性更好,但是可能存在线程不安全
方法二写法简便
方法三可读性和性能都不错,推荐
PS:也可以采用spark sql进行聚合操作