spark 按照key 分组 然后统计每个key对应的最大、最小、平均值思路——使用groupby,或者reduceby

spark 按照key 分组 然后统计每个key对应的最大、最小、平均值思路——使用groupby,或者reduceby
What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.

example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])

example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]

example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]

# OR:
example.groupByKey().mapValues(list)
 
spark 按照key 分组 然后统计每个key对应的最大、最小、平均值思路——使用groupby,或者reduceby
spark 按照key 分组 然后统计每个key对应的最大、最小、平均值思路——使用groupby,或者reduceby
Hey Ron, 

It was pretty much exactly as Sean had depicted. I just needed to provide
count an anonymous function to tell it which elements to count. Since I
wanted to count them all, the function is simply "true".

        val grouped = rdd.groupByKey().mapValues { mcs =>
          val values = mcs.map(_.foo.toDouble)
          val n = values.count(x => true)
          val sum = values.sum
          val sumSquares = values.map(x => x * x).sum
          val stddev = math.sqrt(n * sumSquares - sum * sum) / n
          print("stddev: " + stddev)
          stddev
        }


I hope that helps
spark 按照key 分组 然后统计每个key对应的最大、最小、平均值思路——使用groupby,或者reduceby

 

 

Just don't. Use reduce by key:

lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
    .reduceByKey(lambda x, y: (
        min(x[0], y[0], key=lambda x: x[1]), 
        max(x[1], y[1], , key=lambda x: x[1])))

 
















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7156188.html,如需转载请自行联系原作者


上一篇:bboss内置数据源apache dbcp与druid数据源切换方法


下一篇:Elasticsearch 漫谈