Spark WordCount 文档词频计数

一.使用数据

Apache Spark is a fast and general-purpose cluster computing system.It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

二.实现代码

package big.data.analyse.wordcount

import org.apache.spark.sql.SparkSession

/**
* Created by zhen on 2019/3/9.
*/
object WordCount {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("WordCount")
.master("local[2]")
.getOrCreate()
// 加载数据
val textRDD = spark.sparkContext.textFile("src/big/data/analyse/wordcount/wordcount.txt")
val result = textRDD.map(row => row.replace(",", ""))//去除文字中的,防止出现歧义
.flatMap(row => row.split(" "))//把字符串转换为字符集合
.map(row => (row, ))//把每个字符串转换为map,便于计数
.reduceByKey(_+_)//计数
// 打印结果
result.foreach(println)
}
}

三.计算结果

(Spark,)
(GraphX,)
(graphs.,)
(learning,)
(general-purpose,)
(Python,)
(APIs,)
(provides,)
(that,)
(is,)
(a,)
(R,)
(high-level,)
(general,)
(processing,)
(fast,)
(including,)
(higher-level,)
(optimized,)
(Apache,)
(in,)
(SQL,)
(system.,)
(Java,)
(of,)
(data,)
(tools,)
(cluster,)
(also,)
(graph,)
(structured,)
(execution,)
(It,)
(MLlib,)
(for,)
(Scala,)
(an,)
(computing,)
(machine,)
(supports,)
(and,)
(engine,)
(set,)
(rich,)
(Streaming.,)
上一篇:火狐浏览器开始支持3D游戏和视屏通话


下一篇:Metapackage包