1. 说明
本文基于:spark-2.4.0-hadoop2.7-高可用(HA)安装部署
2. 启动Spark Shell
在任意一台有spark的机器上执行
# --master spark://mini02:7077 连接spark的master,这个master的状态为alive,而不是standby
# --total-executor-cores 总共占用2核CPU
# --executor-memory 512m 每个woker占用512m内存
[yun@mini03 ~]$ spark-shell --master spark://mini02:7077 --total-executor-cores 2 --executor-memory 512m
-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://mini03:4040
Spark context available as 'sc' (master = spark://mini02:7077, app id = app-20181125120746-0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information. scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77e1b84c
注意:
如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的local模式,该模式仅在本机启动一个进程,没有与集群建立联系。
2.1. 相关截图
3. 执行第一个spark程序
该算法是利用蒙特•卡罗算法求PI
[yun@mini03 ~]$ spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://mini02:7077 \
--total-executor-cores \
--executor-memory 512m \
/app/spark/examples/jars/spark-examples_2.-2.4..jar
# 打印的信息如下:
-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-- :: INFO SparkContext: - Running Spark version 2.4.
………………
-- :: INFO TaskSetManager: - Finished task 97.0 in stage 0.0 (TID ) in ms on 172.16.1.14 (executor ) (/)
-- :: INFO TaskSetManager: - Finished task 98.0 in stage 0.0 (TID ) in ms on 172.16.1.13 (executor ) (/)
-- :: INFO TaskSetManager: - Finished task 99.0 in stage 0.0 (TID ) in ms on 172.16.1.14 (executor ) (/)
-- :: INFO TaskSchedulerImpl: - Removed TaskSet 0.0, whose tasks have all completed, from pool
-- :: INFO DAGScheduler: - ResultStage (reduce at SparkPi.scala:) finished in 3.881 s
-- :: INFO DAGScheduler: - Job finished: reduce at SparkPi.scala:, took 4.042591 s
Pi is roughly 3.1412699141269913
………………
4. Spark shell求Word count 【结合Hadoop】
1、启动Hadoop
2、将文件放到Hadoop中
[yun@mini05 sparkwordcount]$ cat wc.info
zhang linux
linux tom
zhan kitty
tom linux
[yun@mini05 sparkwordcount]$ hdfs dfs -ls /
Found items
drwxr-xr-x - yun supergroup -- : /hbase
drwx------ - yun supergroup -- : /tmp
drwxr-xr-x - yun supergroup -- : /wordcount
-rw-r--r-- yun supergroup -- : /zookeeper-3.4..tar.gz
[yun@mini05 sparkwordcount]$ hdfs dfs -mkdir -p /sparkwordcount/input
[yun@mini05 sparkwordcount]$ hdfs dfs -put wc.info /sparkwordcount/input/.info
[yun@mini05 sparkwordcount]$ hdfs dfs -put wc.info /sparkwordcount/input/.info
[yun@mini05 sparkwordcount]$ hdfs dfs -put wc.info /sparkwordcount/input/.info
[yun@mini05 sparkwordcount]$ hdfs dfs -put wc.info /sparkwordcount/input/.info
[yun@mini05 sparkwordcount]$ hdfs dfs -ls /sparkwordcount/input
Found items
-rw-r--r-- yun supergroup -- : /sparkwordcount/input/.info
-rw-r--r-- yun supergroup -- : /sparkwordcount/input/.info
-rw-r--r-- yun supergroup -- : /sparkwordcount/input/.info
-rw-r--r-- yun supergroup -- : /sparkwordcount/input/.info
3、进入spark shell命令行,并计算
[yun@mini03 ~]$ spark-shell --master spark://mini02:7077 --total-executor-cores 2 --executor-memory 512m
# 计算完毕后,打印在命令行
scala> sc.textFile("hdfs://mini01:9000/sparkwordcount/input").flatMap(_.split(" ")).map((_, )).reduceByKey(_+_).sortBy(_._2, false).collect
res6: Array[(String, Int)] = Array((linux,), (tom,), (kitty,), (zhan,), ("",), (zhang,))
# 计算完毕后,保存在HDFS【因为有多个文件组成,则有多个reduce,所以输出有多个文件】
scala> sc.textFile("hdfs://mini01:9000/sparkwordcount/input").flatMap(_.split(" ")).map((_, )).reduceByKey(_+_).sortBy(_._2, false).saveAsTextFile("hdfs://mini01:9000/sparkwordcount/output")
# 计算完毕后,保存在HDFS【将reduce设置为1,输出就只有一个文件】
scala> sc.textFile("hdfs://mini01:9000/sparkwordcount/input").flatMap(_.split(" ")).map((_, )).reduceByKey(_+_, ).sortBy(_._2, false).saveAsTextFile("hdfs://mini01:9000/sparkwordcount/output1")
4、在HDFS的查看结算结果
[yun@mini05 sparkwordcount]$ hdfs dfs -ls /sparkwordcount/
Found items
drwxr-xr-x - yun supergroup -- : /sparkwordcount/input
drwxr-xr-x - yun supergroup -- : /sparkwordcount/output
drwxr-xr-x - yun supergroup -- : /sparkwordcount/output1
[yun@mini05 sparkwordcount]$ hdfs dfs -ls /sparkwordcount/output
Found items
-rw-r--r-- yun supergroup -- : /sparkwordcount/output/_SUCCESS
-rw-r--r-- yun supergroup -- : /sparkwordcount/output/part-
-rw-r--r-- yun supergroup -- : /sparkwordcount/output/part-
-rw-r--r-- yun supergroup -- : /sparkwordcount/output/part-
-rw-r--r-- yun supergroup -- : /sparkwordcount/output/part-
[yun@mini05 sparkwordcount]$
[yun@mini05 sparkwordcount]$ hdfs dfs -cat /sparkwordcount/output/part*
(linux,)
(tom,)
(,)
(zhang,)
(kitty,)
(zhan,)
###############################################
[yun@mini05 sparkwordcount]$ hdfs dfs -ls /sparkwordcount/output1
Found items
-rw-r--r-- yun supergroup -- : /sparkwordcount/output1/_SUCCESS
-rw-r--r-- yun supergroup -- : /sparkwordcount/output1/part-
[yun@mini05 sparkwordcount]$ hdfs dfs -cat /sparkwordcount/output1/part-
(linux,)
(tom,)
(,)
(zhang,)
(kitty,)
(zhan,)