spark安装包:spark-0.8.1-incubating-bin-hadoop2.tgz
操作系统: CentOS6.4
jdk版本: jdk1.7.0_21
1. Cluster模式
1.1安装Hadoop
用VMware Workstation创建三台CentOS虚拟机,hostname分别设置为 master,slaver01, slaver02,设置SSH无密码登陆,安装hadoop,然后启动hadoop集群。参考我的这篇博客,hadoop-2.2.0分布式安装.
1.2 Scala
在三台机器上都要安装 Scala 2.9.3,按照我的博客SparK安装的步骤。JDK在安装Hadoop时已经安装了。进入master节点。
$ cd
$ scp -r scala-2.10.3 root@slaver01:~
$ scp -r scala-2.10.3 root@slaver02:~
1.3在master上安装并配置Spark
解压
$ tar -zxf
spark-0.8.1-incubating-bin-hadoop2.tgz
$ mv
spark-0.8.1-incubating-bin-hadoop2 spark-0.8.1
在 inconf/spark-env.sh
中设置SCALA_HOME
$ cd ~/
spark-0.8.1/conf
$ mv spark-env.sh.template spark-env.sh
$ vi spark-env.sh
# add the following line
export SCALA_HOME=/root/scala-2.10.3
export JAVA_HOME=/usr/java/jdk1.7.0_21
# save and exit
在conf/slaves
,添加Sparkworker的hostname,一行一个。
$ vim slaves
slaver01
slaver02
# save and exit
(可选)设置 SPARK_HOME环境变量,并将SPARK_HOME/bin加入PATH
$ vim /etc/profile
# add the following lines at the end
export SPARK_HOME=$HOME/spark-0.8.1
export PATH=$PATH:$SPARK_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile
1.4在所有worker上安装并配置Spark
既然master上的这个文件件已经配置好了,把它拷贝到所有的worker即可。注意,三台机器spark所在目录必须一致,因为master会登陆到worker上执行命令,master认为worker的spark路径与自己一样。
$ cd
$ scp -r spark-0.8.1 root@slaver01:~
$ scp -r spark-0.8.1 root@slaver02:~
1.5启动 Spark 集群
在master上执行
$ cd ~/spark-0.8.1
$ bin/start-all.sh
检测进程是否启动
[root@master ~]# jps
9664 Jps
7993 Master
9276 SecondaryNameNode
9108 NameNode
8105 Worker
9416 ResourceManager
浏览master的web UI(默认http://master:8080).这是你应该可以看到所有的work节点,以及他们的CPU个数和内存等信息。
1.6运行Spark自带的例子
运行SparkPi
$ cd ~/spark-0.8.1
[root@master ~]# cd ~/spark-0.8.1
[root@master spark-0.8.1]# ./run-example org.apache.spark.examples.SparkPi spark://master:7077
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jEventHandler).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Pi is roughly 3.14236
[root@master spark-0.8.1]#
运行SparkLR
#Logistic Regression
#./run-example org.apache.spark.examples.SparkLR spark://master:7077
[root@master spark-0.8.1]# ./run-example org.apache.spark.examples.SparkLR spark://master:7077
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jEventHandler).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Initial w: (-0.8066603352924779, -0.5488747509304204, -0.7351625370864459, 0.8228539509375878, -0.6662446067860872, -0.33245457898921527, 0.9664202269036932, -0.20407887461434115, 0.4120993933386614, -0.8125908063470539)
On iteration 1
On iteration 2
On iteration 3
On iteration 4
On iteration 5
Final w: (5816.075967498865, 5222.008066011391, 5754.751978607454, 3853.1772062206846, 5593.565827145932, 5282.387874201054, 3662.9216051953435, 4890.78210340607, 4223.371512250292, 5767.368579668863)
[root@master spark-0.8.1]#
运行 SparkKMeans
#kmeans
$ ./run-example org.apache.spark.examples.SparkKMeans spark://master:7077 ./kmeans_data.txt 2 1
[root@master spark-0.8.1]# ./run-example org.apache.spark.examples.SparkKMeans spark://master:7077 ./kmeans_data.txt 2 1
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jEventHandler).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Final centers:
(0.1, 0.1, 0.1)
(9.2, 9.2, 9.2)
[root@master spark-0.8.1]#
1.7从HDFS读取文件并运行WordCount
$ cd ~/spark-0.8.1
$
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
$ hadoop fs -put pg20417.txt ./
$ MASTER=spark://master:7077 ./spark-shell
[root@master spark-0.8.1]# MASTER=spark://master:7077 ./spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 0.8.1
/_/
Using Scala version 2.9.3 (Java HotSpot(TM) Client VM, Java 1.7.0_21)
Initializing interpreter...
log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Creating SparkContext...
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.
scala> val file = sc.textFile("hdfs://master:9000/user/root/pg20417.txt")
file: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at <console>:12
scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
count: org.apache.spark.rdd.RDD[(java.lang.String, Int)] = MapPartitionsRDD[14] at reduceByKey at <console>:14scala> count.collect()
1.8停止 Spark 集群
$ cd ~/spark-0.8.1
$ bin/stop-all.sh