一、实验环境
二、下载安装
三、核心文件配置
四、启动程序
----------------------------------------------------------
一、实验环境
可以先完成以下环境配置,也可直接安装:
1.1 Hadoop2.7集群安装配置
1.2 Anaconda3安装配置
1.3 系统:Centos7,hadoop用户(和Hadoop集群同个用户)
二、下载安装
2.1 下载地址:spark-2.4.5-bin-hadoop2.7.tgz
2.2 进入文件存放目录,解压缩:
$ sudo tar -zxvf ./spark-2.4.5-bin-hadoop2.7.tgz -C /usr/local/hdfs/
$ cd /usr/local/hdfs/
$ sudo mv ./spark-2.4.5-bin-hadoop2.7 ./spark2.4.5
$ sudo chown -R hadoop ./spark2.4.5
$ sudo ln -s /usr/local/hdfs/spark2.4.5 ~/hdfs/spark
2.3 配置环境变量
$ vi ~/.bash_profile
SPARK_HOME=/home/hadoop/hdfs/spark
export SPARK_HOME
PATH=$SPARK_HOME/bin:$PATH
export PATH
$ source ~/.bash_profile
在任何界面输入hive,然后连续按两下Tab键,显示下面内容则设置成功:
$ spark
spark spark-class sparkR spark-shell spark-sql spark-submit
三、核心文件配置
$ cd ~/hdfs/spark/conf
$ sudo cp ./slaves.template ./slaves
$ sudo cp ./spark-env.sh.template ./spark-env.sh
$ sudo cp ./spark-defaults.conf.template ./spark-defaults.conf
$ sudo chown -R hadoop /usr/local/hdfs/spark2.4.5
3.1 slaves
$ vi ./slaves
增加所有的spark executor的机器
Master
Slave2
Slave3
....
3.2 spark-config.sh
& vi $SPARK_HOME/sbin/spark-config.sh
在空白处增加JAVA_HOME路径:
export JAVA_HOME=/usr/jvm/jdk1.8
3.3 spark-env.sh
$ vi ./spark-env.sh
在最后面加上如下一行:
export HADOOP_CONF_DIR=/usr/local/hdfs/hadoop/conf
3.4 spark-defaults.conf
$ start-all.sh
$ hdfs dfs -mkdir /spark_lib
hdfs dfs -mkdir /spark-logs
$ hdfs dfs -put ~/hdfs/spark/jars/* /spark_lib
$ #stop-all.sh
$ vi ./spark-defaults.conf
在后面空白增加:
spark.master yarn # 告诉spark现在使用的是yarn模式
#spark.yarn.jars hdfs://Master:9000/spark_lib/*.jar # spark jar包所在的目录
#spark.yarn.stagingDir hdfs://Master:9000/tmp # spark运行的时候临时目录存放的文件
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
#spark.history.fs.logDirectory hdfs://Master:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.eventLog.enabled true
#spark.eventLog.dir hdfs://Master:9000/spark-logs
“#”标记需要修改的地方,“Master”为NameNode主机名
3.5 yarn-site.xml
关闭检查真实的内存
sudo vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
在原有Hadoop配置上,增加以下:
<property>
<name>yarn.resourcemanager.address</name>
<value>Master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Master:8030</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
3.6 mapred-site.xml
$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
在原有Hadoop配置上,增加以下:
<property>
<name>mapreduce.jobtracker.address</name>
<value>Master:54311</value>
<description>MapReduce job tracker runs at this host and port.
</description>
</property>
各Master为自己NameNode地址
把配置文件复制到各节点上
for i in {slave01,slave02}; do scp /usr/local/hdfs/spark2.4.5 $i:/usr/local/hdfs/; done
四、启动程序
$ #start-all.sh
$ $SPARK_HOME/sbin/start-all.sh
使用jps查看,有Master和Workers则启动成功:
$ jps
71601 SecondaryNameNode
71347 DataNode
71827 ResourceManager
72405 Master
71212 NameNode
71964 NodeManager
72508 Worker
72734 Jps
$ spark-shell
启动成功后如图所示,会有 “scala >” 的命令提示符;并且 “master = yarn” 表示运行在yarn上
Spark context available as ‘sc‘ (master = yarn, app id = application_1628143668230_0003).
Spark session available as ‘spark‘.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_301)
Type in expressions to have them evaluated.
Type :help for more information.
scala>