spark3.1.2 单机安装部署
概述
Spark是一个性能优异的集群计算框架,广泛应用于大数据领域。类似Hadoop,但对Hadoop做了优化,计算任务的中间结果可以存储在内存中,不需要每次都写入HDFS,更适用于需要迭代运算的算法场景中。
Spark专注于数据的处理分析,而数据的存储还是要借助于Hadoop分布式文件系统HDFS等来实现。
大数据问题场景包含以下三种:
- 复杂的批量数据处理
- 基于历史数据的交互式查询
- 基于实时数据流的数据处理
Spark技术栈基本可以解决以上三种场景问题。
下载
下载地址:
http://spark.apache.org/downloads.html
或者
https://archive.apache.org/dist/spark/
选择合适自己的版本下载。
Spark2.X预编译了Scala2.11(Spark2.4.2预编译Scala2.12)
Spark3.0+预编译了Scala2.12
该教程选择Spark3.1.2版本,其中预编译了Hadoop3.2和Scala2.12,对应的包是 spark-3.1.2-bin-hadoop3.2.tgz,但这里的预编译Hadoop不是指不需要再安装Hadoop。
安装
Spark的安装部署支持三种模式,standalone、spark on mesos和 spark on YARN,本文将只介绍最简单的单机模式 standalone,其它模式将会在后续文章中进行介绍。
单机模式是不依赖于Hadoop生态组件的部署方式,部署简单,方便学习测试。
该文的安装环境为centos7。
1、将下载的包上传到服务器指定目录,解压
[root@localhost softnew]# tar zxvf spark-3.1.2-bin-hadoop3.2.tgz
2、切换到hadoop用户
基于Hadoop的安装步骤文档《Hadoop3.2.1安装-单机模式和伪分布式模式》,这里也将使用hadoop用户进行操作
su hadoop
3、修改解压后spark目录权限
sudo chown hadoop:hadoop -R spark-3.1.2-bin-hadoop3.2
4、修改配置文件
修改/etc/profile文件,新增spark环境变量:
# Spark Environment Variables
export SPARK_HOME=/home/bigData/softnew/spark
修改完成后记得执行 source /etc/profile 使其生效
修改spark/conf/spark-env.sh.template,执行如下命令:
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
新增如下内容:
export JAVA_HOME=/home/wch/jdk1.8.0_151
5、启动spark
启动spark不建议采用sbin/start-all.sh,由于saprk需要启动master 和 worker两个进程,worker进程是需要关联到master进程进行网络 通信的。若执行sbin/start-all.sh脚本,worker进程启动时会默认关联7077端口的master,但实际master启动时若7077被占用,则会尝试7078、7079等端口。
可以采取以下方式启动:
[hadoop@localhost sbin]$ jps
111970 Jps
[hadoop@localhost sbin]$ ./start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-localhost.localdomain.out
[hadoop@localhost sbin]$ tail -300f /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-localhost.localdomain.out //查看日志
Spark Command: /home/wch/jdk1.8.0_151/bin/java -cp /home/bigData/softnew/spark/conf/:/home/bigData/softnew/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/07/07 17:46:01 INFO Master: Started daemon with process name: 112153@localhost.localdomain
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for TERM
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for HUP
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for INT
21/07/07 17:46:01 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.136.53 instead (on interface ens32)
21/07/07 17:46:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/07/07 17:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/07 17:46:02 INFO SecurityManager: Changing view acls to: hadoop
21/07/07 17:46:02 INFO SecurityManager: Changing modify acls to: hadoop
21/07/07 17:46:02 INFO SecurityManager: Changing view acls groups to:
21/07/07 17:46:02 INFO SecurityManager: Changing modify acls groups to:
21/07/07 17:46:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/07/07 17:46:02 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
21/07/07 17:46:02 INFO Utils: Successfully started service 'sparkMaster' on port 7078.
21/07/07 17:46:02 INFO Master: Starting Spark master at spark://localhost:7078
21/07/07 17:46:02 INFO Master: Running Spark version 3.1.2
21/07/07 17:46:02 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
21/07/07 17:46:02 WARN Utils: Service 'MasterUI' could not bind on port 8081. Attempting port 8082.
21/07/07 17:46:03 INFO Utils: Successfully started service 'MasterUI' on port 8082.
21/07/07 17:46:03 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://192.168.136.53:8082
21/07/07 17:46:03 INFO Master: I have been elected leader! New state: ALIVE
^C
[hadoop@localhost sbin]$ ./start-worker.sh spark://localhost:7078
starting org.apache.spark.deploy.worker.Worker, logging to /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
[hadoop@localhost sbin]$ jps
112153 Master
112445 Jps
112350 Worker
大致思路即是先启动master进程,查看日志获取通信地址 spark://localhost:7078,带上通信地址启动worker进程。
这样单独的Spark就启动起来了。可以在spark-shell下利用spark做一些简单开发操作;
简单测试
启动spark-shell
[root@localhost bin]# ./spark-shell
如图启动完成出现scala命令行:
简单测试
读取spark安装目录下的readme.md文件,并统计词条数量和显示第一行字符。
scala> val textFile = sc.textFile("/home/bigData/softnew/spark/README.md") //读取readme.md文件
textFile: org.apache.spark.rdd.RDD[String] = /home/bigData/softnew/spark/README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> textFile.count() //词条统计
res0: Long = 108
scala> textFile.first() //打印第一行字符
res1: String = # Apache Spark
scala>
结论
spark可以集群部署,也可以单机部署供学习测试使用。本文主要讲述了单机部署流程及注意事项。但使用spark-shell命令行进行开发总是诸多不便,后续将研究如何使用idea开发spark应用,请继续关注。也欢迎大家来交流沟通。