Spark3.1.2单机安装部署

spark3.1.2 单机安装部署

概述

Spark是一个性能优异的集群计算框架,广泛应用于大数据领域。类似Hadoop,但对Hadoop做了优化,计算任务的中间结果可以存储在内存中,不需要每次都写入HDFS,更适用于需要迭代运算的算法场景中。

Spark专注于数据的处理分析,而数据的存储还是要借助于Hadoop分布式文件系统HDFS等来实现。

大数据问题场景包含以下三种:

  • 复杂的批量数据处理
  • 基于历史数据的交互式查询
  • 基于实时数据流的数据处理

Spark技术栈基本可以解决以上三种场景问题。

下载

下载地址:

http://spark.apache.org/downloads.html

或者

https://archive.apache.org/dist/spark/

选择合适自己的版本下载。

Spark2.X预编译了Scala2.11(Spark2.4.2预编译Scala2.12)

Spark3.0+预编译了Scala2.12

该教程选择Spark3.1.2版本,其中预编译了Hadoop3.2和Scala2.12,对应的包是 spark-3.1.2-bin-hadoop3.2.tgz,但这里的预编译Hadoop不是指不需要再安装Hadoop。

安装

Spark的安装部署支持三种模式,standalone、spark on mesos和 spark on YARN,本文将只介绍最简单的单机模式 standalone,其它模式将会在后续文章中进行介绍。

单机模式是不依赖于Hadoop生态组件的部署方式,部署简单,方便学习测试。

该文的安装环境为centos7。

1、将下载的包上传到服务器指定目录,解压

[root@localhost softnew]# tar zxvf spark-3.1.2-bin-hadoop3.2.tgz

2、切换到hadoop用户

基于Hadoop的安装步骤文档《Hadoop3.2.1安装-单机模式和伪分布式模式》,这里也将使用hadoop用户进行操作

 su hadoop

3、修改解压后spark目录权限

 sudo chown hadoop:hadoop -R spark-3.1.2-bin-hadoop3.2

4、修改配置文件

修改/etc/profile文件,新增spark环境变量:

 # Spark Environment Variables
 export SPARK_HOME=/home/bigData/softnew/spark

修改完成后记得执行 source /etc/profile 使其生效

修改spark/conf/spark-env.sh.template,执行如下命令:

 cp spark-env.sh.template spark-env.sh

 vi spark-env.sh 

新增如下内容:

 export JAVA_HOME=/home/wch/jdk1.8.0_151

5、启动spark

启动spark不建议采用sbin/start-all.sh,由于saprk需要启动master 和 worker两个进程,worker进程是需要关联到master进程进行网络 通信的。若执行sbin/start-all.sh脚本,worker进程启动时会默认关联7077端口的master,但实际master启动时若7077被占用,则会尝试7078、7079等端口。

可以采取以下方式启动:

[hadoop@localhost sbin]$ jps
111970 Jps
[hadoop@localhost sbin]$ ./start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-localhost.localdomain.out
[hadoop@localhost sbin]$ tail -300f /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-localhost.localdomain.out                 //查看日志
Spark Command: /home/wch/jdk1.8.0_151/bin/java -cp /home/bigData/softnew/spark/conf/:/home/bigData/softnew/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/07/07 17:46:01 INFO Master: Started daemon with process name: 112153@localhost.localdomain
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for TERM
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for HUP
21/07/07 17:46:01 INFO SignalUtils: Registering signal handler for INT
21/07/07 17:46:01 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.136.53 instead (on interface ens32)
21/07/07 17:46:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/07/07 17:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/07 17:46:02 INFO SecurityManager: Changing view acls to: hadoop
21/07/07 17:46:02 INFO SecurityManager: Changing modify acls to: hadoop
21/07/07 17:46:02 INFO SecurityManager: Changing view acls groups to: 
21/07/07 17:46:02 INFO SecurityManager: Changing modify acls groups to: 
21/07/07 17:46:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/07/07 17:46:02 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
21/07/07 17:46:02 INFO Utils: Successfully started service 'sparkMaster' on port 7078.
21/07/07 17:46:02 INFO Master: Starting Spark master at spark://localhost:7078
21/07/07 17:46:02 INFO Master: Running Spark version 3.1.2
21/07/07 17:46:02 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
21/07/07 17:46:02 WARN Utils: Service 'MasterUI' could not bind on port 8081. Attempting port 8082.
21/07/07 17:46:03 INFO Utils: Successfully started service 'MasterUI' on port 8082.
21/07/07 17:46:03 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://192.168.136.53:8082
21/07/07 17:46:03 INFO Master: I have been elected leader! New state: ALIVE
^C
[hadoop@localhost sbin]$ ./start-worker.sh spark://localhost:7078
starting org.apache.spark.deploy.worker.Worker, logging to /home/bigData/softnew/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
[hadoop@localhost sbin]$ jps
112153 Master
112445 Jps
112350 Worker

大致思路即是先启动master进程,查看日志获取通信地址 spark://localhost:7078,带上通信地址启动worker进程。

这样单独的Spark就启动起来了。可以在spark-shell下利用spark做一些简单开发操作;

简单测试

启动spark-shell

[root@localhost bin]# ./spark-shell 

如图启动完成出现scala命令行:

Spark3.1.2单机安装部署

简单测试

读取spark安装目录下的readme.md文件,并统计词条数量和显示第一行字符。

scala> val textFile = sc.textFile("/home/bigData/softnew/spark/README.md")   //读取readme.md文件
textFile: org.apache.spark.rdd.RDD[String] = /home/bigData/softnew/spark/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> textFile.count()   //词条统计
res0: Long = 108                                                                

scala> textFile.first()   //打印第一行字符
res1: String = # Apache Spark

scala> 

结论

spark可以集群部署,也可以单机部署供学习测试使用。本文主要讲述了单机部署流程及注意事项。但使用spark-shell命令行进行开发总是诸多不便,后续将研究如何使用idea开发spark应用,请继续关注。也欢迎大家来交流沟通。

上一篇:Golang中的时间表示问题


下一篇:46道面试题带你了解高级Android面试,威力加强版