Spark3大数据实时处理-Streaming+Structured Streaming 实战

download:Spark3大数据实时处理-Streaming+Structured Streaming 实战

随着云计算和大数据的快速发展,在企业中大数据实时处理场景的需求越来越多。本课针对企业级实时处理方案进行全方位的讲解,基于Spark3,在同一个项目中,学习两套实时处理的解决方案:Spark Streaming和Structured Streaming。在框架学习的基础上,不仅带你体验完整实时处理方案的全流程,真正所学即所用,还会为你梳理大数据的常见面试题、大厂的实时解决方案,带你跨过面试最后一公里。

适合人群
想转型或者从事大数据开发的同学
对Spark有浓厚兴趣的同学
想掌握大数据实时处理技术的同学
技术储备要求
Linux命令基本操作
Hadoop命令基本操作
Scala基本语法的使用

下载的版本,没有预编译hadoop,简单配置后可以应用到任意Hadoop版本。

2 单机安装
2.1 Java安装JDK
2.2 Hadoop安装
Spark会用到HDFS与YARN,因此需要先安装Hadoop,即必须安装Hadoop才能使用Spark。但是如果使用Spark过程中没有用到HDFS,不启动Hadoop也是可以的,但是必须安装。

2.3 Spark安装
2.3.1 解压
$ sudo tar -xzvf /home/hadoop/Desktop/spark-2.4.2-bin-without-hadoop.tgz -C /usr/local
$ cd /usr/local
$ sudo mv ./spark-2.4.2-bin-without-hadoop/ ./spark
$ sudo chown -R hadoop:hadoop spark/
$ gedit /home/hadoop/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=P A T H : PATH:PATH:SPARK_HOME/bin:$SPARK_HOME/sbin
$ source /home/hadoop/.bashrc

2.3.2 修改spark-env.sh
$ cd /usr/local/spark/conf
$ cp ./spark-env.sh.template spark-env.sh
$ gedit spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
有了以上的配置信息以后,Spark就可以把数据存储到Hadoop分布式文件系统HDFS中,也可以从HDFS中读取数据。如果没有配置上面的信息,Spark就只能读写本地数据,无法读写HDFS数据。
配置完成后就可以直接使用,不需要像Hadoop运行启动命令。

2.4 运行
目录/usr/local/spark/examples/src/main下面有Spark的示例程序,有Scala、Java、Python、R等语言版本。
运行一个示例程序SparkPi,计算pi的近似值。
(1)Java版本
$ run-example SparkPi
Pi is roughly 3.144195720978605
日志信息如下:
spark.SparkContext: Running Spark version 2.4.2
spark.SparkContext: Submitted application: Spark Pi
Successfully started service ‘sparkDriver’ on port 44037.
Created local directory at /tmp/blockmgr-ac1cea86-5092-4e7b-92e8-aedb55cd2164
MemoryStore started with capacity 413.9 MB
Started ServerConnector@37eb7628{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Successfully started service ‘SparkUI’ on port 4040.
Bound SparkUI to 0.0.0.0, and started at http://192.168.1.112:4040
Server created on 192.168.1.112:36417
Stopped Spark@37eb7628{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Stopped Spark web UI at http://192.168.1.112:4040
spark.SparkContext: Successfully stopped SparkContext
Deleting directory /tmp/spark-a1ecab95-ffc0-4bdc-986a-e361082d6bcc
Deleting directory /tmp/spark-57ecb328-079c-409f-aa81-bd0fea255b8b

(2)python版本的需要spark-submit运行
$ spark-submit /usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.131640
日志信息如下:
Running Spark version 2.4.2
spark.SparkContext: Submitted application: PythonPi
Successfully started service ‘sparkDriver’ on port 39419.
Created local directory at /tmp/blockmgr-7304b67f-e55f-40e0-a376-ee23c8af6656
MemoryStore started with capacity 413.9 MB
Started ServerConnector@6fc8f2a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Successfully started service ‘SparkUI’ on port 4040.
Bound SparkUI to 0.0.0.0, and started at http://192.168.1.112:4040
Starting executor ID driver on host localhost
Setting hive.metastore.warehouse.dir (‘null’) to the value of spark.sql.warehouse.dir (‘file:/home/hadoop/spark-warehouse’).
Warehouse path is ‘file:/home/hadoop/spark-warehouse’.
Pi is roughly 3.131640
Stopped Spark@6fc8f2a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Stopped Spark web UI at http://192.168.1.112:4040
Successfully stopped SparkContext
Deleting directory /tmp/spark-24407c80-129f-4c38-a9a7-c3fd7ae29267
Deleting directory /tmp/spark-6da4353c-4072-41ba-b1b5-01747d9faffb

3 Spark交互式编程
Spark的交互式环境支持Scala和Python。

3.1 Scala版本的交互式环境
$ spark-shell
jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V报错
应该是版本太高的原因,切换回spark2.1.1版本,就解决了这个问题。
Spark context Web UI available at http://192.168.1.112:4040
Spark context available as ‘sc’ (master = local[*], app id = local-1556961357179).
Spark session available as ‘spark’.
scala>
在这里插入图片描述

3.2 Python版本的交互式环境
$ pyspark
Using Python version 3.5.2 (default, Nov 12 2018 13:43:14)
SparkSession available as ‘spark’.

上一篇:spark web ui各端口修改


下一篇:The type initializer for 'Gdip' threw an exception