文章目录
一.安装Java和Scale
1.1 安装java
因为我这个环境是CDH 6.3.1版本,已经安装了JDK,此次略过。
[root@hp1 ~]# javac -version
javac 1.8.0_181
1.2 安装Scala
1.2.1 安装
代码:
官网地址:https://www.scala-lang.org/download/
wget https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
tar -zxvf scala-2.13.1.tgz
mv scala-2.13.1 scala
测试地址:
[root@hp1 local]# cd /home/
[root@hp1 home]# ls
backup cloudera-host-monitor.bak3 cloudera-service-monitor.moved csv hdfs shell
[root@hp1 home]# mkdir software
[root@hp1 home]# cd software/
[root@hp1 software]# ls
[root@hp1 software]#
[root@hp1 software]# pwd
/home/software
[root@hp1 software]#
[root@hp1 software]# https://www.scala-lang.org/download/^M
: 没有那个文件或目录cala-lang.org/download/
[root@hp1 software]#
[root@hp1 software]#
[root@hp1 software]# pwd
/home/software
[root@hp1 software]#
[root@hp1 software]#
[root@hp1 software]# wget https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
--2021-04-08 10:37:47-- https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
正在解析主机 downloads.lightbend.com (downloads.lightbend.com)... 13.35.121.34, 13.35.121.81, 13.35.121.50, ...
正在连接 downloads.lightbend.com (downloads.lightbend.com)|13.35.121.34|:443... 已连接。
发出 HTTP 请求,正在等待回应... 200 OK
长度:19685743 (19M) [application/octet-stream]
正在保存至: “scala-2.13.1.tgz”
100%[================================================================================================================================================================>] 19,685,743 9.92MB/s 用时 1.9s
2021-04-08 10:37:50 (9.92 MB/s) - 已保存 “scala-2.13.1.tgz” [19685743/19685743])
[root@hp1 software]# tar -zxvf scala-2.13.1.tgz
scala-2.13.1/
scala-2.13.1/lib/
scala-2.13.1/lib/scala-compiler.jar
scala-2.13.1/lib/scalap-2.13.1.jar
scala-2.13.1/lib/scala-reflect.jar
scala-2.13.1/lib/jansi-1.12.jar
scala-2.13.1/lib/jline-2.14.6.jar
scala-2.13.1/lib/scala-library.jar
scala-2.13.1/doc/
scala-2.13.1/doc/licenses/
scala-2.13.1/doc/licenses/mit_jquery.txt
scala-2.13.1/doc/licenses/bsd_scalacheck.txt
scala-2.13.1/doc/licenses/bsd_asm.txt
scala-2.13.1/doc/licenses/apache_jansi.txt
scala-2.13.1/doc/licenses/bsd_jline.txt
scala-2.13.1/doc/LICENSE.md
scala-2.13.1/doc/License.rtf
scala-2.13.1/doc/README
scala-2.13.1/doc/tools/
scala-2.13.1/doc/tools/scaladoc.html
scala-2.13.1/doc/tools/scalap.html
scala-2.13.1/doc/tools/css/
scala-2.13.1/doc/tools/css/style.css
scala-2.13.1/doc/tools/scala.html
scala-2.13.1/doc/tools/index.html
scala-2.13.1/doc/tools/images/
scala-2.13.1/doc/tools/images/scala_logo.png
scala-2.13.1/doc/tools/images/external.gif
scala-2.13.1/doc/tools/scalac.html
scala-2.13.1/doc/tools/fsc.html
scala-2.13.1/bin/
scala-2.13.1/bin/fsc
scala-2.13.1/bin/scalap.bat
scala-2.13.1/bin/scala
scala-2.13.1/bin/scaladoc.bat
scala-2.13.1/bin/fsc.bat
scala-2.13.1/bin/scala.bat
scala-2.13.1/bin/scaladoc
scala-2.13.1/bin/scalap
scala-2.13.1/bin/scalac
scala-2.13.1/bin/scalac.bat
scala-2.13.1/LICENSE
scala-2.13.1/man/
scala-2.13.1/man/man1/
scala-2.13.1/man/man1/scalac.1
scala-2.13.1/man/man1/scala.1
scala-2.13.1/man/man1/scaladoc.1
scala-2.13.1/man/man1/fsc.1
scala-2.13.1/man/man1/scalap.1
scala-2.13.1/NOTICE
[root@hp1 software]#
[root@hp1 software]#
[root@hp1 software]#
[root@hp1 software]# mv scala-2.13.1 scala
1.2.2 配置
vim /etc/profile
SCALA_HOME=/home/software/scala
PATH=$SCALA_HOME/bin:$PATH
source /etc/profile
1.2.3 启动
[root@hp1 software]# scala
Welcome to Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181).
Type in expressions for evaluation. Or try :help.
scala>
scala>
二.安装Apache Spark
因为我这个环境是CDH 6.3.1版本,已经安装了spark,此次略过。
完美,pyspark也都安装了,厉害了。
查找pyspark路径
[root@hp1 ~]# which pyspark
/usr/bin/pyspark
[root@hp1 ~]#
运行pyspark报错
[root@hp1 software]# pyspark
Python 2.7.5 (default, Apr 2 2020, 13:16:51)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/08 11:03:36 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/04/08 11:03:36 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
21/04/08 11:03:36 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/04/08 11:03:36 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/shell.py", line 41, in <module>
spark = SparkSession._create_shell_session()
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/sql/session.py", line 594, in _create_shell_session
return SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 354, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 123, in __init__
conf, jsc, profiler_cls)
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 185, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 293, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
answer, self._gateway_client, None, self._fqn)
File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
[root@hp1 software]#
网上搜到解决方案,鉴于测试机器内存较小,修改配置如下:
yarn.app.mapreduce.am.resource.mb =2g
yarn.nodemanager.resource.memory-mb=4g
yarn.scheduler.maximum-allocation-mb=2g
修改完成后重启服务使配置生效,重新测试
[root@hp1 software]# pyspark
Python 2.7.5 (default, Apr 2 2020, 13:16:51)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/08 11:08:07 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
21/04/08 11:08:07 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.3.1
/_/
Using Python version 2.7.5 (default, Apr 2 2020 13:16:51)
SparkSession available as 'spark'.
>>>
三.pyspark案例
用pyspark写一个wordcount程序
代码:
wordcount.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import time
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
#设置环境变量
os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_181' # java环境配置
os.environ['HADOOP_HOME'] = '/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop' # hadoop安装目录
os.environ['SPARK_HOME'] = '/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark' # 设置spark安装目录
spark_conf = SparkConf()\
.setAppName('Python_Spark_WordCount')\
.setMaster('local[2]')
# 设置Spark程序运行的地方,此处设置运行在本地模式,启动2个线程分析数据
sc = SparkContext(conf=spark_conf) # 获取SparkContext实例对象, 用于读取要处理的数据和Job执行
# 设置日志级别 Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
sc.setLogLevel('WARN')
# <SparkContext master=local[2] appName=Python_Spark_WordCount>
print (sc)
"""
创建RDD,读取要分析的数据:
-1. 方式一:从本地集合(列表、元组、字典)进行并行化创建
-2. 方式二:从外部文件系统读取数据(HDFS、LocalFS)
"""
# 第一种方式:从集合并行创建RDD
def local_rdd(spark_context):
datas = ['hadoop spark', 'spark hive spark spark', 'spark hadoop python hive', ' ']
return spark_context.parallelize(datas) # Create RDD
# 第二种方式:从本地文件系统中读取
def hdfs_rdd(spark_context):
return spark_context.textFile("/user/rdedu/wc.data") # 从文件中读取数据
# rdd = local_rdd(sc) #方法1
rdd = hdfs_rdd(sc) #方法2
print rdd.count()
print rdd.first()
# =============词频统计=======================================
word_count_rdd = rdd\
.filter(lambda line: len(line.strip()) != 0)\
.flatMap(lambda line: line.strip().split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a + b) # 将Key相同的Value进行合并
for word, count in word_count_rdd.collect(): # collect()函数将rdd转换为列表
print word, ', ', count
print "===================================="
# 依据统计的count值降序排序
sort_rdd = word_count_rdd\
.map(lambda (word, count): (count, word))\
.sortByKey(ascending=False)
print sort_rdd.collect()
# def top(self, num, key=None):
print word_count_rdd.top(3, key=lambda (word, count): count)
# def takeOrdered(self, num, key=None) -> Bottom
print word_count_rdd.takeOrdered(3, key=lambda (word, count): count)
# 为了查看Spark程序运行是的WEB UI界面,让线程休眠一段时间
time.sleep(100)
# SparkContext Stop
sc.stop()
测试记录:
[root@hp1 software]# spark-submit wordcount.py
21/04/08 14:13:13 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
21/04/08 14:13:13 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-11001905-0583-4ed6-a7f5-787ae6d9565c/__driver_logs__/driver.log
21/04/08 14:13:13 INFO spark.SparkContext: Submitted application: Python_Spark_WordCount
21/04/08 14:13:13 INFO spark.SecurityManager: Changing view acls to: root
21/04/08 14:13:13 INFO spark.SecurityManager: Changing modify acls to: root
21/04/08 14:13:13 INFO spark.SecurityManager: Changing view acls groups to:
21/04/08 14:13:13 INFO spark.SecurityManager: Changing modify acls groups to:
21/04/08 14:13:13 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'sparkDriver' on port 42666.
21/04/08 14:13:13 INFO spark.SparkEnv: Registering MapOutputTracker
21/04/08 14:13:13 INFO spark.SparkEnv: Registering BlockManagerMaster
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/04/08 14:13:13 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-190fe2cf-05dc-415b-ba34-168b596ddbfd
21/04/08 14:13:13 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
21/04/08 14:13:13 INFO spark.SparkEnv: Registering OutputCommitCoordinator
21/04/08 14:13:13 INFO util.log: Logging initialized @1925ms
21/04/08 14:13:13 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
21/04/08 14:13:13 INFO server.Server: Started @2002ms
21/04/08 14:13:13 INFO server.AbstractConnector: Started ServerConnector@575fe4eb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3cf93f90{/jobs,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f6c4275{/jobs/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@40aaa4fa{/jobs/job,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c969bc1{/jobs/job/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@313403cc{/stages,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@242cec64{/stages/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51a6582b{/stages/stage,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b01516d{/stages/stage/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@38853d26{/stages/pool,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5c4278a{/stages/pool/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73315507{/storage,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@22eae835{/storage/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@33178d43{/storage/rdd,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@17cb14dc{/storage/rdd/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6eabf24f{/environment,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b7514ef{/environment/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1407ff57{/executors,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ba564ff{/executors/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6aadcc4e{/executors/threadDump,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@59cc6f98{/executors/threadDump/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@721adeb1{/static,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e08dd87{/,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@29849cf7{/api,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3c51f401{/jobs/job/kill,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@252c4495{/stages/stage/kill,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp1:4040
21/04/08 14:13:13 INFO executor.Executor: Starting executor ID driver on host localhost
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33083.
21/04/08 14:13:13 INFO netty.NettyBlockTransferService: Server created on hp1:33083
21/04/08 14:13:13 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/04/08 14:13:13 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp1:33083 with 366.3 MB RAM, BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManager: external shuffle service port = 7337
21/04/08 14:13:13 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:14 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fba4081{/metrics/json,null,AVAILABLE,@Spark}
21/04/08 14:13:14 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1617862393786
21/04/08 14:13:14 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
21/04/08 14:13:14 INFO util.Utils: Extension com.cloudera.spark.lineage.NavigatorAppListener not being initialized.
21/04/08 14:13:14 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1617862393786_driver.log
<SparkContext master=local[2] appName=Python_Spark_WordCount>
1
'hadoop spark', 'spark hive spark spark', 'spark hadoop python hive', ' '
python , 1
'spark , 2
spark , 1
hive , 1
' , 2
hive', , 1
'hadoop , 1
spark', , 2
hadoop , 1
====================================
[(2, u"'spark"), (2, u"'"), (2, u"spark',"), (1, u'python'), (1, u'spark'), (1, u'hive'), (1, u"hive',"), (1, u"'hadoop"), (1, u'hadoop')]
[(u"'spark", 2), (u"'", 2), (u"spark',", 2)]
[(u'python', 1), (u'spark', 1), (u'hive', 1)]
[root@hp1 software]#
[root@hp1 software]#
参考:
1.https://blog.csdn.net/u013227399/article/details/102897606
2.https://www.cnblogs.com/erlou96/p/12933548.html