spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

1.简述

使用yarn的方式提交spark应用时,在没有配置spark.yarn.archive或者spark.yarn.jars时, 看到输出的日志在输出Neither spark.yarn.jars nor spark.yarn.archive is set;一段指令后,会看到不停地上传本地jar到HDFS上,内容如下,这个过程会非常耗时。可以通过在spark-defaults.conf配置里添加spark.yarn.archive或spark.yarn.jars来缩小spark应用的启动时间。

 Will allocate AM container, with 896 MB memory including 384 MB overhead
2020-12-01 11:16:11 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 11:16:11 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 11:16:11 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 11:16:12 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2020-12-01 11:16:14 INFO  Client:54 - Uploading resource file:/tmp/spark-897c6291-e0bd-47e6-8d42-7f67225c4819/__spark_libs__5294834939010995385.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/__spark_libs__5294834939010995385.zip
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/wordcount.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/zookeeper-3.4.6.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/xz-1.0.jar

2.spark官网对这两个配置的解释

spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)
中文释义大概如下
spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

3.spark.yarn.jars使用

3.1 将spark根目录下jars里的所有jar包上传到HDFS

 hadoop fs -mkdir -p  /spark-yarn/jars
 hadoop fs -put /opt/module/spark-2.3.2-bin-hadoop2.7/jars/* /spark-yarn/jars/

3.2 修改spark-defaults.conf

spark.yarn.jars hdfs://hadoop122:9000/spark-yarn/jars/*.jar

3.3 效果

2020-12-01 13:53:52 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 13:53:52 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 13:53:52 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/JavaEWAH-0.3.2.jar
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/RoaringBitmap-0.5.11.jar

3.4 可能遇到的错误

ERROR client.TransportClient: Failed to send RPC RPC

Caused by: java.io.IOException: Failed to send RPC 5353749227723805834 to /192.168.10.122:58244: java.nio.channels.ClosedChannelException
	at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
	at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

关闭通道异常,看上去是超时的问题,这个问题当运行spark-shell --master yarn-client时,可能也会出现。在yarn-site.xml里添加如下配置可以解决

<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>false</value>
</property>
<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>false</value>
</property>

4.spark.yarn.archive使用

4.1 将spark根目录下jars里的所有jar包上传到HDFS

打包要注意所有的jar都在zip包的根目录中

cd /opt/module/spark-2.3.2-bin-hadoop2.7/jars/
zip -q -r spark_jars_2.3.2.zip *
hadoop fs -mkdir /spark-yarn/zip
hadoop fs -put spark_jars_2.3.2.zip /spark-yarn/zip/

4.2 修改spark-defaults.conf

spark.yarn.archive hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip

4.3 效果

2020-12-01 14:41:53 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 14:41:53 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 14:41:53 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 14:41:54 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
2020-12-01 14:41:54 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/wordcount.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zstd-jni-1.3.2-2.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zookeeper-3.4.6.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xz-1.0.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xmlenc-0.52.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xmlenc-0.52.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xml-apis-1.3.04.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xml-apis-1.3.04.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xercesImpl-2.9.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xercesImpl-2.9.1.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xbean-asm5-shaded-4.4.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xbean-asm5-shaded-4.4.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/spark-core_2.11-2.3.2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/spark-core_2.11-2.3.2.jar

4.4 可能遇到的错误

应用的driver日志

错误: 找不到或无法加载主类 org.apache.spark.deploy.yarn.ApplicationMaster

如果像如下的打包方式,就会保留目录的层级到zip包中,就会报错如上

zip -q -r spark_jars_2.3.2.zip /opt/module/spark-2.3.2-bin-hadoop2.7/jars/*

spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

5.对比

官方说明:说如果两个参数都配置,应用程序会优先使用 spark.yarn.archive 配置的路径,spark.yarn.archive 的优先级高于 spark.yarn.jars.
二者提交应用时都会出现

Source and destination file systems are the same. Not copying ...

的提示,但是使用spark.yarn.jars的方式,所有上传过的jar,都会提示

Source and destination file systems are the same. Not copying 

而使用spark.yarn.archive的方式,只是会有

Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip

一个提示,后来的jar仍然会从本地上传。参照4.4的日志,
那么使用spark.yarn.archive的方式是怎么加快文件的分发速度的?
亦或该如下理解
这两种方式都能加快依赖文件的分发速度,spark.yarn.jars对于已经上传的jar也可以免去从本地上传依赖的过程?
欢迎知道的小伙伴来讨论

参考文章:
https://blog.csdn.net/liyaya0201/article/details/105277681
https://www.cnblogs.com/yyy-blog/p/11110388.html
https://www.jianshu.com/p/e44e948b8d5f
https://blog.csdn.net/u012957549/article/details/89361485

上一篇:1.打印菱形 2. 字符串逆序输出 3. 解释栈区存放原理 day (10)


下一篇:Go:54---TCP通信(附TCP黏包)