记录一下pyspark的一些坑
在用textFile读取文件后,我想看看有多少数据量,就想用count算一下
rdd = sc.textFile("/home/parastor/backup/datum/bus/gps/2017-07-17/*/*.gz").filter(lambda x:x!=None) print(rdd.count())
然后报错了
Traceback (most recent call last): File "/root/hxj/tmp/pycharm/fmm4bus.py", line 98, in <module> print(rdd.count()) File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 1046, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 917, in fold vals = self.mapPartitions(func).collect() File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 816, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.IllegalArgumentException: Unsupported class file major version 55 at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
解决方法,在程序前面显式指出java_home即可
我用的远程解释器,先上服务器查一下java的一些信息
输入
java -verbose
查看最后两行
然后在代码中指定。
import os os.environ['JAVA_HOME']='/usr/lib/java/jdk1.8.0_212'