pyspark中通过textFile读取的rdd不能count()

记录一下pyspark的一些坑

在用textFile读取文件后,我想看看有多少数据量,就想用count算一下

rdd = sc.textFile("/home/parastor/backup/datum/bus/gps/2017-07-17/*/*.gz").filter(lambda x:x!=None)
print(rdd.count())

然后报错了

Traceback (most recent call last):
  File "/root/hxj/tmp/pycharm/fmm4bus.py", line 98, in <module>
    print(rdd.count())
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 1055, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 1046, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 917, in fold
    vals = self.mapPartitions(func).collect()
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/rdd.py", line 816, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)

解决方法,在程序前面显式指出java_home即可

我用的远程解释器,先上服务器查一下java的一些信息

输入

java -verbose

查看最后两行

pyspark中通过textFile读取的rdd不能count()

 

 

 然后在代码中指定。

import os
os.environ['JAVA_HOME']='/usr/lib/java/jdk1.8.0_212'

 

上一篇:pyspark topandas的妙用


下一篇:python学习:收集ip信息