python – PySpark使用dict创建新列

使用Spark 1.6,我有一个Spark DataFrame列(名为let,比如col1),其值为A,B,C,DS,DNS,E,F,G和H,我想用值创建一个新列(比如col2)从下面的词典中,我该如何映射? (所以f.i.’A’需要映射到’S’等……)

dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

解决方法:

使用UDF的低效解决方案(版本无关):

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

def translate(mapping):
    def translate_(col):
        return mapping.get(col)
    return udf(translate_, StringType())

df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
mapping = {
    'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 
    'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

df.withColumn("value", translate(mapping)("key"))

结果:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

更高效(仅限Spark 2.0)是创建MapType文字:

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("value", mapping_expr.getItem(col("key")))

结果相同:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

但更有效的执行计划:

== Physical Plan ==
*Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
+- Scan ExistingRDD[key#15]

与UDF版本相比:

== Physical Plan ==
*Project [key#15, pythonUDF0#61 AS value#57]
+- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
   +- Scan ExistingRDD[key#15]
上一篇:python – 遇到丢失的功能时,Apache Spark会抛出NullPointerException


下一篇:python – PySpark.将Dataframe传递给pandas_udf并返回一个系列