This document contains an examples of creating a UDF in Python and registering it for use in Spark SQL.
What is Pyspark UDF
Pyspark UDF is a user defined function executed in Python runtime.
Two types
Row UDF
lambda x: x+1
lambda data1, data2: (data1 - data2).year
Group UDF :
lambda values: np.mean(np.array(values))
So the row UDF, it’s similar to what you do in Spark with the map operator and pressing a function.
Group UDF, while it’s similar to a row UDF, except you want to do it on a list of values. So it’s similar to do a groupBy followed by a map operator in Scala.
Define a function , register the function as a UDF and set the return type of your UDF.
1 2 3
def (s): return s * s sqlContext.udf.register("squaredWithPython", squared, LongType())
Call the UDF in Spark SQL
1 2
sqlContext.range(1, 20).registerTempTable("test") result = spark.sql("select id, squaredWithPython(id) as id_squared from test")
Use UDF with DataFrames
1 2 3 4 5 6 7 8 9 10 11 12 13 14
%sql select * FROM test from pyspark.sql import SparkSession from pyspark.sql.functions import udf