Spark-UDF

User Defined Functions - Python

This document contains an examples of creating a UDF in Python and registering it for use in Spark SQL.

What is Pyspark UDF
  • Pyspark UDF is a user defined function executed in Python runtime.
  • Two types
    1. Row UDF
      • lambda x: x+1
      • lambda data1, data2: (data1 - data2).year
  1. Group UDF :
    • lambda values: np.mean(np.array(values))

So the row UDF, it’s similar to what you do in Spark with the map operator and pressing a function.

Group UDF, while it’s similar to a row UDF, except you want to do it on a list of values. So it’s similar to do a groupBy followed by a map operator in Scala.

Define a function , register the function as a UDF and set the return type of your UDF.
1
2
3
def (s):
return s * s
sqlContext.udf.register("squaredWithPython", squared, LongType())
Call the UDF in Spark SQL
1
2
sqlContext.range(1, 20).registerTempTable("test")
result = spark.sql("select id, squaredWithPython(id) as id_squared from test")
Use UDF with DataFrames
1
2
3
4
5
6
7
8
9
10
11
12
13
14
%sql select * FROM test
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()


squared_udf = udf(squared, LongType())
df = sqlContext.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

User-Defined Aggregate Functions - Scala

1
2
3
4
5
6
7
大专栏
上一篇:sqarkSQL中的UDF用户自定义函数理解及应用


下一篇:05hive函数