Hive系列(四)函数

Hive函数

一、常用内置函数

数学运算函数

字符串函数

时间函数

条件控制函数

集和函数

分组聚合函数

表生成函数

JSON解析函数

窗口分析函数

二、自定义函数

为什么要自定义函数?
有时候 hive 自带的函数不能满足当前需要,需要自定义函数来解决问题

UDF、UDAF、UDTF比较

  • UDF 操作作用于单个数据行,并且产生一个数据行作为输出。大多数函数都属于这一类(比如数学函数和字符串函数)。返回对应值,一对一。
  • UDAF 接受多个输入数据行,并产生一个输出数据行。像 COUNT 和 MAX 这样的函数就是聚集函数。返回聚合值,多对一。
  • UDTF 操作作用于单个数据行,并且产生多个数据行,一个表作为输出。lateral view explore(),返回拆分值,一对多。

自定义UDF

创建临时函数

编写UDF代码

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hive-udf</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <hadoop.version>2.7.3</hadoop.version>
        <hive.version>2.3.6</hive.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.3.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>2.3.6</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
    </dependencies>
    <build>
        <finalName>hive-udf</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass></mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <!-- 添加此项后,可直接使用mvn package | mvn install -->
                <!-- 不添加此项,需直接使用mvn package assembly:single -->
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

自定义标准函数需要继承实现抽象类org.apache.hadoop.hive.ql.udf.generic.GenericUDF
BASE64加密UDF

package udf;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Encoder;

import java.io.UnsupportedEncodingException;

/**
 * @Author: H.w
 * @Date: 2021/1/7 下午3:45
 * @Description: Base64加密UDF
 **/
public class Base64Encrypt extends UDF {

    public String evaluate(String message) throws Exception {
        //判断传进来的参数是否为空
        if(StringUtils.isBlank(message)){
            return "";
        }
        //base64 加密
        byte[] bt = null;
        String newMsg = null;
        try {
            bt = message.getBytes("utf-8");
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        if( bt != null){
            newMsg = new BASE64Encoder().encode(bt);
        }
        if(newMsg.contains("\r\n")){
            newMsg = newMsg.replace("\r\n","");
        }else if(newMsg.contains("\r")){
            newMsg = newMsg.replace("\r","");}else if(newMsg.contains("\n")){
            newMsg = newMsg.replace("\n","");
        }
        return newMsg;
    }
}

BASE64解密UDF

package udf;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Decoder;

/**
 * @Author: H.w
 * @Date: 2021/1/7 下午3:56
 * @Description: Base64解压UDF
 **/
public class Base64Decrypt extends UDF {
    public String evaluate(String msg) throws Exception {
        //判断传进来的参数是否为空
        if (StringUtils.isBlank(msg)) {
            return "";
        }
        //base64 解密
        byte[] bt = null;
        String result = null;
        if (msg != null) {
            BASE64Decoder decoder = new BASE64Decoder();
            try {
                bt = decoder.decodeBuffer(msg);
                result = new String(bt, "utf-8");
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return result;
    }
}

打包
将jar添加到hive服务器

hive> add jar /home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar;
Added [/home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [/home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar]

创建临时函数

hive> create temporary function base_en as 'udf.Base64Encrypt';
OK
Time taken: 0.281 seconds
hive> create temporary function base_de as 'udf.Base64Decrypt';
OK
Time taken: 0.004 seconds

测试

hive> select base_en('hello');
OK
aGVsbG8=
Time taken: 0.462 seconds, Fetched: 1 row(s)
hive> select base_de('aGVsbG8=');
OK
hello
Time taken: 0.078 seconds, Fetched: 1 row(s)

注意:这种只是创建临时函数,重启Hive就会时效

创建永久函数

将jar上传到指定位置

[root@localhost lib]# hdfs dfs -mkdir -p /hive/lib;
[root@localhost lib]# hdfs dfs -put /home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar /hive/lib

创建永久生效的UDF函数

hive> create function base64en as 'udf.Base64Encrypt' using jar 'hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar';
Added [/opt/hive/tmp/217fe48a-a64c-42d4-8696-e07c57e6ee13_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
Time taken: 0.371 seconds
hive> create function base64de as 'udf.Base64Decrypt' using jar 'hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar';
Added [/opt/hive/tmp/689060e4-4c5e-4126-a685-c9d008ea75cb_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
Time taken: 2.665 seconds

检测:

hive> select base64en('hello');
Added [/opt/hive/tmp/580da8d8-3cfa-4b2f-a862-03c37fc66ff4_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
aGVsbG8=
Time taken: 2.86 seconds, Fetched: 1 row(s)
hive> select base64de('aGVsbG8=');
Added [/opt/hive/tmp/580da8d8-3cfa-4b2f-a862-03c37fc66ff4_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
hello
Time taken: 0.141 seconds, Fetched: 1 row(s)

检查MySQL中元数据

mysql> SELECT * FROM FUNCS;
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME        | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
|       1 | udf.Base64Encrypt |  1610011833 |     1 | base64en  |         1 | NULL       | USER       |
|       6 | udf.Base64Decrypt |  1610011914 |     1 | base64de  |         1 | NULL       | USER       |
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
2 rows in set (0.00 sec)

可以看到函数的信息已经注册到元数据中了。

上一篇:mysql的UDF提权


下一篇:[源码分析]从"UDF不应有状态" 切入来剖析Flink SQL代码生成 (修订版)