函数 1.hive函数分类 标准函数 自定义标准函数称之为 UDF 传递一行数据,返回一个结果 聚合函数 自定义聚合函数称之为 UDAF 传递多行数据,返回一个结果 group by sum count 表生成函数 自定义表生成函数称之为 UDTF 传递一行数据,返回多行数据 explode 2.函数帮助文档 SHOW FUNCTIONS; 显示当前环境所有的函数 DESC FUNCTION length; 显示帮助函数文档 DESC FUNCTION EXTENDED length; 显示函数额外信息 3.数学函数 round 求四舍五入 rand 随机 conv 进制转换函数 conv(num, from_base, to_base) num数据 from_base 指定的进制 to_base 转换进制 cast 类型转换 select cast(id as STRING) from student; 把int类型转化成STRING 4.日期函数 from_unixtime 把时间戳转化成字符串 select from_unixtime(1567995900, 'yyyyMMdd HHmmss') unix_timestamp 把字符串转化时间戳 select unix_timestamp('2011-12-10 13:10:12'); 5.条件函数 IF(expr1,expr2,expr3) !!! expr1 如果expr1为真 输出expr2 否则就输出expr3 SELECT name, age, if(age > 20, 'A', 'B') as type FROM student; 返回第一个非空数据 # SELECT coalesce(null, 12, 3,14,null); CASE !!! 给数据分组 SELECT id, CASE WHEN id <=2 THEN 'a' WHEN id > 2 AND id <=10 THEN 'b' ELSE 'c' END FROM student; SELECT mycol, sum(id)FROM (SELECT id, (CASE WHEN id <=2 THEN 'a' WHEN id > 2 AND id <=10 THEN 'b' ELSE 'c' END) as mycol FROM student) t GROUP BY mycol SELECT CASE id WHEN 2 THEN 'a' ELSE 'b' END from student; ////////////////////////////////////////////////////////////////////// 6.字符串处理函数 字符串拼接 concat 拼接任何类型,中间没有分隔符指定 select id, concat(name, age) from student; concat_ws 只能拼接字符串类型,可以指定分隔符 select concat_ws('_', 'asdfasd', '18') ; instr(string str, string substr) select instr('candle', 'dle') length 字符串大小 lower 小写 lcase 大写 正则表达式 # regexp_extract 通过正则表达式查找 regexp_extract(str, regexp[, idx]) 参数 str 需要处理的字符串 regexp 正则表达式 主要用于匹配 idx 索引 0 整个字符串匹配 1 匹配正则表达式中第一个括号 2 匹配正则表达式第二个括号 select regexp_extract('x=asdfasdf12345asdf', 'x=([a-z]+)([0-9]+)([a-z]+)', 3); regexp_replace 。。。。。。。替换 select regexp_replace('x=asdfasdf12345asdf3456345', '([0-9]+)','xxxx' ); URL # parse_url 专门用来解析URL数据 http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'REF'); [HOST 域名,PATH 路径,QUERY 查询,REF 锚点,PROTOCOL 协议,FILE,AUTHORITY IP地址+端口,USERINFO] 输出结果 hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'HOST'); OK _c0 www.candle.com Time taken: 0.07 seconds, Fetched: 1 row(s) hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'PATH'); OK _c0 /dir1/dir2/dir3/file.html Time taken: 0.093 seconds, Fetched: 1 row(s) hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'QUERY'); OK _c0 key1=value1&key2=value2 Time taken: 0.051 seconds, Fetched: 1 row(s) hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'QUERY', 'key1'); OK _c0 value1 Time taken: 0.105 seconds, Fetched: 1 row(s) hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'REF'); OK _c0 imhere JSON # '{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}' { "store": { "fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} } , "email":"amy@only_for_json_udf_test.net", "owner":"amy" } hive (hadoop)> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}', '$.store.bicycle.price'); OK _c0 19.95 hive (hadoop)> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}', '$.store.fruit[0].type'); OK _c0 apple 7.宽表和长表转化 !!! explode 表生成函数 hive (hadoop)> select explode(city) from student1; 错误: hive (hadoop)> select id, name, explode(city) from student1; FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions 数组 hive 中 查询,表生成函数不能直接和普通列直接查询 需要先生成一个临时视图 拼接起来 select id, name, mycity from student1 lateral view explode(city) tmp_view AS mycity; SELECT id, name, mycity FROM student1 LATERAL VIEW explode(city) 临时视图名称 AS 自定义列名 键值对 键值对 explode生成两列 key value > select id, name , mykey, myvalue from student3 LATERAL VIEW explode(pairs) tmp_view AS mykey, myvalue; OK id name mykey myvalue 1 candle k1 v1 1 candle k2 v2 2 jack k1 v1 2 jack k2 v2 3 tom k1 v1 结构体: 多列 hive (hadoop)> select explode(addr) from student2; FAILED: UDFArgumentException explode() takes an array or a map as a parameter 注意: 结构体成员不能直接转化成多列,explode只接受arry和map 其实,把结构体每个成员转化成每一行 没有意义 一般再处理结构体采用如下形式 hive (hadoop)> select id,name, addr.city, addr.area, addr.streetid from student2; OK id name city area streetid 1 candle shanghai minhang 35 2 jack beijing haidian 100 3 tom hefei shushan 103 一般不采用如下格式 hive (hadoop)> select id, name, info from student2 lateral view explode(array(addr.city, addr.area, addr.streetid)) tmp_view as info; OK id name info 1 candle shanghai 1 candle minhang 1 candle 35 2 jack beijing 2 jack haidian 2 jack 100 3 tom hefei 3 tom shushan 3 tom 103 长转宽 聚合函数 group by collect_list 多行数据 合并到一行 生成数组 list 允许成员重复 select age, collect_list(name), collect_list(id) from student group by age; age _c1 _c2 18 ["candle","candle"] [1,1] 19 ["aa","jack"] [10,2] 20 ["c2","tom"] [1,4] 100 ["cc"] [12] 200 ["dd"] [13] collect_set 不允许重复 select age, collect_set(name), collect_set(id) from student group by age; age _c1 _c2 18 ["candle"] [1] 19 ["aa","jack"] [10,2] 20 ["c2","tom"] [1,4] 100 ["cc"] [12] 200 ["dd"] [13] select map(id, name) from student; 生成键值对 select struct(id, name, age) from student; 生成结构体 8.窗口函数 !!! 分组排序 1) 创建表格 CREATE TABLE windows ( id INT, class STRING, score INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE; 2) row_number partition by 指定分组列 order by 指定排序列 SELECT class, score, row_number() over(partition by class order by score desc) from windows; 排名 不会因为score相同改变次序 class score paiming class1 78 1 class1 78 2 class1 66 3 class1 66 4 class1 66 5 class1 56 6 class1 56 7 class1 56 8 class1 56 9 class1 56 10 3) rank 排名 如果score重复,排序也重复,下一个名词会跳过重复的部分 SELECT class, score, rank() over(partition by class order by score desc) from windows; class score _wcol0 class1 78 1 class1 78 1 class1 66 3 class1 66 3 class1 66 3 class1 56 6 class1 56 6 class1 56 6 class1 56 6 class1 56 6 4) dense_rank 如果score重复,排序也重复,下一个名词不会跳过重复的部分 SELECT class, score, dense_rank() over(partition by class order by score desc) from windows; class score _wcol0 class1 78 1 class1 78 1 class1 66 2 class1 66 2 class1 66 2 class1 56 3 class1 56 3 class1 56 3 class1 56 3 class1 56 3 5) 分组 topn模型 不能直接筛选 hive (hadoop)> SELECT class, score, row_number() over(partition by class order by score desc) as paiming from windows WHERE paiming <= 5; FAILED: SemanticException [Error 10004]: Line 1:110 Invalid table alias or column reference 'paiming': (possible column names are: id, class, score) 使用子查询 SELECT class, score from (SELECT class, score, row_number() over(partition by class order by score desc) as paiming from windows) t WHERE paiming <=5;