0
点赞
收藏
分享

微信扫一扫

Hive查询函数

unadlib 2022-03-13 阅读 82

0.函数查询

1.显示所有系统函数

show functions;

2.查询有关日期的函数

show functions like "*date*"

3.查看函数说明

desc function 'current_date';

一.空字段赋值

给值为NULL的数据赋值,它的格式是NVL( value,default_value)。它的功能是如果value为NULL,则NVL函数返回default_value的值,否则返回value的值,如果两个参数都为NULL ,则返回NULL。

将comm列为null的值赋值为-1 
select *,nvl(comm, -1) from emp;

二.CASE WHEN:类switch case

创建表,上传数据

create table emp_sex(
name string,
dept_id string,
sex string)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

1.先统计各部门有多人   

select dept_id,
count(*) total,
from emp_sex
group by dept_id;

 

2.求出不同部门男女各多少人

select dept_id,
count(*) total,
sum(case sex when '男' then 1 else 0 end) male,
sum(case sex when '女' then 1 else 0 end) famale
from emp_sex
group by dept_id;

 三.行转列

使用函数:

CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字符串;

CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;

COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。

COLLECT_LIST:不去重

需求:把星座和血型一样的人归类到一起

射手座,A            大海|凤姐

白羊座,A            孙悟空|猪八戒

白羊座,B            宋宋|苍老师

创建表上传数据

create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/datas/constellation.txt" into table person_info;

1.先统计相同星座,及血型各有多少人使用函数“count()”

select constellation,blood_type,count(*)
from person_info
group by constellation,blood_type;

2.将血型和星座拼接,将名字整合到一列输出使用函数“concat”,“collect_list”

select concat(constellation,",",blood_type) CB,  --将两列字符串拼接成一列
collect_list(name) --将一列数值捏合成一个数组
from person_info
group by constellation,blood_type;

 3.将name使用分隔符“|”分开,使用函数“concat_ws”

select concat(constellation,",",blood_type) CB,
concat_ws("|",collect_list(name)) --接收数组输入,输出字符串使用“|”分割
from person_info
group by constellation,blood_type;

 四、列转行

使用函数

EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行。

LATERAL VIEW

用法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。

1.创建表,上传数据

create table movie_info
(
movie string,
category string
)
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

select * from movie_info;

2.使用split()函数将category列通过“,”分割由字符串转为数组

select split(category,",") from movie_info;

 

3.再通过explpde方法将数组炸开

select explode(split(category,",")) from movie_info;

 

 4.使用lateral view 语句将数据,独自生成一张只有一列的表格,表名为tbl,列名为cate,

select m.movie,tbl.cate
from movie_info m
lateral view
explode(split(category,",")) tbl as cate

 五、行转列,列转换叠加应用

1.对cate列进行分组,再对movie列进行聚合拼接。

select cate, collect_list(movie)
from (select m.movie, tbl.cate
from movie_info m
lateral view
explode(split(category, ",")) tbl as cate) t1
group by cate;
select tbl.cate,
concat_ws(",",collect_list(movie))
from movie_info m
lateral view
explode(split(category, ",")) tbl as cate
group by cate;

 

 

六、窗口函数(开窗函数)

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。

CURRENT ROW:当前行

n PRECEDING:往前n行数据

n FOLLOWING:往后n行数据

UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点

LAG(col,n,default_val):往前第n行数据

LEAD(col,n, default_val):往后第n行数据

NTILE(n):把有序窗口的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。

1.创建表,上传数据

create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;

2.需求

查询在2017年4月份购买过的顾客及总人数

select name, 
count(*) over ()
from business
where substring(orderdate, 1, 7) = '2017-04'
group by name;

 

查询顾客的购买明细及月购买总额

--over(partition by substring(orderdate,1,7))  --对orderdate的1-7位进行分区,累加前七位相同的cost

select
name,cost,orderdate,
sum(cost) over(partition by substring(orderdate,1,7))
from
business;

 

 

上述的场景中,将每个顾客的cost按照日期进行累加

select
name,cost,orderdate,
--按照orderdate的前七位进行分区
sum(cost) over(partition by substring(orderdate,1,7)) mc,
--按照name进行分区,然后按照orderdate排序
sum(cost) over(partition by name order by orderdate asc
--加和lc列, 每组第一行 和 当前行
rows between unbounded preceding and current row ) lc
from
business;

 

 

sql语句执行顺序

from 
where
group by
select
having
over()
order by
limit

多个窗口函数依次执行

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;

求明细及各个月由哪个顾客消费过

select
name,cost,orderdate,
concat_ws(",",collect_set(name) over(partition by substring(orderdate,1,7)))
from
business;

 

 

查询顾客上次消费时间,使用函数 lag()over()对一个有序窗口使用

select name,
orderdate,
cost,
--显示某一列数据的上n行数据,1代表上一行,"1970-01"代表如果没有上一行显示这个日期
-- 如果不写则返回null
lag(orderdate, 1, "1970-01") over (partition by name order by orderdate) last_order
from business;

查询顾客的下次消费时间,使用函数 lead()over()

select name,
orderdate,
cost,
lag(orderdate, 1, "1970-01") over (partition by name order by orderdate) last_order,
lead(orderdate, 1, "1970-01") over (partition by name order by orderdate) next_order
from business;

查询前20%时间的订单信息使用函数 NTILE()对一个有序窗口使用

1.先将数据进行分组

select
name,cost,orderdate,
--使用ntile进行分组 5份
ntile(5) over(order by orderdate)
from
business;

 

 2.取出第一组

select *
from (select name,
cost,
orderdate,
ntile(5) over (order by orderdate) n
from business) t1
where n = 1;

 

 查询前7%时间的订单信息使用函数 percent_rank()对一个有序窗口使用 

1.使用percent_rank将数据每行按百分比划分

select name,
cost,
orderdate,
percent_rank() over (order by orderdate)
from business;

 2.取出前7%数据

select *
from (select name,
cost,
orderdate,
percent_rank() over (order by orderdate) n
from business) t1
where n < 0.08;

 七、Rank

RANK() 排序相同时会重复,总数不会变

DENSE_RANK() 排序相同时会重复,总数会减少

ROW_NUMBER() 会根据顺序计算

计算每门学科成绩排名

select *,
rank() over (partition by subject order by score desc ) r,
dense_rank() over (partition by subject order by score desc ) dr,
row_number() over (partition by subject order by score desc ) rn
from score;

 八、日期相关函数

current_date 返回当前的日期

select current_date;

date_add  date_sub 日期的加减

今天开始90天以后的日期

select date_add(`current_date`(),90);

 今天开始90天以前的日期

select date_sub(`current_date`(),90);

datediff 计算两个日期差,返回天数

select datediff(`current_date`(),"1970-01-01");
举报

相关推荐

0 条评论