HIVE-SQL经典⾯试题
HIVE-SQL经典⾯试题
1. 需求:
我们有如下的⽤户访问数据
userId visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
U02 2017/1/23 6
U01 2017/2/22 4
要求使⽤SQL统计出每个⽤户每个⽉的累积访问次数以及总的累计次数,如下表所⽰:
⽤户id ⽉份 ⼩计 累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
2.数据准备
CREATE TABLE st1 (
userId string,
visitDate string,
visitCount INT)
ROW format delimited FIELDS TERMINATED BY"\t";
INSERT INTO TABLE st1
VALUES
('u01','2017/1/21',5),
('u02','2017/1/23',6),
('u03','2017/1/22',8),
('u04','2017/1/20',3),
('u01','2017/1/23',6),
('u01','2017/2/21',8),
('u02','2017/1/23',6),
('u01','2017/2/22',4);
思路分析:
1.利⽤⽇期函数与政策表达式替换,将原始数据中的⽇转换成需要的⽉份 date_formate(regexp_replace(visitdate,'\','-'),'yyyy-MM')
sql查询面试题及答案2.利⽤userid和visitdate(⽉份)分组,将每个⽉访问总数求和
3. 在上述⼦查询基础上,利⽤窗⼝函数,对uersid进⾏分组求和
over(partition by userid order by visitmonth)
最终实现
select userid,visitmonth,subTotal,
sum(subTotal)over(partition by userid order by visitmonth)as total
from
(select userid,
visitmonth,
sum(visitcount)as subTotal
from
(select userid,
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM')as visitmonth,
visitcount
from test1)as t1
group by userid,visitmonth)as t2
结果:
2. 需求:
有50W个京东店铺,每个顾客访客访问任何⼀个店铺的任何⼀个商品时都会产⽣⼀条访问⽇志,
访问⽇志存储的表名为Visit,访客的⽤户id为user_id,被访问的店铺名称为shop,数据如下:
u1  a
u2  b
u1  b
u1  a
u3  c
u4  b
u1  a
u2  c
u5  b
u4  b
u6  c
u2  c
u1  b
u2  a
u2  a
u3  a
u5  a
u5  a
u5  a
请统计:
(1)每个店铺的UV(访客数)
(2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
数据准备:
CREATE TABLE st2 (
user_id string,
shop string )
ROW format delimited FIELDS TERMINATED BY'\t';
INSERT INTO TABLE st2 VALUES
('u1','a'),
('u2','b'),
('u1','b'),
('u1','a'),
('u3','c'),
('u4','b'),
('u1','a'),
('u2','c'),
('u5','b'),
('u4','b'),
('u6','c'),
('u2','c'),
('u1','b'),
('u2','a'),
('u2','a'),
('u3','a'),
('u5','a'),
('u5','a'),
('u5','a');
思路分析:
(1)注意访客数⽬要去重
(2)第⼆题先按⼦查询,把每个店铺每个⼈有次访问该店铺求出来,然后利⽤排名函数,根据店铺shop进⾏分组,对次数进⾏排序
(1)SELECT shop,
count(DISTINCT user_id)
FROM st2
GROUP BY shop
(2)select shop,userid,visitNum,rankOrder from
(select shop,userid,visitNum,
row_number()over(partition by shop order by visitNum desc)as rankOrder
from
(select shop,userid,count(userid)as visitNum from test2
group by shop,userid)as t1)
as t2
where rankOrder<=3
3.需求:
已知⼀个表STG.ORDER,有如下字段:Date,Order_id,User_id,amount。
数据样例:2017-01-01,10029028,1000003251,33.57。
请给出sql进⾏统计:
(1)给出 2017年每个⽉的订单数、⽤户数、总成交⾦额。
(2)给出2017年11⽉的新客数(指在11⽉才有第⼀笔订单)
数据准备:
CREATE TABLE st3 (
dt string,
order_id string,
user_id string,
amount DECIMAL(10,2))
ROW format delimited FIELDS TERMINATED BY'\t';
INSERT INTO TABLE st3 VALUES('2017-01-01','10029028','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-01-01','10029029','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-01-01','100290288','1000003252',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','10029088','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','100290281','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','100290282','1000003253',33.57); INSERT INTO TABLE st3 VALUES('2017-11-02','10290282','100003253',234); INSERT INTO TABLE st3 VALUES('2018-11-02','10290284','100003243',234);
思路分析
(1) ⾸先将⽇期数据转换成需要的格式,然后根据⽉份分组,分别求总⼈数、⾦额、订单数(2)巧妙利⽤group by 会进⾏默认排序,在having⾥⾯⽤dt进⾏筛选
实现:
CREATE TABLE st3 (
dt string,
order_id string,
user_id string,
amount DECIMAL(10,2))
ROW format delimited FIELDS TERMINATED BY'\t';
INSERT INTO TABLE st3 VALUES('2017-01-01','10029028','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-01-01','10029029','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-01-01','100290288','1000003252',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','10029088','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','100290281','1000003251',33.57); INSERT INTO TABLE st3 VALUES('2017-02-02','100290282','1000003253',33.57); INSERT INTO TABLE st3 VALUES('2017-11-02','10290282','100003253',234); INSERT INTO TABLE st3 VALUES('2018-11-02','10290284','100003243',234);
select count(distinct user_id)from test3 group by user_id
having date_format(min(dt),'yyyy-MM')='2017-11';
错误写法:(原因:hive 不⽀持⼦查询,如果是in的话可以转化为left join 来写)
select*from test3 where dt>='2017-11-01'and user_id not in
(select distinct user_id from test3 where dt<'2017-11-01')
4 需求:
有⼀个5000万的⽤户⽂件(user_id,name,age),⼀个2亿记录的⽤户看电影的记录⽂件(user_id,url),根据年龄段观看电影的次数进⾏排序?
数据准备
CREATE TABLE st4user
(user_id string,
name string,
age int);
CREATE TABLE st4log
(user_id string,
url string);
insert into test4user values
('001','u1',10),
('002','u2',15),
('003','u3',15),
('004','u4',20),
('005','u5',25),
('006','u6',35),
('007','u7',40),
('008','u8',45),
('009','u9',50),
('0010','u10',65);
insert into test4log values
('001','url1'),
('002','url1'),
('003','url2'),
('004','url3'),
('005','url3'),
('006','url1'),
('007','url5'),
('008','url7'),
('009','url5'),
('0010','url1');
思路分析:
1. 最后⼀层⼀定是个排序函数
2. 要排序⾸先要求,每个年龄段的观看次数
3. 先求每个⼈观看次数,关联年龄
select t2.age_phase,count(user_id)as num
from
(select distinct test4user.user_id,test4user.name,test4user.age,t1.num,
case when age<=10and age>0then'0-10'
WHEN age <=20AND age >10THEN'10-20'
WHEN age >20AND age <=30THEN'20-30'
WHEN age >30AND age <=40THEN'30-40'
WHEN age >40AND age <=50THEN'40-50'
WHEN age >50AND age <=60THEN'50-60'
WHEN age >60AND age <=70THEN'60-70'
ELSE'more than 70'END as age_phase
from
test4user  left join
(select user_id,count(1)as num from test4log group by user_id) t1
on t1.user_id = test4user.user_id) t2
group by age_phase;
标准答案(这⾥中⽂会报错,需要看下怎么写中⽂)
(SELECT user_id,
count(*) cnt
FROM st4log
GROUP BY user_id) t1
JOIN
(SELECT user_id,
CASE WHEN age <=10AND age >0THEN'0-10'
WHEN age <=20AND age >10THEN'10-20'
WHEN age >20AND age <=30THEN'20-30'
WHEN age >30AND age <=40THEN'30-40'
WHEN age >40AND age <=50THEN'40-50'
WHEN age >50AND age <=60THEN'50-60'
WHEN age >60AND age <=70THEN'60-70'
ELSE'70以上'END as age_phase
FROM st4user) t2 ON t1.user_id = t2.user_id
GROUP BY t2.age_phase```
5.需求:
请⽤sql写出所有⽤户中在今年10⽉份第⼀次购买商品的⾦额,
表ordertable字段:
(购买⽤户:userid,⾦额:money,购买时间:paymenttime(格式:2017-10-01),订单id:orderid
数据准备
```sql
create table test6 (
userid string,
money decimal(10,2),
paymenttime string,
orderid string
)
row format delimited fields terminated by"\t";
insert into test6
values
('001',100,'2017-10-01','123'),
('001',200,'2017-10-02','124'),
('002',500,'2017-10-01','125'),
('001',100,'2017-11-01','126');
思路分析
1.出10⽉份的所有订单,⽇期函数
date_format(paymenttime,'yyyy-MM')
2.对范围内订单进⾏排名row_number 并且利⽤窗⼝函数over(),以userid为分区,以paymenttime为排序
3.选区row_number为1的
select userid,paymenttime,money,orderid from(
select userid,paymenttime,money,orderid,
row_number()over(partition by userid order by paymenttime)as rowNum
from test6
where date_format(paymenttime,'yyyy-MM')='2017-10') t1 where rowNum =1;