目录
0 问题描述
用户行为日志表tb_user_log
| id | uid | artical_id | in_time | out_time | sign_in | 
| 1 | 101 | 0 | 2021-07-07 10:00:00 | 2021-07-07 10:00:09 | 1 | 
| 2 | 101 | 0 | 2021-07-08 10:00:00 | 2021-07-08 10:00:09 | 1 | 
| 3 | 101 | 0 | 2021-07-09 10:00:00 | 2021-07-09 10:00:42 | 1 | 
| 4 | 101 | 0 | 2021-07-10 10:00:00 | 2021-07-10 10:00:09 | 1 | 
| 5 | 101 | 0 | 2021-07-11 23:59:55 | 2021-07-11 23:59:59 | 1 | 
| 6 | 101 | 0 | 2021-07-12 10:00:28 | 2021-07-12 10:00:50 | 1 | 
| 7 | 101 | 0 | 2021-07-13 10:00:28 | 2021-07-13 10:00:50 | 1 | 
| 8 | 102 | 0 | 2021-10-01 10:00:28 | 2021-10-01 10:00:50 | 1 | 
| 9 | 102 | 0 | 2021-10-02 10:00:01 | 2021-10-02 10:01:50 | 1 | 
| 10 | 102 | 0 | 2021-10-03 10:00:55 | 2021-10-03 11:00:59 | 1 | 
| 11 | 102 | 0 | 2021-10-04 10:00:45 | 2021-10-04 11:00:55 | 0 | 
| 12 | 102 | 0 | 2021-10-05 10:00:53 | 2021-10-05 11:00:59 | 1 | 
| 13 | 102 | 0 | 2021-10-06 10:00:45 | 2021-10-06 11:00:55 | 1 | 
(uid-用户ID, artical_id-文章ID, in_time-进入时间, out_time-离开时间, sign_in-是否签到)
场景逻辑说明:
- artical_id-文章ID代表用户浏览的文章的ID,特殊情况artical_id-文章ID为0表示用户在非文章内容页(比如App内的列表页、活动页等)。注意:只有artical_id为0时sign_in值才有效。
- 从2021年7月7日0点开始,用户每天签到可以领1金币,并可以开始累积签到天数,连续签到的第3、7天分别可额外领2、6金币。
- 每连续签到7天后重新累积签到天数(即重置签到天数:连续第8天签到时记为新的一轮签到的第一天,领1金币)
问题:计算每个用户2021年7月以来每月获得的金币数(该活动到10月底结束,11月1日开始的签到不再获得金币)。结果按月份、ID升序排序。
注:如果签到记录的in_time-进入时间和out_time-离开时间跨天了,也只记作in_time对应的日期签到了。
输出示例:
示例数据的输出结果如下:
| uid | month | coin | 
| 101 | 202107 | 15 | 
| 102 | 202110 | 7 | 
解释:
101在活动期内连续签到了7天,因此获得1*7+2+6=15金币;
102在10.01~10.03连续签到3天获得5金币
10.04断签了,10.05~10.06连续签到2天获得2金币,共得到7金币。
1 数据准备
DROP TABLE IF EXISTS tb_user_log;
CREATE TABLE tb_user_log (
    uid INT COMMENT '用户ID',
    artical_id INT COMMENT '视频ID',
    in_time string COMMENT '进入时间',
    out_time string COMMENT '离开时间',
    sign_in int COMMENT '是否签到'
) ;
INSERT INTO tb_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
  (101, 0, '2021-07-07 10:00:00', '2021-07-07 10:00:09', 1),
  (101, 0, '2021-07-08 10:00:00', '2021-07-08 10:00:09', 1),
  (101, 0, '2021-07-09 10:00:00', '2021-07-09 10:00:42', 1),
  (101, 0, '2021-07-10 10:00:00', '2021-07-10 10:00:09', 1),
  (101, 0, '2021-07-11 23:59:55', '2021-07-11 23:59:59', 1),
  (101, 0, '2021-07-12 10:00:28', '2021-07-12 10:00:50', 1),
  (101, 0, '2021-07-13 10:00:28', '2021-07-13 10:00:50', 1),
  (101, 0, '2021-07-14 11:00:28', '2021-07-14 11:00:50', 1),
  (101, 0, '2021-07-15 11:59:28', '2021-07-16 00:01:20', 1),
  (102, 0, '2021-10-01 10:00:28', '2021-10-01 10:00:50', 1),
  (102, 0, '2021-10-02 10:00:01', '2021-10-02 10:01:50', 1),
  (102, 0, '2021-10-03 11:00:55', '2021-10-03 11:00:59', 1),
  (102, 0, '2021-10-04 11:00:45', '2021-10-04 11:00:55', 0),
  (102, 0, '2021-10-05 11:00:53', '2021-10-05 11:00:59', 1),
  (102, 0, '2021-10-06 11:00:45', '2021-10-06 11:00:55', 1);2 问题分析
本题在牛客网上算比较难的一道题目,本地所要求的是每个用户每月获得的金币数:
领取金币规则:
- (1)只要用户签到就能获取一枚金币
- (2)连续签到第三天和第七天分别额外获取2枚和6金币
- (3)连续签到7天后进行重置,按照规则(1)和(2)进行新一轮领币
分析:本题的维度是用户和月份,也就是最终结果是按照这两个来分组
第一步:根据领金币的规则,我们容易想到按照签到的连续性先进行分组,将连续的进行标记分成一组
第二步:在每一个连续的组里进行row_number()标记(rn),目的是获取连续的第三天和第七个天特殊位置。
第三步:根据第二步连续的天数中是否有超过7天的,然后以此进行分割,再进行分组
整体计算规则如下草图:
 
  
  
步骤三种,首先需要对为7天的地方进行标记(打标记的目的就是为了辅助计算)
标记的方法:rn % 7==0.如果为0则记为1,不为0则记为0
分组方法:按照滑动窗口求和的方法。此类分俩种一种是窗口变得越来越大的方法,另一种是窗口越来越小的方法。具体如下图所示


此两种算法对应的目标不一样:第一种窗口越来越大的方法是希望1标记位分组的时候能分到下面组中(标记位向下分组),第二种窗口越来越小的形式是希望标记为1能分到上一组数据中(或理解为进入到上一个数据桶中)(标记位向上)。具体整体分组如下图
 
  
  
步骤一:先判断连续性,做第一个分组标签
注意此处不要将是否签到值sign_in在where条件中过滤,此题连续真正的含义是:时间连续+签到的连续,如果sign_in在where条件中过滤则不能构成分组条件
select  uid
      ,in_time
      ,sign_in
      ,month
      -- 时间和签到同时的连续性
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg
from(
select uid
      ,in_time
      ,sign_in
      ,month
    --求出当前值与上一个值的差值,如果为1具备连续的条件(时间的连续性)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --构造签到的连续性条件
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--求出上一行值
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1+------+----------------------+----------+----------+-------------+
| uid  |       in_time        | sign_in  |  month   | series_flg  |
+------+----------------------+----------+----------+-------------+
| 101  | 2021-07-07 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-08 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-09 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-10 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-11 23:59:55  | 1        | 2021-07  | 1           |
| 101  | 2021-07-12 10:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-13 10:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-14 11:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-15 11:59:28  | 1        | 2021-07  | 1           |
| 102  | 2021-10-01 10:00:28  | 1        | 2021-10  | 1           |
| 102  | 2021-10-02 10:00:01  | 1        | 2021-10  | 1           |
| 102  | 2021-10-03 11:00:55  | 1        | 2021-10  | 1           |
| 102  | 2021-10-05 11:00:53  | 1        | 2021-10  | 0           |
| 102  | 2021-10-06 11:00:45  | 1        | 2021-10  | 0           |
+------+----------------------+----------+----------+-------------+错误的分组标签生成:sign_in在where条件中过滤
select *
      -- 签到连续性标签
     , sum(if(datediff(to_date(in_time),to_date(lag_in_time)) < 1,1,0)) over(partition by uid,month order by in_time) as    series_flg
from
  (select uid
        ,in_time
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10' and sign_in=1
  ) t生成的结果如下:

步骤二:生成第二个标签,连续7天的分组标记
select *
      --生成连续7天的分组标记
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--生成每组中的顺序标记
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- 时间和签到同时的连续性
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg
from(
select uid
      ,in_time
      ,sign_in
      ,month
    --求出当前值与上一个值的差值,如果为1具备连续的条件(时间的连续性)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --构造签到的连续性条件
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--求出上一行值
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t对应结果如下:

步骤三:根据领取金币规则,在每个条件分组中求金币数
select uid
      ,month
      ,series_flg
      ,series_7_flg
      --按照领取金币规则求金币数
      ,case when cnt>=1 and cnt<3 then cnt
            when cnt>=3 and cnt<7 then cnt + 2
       else cnt + 2 + 6 end coin        
from
(select uid
      ,month
      ,series_flg
      ,series_7_flg
      ,count(1) as cnt
from
(select *
      --生成连续7天的分组标记
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--生成每组中的顺序标记
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- 时间和签到同时的连续性
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg
from(
select uid
      ,in_time
      ,sign_in
      ,month
    --求出当前值与上一个值的差值,如果为1具备连续的条件(时间的连续性)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --构造签到的连续性条件
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--求出上一行值
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t
) t
group by uid,month,series_flg,series_7_flg
) t+------+----------+-------------+---------------+-------+
| uid  |  month   | series_flg  | series_7_flg  | coin  |
+------+----------+-------------+---------------+-------+
| 102  | 2021-10  | 1           | 0             | 5     |
| 102  | 2021-10  | 0           | 0             | 2     |
| 101  | 2021-07  | 1           | 1             | 15    |
| 101  | 2021-07  | 1           | 0             | 2     |
+------+----------+-------------+---------------+-------+分组1:series_flg表示连续签到分组。分组2:series_7_flg表示连续领取金币天数中有超过7天的,并以7天为分割点重新分组。
此处领取金币的算法为:1-2天为等于连续的天数,3-6天为连续的天数加2,当为7天时为连续天数加8
步骤四:获取用户每月的金币数,并按用户及月份升序排序
最终SQL如下
select uid,month,sum(coin) as coin
from(
select uid
      ,month
      ,series_flg
      ,series_7_flg
      --按照领取金币规则求金币数
      ,case when cnt>=1 and cnt<3 then cnt
            when cnt>=3 and cnt<7 then cnt + 2
       else cnt + 2 + 6 end coin        
from
(select uid
      ,month
      ,series_flg
      ,series_7_flg
      ,count(1) as cnt
from
(select *
      --生成连续7天的分组标记
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--生成每组中的顺序标记
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- 时间和签到同时的连续性
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg
from(
select uid
      ,in_time
      ,sign_in
      ,month
    --求出当前值与上一个值的差值,如果为1具备连续的条件(时间的连续性)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --构造签到的连续性条件
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--求出上一行值
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t
) t
group by uid,month,series_flg,series_7_flg
) t
)t
group by uid,month 
order by uid,month最终结果如下:
+------+----------+-------+
| uid  |  month   | coin  |
+------+----------+-------+
| 101  | 2021-07  | 17    |
| 102  | 2021-10  | 7     |
+------+----------+-------+注意此处与所给的答案有出入是因为造的数据中增加了2条数据,便于测试。
3 小结
本文给出了一种连续签到领取金币的一种通用解法,通过窗口函数生成标签值进行辅助计算。通过本题可以收获以下知识点:
- (1)连续签到的判断方法(注意时间连续+签到的连续)
- (2)断点(按照某种条件)重分组的方法(如本题的连续7天后为一个断点)
         1) 断点同上:order by in_time rows between current row and unbounded   
                     following(上边界滑动,下边界固定)
         2)断点同下: order by in_time(下边界滑动,上边界固定)- (3)周期的计算方法,如本题7天一个循环(采用求余的方法:rn%7==0)











