最近我们要根据 平时的 投放日志 统计出 每天的 活跃用户,留存 等指标,为此我进行了细致的调研。
我们的留存规则如下: 对于第1天的新增用户,如果第i 天,该用户有访问行为,则认为该用户为留存用户。
为此,我们简化为如下3张表的一个统计需求。
日志表,全量用户表,每天去重用户表
表结构介绍:
这几张表的字段如下:
日志表
日志表主要有以下几个字段:
id 用户id
action 用户的具体行为
day 产生日志的时间 (按天分区)
日志表存储了每天的全量日志,不会对用户维度的数据进行去重,结构和数据如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
320: jdbc:hive2://cdh-manager:10000> show create table org_log; INFO : Compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.053 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.044 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `org_log`( | | `id` int, | | `action` string) | | PARTITIONED BY ( | | `day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/org_log' | | TBLPROPERTIES ( | | 'last_modified_by'='hive', | | 'last_modified_time'='1554816805', | | 'transient_lastDdlTime'='1554816805') | +----------------------------------------------------+
日志表内的数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
300: jdbc:hive2://cdh-manager:10000> select * from org_log; INFO : Compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:org_log.id, type:int, comment:null), FieldSchema(name:org_log.action, type:string, comment:null), FieldSchema(name:org_log.day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.623 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log INFO : Completed executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.001 seconds INFO : OK +-------------+-----------------+--------------+ | org_log.id | org_log.action | org_log.day | +-------------+-----------------+--------------+ | 1 | find | 2019-04-09 | | 2 | touch | 2019-04-09 | | 1 | touch | 2019-04-09 | | 3 | touch | 2019-04-09 | | 4 | touch | 2019-04-10 | | 1 | touch | 2019-04-10 | | 2 | find | 2019-04-10 | | 2 | key | 2019-04-10 | | 5 | touch | 2019-04-10 | | 1 | touch | 2019-04-11 | | 2 | touch | 2019-04-11 | | 3 | touch | 2019-04-11 | | 1 | touch | 2019-04-12 | | 2 | touch | 2019-04-12 | | 3 | touch | 2019-04-12 | +-------------+-----------------+--------------+ 15 rows selected (1.16 seconds)
----------------------------------------------------------------
全量用户表
全量用户表 ,存储了所有去重后的用户
表内主要有2个字段 :
id 用户id
create_day 用户第一次有记录的时间
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
290: jdbc:hive2://cdh-manager:10000> show create table all_user; INFO : Compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.035 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.016 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `all_user`( | | `id` int, | | `create_day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/all_user' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1554819131') | +----------------------------------------------------+ 13 rows selected (0.091 seconds)
内部数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
200: jdbc:hive2://cdh-manager:10000> select * from all_user; INFO : Compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:all_user.id, type:int, comment:null), FieldSchema(name:all_user.create_day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.086 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user INFO : Completed executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.001 seconds INFO : OK +--------------+----------------------+ | all_user.id | all_user.create_day | +--------------+----------------------+ | 1 | 2019-04-09 | | 2 | 2019-04-09 | | 3 | 2019-04-09 | | 4 | 2019-04-10 | | 5 | 2019-04-10 | +--------------+----------------------+ 5 rows selected (0.158 seconds)
插入数据脚本:
1insert into org_log partition(day='2019-04-11') values (1,'touch'),(2,'touch'),(3,'touch');
---------------------------------------------
每天去重用户表
每天去重用户表,存储了每天去重后的用户 :
表内主要有以下3个字段:
id 用户id
day 日期(按天分区)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
290: jdbc:hive2://cdh-manager:10000> show create table daily_active; INFO : Compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.037 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.02 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `daily_active`( | | `id` int) | | PARTITIONED BY ( | | `day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/daily_active' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1554866026') | +----------------------------------------------------+ 14 rows selected (0.105 seconds)
内部数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
280: jdbc:hive2://cdh-manager:10000> select * from daily_active; INFO : Compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:daily_active.id, type:int, comment:null), FieldSchema(name:daily_active.day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.126 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active INFO : Completed executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.001 seconds INFO : OK +------------------+-------------------+ | daily_active.id | daily_active.day | +------------------+-------------------+ | 1 | 2019-04-09 | | 2 | 2019-04-09 | | 3 | 2019-04-09 | | 1 | 2019-04-10 | | 2 | 2019-04-10 | | 4 | 2019-04-10 | | 5 | 2019-04-10 | | 1 | 2019-04-11 | | 2 | 2019-04-11 | | 3 | 2019-04-11 | | 1 | 2019-04-12 | | 2 | 2019-04-12 | | 3 | 2019-04-12 | +------------------+-------------------+ 13 rows selected (0.259 seconds)
--------------------------------------------
统计流程 及 HQL 脚本:
流程如下:
第一步. 得出每天的去重用户
第二步. 根据当前的去重用户,更新全量用户表
第三步. 选取当天 之前的 1-30 day 用户,计算留存
计算当天的去重用户的 HQL 以及脚本:
HQL :
1
2
3
4
5
6[root@cdh-manager active_test]# cat insert_daily_active.hql use keep_active_test; INSERT OVERWRITE table daily_active partition(day = '${cal_day}') SELECT DISTINCT ID FROM org_log WHERE day = '${cal_day}';
脚本:
1
2
3
4
5[root@cdh-manager active_test]# cat insert_daily_active.sh #!/bin/bash beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_daily_active.hql --hivevar cal_day=$1
更新全量用户的HQL 以及脚本:
HQL :
1
2
3
4
5
6
7
8
9
10
11[root@cdh-manager active_test]# cat insert_distinct_alluser.hql use keep_active_test; INSERT INTO table all_user (id, create_day) SELECT tmp.id, '${cal_day}' AS create_day FROM ( SELECT DISTINCT id FROM org_log WHERE day = '${cal_day}' ) AS tmp WHERE tmp.id NOT IN (SELECT id FROM all_user);
脚本:
1
2
3
4
5
6[root@cdh-manager active_test]# cat insert_distinct_alluser.sh #!/bin/bash echo $1 beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_distinct_alluser.hql --hivevar cal_day=$1
计算留存:
这里描述下主要的思路:
通过 全量用户表 选出 30天的增量用户, 与每天去重的用户表 JOIN , JOIN 的字段是用户ID , 计算时间差值,按照时间差值分组统计。
有两个脚本,一个较为基础,
即原始脚本。帮助理解,无分组函数:
1
2
3
4
5
6
7
8
9
10[root@cdh-manager active_test]# cat cal_daily_active2.hql use keep_active_test; SELECT all_user.id, all_user.create_day, daily_active.day, datediff(daily_active.day, all_user.create_day ) FROM all_user INNER JOIN daily_active ON all_user.id = daily_active.id WHERE all_user.create_day = '2019-04-09';
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17+--------------+----------------------+-------------------+------+ | all_user.id | all_user.create_day | daily_active.day | _c3 | +--------------+----------------------+-------------------+------+ | 1 | 2019-04-09 | 2019-04-11 | 2 | | 1 | 2019-04-09 | 2019-04-09 | 0 | | 1 | 2019-04-09 | 2019-04-12 | 3 | | 1 | 2019-04-09 | 2019-04-10 | 1 | | 2 | 2019-04-09 | 2019-04-11 | 2 | | 2 | 2019-04-09 | 2019-04-09 | 0 | | 2 | 2019-04-09 | 2019-04-12 | 3 | | 2 | 2019-04-09 | 2019-04-10 | 1 | | 3 | 2019-04-09 | 2019-04-11 | 2 | | 3 | 2019-04-09 | 2019-04-09 | 0 | | 3 | 2019-04-09 | 2019-04-12 | 3 | +--------------+----------------------+-------------------+------+ 11 rows selected (54.082 seconds)
最后是实际统计函数:
1
2
3
4
5
6
7
8
9
10
11
12root@cdh-manager active_test]# cat cal_daily_active.hql use keep_active_test; SELECT datediff(daily_active.day, all_user.create_day) AS day , COUNT(daily_active.id) AS total FROM all_user INNER JOIN daily_active ON all_user.id = daily_active.id WHERE all_user.create_day = '2019-04-09' GROUP BY datediff(daily_active.day, all_user.create_day);
结果:
1
2
3
4
5
6
7
8
9+------+--------+ | day | total | +------+--------+ | 0 | 3 | | 1 | 2 | | 2 | 3 | | 3 | 3 | +------+--------+
第一行 : 留存天数
第二行: 留存人数
最后
以上就是结实星星最近收集整理的关于Hive _ 统计日活,留存等指标 设计与实现 指南的全部内容,更多相关Hive内容请搜索靠谱客的其他文章。
发表评论 取消回复