我是靠谱客的博主 结实星星,这篇文章主要介绍Hive _ 统计日活,留存等指标 设计与实现 指南,现在分享给大家,希望可以做个参考。

 

最近我们要根据 平时的 投放日志 统计出 每天的 活跃用户,留存 等指标,为此我进行了细致的调研。

我们的留存规则如下: 对于第1天的新增用户,如果第i 天,该用户有访问行为,则认为该用户为留存用户。

 

为此,我们简化为如下3张表的一个统计需求。

日志表,全量用户表,每天去重用户表

 

表结构介绍:

 

这几张表的字段如下:

 

日志表

日志表主要有以下几个字段:

id  用户id

action 用户的具体行为

day  产生日志的时间 (按天分区)

 

日志表存储了每天的全量日志,不会对用户维度的数据进行去重,结构和数据如下:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
0: jdbc:hive2://cdh-manager:10000> show create table org_log; INFO : Compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.053 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.044 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `org_log`( | | `id` int, | | `action` string) | | PARTITIONED BY ( | | `day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/org_log' | | TBLPROPERTIES ( | | 'last_modified_by'='hive', | | 'last_modified_time'='1554816805', | | 'transient_lastDdlTime'='1554816805') | +----------------------------------------------------+

 

日志表内的数据:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
0: jdbc:hive2://cdh-manager:10000> select * from org_log; INFO : Compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:org_log.id, type:int, comment:null), FieldSchema(name:org_log.action, type:string, comment:null), FieldSchema(name:org_log.day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.623 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log INFO : Completed executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.001 seconds INFO : OK +-------------+-----------------+--------------+ | org_log.id | org_log.action | org_log.day | +-------------+-----------------+--------------+ | 1 | find | 2019-04-09 | | 2 | touch | 2019-04-09 | | 1 | touch | 2019-04-09 | | 3 | touch | 2019-04-09 | | 4 | touch | 2019-04-10 | | 1 | touch | 2019-04-10 | | 2 | find | 2019-04-10 | | 2 | key | 2019-04-10 | | 5 | touch | 2019-04-10 | | 1 | touch | 2019-04-11 | | 2 | touch | 2019-04-11 | | 3 | touch | 2019-04-11 | | 1 | touch | 2019-04-12 | | 2 | touch | 2019-04-12 | | 3 | touch | 2019-04-12 | +-------------+-----------------+--------------+ 15 rows selected (1.16 seconds)

 

 

----------------------------------------------------------------

 

 

全量用户表

全量用户表 ,存储了所有去重后的用户

表内主要有2个字段 : 

id   用户id

create_day  用户第一次有记录的时间

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0: jdbc:hive2://cdh-manager:10000> show create table all_user; INFO : Compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.035 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.016 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `all_user`( | | `id` int, | | `create_day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/all_user' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1554819131') | +----------------------------------------------------+ 13 rows selected (0.091 seconds)

 

内部数据:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0: jdbc:hive2://cdh-manager:10000> select * from all_user; INFO : Compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:all_user.id, type:int, comment:null), FieldSchema(name:all_user.create_day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.086 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user INFO : Completed executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.001 seconds INFO : OK +--------------+----------------------+ | all_user.id | all_user.create_day | +--------------+----------------------+ | 1 | 2019-04-09 | | 2 | 2019-04-09 | | 3 | 2019-04-09 | | 4 | 2019-04-10 | | 5 | 2019-04-10 | +--------------+----------------------+ 5 rows selected (0.158 seconds)

 

插入数据脚本:

复制代码
1
insert into org_log partition(day='2019-04-11') values (1,'touch'),(2,'touch'),(3,'touch');

 

 

---------------------------------------------

每天去重用户表

每天去重用户表,存储了每天去重后的用户 : 

表内主要有以下3个字段:

id  用户id

day 日期(按天分区)

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0: jdbc:hive2://cdh-manager:10000> show create table daily_active; INFO : Compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.037 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.02 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `daily_active`( | | `id` int) | | PARTITIONED BY ( | | `day` date) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/daily_active' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1554866026') | +----------------------------------------------------+ 14 rows selected (0.105 seconds)

内部数据:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0: jdbc:hive2://cdh-manager:10000> select * from daily_active; INFO : Compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:daily_active.id, type:int, comment:null), FieldSchema(name:daily_active.day, type:date, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.126 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active INFO : Completed executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.001 seconds INFO : OK +------------------+-------------------+ | daily_active.id | daily_active.day | +------------------+-------------------+ | 1 | 2019-04-09 | | 2 | 2019-04-09 | | 3 | 2019-04-09 | | 1 | 2019-04-10 | | 2 | 2019-04-10 | | 4 | 2019-04-10 | | 5 | 2019-04-10 | | 1 | 2019-04-11 | | 2 | 2019-04-11 | | 3 | 2019-04-11 | | 1 | 2019-04-12 | | 2 | 2019-04-12 | | 3 | 2019-04-12 | +------------------+-------------------+ 13 rows selected (0.259 seconds)

 

 

--------------------------------------------

 

 

统计流程 及 HQL 脚本:

流程如下:

第一步. 得出每天的去重用户

第二步. 根据当前的去重用户,更新全量用户表

第三步. 选取当天 之前的 1-30 day 用户,计算留存

 

计算当天的去重用户的 HQL 以及脚本:

HQL : 

复制代码
1
2
3
4
5
6
[root@cdh-manager active_test]# cat insert_daily_active.hql use keep_active_test; INSERT OVERWRITE table daily_active partition(day = '${cal_day}') SELECT DISTINCT ID FROM org_log WHERE day = '${cal_day}';

脚本:

复制代码
1
2
3
4
5
[root@cdh-manager active_test]# cat insert_daily_active.sh #!/bin/bash beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_daily_active.hql --hivevar cal_day=$1

 

更新全量用户的HQL 以及脚本:

HQL :

复制代码
1
2
3
4
5
6
7
8
9
10
11
[root@cdh-manager active_test]# cat insert_distinct_alluser.hql use keep_active_test; INSERT INTO table all_user (id, create_day) SELECT tmp.id, '${cal_day}' AS create_day FROM ( SELECT DISTINCT id FROM org_log WHERE day = '${cal_day}' ) AS tmp WHERE tmp.id NOT IN (SELECT id FROM all_user);

脚本:

复制代码
1
2
3
4
5
6
[root@cdh-manager active_test]# cat insert_distinct_alluser.sh #!/bin/bash echo $1 beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_distinct_alluser.hql --hivevar cal_day=$1

 

计算留存:

这里描述下主要的思路:

通过 全量用户表 选出 30天的增量用户, 与每天去重的用户表 JOIN , JOIN 的字段是用户ID ,  计算时间差值,按照时间差值分组统计。

 

 

有两个脚本,一个较为基础,

 

即原始脚本。帮助理解,无分组函数:

复制代码
1
2
3
4
5
6
7
8
9
10
[root@cdh-manager active_test]# cat cal_daily_active2.hql use keep_active_test; SELECT all_user.id, all_user.create_day, daily_active.day, datediff(daily_active.day, all_user.create_day ) FROM all_user INNER JOIN daily_active ON all_user.id = daily_active.id WHERE all_user.create_day = '2019-04-09';

结果:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
+--------------+----------------------+-------------------+------+ | all_user.id | all_user.create_day | daily_active.day | _c3 | +--------------+----------------------+-------------------+------+ | 1 | 2019-04-09 | 2019-04-11 | 2 | | 1 | 2019-04-09 | 2019-04-09 | 0 | | 1 | 2019-04-09 | 2019-04-12 | 3 | | 1 | 2019-04-09 | 2019-04-10 | 1 | | 2 | 2019-04-09 | 2019-04-11 | 2 | | 2 | 2019-04-09 | 2019-04-09 | 0 | | 2 | 2019-04-09 | 2019-04-12 | 3 | | 2 | 2019-04-09 | 2019-04-10 | 1 | | 3 | 2019-04-09 | 2019-04-11 | 2 | | 3 | 2019-04-09 | 2019-04-09 | 0 | | 3 | 2019-04-09 | 2019-04-12 | 3 | +--------------+----------------------+-------------------+------+ 11 rows selected (54.082 seconds)

 

 

最后是实际统计函数:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
root@cdh-manager active_test]# cat cal_daily_active.hql use keep_active_test; SELECT datediff(daily_active.day, all_user.create_day) AS day , COUNT(daily_active.id) AS total FROM all_user INNER JOIN daily_active ON all_user.id = daily_active.id WHERE all_user.create_day = '2019-04-09' GROUP BY datediff(daily_active.day, all_user.create_day);

结果:

复制代码
1
2
3
4
5
6
7
8
9
+------+--------+ | day | total | +------+--------+ | 0 | 3 | | 1 | 2 | | 2 | 3 | | 3 | 3 | +------+--------+

 

第一行 : 留存天数

第二行: 留存人数

 

最后

以上就是结实星星最近收集整理的关于Hive _ 统计日活,留存等指标 设计与实现 指南的全部内容,更多相关Hive内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(93)

评论列表共有 0 条评论

立即
投稿
返回
顶部