问题描述
开发中,经常会遇到Hive分区表需要加字段的问题。在我们使用常规手段alter table 加上字段后,重新导入当天的数据,会发现新加字段的值全为空。
ALTER TABLE test.partition_test ADD columns(id string);
实际案例:
创建一个分区表,并插入数据,查看分区信息
CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
SELECT * FROM test.partition_test;
ALTER TABLE test.partition_test ADD columns(id string);
INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
SELECT * FROM test.partition_test;
结果如下
hive> CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
OK
Time taken: 0.676 seconds
hive> INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.68 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 8.661 seconds
hive> SELECT * FROM test.partition_test;
OK
Daniel 2022-09-04
Time taken: 0.196 seconds, Fetched: 1 row(s)
hive> ALTER TABLE test.partition_test ADD columns(id string);
OK
Time taken: 0.089 seconds
hive> INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 0.64 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 1.82 seconds
hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)
使用MSCK语法更新分区信息
MSCK REPAIR TABLE test.partition_test;
SELECT * FROM test.partition_test;
发现无效,新增的字段仍然为空
hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)
hive> MSCK REPAIR TABLE test.partition_test;
OK
Time taken: 3.451 seconds
hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.186 seconds, Fetched: 1 row(s)
原因分析:
在修改分区表的字段的时候,我们使用常规的手段来添加字段,其实该表对应的location上的数据已经更新了,但是Hive用的是自身的元数据,所以查出来为空,其实数据已经插入进去了。
这个时候,可能大家会想到用MSCK REPAIR TABLE
的办法来修复分区,如上面测试下来,会发现是无效的。
MACK REPAIR TABLE
命令主要是用来解决通过hdfs dfs -put
或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。
Hive里面有一个叫metastore的服务,主要存储一些源数据信息,例如数据库名,表名,分区信息等。如果不是通过常规的insert导入的数据,有很多分区信息在这里面是没有的,所以很容易想到用ALTER TABLE table_name DROP/ADD PARTITION
。这种方式是可行的,但是如果需要大量修改分区,就显得不太友好。
解决方案:
- CASCADE
- ALTER TABLE TABLE_NAME DROP/ADD PARTITION
- DROP/CREATE TABLE
- cascade(推荐)
将修改表的SQL替换如下
ALTER TABLE test.partition_test ADD columns(id string)CASCADE;
重新导入数据,就可以查询到了。加上cascade关键字,会级联更新,同时刷新表与分区。
- ALTER TABLE table_name DROP/ADD PARTITION
先删除当前分区,再重新添加(适用于分区少的情况)
ALTER TABLE test.partition_test DROP partition(dt = '2022-09-04');
ALTER TABLE test.partition_test ADD partition(dt = '2022-09-04');
- drop/create table(暴力方式,不推荐)
DROP TABLE test.partition_test;
CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);