Hive表新增字段值为NULL问题-CFANZ编程社区

问题描述

开发中，经常会遇到Hive分区表需要加字段的问题。在我们使用常规手段alter table 加上字段后，重新导入当天的数据，会发现新加字段的值全为空。

ALTER TABLE test.partition_test ADD columns(id string);

实际案例：

创建一个分区表，并插入数据，查看分区信息

CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
SELECT * FROM test.partition_test;
ALTER TABLE test.partition_test ADD columns(id string);
INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
SELECT * FROM test.partition_test;

结果如下

hive> CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
OK
Time taken: 0.676 seconds
hive> INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.68 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 8.661 seconds
hive> SELECT * FROM test.partition_test;
OK
Daniel  2022-09-04
Time taken: 0.196 seconds, Fetched: 1 row(s)
hive> ALTER TABLE test.partition_test ADD columns(id string);
OK
Time taken: 0.089 seconds
hive> INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 0.64 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 1.82 seconds
hive> SELECT * FROM test.partition_test;
OK
1       NULL    2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)

使用MSCK语法更新分区信息

MSCK REPAIR TABLE test.partition_test;
SELECT * FROM test.partition_test;

发现无效，新增的字段仍然为空

hive> SELECT * FROM test.partition_test;
OK
1       NULL    2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)
hive> MSCK REPAIR TABLE test.partition_test;
OK
Time taken: 3.451 seconds
hive> SELECT * FROM test.partition_test;
OK
1       NULL    2022-09-04
Time taken: 0.186 seconds, Fetched: 1 row(s)

原因分析：

在修改分区表的字段的时候，我们使用常规的手段来添加字段，其实该表对应的location上的数据已经更新了，但是Hive用的是自身的元数据，所以查出来为空，其实数据已经插入进去了。

这个时候，可能大家会想到用MSCK REPAIR TABLE的办法来修复分区，如上面测试下来，会发现是无效的。

MACK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

Hive里面有一个叫metastore的服务，主要存储一些源数据信息，例如数据库名，表名，分区信息等。如果不是通过常规的insert导入的数据，有很多分区信息在这里面是没有的，所以很容易想到用ALTER TABLE table_name DROP/ADD PARTITION。这种方式是可行的，但是如果需要大量修改分区，就显得不太友好。

解决方案：

CASCADE
ALTER TABLE TABLE_NAME DROP/ADD PARTITION
DROP/CREATE TABLE

cascade(推荐)
将修改表的SQL替换如下

ALTER TABLE test.partition_test ADD columns(id string)CASCADE;

重新导入数据，就可以查询到了。加上cascade关键字，会级联更新，同时刷新表与分区。

ALTER TABLE table_name DROP/ADD PARTITION
先删除当前分区，再重新添加(适用于分区少的情况)

ALTER TABLE test.partition_test DROP partition(dt = '2022-09-04');
ALTER TABLE test.partition_test ADD partition(dt = '2022-09-04');

drop/create table(暴力方式，不推荐)

DROP TABLE test.partition_test;
CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);