0
点赞
收藏
分享

微信扫一扫

Hive表新增字段值为NULL问题

hwwjian 2022-09-10 阅读 339


问题描述

​开发中,经常会遇到Hive分区表需要加字段的问题。在我们使用常规手段alter table 加上字段后,重新导入当天的数据,会发现新加字段的值全为空。​

ALTER TABLE test.partition_test ADD columns(id string);

实际案例:

创建一个分区表,并插入数据,查看分区信息

CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
SELECT * FROM test.partition_test;
ALTER TABLE test.partition_test ADD columns(id string);
INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
SELECT * FROM test.partition_test;

结果如下

hive> CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);
OK
Time taken: 0.676 seconds
hive> INSERT INTO TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ("Daniel");
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.68 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 8.661 seconds
hive> SELECT * FROM test.partition_test;
OK
Daniel 2022-09-04
Time taken: 0.196 seconds, Fetched: 1 row(s)
hive> ALTER TABLE test.partition_test ADD columns(id string);
OK
Time taken: 0.089 seconds
hive> INSERT overwrite TABLE test.partition_test PARTITION(dt='2022-09-04')VALUES ('1', 'Daniel');
Query ID = ***
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id ***)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 0.64 s
----------------------------------------------------------------------------------------------
Loading data to table test.partition_test partition (dt=2022-09-04)
OK
Time taken: 1.82 seconds
hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)

使用MSCK语法更新分区信息

MSCK REPAIR TABLE test.partition_test;
SELECT * FROM test.partition_test;

发现无效,新增的字段仍然为空

hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.127 seconds, Fetched: 1 row(s)
hive> MSCK REPAIR TABLE test.partition_test;
OK
Time taken: 3.451 seconds
hive> SELECT * FROM test.partition_test;
OK
1 NULL 2022-09-04
Time taken: 0.186 seconds, Fetched: 1 row(s)

原因分析:

在修改分区表的字段的时候,我们使用常规的手段来添加字段,其实该表对应的location上的数据已经更新了,但是Hive用的是自身的元数据,所以查出来为空,其实数据已经插入进去了。

这个时候,可能大家会想到用​​MSCK REPAIR TABLE​​的办法来修复分区,如上面测试下来,会发现是无效的。

​MACK REPAIR TABLE​​​命令主要是用来解决通过​​hdfs dfs -put​​或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

Hive里面有一个叫metastore的服务,主要存储一些源数据信息,例如数据库名,表名,分区信息等。如果不是通过常规的insert导入的数据,有很多分区信息在这里面是没有的,所以很容易想到用​​ALTER TABLE table_name DROP/ADD PARTITION​​。这种方式是可行的,但是如果需要大量修改分区,就显得不太友好。

解决方案:

  1. CASCADE
  2. ALTER TABLE TABLE_NAME DROP/ADD PARTITION
  3. DROP/CREATE TABLE
  1. cascade(推荐)
    将修改表的SQL替换如下

ALTER TABLE test.partition_test ADD columns(id string)CASCADE;

重新导入数据,就可以查询到了。加上cascade关键字,会级联更新,同时刷新表与分区。

  1. ALTER TABLE table_name DROP/ADD PARTITION
    先删除当前分区,再重新添加(适用于分区少的情况)

ALTER TABLE test.partition_test DROP partition(dt = '2022-09-04');
ALTER TABLE test.partition_test ADD partition(dt = '2022-09-04');

  1. drop/create table(暴力方式,不推荐)

DROP TABLE test.partition_test;
CREATE TABLE test.partition_test (value STRING) PARTITIONED BY (dt STRING);


举报

相关推荐

0 条评论