hadoop支持lzo ,见上一篇博文:
https://blog.csdn.net/qq_45494908/article/details/122518940?spm=1001.2014.3001.5501
参考:https://blog.csdn.net/TomAndersen/article/details/106892522
1.在core-site.xml文件的io.compression.codecs参数中添加lzo、lzop压缩对应的编解码器类,并配置io.compression.codec.lzo.class参数
<!-- 声明可用的压缩算法的编/解码器 -->
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.Lz4Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
<description>
A comma-separated list of the compression codec classes that can
be used for compression/decompression. In addition to any classes specified
with this property (which take precedence), codec classes on the classpath
are discovered using a Java ServiceLoader.
</description>
</property>
<!-- 配置lzo编解码器相关参数 -->
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
2.在mapred-site.xml文件中设置MR Job执行时使用的压缩方式
<!-- map输出是否压缩 -->
<!-- 默认值:false -->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
<description>
Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<!-- 设置map输出压缩所使用的对应压缩算法的编解码器,此处设置为LzoCodec,生成的文件后缀为.lzo_deflate -->
<!-- 默认值:org.apache.hadoop.io.compress.DefaultCodec -->
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>
If the map outputs are compressed, how should they be compressed?
</description>
</property>
<!-- 设置MR job最终输出文件是否压缩 -->
<!-- 默认值:false -->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<!-- 设置MR job最终输出文件所使用的压缩算法对应的编解码器,此处设置为LzoCodec,生成的文件后缀为.lzo_deflate -->
<!-- 默认值:org.apache.hadoop.io.compress.DefaultCodec -->
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
<!-- 设置序列文件的压缩格式 -->
<!-- 默认值:RECORD -->
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
配置Hive
在$HIVE_HOME/conf/hive-site.xml文件中设置如下参数,使得Hive进行查询时使用压缩功能,具体使用的压缩算法默认与Hadoop中的配置相同
<!-- 设置hive语句执行输出文件是否开启压缩,具体的压缩算法和压缩格式取决于hadoop中
设置的相关参数 -->
<!-- 默认值:false -->
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
<description>
This controls whether the final outputs of a query (to a local/HDFS file or a Hive table)
is compressed.
The compression codec and other options are determined from Hadoop config variables
mapred.output.compress*
</description>
</property>
<!-- 控制多个MR Job的中间结果文件是否启用压缩,具体的压缩算法和压缩格式取决于hadoop中
设置的相关参数 -->
<!-- 默认值:false -->
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description>
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed.
The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
</description>
</property>
建立支持lzo压缩数据,写入Hive表
[liqiang@Gargantua data]$ lzop emp.txt
# 启动hive
hive (default)> CREATE TABLE emp_lzo like emp_hive
> STORED AS
> INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
> OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
hive (default)> LOAD DATA LOCAL INPATH '/home/liqiang/data/emp.txt.lzo' INTO TABLE emp_lzo;
准备一份lzo格式的数据,并load
hive (default)> LOAD DATA LOCAL INPATH '/home/liqiang/data/emp.txt.lzo' INTO TABLE emp_lzo;
查询
hive (default)> select * from emp_lzo;
OK
emp_lzo.empno emp_lzo.ename emp_lzo.job emp_lzo.mgr emp_lzo.hiredate emp_lzo.sal emp_lzo.comm emp_lzo.deptno
7369 SMITH CLERK 7902 NULL 800 NULL 20
7499 ALLEN SALESMAN 7698 NULL 1600 300 30
7521 WARD SALESMAN 7698 NULL 1250 500 30
7566 JONES MANAGER 7839 NULL 2975 NULL 20
7654 MARTIN SALESMAN 7698 NULL 1250 1400 30
7698 BLAKE MANAGER 7839 NULL 2850 NULL 30