You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/05/02 15:00:55 UTC

external table or gz compressed file

Hello,

Can somebody please explain me or point me in the right direction for :
how Hive handles gz compressed files, If I create an external table
pointing to a .gz compressed file stored on AWS S3.
Does hive copy the file to the HDFS and decompress it before it uses the
file?
OR does it use the file directly?
If we use a decompressed file stored on S3... does hive still copy the file
to HDFS or read records directly from S3?

Please help me understand the working.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: external table or gz compressed file

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi

INPUT
=====
Hive can handle gz files out of the box with NO additional configurations

OUTPUT
======
If you want Hive to output to compressed files (say gz) then add the following as part of the hive SQL at the begining
SET hive.exec.compress.output=true;
SET mapred.reduce.tasks=16;    // this will create max 16 gzip files as part of your Hive output query
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;


SideNote (may or may not be relevant to u ….nevertheless)
You may know that GZIP is not splittable and unless u have a definite reason to use GZIP (like multiple lines in a log file actually constitute one logical Object or Record) , I would recommend LZO…
A little bit of plumbing is required since they discontinued LZO with Hadoop out of the box…..but its pretty straight forward….and remember to use the LZO indexer to create an index for your output so that the LZO files can be split going fwd


Thanks

sanjay

From: Panshul Whisper <ou...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Thursday, May 2, 2013 6:00 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: external table or gz compressed file

Hello,

Can somebody please explain me or point me in the right direction for :
how Hive handles gz compressed files, If I create an external table pointing to a .gz compressed file stored on AWS S3.
Does hive copy the file to the HDFS and decompress it before it uses the file?
OR does it use the file directly?
If we use a decompressed file stored on S3... does hive still copy the file to HDFS or read records directly from S3?

Please help me understand the working.

Thanking You,

--
Regards,
Ouch Whisper
010101010101

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.