You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mark Grover <gr...@gmail.com> on 2012/11/13 19:54:19 UTC

Re: Hive Loading Zip CSV Files

bcc: cdh-user

This question might be more appropriate for the Apache Hive user list, so
redirecting it there.

However to answer your question:
>From the little I've read about PKZip, they follow the standard zip format.
So the question you are really asking is if Hive supports reading from zip
files. As far as I know, the answer is no. This is because Hadoop doesn't
have an InputFormat for reading zip files:
https://issues.apache.org/jira/browse/MAPREDUCE-210
There is also a Hive user email thread that tackles the same question:
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCAENxBwxkF--3PzCkpz1HX21=Gb9YVASr2JL0U3yUL2tfGu010Q@mail.gmail.com%3E

Having said that, a possible workaround would be to unzip the zip files and
use a different compression codec (e.g. Snappy) on SequenceFile's for
storing your files on HDFS.

Good luck!
Mark



On Tue, Nov 13, 2012 at 9:17 AM, ben <bb...@gmail.com> wrote:

> Anybody ever try to load CSV files compressed using PKZip into a Hive
> table stored as Sequence Files? Is there a SerDe out there for this?
>
> Thanks,
> Ben
>
> --
>
>
>
>

Re: Hive Loading Zip CSV Files

Posted by Mark Grover <gr...@gmail.com>.
Hi Ben,
If you have specific questions, I'd suggest you post them on the
appropriate user mailing lists (user@hadoop.apache.org, user@hive.apache.org,
etc.).

Mark

On Tue, Nov 13, 2012 at 5:51 PM, ben <bb...@gmail.com> wrote:

> Hi Mark,
>
> I'm trying to incorporate the patch, but I'm having a hard time with
> deprecated classes and methods. If you have the time, can you give me some
> advice on how to modify it?
>
> Thanks,
> Ben
>
>
> On Tuesday, November 13, 2012 3:32:34 PM UTC-8, Mark Grover wrote:
>
>> Ben,
>> The JIRA I mentioned in the previous email (https://issues.apache.org/**
>> jira/browse/MAPREDUCE-210<https://issues.apache.org/jira/browse/MAPREDUCE-210>)
>> has a patch attached to it. I haven't reviewed the patch myself but it
>> seems like that might be a good starting point.
>>
>> Feel free to search for other blogs/articles/wiki pages describing how to
>> write your own InputFormat.
>>
>> Mark
>>
>>
>> On Tue, Nov 13, 2012 at 2:57 PM, ben <bb...@gmail.com> wrote:
>>
>>> Mark,
>>>
>>> Can you direct me to a resource for creating a ZipInputFormat class for
>>> use as a Hive InputFormat?
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>>  --
>
>
>
>

Re: Hive Loading Zip CSV Files

Posted by Mark Grover <gr...@gmail.com>.
Ben,
The JIRA I mentioned in the previous email (
https://issues.apache.org/jira/browse/MAPREDUCE-210) has a patch attached
to it. I haven't reviewed the patch myself but it seems like that might be
a good starting point.

Feel free to search for other blogs/articles/wiki pages describing how to
write your own InputFormat.

Mark

On Tue, Nov 13, 2012 at 2:57 PM, ben <bb...@gmail.com> wrote:

> Mark,
>
> Can you direct me to a resource for creating a ZipInputFormat class for
> use as a Hive InputFormat?
>
> Thanks,
> Ben
>
> --
>
>
>
>

Re: Hive Loading Zip CSV Files

Posted by Mark Grover <gr...@gmail.com>.
Adding user@hive.apache.org

Ben,
That's great to hear. It would be awesome if you'd like to contribute this
back to Hive so others in the community could use it too. Let us know what
you think!

Mark

On Wed, Nov 28, 2012 at 8:05 PM, ben <bb...@gmail.com> wrote:

> Mark,
>
> Just wanted to let you know that I got it to work. By implementing the
> mapred InputFormat and RecordReader classes, I got Hive to change the error
> to something like "must be BytesWritable or Text" that needs to be
> returned. All I had to was to take out the line that sets the key in the
> "next" method of the RecordReader. Then, voila. It worked! Now, I can
> manipulate the data all I want using just Hive scripts. This removes all
> the custom extraction and conversion map reduce jobs making it just a 2
> step process. Plus, Hive does it all faster and cleaner.
>
> Thanks for all your help.
>
> Cheers,
> Ben
>
> --
>
>
>
>

Re: Hive Loading Zip CSV Files

Posted by Mark Grover <gr...@gmail.com>.
bcc: cdh-user

Hi Ben,
My apologies for the delayed response.

I don't have any other specific resources I can direct you to, sorry. Your
best bet is to search online to see examples.

I did a quick search. This looks like a good one:
https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive
However, again, I haven't personally used it so there is not
much corroboration I can provide behind it.

Here is an example from the Hive source code:
http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64/Base64TextInputFormat.java?view=markup

Hope that helps.
Mark


On Tue, Nov 13, 2012 at 1:47 PM, ben <bb...@gmail.com> wrote:

> Hi Mark,
>
> Can you direct me to where I could create my own InputFormat for Zip
> Files? To create a ZipFileInputFormat for Hive?
>
> Thanks,
> Ben
>
>
> On Tuesday, November 13, 2012 10:54:25 AM UTC-8, Mark Grover wrote:
>
>> bcc: cdh-user
>>
>> This question might be more appropriate for the Apache Hive user list, so
>> redirecting it there.
>>
>> However to answer your question:
>> From the little I've read about PKZip, they follow the standard zip
>> format. So the question you are really asking is if Hive supports reading
>> from zip files. As far as I know, the answer is no. This is because Hadoop
>> doesn't have an InputFormat for reading zip files: https://issues.apache.
>> **org/jira/browse/MAPREDUCE-210<https://issues.apache.org/jira/browse/MAPREDUCE-210>
>> There is also a Hive user email thread that tackles the same question:
>> http://mail-**archives.apache.org/mod_mbox/**hive-user/201203.mbox/%**
>> 3CCAENxBwxkF--3PzCkpz1HX21=**Gb9YVASr2JL0U3yUL2tfGu010Q@**
>> mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCAENxBwxkF--3PzCkpz1HX21=Gb9YVASr2JL0U3yUL2tfGu010Q@mail.gmail.com%3E>
>>
>> Having said that, a possible workaround would be to unzip the zip files
>> and use a different compression codec (e.g. Snappy) on SequenceFile's for
>> storing your files on HDFS.
>>
>> Good luck!
>> Mark
>>
>>
>>
>> On Tue, Nov 13, 2012 at 9:17 AM, ben <bb...@gmail.com> wrote:
>>
>>> Anybody ever try to load CSV files compressed using PKZip into a Hive
>>> table stored as Sequence Files? Is there a SerDe out there for this?
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>>  --
>
>
>
>