You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by zuohua zhang <zu...@gmail.com> on 2012/10/02 21:53:28 UTC
best way to load millions of gzip files in hdfs to one table in hive?
I have millions of gzip files in hdfs (with the same fields), would like to
load them into one table in hive with a specified schema.
What is the most efficient ways to do that?
Given that my data is only in hdfs, and also gzipped, does that mean I
could just simply set up the table somehow bypassing some unnecessary
overhead of the typical approach?
Thanks!
Re: best way to load millions of gzip files in hdfs to one table in hive?
Posted by Abhishek <ab...@gmail.com>.
Hi Edward,
I am kind of interested in this, for crush to work do we need install any thing??
How can it be used in a cluster.
Regards
Abhi
Sent from my iPhone
On Oct 2, 2012, at 5:45 PM, Edward Capriolo <ed...@gmail.com> wrote:
> You may want to use:
>
> https://github.com/edwardcapriolo/filecrush
>
> We use this to deal with pathological cases although the best idea is
> to avoid big files all together.
>
> Edward
>
> On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov
> <ap...@gmail.com> wrote:
>> Options
>> 1. create table and put files under the table dir
>>
>> 2. create external table and point it to files dir
>>
>> 3. if files are small then I recomend to create new set of files using
>> simple MR program and specifying number of reduce tasks. Goal is to make
>> files size > hdfs block size (it safes NN memory and read will be faster)
>>
>>
>> On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <zu...@gmail.com> wrote:
>>>
>>> I have millions of gzip files in hdfs (with the same fields), would like
>>> to load them into one table in hive with a specified schema.
>>> What is the most efficient ways to do that?
>>> Given that my data is only in hdfs, and also gzipped, does that mean I
>>> could just simply set up the table somehow bypassing some unnecessary
>>> overhead of the typical approach?
>>>
>>> Thanks!
>>
>>
Re: best way to load millions of gzip files in hdfs to one table in hive?
Posted by Edward Capriolo <ed...@gmail.com>.
You may want to use:
https://github.com/edwardcapriolo/filecrush
We use this to deal with pathological cases although the best idea is
to avoid big files all together.
Edward
On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov
<ap...@gmail.com> wrote:
> Options
> 1. create table and put files under the table dir
>
> 2. create external table and point it to files dir
>
> 3. if files are small then I recomend to create new set of files using
> simple MR program and specifying number of reduce tasks. Goal is to make
> files size > hdfs block size (it safes NN memory and read will be faster)
>
>
> On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <zu...@gmail.com> wrote:
>>
>> I have millions of gzip files in hdfs (with the same fields), would like
>> to load them into one table in hive with a specified schema.
>> What is the most efficient ways to do that?
>> Given that my data is only in hdfs, and also gzipped, does that mean I
>> could just simply set up the table somehow bypassing some unnecessary
>> overhead of the typical approach?
>>
>> Thanks!
>
>
Re: best way to load millions of gzip files in hdfs to one table in hive?
Posted by Alexander Pivovarov <ap...@gmail.com>.
Options
1. create table and put files under the table dir
2. create external table and point it to files dir
3. if files are small then I recomend to create new set of files using
simple MR program and specifying number of reduce tasks. Goal is to make
files size > hdfs block size (it safes NN memory and read will be faster)
On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <zu...@gmail.com> wrote:
> I have millions of gzip files in hdfs (with the same fields), would like
> to load them into one table in hive with a specified schema.
> What is the most efficient ways to do that?
> Given that my data is only in hdfs, and also gzipped, does that mean I
> could just simply set up the table somehow bypassing some unnecessary
> overhead of the typical approach?
>
> Thanks!
>