You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2007/08/30 17:21:36 UTC

Compression using Hadoop...

Hello All:
   
  I think I must be missing something fundamental.  Is it possible to load compressed data into HDFS, and then operate on it directly with map/reduce?  I see a lot of stuff in the docs about writing compressed outputs, but nothing about reading compressed inputs.
   
  Am I being ponderously stupid here?
   
  Any help/comments appreciated...
   
  Thanks,
  C G

       
---------------------------------
Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.

Re: Compression using Hadoop...

Posted by Arun C Murthy <ar...@yahoo-inc.com>.
On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote:
>Arun C Murthy wrote:
>>One way to reap benefits of both compression and better parallelism is to 
>>use compressed SequenceFiles: 
>>http://wiki.apache.org/lucene-hadoop/SequenceFile
>>
>>Of course this means you will have to do a conversion from .gzip to .seq 
>>file and load it onto hdfs for your job, which should be fairly simple 
>>piece of code.
>
>We really need someone to contribute an InputFormat for bzip files. 
>This has come up before: bzip is a standard compression format that is 
>splittable.
>
>Another InputFormat that would be handy is zip.  Zip archives, unlike 
>tar files, can be split by reading the table of contents.  So one could 
>package a bunch of tiny files as a zip file, then the input format could 
>split the zip file into splits that each contain a number of files 
>inside the zip.  Each map task would then have to read the table of 
>contents from the file, but could then seek directly to the files in its 
>split without scanning the entire file.
>
>Should we file jira issues for these?  Any volunteers who're interested 
>in implementing these?
>

Please file the bzip and zip issues Doug. I'll try and get to them in the short-term unless someone is more interested and wants to scratch that itch right-away.

thanks,
Arun

>Doug

Re: Compression using Hadoop...

Posted by Doug Cutting <cu...@apache.org>.
Ted Dunning wrote:
> I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel.  I needed to add 10 source roots in IntelliJ to get a clean compile.  In this process, I noticed some circular dependencies.
> 
> Would the committers be open to some small set of changes to remove cyclic dependencies?  

It doesn't hurt to propose them.  Even if the get shot down, we'll gain 
a log of the rationale.  There are great benefits to clean layers, but 
we also can't gratuitously break source-code compatibility.

Doug

RE: Compression using Hadoop...

Posted by Ted Dunning <td...@veoh.com>.
I was just looking at this.  It looks relatively easy to simply create a new codec and register it in the config files.

I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel.  I needed to add 10 source roots in IntelliJ to get a clean compile.  In this process, I noticed some circular dependencies.

Would the committers be open to some small set of changes to remove cyclic dependencies?  


-----Original Message-----
From: Milind Bhandarkar [mailto:milindb@yahoo-inc.com]
Sent: Fri 8/31/2007 11:53 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
 

On 8/31/07 10:43 AM, "Doug Cutting" <cu...@apache.org> wrote:

>
> We really need someone to contribute an InputFormat for bzip files.
> This has come up before: bzip is a standard compression format that is
> splittable.
 
+1


- milind
--
Milind Bhandarkar
408-349-2136
(milindb@yahoo-inc.com)



Re: Compression using Hadoop...

Posted by Milind Bhandarkar <mi...@yahoo-inc.com>.
On 8/31/07 10:43 AM, "Doug Cutting" <cu...@apache.org> wrote:

>
> We really need someone to contribute an InputFormat for bzip files.
> This has come up before: bzip is a standard compression format that is
> splittable.
 
+1


- milind
--
Milind Bhandarkar
408-349-2136
(milindb@yahoo-inc.com)


Re: Compression using Hadoop...

Posted by Doug Cutting <cu...@apache.org>.
Arun C Murthy wrote:
> One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
> 
> Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which should be fairly simple piece of code.

We really need someone to contribute an InputFormat for bzip files. 
This has come up before: bzip is a standard compression format that is 
splittable.

Another InputFormat that would be handy is zip.  Zip archives, unlike 
tar files, can be split by reading the table of contents.  So one could 
package a bunch of tiny files as a zip file, then the input format could 
split the zip file into splits that each contain a number of files 
inside the zip.  Each map task would then have to read the table of 
contents from the file, but could then seek directly to the files in its 
split without scanning the entire file.

Should we file jira issues for these?  Any volunteers who're interested 
in implementing these?

Doug

Re: Compression using Hadoop...

Posted by Arun C Murthy <ar...@yahoo-inc.com>.
On Fri, Aug 31, 2007 at 10:22:18AM -0700, Ted Dunning wrote:
>
>They will only be a non-issue if you have enough of them to get the parallelism you want.  If you have number of gzip files > 10*number of task nodes you should be fine.
>

One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile

Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which should be fairly simple piece of code.

Arun

>
>-----Original Message-----
>From: jason.gessner@gmail.com on behalf of jason gessner
>Sent: Fri 8/31/2007 9:38 AM
>To: hadoop-user@lucene.apache.org
>Subject: Re: Compression using Hadoop...
> 
>ted, will the gzip files be a non-issue as far as splitting goes if
>they are under the default block size?
>
>C G, glad i could help a little.
>
>-jason
>
>On 8/31/07, C G <pa...@yahoo.com> wrote:
>> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why.  Thanks for the "answer in advance" :-).
>>
>> Ted Dunning <td...@veoh.com> wrote:
>> With gzipped files, you do face the problem that your parallelism in the map
>> phase is pretty much limited to the number of files you have (because
>> gzip'ed files aren't splittable). This is often not a problem since most
>> people can arrange to have dozens to hundreds of input files easier than
>> they can arrange to have dozens to hundreds of CPU cores working on their
>> data.


RE: Compression using Hadoop...

Posted by Ted Dunning <td...@veoh.com>.
My 10x was very rough.

I based it on:

a) you want a few files per map task
b) you want a map task per core

I tend to use quad core machines and so I used 2 x 8 = 10 (roughly).

On EC2, you don't have multi-core machines (I think) so you might be fine with 2-4 files per CPU.


-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com]
Sent: Fri 8/31/2007 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Compression using Hadoop...
 
> Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware 
> of the number of nodes and split accordingly. Do you concur?

RE: Compression using Hadoop...

Posted by C G <pa...@yahoo.com>.
My input is typical row-based stuff across which are run a large stack of aggregations/rollups.  After reading earlier posts on this thread, I modified my loader to split the input up into 1M row partitions (literally gunzip -cd input.gz | split...).  I then ran an experiment using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster. Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware of the number of nodes and split accordingly. Do you concur?
   
  Load time to HDFS may be the next challenge.  My HDFS configuration on 8 nodes uses a replication factor of 3.  Sequentially copying my data to HDFS using -copyFromLocal took 23 minutes to move 266M in individual files of 5.7M each.  Does anybody find this result surprising?  Note that this is on EC2, where there is no such thing as rack-level or switch-level locality.  Should I expect dramatically better performance on a real iron?  Once I get this prototyping/education under my belt my plan is to deploy a 64 node grid of 4 way machines with a terabyte of local storage on each node.
   
  Thanks for the discussion...the Hadoop community is very helpful!
   
  C G 
    

Ted Dunning <td...@veoh.com> wrote:
  
They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine.


-----Original Message-----
From: jason.gessner@gmail.com on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...

ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G 
wrote:
> Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).
>
> Ted Dunning wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.


       
---------------------------------
Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.

RE: Compression using Hadoop...

Posted by Ted Dunning <td...@veoh.com>.
They will only be a non-issue if you have enough of them to get the parallelism you want.  If you have number of gzip files > 10*number of task nodes you should be fine.


-----Original Message-----
From: jason.gessner@gmail.com on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
 
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G <pa...@yahoo.com> wrote:
> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why.  Thanks for the "answer in advance" :-).
>
> Ted Dunning <td...@veoh.com> wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.

Re: Compression using Hadoop...

Posted by jason gessner <ja...@multiply.org>.
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G <pa...@yahoo.com> wrote:
> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why.  Thanks for the "answer in advance" :-).
>
> Ted Dunning <td...@veoh.com> wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.
>
>
> On 8/30/07 8:46 AM, "jason gessner" wrote:
>
> > if you put .gz files up on your HDFS cluster you don't need to do
> > anything to read them. I see lots of extra control via the API, but i
> > have simply put the files up and run my jobs on them.
> >
> > -jason
> >
> > On 8/30/07, C G
> wrote:
> >> Hello All:
> >>
> >> I think I must be missing something fundamental. Is it possible to load
> >> compressed data into HDFS, and then operate on it directly with map/reduce?
> >> I see a lot of stuff in the docs about writing compressed outputs, but
> >> nothing about reading compressed inputs.
> >>
> >> Am I being ponderously stupid here?
> >>
> >> Any help/comments appreciated...
> >>
> >> Thanks,
> >> C G
> >>
> >>
> >> ---------------------------------
> >> Luggage? GPS? Comic books?
> >> Check out fitting gifts for grads at Yahoo! Search.
>
>
>
>
> ---------------------------------
> Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when.

Re: Compression using Hadoop...

Posted by C G <pa...@yahoo.com>.
Thanks Ted and Jason for your comments.  Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why.  Thanks for the "answer in advance" :-).

Ted Dunning <td...@veoh.com> wrote:  
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable). This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they can arrange to have dozens to hundreds of CPU cores working on their
data. 


On 8/30/07 8:46 AM, "jason gessner" wrote:

> if you put .gz files up on your HDFS cluster you don't need to do
> anything to read them. I see lots of extra control via the API, but i
> have simply put the files up and run my jobs on them.
> 
> -jason
> 
> On 8/30/07, C G 
wrote:
>> Hello All:
>> 
>> I think I must be missing something fundamental. Is it possible to load
>> compressed data into HDFS, and then operate on it directly with map/reduce?
>> I see a lot of stuff in the docs about writing compressed outputs, but
>> nothing about reading compressed inputs.
>> 
>> Am I being ponderously stupid here?
>> 
>> Any help/comments appreciated...
>> 
>> Thanks,
>> C G
>> 
>> 
>> ---------------------------------
>> Luggage? GPS? Comic books?
>> Check out fitting gifts for grads at Yahoo! Search.



       
---------------------------------
Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. 

Re: Compression using Hadoop...

Posted by Ted Dunning <td...@veoh.com>.
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable).  This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they can arrange to have dozens to hundreds of CPU cores working on their
data. 


On 8/30/07 8:46 AM, "jason gessner" <ja...@multiply.org> wrote:

> if you put .gz files up on your HDFS cluster you don't need to do
> anything to read them.  I see lots of extra control via the API, but i
> have simply put the files up and run my jobs on them.
> 
> -jason
> 
> On 8/30/07, C G <pa...@yahoo.com> wrote:
>> Hello All:
>> 
>>   I think I must be missing something fundamental.  Is it possible to load
>> compressed data into HDFS, and then operate on it directly with map/reduce?
>> I see a lot of stuff in the docs about writing compressed outputs, but
>> nothing about reading compressed inputs.
>> 
>>   Am I being ponderously stupid here?
>> 
>>   Any help/comments appreciated...
>> 
>>   Thanks,
>>   C G
>> 
>> 
>> ---------------------------------
>> Luggage? GPS? Comic books?
>> Check out fitting  gifts for grads at Yahoo! Search.


Re: Compression using Hadoop...

Posted by jason gessner <ja...@multiply.org>.
if you put .gz files up on your HDFS cluster you don't need to do
anything to read them.  I see lots of extra control via the API, but i
have simply put the files up and run my jobs on them.

-jason

On 8/30/07, C G <pa...@yahoo.com> wrote:
> Hello All:
>
>   I think I must be missing something fundamental.  Is it possible to load compressed data into HDFS, and then operate on it directly with map/reduce?  I see a lot of stuff in the docs about writing compressed outputs, but nothing about reading compressed inputs.
>
>   Am I being ponderously stupid here?
>
>   Any help/comments appreciated...
>
>   Thanks,
>   C G
>
>
> ---------------------------------
> Luggage? GPS? Comic books?
> Check out fitting  gifts for grads at Yahoo! Search.