You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2012/02/28 16:50:27 UTC

Should splittable Gzip be a "core" hadoop feature?

Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

On Wed, Feb 29, 2012 at 13:10, Michel Segel <mi...@hotmail.com>wrote:

> Let's play devil's advocate for a second?
>

I always like that :)


> Why?


Because then datafiles from other systems (like the Apache HTTP webserver)
can be processed without preprocessing more efficiently.

Snappy exists.
>

Compared to gzip: Snappy is faster, compresses a bit less and is
unfortunately not splittable.

The only advantage is that you don't have to convert from gzip to snappy
> and can process gzip files natively.
>

Yes, that and the fact that the files are smaller.
Note that I've described some of these considerations in the javadoc.

Next question is how large are the gzip files in the first place?
>

I work for the biggest webshop in the Netherlands and I'm facing a set of
logfiles that are very often > 1 GB each.... and are gzipped.
The first thing we do with then is parse and disect each line in the very
first mapper. Then we store the result in (snappy compressed) avro files.

I don't disagree, I just want to have a solid argument in favor of it...
>

:)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Michel Segel <mi...@hotmail.com>.
 I do agree that a git hub project is the way to go unless you could convince Cloudera, HortonWorks or MapR to pick it up and support it.  They have enough committers 

Is this potentially worthwhile? Maybe, it depends on how the cluster is integrated in to the overall environment. Companies that have standardized on using gzip would find it useful.



Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 29, 2012, at 3:17 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
> 
> On Wed, Feb 29, 2012 at 19:13, Robert Evans <ev...@yahoo-inc.com> wrote:
> 
> 
>> What I really want to know is how well does this new CompressionCodec
>> perform in comparison to the regular gzip codec in
> 
> various different conditions and what type of impact does it have on
>> network traffic and datanode load.  My gut feeling is that
> 
> the speedup is going to be relatively small except when there is a lot of
>> computation happening in the mapper
> 
> 
> I agree, I made the same assesment.
> In the javadoc I wrote under "When is this useful?"
> *"Assume you have a heavy map phase for which the input is a 1GiB Apache
> httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> 
> 
>> and the added load and network traffic outweighs the speedup in most
>> cases,
> 
> 
> No, the trick to solve that one is to upload the gzipped files with a HDFS
> blocksize equal (or 1 byte larger) than the filesize.
> This setting will help in speeding up Gzipped input files in any situation
> (no more network overhead).
> From there the HDFS file replication factor of the file dictates the
> optimal number of splits for this codec.
> 
> 
>> but like all performance on a complex system gut feelings are
> 
> almost worthless and hard numbers are what is needed to make a judgment
>> call.
> 
> 
> Yes
> 
> 
>> Niels, I assume you have tested this on your cluster(s).  Can you share
>> with us some of the numbers?
>> 
> 
> No I haven't tested it beyond a multiple core system.
> The simple reason for that is that when this was under review last summer
> the whole "Yarn" thing happened
> and I was unable to run it at all for a long time.
> I only got it running again last december when the restructuring of the
> source tree was mostly done.
> 
> At this moment I'm building a experimentation setup at work that can be
> used for various things.
> Given the current state of Hadoop 2.0 I think it's time to produce some
> actual results.
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

On Wed, Feb 29, 2012 at 19:13, Robert Evans <ev...@yahoo-inc.com> wrote:


> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in

various different conditions and what type of impact does it have on
> network traffic and datanode load.  My gut feeling is that

the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper


I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*


> and the added load and network traffic outweighs the speedup in most
> cases,


No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.


> but like all performance on a complex system gut feelings are

almost worthless and hard numbers are what is needed to make a judgment
> call.


Yes


> Niels, I assume you have tested this on your cluster(s).  Can you share
> with us some of the numbers?
>

No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.

At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
actual results.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Robert Evans <ev...@yahoo-inc.com>.
If many people are going to use it then by all means put it in.  If there is only one person, or a very small handful of people that are going to use it then I personally would prefer to see it a separate project.  However, Edward, you have convinced me that I am trying to make a logical judgment based only on a gut feeling and the response rate to an email chain.  Thanks for that.  What I really want to know is how well does this new CompressionCodec perform in comparison to the regular gzip codec in various different conditions and what type of impact does it have on network traffic and datanode load.  My gut feeling is that the speedup is going to be relatively small except when there is a lot of computation happening in the mapper and the added load and network traffic outweighs the speedup in most cases, but like all performance on a complex system gut feelings are almost worthless and hard numbers are what is needed to make a judgment call.  Niels, I assume you have tested this on your cluster(s).  Can you share with us some of the numbers?

--Bobby Evans

On 2/29/12 11:06 AM, "Edward Capriolo" <ed...@gmail.com> wrote:

Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.

The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,

On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
> I can see a use for it, but I have two concerns about it.  My biggest concern is maintainability.  We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot.  I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it.  My second concern is with knowing when to use this.  Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc.
>
> From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in.  But if others disagree I am not going to block it.  I am a -0 on pulling this in now.
>
> --Bobby
>
> On 2/29/12 10:00 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>
> Hi,
>
> On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <ed...@gmail.com>wrote:
> ...
>
>> But being able to generate split info for them and processing them
>> would be good as well. I remember that was a hot thing to do with lzo
>> back in the day. The pain of once overing the gz files to generate the
>> split info is detracting but it is nice to know it is there if you
>> want it.
>>
>
> Note that the solution I created (HADOOP-7076) does not require any
> preprocessing.
> It can split ANY gzipped file as-is.
> The downside is that this effectively costs some additional performance
> because the task has to decompress the first part of the file that is to be
> discarded.
>
> The other two ways of splitting gzipped files either require
> - creating come kind of "compression index" before actually using the file
> (HADOOP-6153)
> - creating a file in a format that is gerenated in such a way that it is
> really a set of concatenated gzipped files. (HADOOP-7909)
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Edward Capriolo <ed...@gmail.com>.
Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.

The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,

On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
> I can see a use for it, but I have two concerns about it.  My biggest concern is maintainability.  We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot.  I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it.  My second concern is with knowing when to use this.  Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc.
>
> From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in.  But if others disagree I am not going to block it.  I am a -0 on pulling this in now.
>
> --Bobby
>
> On 2/29/12 10:00 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>
> Hi,
>
> On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <ed...@gmail.com>wrote:
> ...
>
>> But being able to generate split info for them and processing them
>> would be good as well. I remember that was a hot thing to do with lzo
>> back in the day. The pain of once overing the gz files to generate the
>> split info is detracting but it is nice to know it is there if you
>> want it.
>>
>
> Note that the solution I created (HADOOP-7076) does not require any
> preprocessing.
> It can split ANY gzipped file as-is.
> The downside is that this effectively costs some additional performance
> because the task has to decompress the first part of the file that is to be
> discarded.
>
> The other two ways of splitting gzipped files either require
> - creating come kind of "compression index" before actually using the file
> (HADOOP-6153)
> - creating a file in a format that is gerenated in such a way that it is
> really a set of concatenated gzipped files. (HADOOP-7909)
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Robert Evans <ev...@yahoo-inc.com>.
I can see a use for it, but I have two concerns about it.  My biggest concern is maintainability.  We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot.  I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it.  My second concern is with knowing when to use this.  Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc.

>From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in.  But if others disagree I am not going to block it.  I am a -0 on pulling this in now.

--Bobby

On 2/29/12 10:00 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <ed...@gmail.com>wrote:
...

> But being able to generate split info for them and processing them
> would be good as well. I remember that was a hot thing to do with lzo
> back in the day. The pain of once overing the gz files to generate the
> split info is detracting but it is nice to know it is there if you
> want it.
>

Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of "compression index" before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

--
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <ed...@gmail.com>wrote:
...

> But being able to generate split info for them and processing them
> would be good as well. I remember that was a hot thing to do with lzo
> back in the day. The pain of once overing the gz files to generate the
> split info is detracting but it is nice to know it is there if you
> want it.
>

Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of "compression index" before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Edward Capriolo <ed...@gmail.com>.
Mike,

Snappy is cool and all, but I was not overly impressed with it.

GZ zipps much better then Snappy. Last time I checked for our log file
gzip took them down from 100MB-> 40MB, while snappy compressed them
from 100MB->55MB. That was only with sequence files. But still that is
pretty significant if you are considering long term storage. Also
being that the delta in the file size was large I could not actually
make the agree that using sequence+snappy was faster then sequence+gz.
Sure the MB/s rate was probably faster but since I had more MB I was
not able to prove snappy a win. I use it for intermediate compression
only.

Actually the raw formats (gz vs sequence gz) are significantly smaller
and faster then their sequence file counterparts.

Believe it or not, I commonly use mapred.compress.output without
sequence files. As long as I have a larger number of reducers I do not
have to worry about files being splittable because N mappers process N
files. Generally I am happpy with say N mappers because the input
formats tend to create more mappers then I want which makes more
overhead and more shuffle.

But being able to generate split info for them and processing them
would be good as well. I remember that was a hot thing to do with lzo
back in the day. The pain of once overing the gz files to generate the
split info is detracting but it is nice to know it is there if you
want it.

Edward
On Wed, Feb 29, 2012 at 7:10 AM, Michel Segel <mi...@hotmail.com> wrote:
> Let's play devil's advocate for a second?
>
> Why? Snappy exists.
> The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively.
>
> Next question is how large are the gzip files in the first place?
>
> I don't disagree, I just want to have a solid argument in favor of it...
>
>
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 28, 2012, at 9:50 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> Hi,
>>
>> Some time ago I had an idea and implemented it.
>>
>> Normally you can only run a single gzipped input file through a single
>> mapper and thus only on a single CPU core.
>> What I created makes it possible to process a Gzipped file in such a way
>> that it can run on several mappers in parallel.
>>
>> I've put the javadoc I created on my homepage so you can read more about
>> the details.
>> http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec
>>
>> Now the question that was raised by one of the people reviewing this code
>> was: Should this implementation be part of the core Hadoop feature set?
>> The main reason that was given is that this needs a bit more understanding
>> on what is happening and as such cannot be enabled by default.
>>
>> I would like to hear from the Hadoop Core/Map reduce users what you think.
>>
>> Should this be
>> - a part of the default Hadoop feature set so that anyone can simply enable
>> it by setting the right configuration?
>> - a separate library?
>> - a nice idea I had fun building but that no one needs?
>> - ... ?
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Posted by Michel Segel <mi...@hotmail.com>.
Let's play devil's advocate for a second?

Why? Snappy exists.
The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively.

Next question is how large are the gzip files in the first place?

I don't disagree, I just want to have a solid argument in favor of it...




Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 28, 2012, at 9:50 AM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
> 
> Some time ago I had an idea and implemented it.
> 
> Normally you can only run a single gzipped input file through a single
> mapper and thus only on a single CPU core.
> What I created makes it possible to process a Gzipped file in such a way
> that it can run on several mappers in parallel.
> 
> I've put the javadoc I created on my homepage so you can read more about
> the details.
> http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec
> 
> Now the question that was raised by one of the people reviewing this code
> was: Should this implementation be part of the core Hadoop feature set?
> The main reason that was given is that this needs a bit more understanding
> on what is happening and as such cannot be enabled by default.
> 
> I would like to hear from the Hadoop Core/Map reduce users what you think.
> 
> Should this be
> - a part of the default Hadoop feature set so that anyone can simply enable
> it by setting the right configuration?
> - a separate library?
> - a nice idea I had fun building but that no one needs?
> - ... ?
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes