You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Som Satpathy <so...@gmail.com> on 2013/08/03 00:58:31 UTC

Writing compressed sequence files

Hi all,

I am trying to write compressed sequence files at the end of my crunch
pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
for that.
However, Crunch is writing an uncompressed sequence file by default. How do
I pass the codec that I want to use to Crunch?

Looking forward for your inputs.

Thanks,
Som

Re: Writing compressed sequence files

Posted by Som Satpathy <so...@gmail.com>.

    It worked! Thanks Josh, appreciate it.

All I had to do is:

    conf.setBoolean("mapred.output.compress", true);

    conf.set("mapred.output.compression.type", "BLOCK");

    conf.setClass("mapred.output.compression.codec", SnappyCodec.class,
CompressionCodec.class);


instead of:

    conf.set("mapred.compress.output", "true");

    conf.set("mapred.output.compression.type", "BLOCK");

    conf.set("mapred.output.compression.codec",
"org.apache.hadoop.io.compress.SnappyCodec");



On Fri, Aug 2, 2013 at 6:24 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Som,
>
> Something seems amiss-- I use this trick in Cloudera ML to handle output
> compression, viz.:
>
>
> https://github.com/cloudera/ml/blob/master/client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java
>
> Can you send me a gist of what you're trying if that doesn't work?
>
> J
>
>
>
> On Fri, Aug 2, 2013 at 5:33 PM, Som Satpathy <so...@gmail.com>wrote:
>
>> Thanks Josh. I tried setting compression parameters via the Configuration
>> object and also via command line, but the output sequence file never seems
>> to get compressed. I'm trying to Snappy compress it.
>>
>> If I trying creating a sequence file outside of crunch using
>> SequenceFile.createWriter, I see the file getting compressed with my
>> compression type (i.e Snappy)
>>
>> I was wondering if this is a know issue with crunch..
>>
>> Thanks,
>> Som
>>
>>
>> On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Som,
>>>
>>> The Pipeline object that coordinates the flow has a getConfiguration()
>>> method where you can set any options you might like and they will propagate
>>> to all of your jars.
>>>
>>> I usually implement Hadoop's Tool interface and then specify these
>>> configuration options on the command line so I can play with them
>>> independent of the logic of my runtime, and I end up w/something like:
>>>
>>> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
>>> mapred.output.compression.type=block etc.
>>>
>>> I think that having some syntactic sugar for compressing Target objects
>>> (like To.sequenceFile or To.avroFile) would be a nice JIRA.
>>>
>>> J
>>>
>>>
>>> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <so...@gmail.com>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to write compressed sequence files at the end of my crunch
>>>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
>>>> for that.
>>>> However, Crunch is writing an uncompressed sequence file by default.
>>>> How do I pass the codec that I want to use to Crunch?
>>>>
>>>> Looking forward for your inputs.
>>>>
>>>> Thanks,
>>>> Som
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Writing compressed sequence files

Posted by Josh Wills <jw...@cloudera.com>.

Hey Som,

Something seems amiss-- I use this trick in Cloudera ML to handle output
compression, viz.:

https://github.com/cloudera/ml/blob/master/client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java

Can you send me a gist of what you're trying if that doesn't work?

J



On Fri, Aug 2, 2013 at 5:33 PM, Som Satpathy <so...@gmail.com> wrote:

> Thanks Josh. I tried setting compression parameters via the Configuration
> object and also via command line, but the output sequence file never seems
> to get compressed. I'm trying to Snappy compress it.
>
> If I trying creating a sequence file outside of crunch using
> SequenceFile.createWriter, I see the file getting compressed with my
> compression type (i.e Snappy)
>
> I was wondering if this is a know issue with crunch..
>
> Thanks,
> Som
>
>
> On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Som,
>>
>> The Pipeline object that coordinates the flow has a getConfiguration()
>> method where you can set any options you might like and they will propagate
>> to all of your jars.
>>
>> I usually implement Hadoop's Tool interface and then specify these
>> configuration options on the command line so I can play with them
>> independent of the logic of my runtime, and I end up w/something like:
>>
>> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
>> mapred.output.compression.type=block etc.
>>
>> I think that having some syntactic sugar for compressing Target objects
>> (like To.sequenceFile or To.avroFile) would be a nice JIRA.
>>
>> J
>>
>>
>> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <so...@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> I am trying to write compressed sequence files at the end of my crunch
>>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
>>> for that.
>>> However, Crunch is writing an uncompressed sequence file by default. How
>>> do I pass the codec that I want to use to Crunch?
>>>
>>> Looking forward for your inputs.
>>>
>>> Thanks,
>>> Som
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Writing compressed sequence files

Posted by Som Satpathy <so...@gmail.com>.

Thanks Josh. I tried setting compression parameters via the Configuration
object and also via command line, but the output sequence file never seems
to get compressed. I'm trying to Snappy compress it.

If I trying creating a sequence file outside of crunch using
SequenceFile.createWriter, I see the file getting compressed with my
compression type (i.e Snappy)

I was wondering if this is a know issue with crunch..

Thanks,
Som

On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Som,
>
> The Pipeline object that coordinates the flow has a getConfiguration()
> method where you can set any options you might like and they will propagate
> to all of your jars.
>
> I usually implement Hadoop's Tool interface and then specify these
> configuration options on the command line so I can play with them
> independent of the logic of my runtime, and I end up w/something like:
>
> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
> mapred.output.compression.type=block etc.
>
> I think that having some syntactic sugar for compressing Target objects
> (like To.sequenceFile or To.avroFile) would be a nice JIRA.
>
> J
>
>
> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <so...@gmail.com>wrote:
>
>> Hi all,
>>
>> I am trying to write compressed sequence files at the end of my crunch
>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
>> for that.
>> However, Crunch is writing an uncompressed sequence file by default. How
>> do I pass the codec that I want to use to Crunch?
>>
>> Looking forward for your inputs.
>>
>> Thanks,
>> Som
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Writing compressed sequence files

Posted by Josh Wills <jw...@cloudera.com>.

Hey Som,

The Pipeline object that coordinates the flow has a getConfiguration()
method where you can set any options you might like and they will propagate
to all of your jars.

I usually implement Hadoop's Tool interface and then specify these
configuration options on the command line so I can play with them
independent of the logic of my runtime, and I end up w/something like:

hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
mapred.output.compression.type=block etc.

I think that having some syntactic sugar for compressing Target objects
(like To.sequenceFile or To.avroFile) would be a nice JIRA.

J

On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <so...@gmail.com> wrote:

> Hi all,
>
> I am trying to write compressed sequence files at the end of my crunch
> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
> for that.
> However, Crunch is writing an uncompressed sequence file by default. How
> do I pass the codec that I want to use to Crunch?
>
> Looking forward for your inputs.
>
> Thanks,
> Som
>
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>