You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by David Ginzburg <da...@inner-active.com> on 2013/10/13 17:16:28 UTC

Generating snappy compressed avro files as hadoop map reduce input files

Hi,

I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
I would like to compress with snappy codec before storing them on S3.
It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
My question is how can I generate within my application logic (not MR) snappy compressed avro files?

Re: Generating snappy compressed avro files as hadoop map reduce input files

Posted by Bertrand Dechoux <de...@gmail.com>.

David wants to generate those files from outside Hadoop. So InputFormat and
OutputFormat may not be the most appropriate.

The aim of avro files are to be readable and writable easily without hadoop
MapReduce. The stackoverflow link (in short) does only talk about the
limitation of most compression algorithms : they are not splittable (by
Hadoop or everybody). That's the case for Snappy.

It is a known limitation. And it is why there is an avro file which is a
specific file which is not itself compressed but parts of the files
(blocks) are compressed. That way there is no issue with "splittability".

The stackoverflow is about a text file (logs) which has been snappy
compressed. That's not a good practice. Once again, there is specific file
which is the avro file.

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
should be what you need. If you are not using a java program, to generate
your data, others languages are also supported.

Then when/if you will want to process that with MapReduce, there is
AvroInputFormat and AvroOutputFormat.

Regards

Bertrand

PS : from the Cloudera blog post linked by stackoverflow

"One thing to note is that Snappy is intended to be used with a container
format, like Sequence Files or Avro Data Files, rather than being used
directly on plain text, for example, since the latter is not splittable and
can’t be processed in parallel using MapReduce. "

On Sun, Oct 13, 2013 at 11:31 PM, graham sanderson <gr...@vast.com> wrote:

> I haven't actually tried writing, but look at AvroSequenceFileOutputFormat
> (and obviously have native snappy libraries on your box)
>
> Also the javadoc is a bit IMHO ambiguous on AvroJob setup - you can
> totally use NullWritable (or any other hadoop supported Serializable) as a
> key.
>
> On Oct 13, 2013, at 2:23 PM, David Ginzburg <da...@inner-active.com>
> wrote:
>
> Thanks,
> I am not generating the avro files with hadoop MR, but a different process.
> I Plan to just store the files on s3 for potential archive processing with
> EMR.
> Can I use AvroSequenceFile from a non M/R process to generate the sequence
> files having my avro records as the values, and null keys ?
> ------------------------------
> *From:* graham sanderson <gr...@vast.com>
> *Sent:* Sunday, October 13, 2013 9:16 PM
> *To:* user@avro.apache.org
> *Subject:* Re: Generating snappy compressed avro files as hadoop map
> reduce input files
>
> If you're using hadoop, why not use AvroSequenceFileOutputFormat - this
> works fine with snappy (block level compression may be best depending on
> your data)
>
> On Oct 13, 2013, at 10:58 AM, David Ginzburg <da...@inner-active.com>
> wrote:
>
> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec
> just doesn't work with externally generated files.
>
> Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29>
> serve as input files for a map reduce job, specially EMR jobs ?
> ------------------------------
> *From:* Bertrand Dechoux <de...@gmail.com>
> *Sent:* Sunday, October 13, 2013 6:36 PM
> *To:* user@avro.apache.org
> *Subject:* Re: Generating snappy compressed avro files as hadoop map
> reduce input files
>
> I am not sure to understand the relation between your problem and the way
> the temporary data are stored after the map phase.
>
> However, I guess you are looking for a DataFileWriter and its setCodec
> function.
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
>
> Regards
>
> Bertrand
>
> PS : A snappy-compressed avro file is not a standard file which has been
> compressed afterwards but really a specific file containing compressed
> blocks. This principle is similar to the SequenceFile's. Maybe that's what
> you mean by different snappy codec?
>
> On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com>
> wrote:
>
>> Hi,
>>
>> I am writing an application that produces avro record files , to be
>> stored on AWS S3 as possible input to EMR.
>> I would like to compress with snappy codec before storing them on S3.
>> It is my understanding that hadoop currently uses a different snappy
>> codec, mostly used as intermediate map output format .
>> My question is how can I generate within my application logic (not MR)
>> snappy compressed avro files?
>>
>>
>>
>>
>
>
>
>
>

-- 
Bertrand Dechoux

Re: Generating snappy compressed avro files as hadoop map reduce input files

Posted by graham sanderson <gr...@vast.com>.

I haven't actually tried writing, but look at AvroSequenceFileOutputFormat (and obviously have native snappy libraries on your box)

Also the javadoc is a bit IMHO ambiguous on AvroJob setup - you can totally use NullWritable (or any other hadoop supported Serializable) as a key.

On Oct 13, 2013, at 2:23 PM, David Ginzburg <da...@inner-active.com> wrote:

> Thanks,
> I am not generating the avro files with hadoop MR, but a different process.
> I Plan to just store the files on s3 for potential archive processing with EMR.
> Can I use AvroSequenceFile from a non M/R process to generate the sequence files having my avro records as the values, and null keys ?
> From: graham sanderson <gr...@vast.com>
> Sent: Sunday, October 13, 2013 9:16 PM
> To: user@avro.apache.org
> Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files
>  
> If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy (block level compression may be best depending on your data)
> 
> On Oct 13, 2013, at 10:58 AM, David Ginzburg <da...@inner-active.com> wrote:
> 
>> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files.
>> 
>> Can files generated by DataFileWriter  serve as input files for a map reduce job, specially EMR jobs ? 
>> From: Bertrand Dechoux <de...@gmail.com>
>> Sent: Sunday, October 13, 2013 6:36 PM
>> To: user@avro.apache.org
>> Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files
>>  
>> I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase.
>> 
>> However, I guess you are looking for a DataFileWriter and its setCodec function.
>> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
>> 
>> Regards
>> 
>> Bertrand
>> 
>> PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec?
>> 
>> On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com> wrote:
>> Hi,
>> 
>> I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
>> I would like to compress with snappy codec before storing them on S3.
>> It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
>> My question is how can I generate within my application logic (not MR) snappy compressed avro files?
>> 
>> 
>> 
>> 
>> 
> 
>

RE: Generating snappy compressed avro files as hadoop map reduce input files

Posted by David Ginzburg <da...@inner-active.com>.

Thanks,
I am not generating the avro files with hadoop MR, but a different process.
I Plan to just store the files on s3 for potential archive processing with EMR.
Can I use AvroSequenceFile from a non M/R process to generate the sequence files having my avro records as the values, and null keys ?
________________________________
From: graham sanderson <gr...@vast.com>
Sent: Sunday, October 13, 2013 9:16 PM
To: user@avro.apache.org
Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files

If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy (block level compression may be best depending on your data)

On Oct 13, 2013, at 10:58 AM, David Ginzburg <da...@inner-active.com>> wrote:

As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files.

Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29>  serve as input files for a map reduce job, specially EMR jobs ?
________________________________
From: Bertrand Dechoux <de...@gmail.com>>
Sent: Sunday, October 13, 2013 6:36 PM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files

I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase.

However, I guess you are looking for a DataFileWriter and its setCodec function.
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29

Regards

Bertrand

PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec?

On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com>> wrote:
Hi,

I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
I would like to compress with snappy codec before storing them on S3.
It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
My question is how can I generate within my application logic (not MR) snappy compressed avro files?

Re: Generating snappy compressed avro files as hadoop map reduce input files

Posted by graham sanderson <gr...@vast.com>.

If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy (block level compression may be best depending on your data)

On Oct 13, 2013, at 10:58 AM, David Ginzburg <da...@inner-active.com> wrote:

> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files.
> 
> Can files generated by DataFileWriter  serve as input files for a map reduce job, specially EMR jobs ? 
> From: Bertrand Dechoux <de...@gmail.com>
> Sent: Sunday, October 13, 2013 6:36 PM
> To: user@avro.apache.org
> Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files
>  
> I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase.
> 
> However, I guess you are looking for a DataFileWriter and its setCodec function.
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
> 
> Regards
> 
> Bertrand
> 
> PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec?
> 
> On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com> wrote:
> Hi,
> 
> I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
> I would like to compress with snappy codec before storing them on S3.
> It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
> My question is how can I generate within my application logic (not MR) snappy compressed avro files?
> 
> 
> 
> 
>

RE: Generating snappy compressed avro files as hadoop map reduce input files

Posted by David Ginzburg <da...@inner-active.com>.

As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files.

Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29>  serve as input files for a map reduce job, specially EMR jobs ?
________________________________
From: Bertrand Dechoux <de...@gmail.com>
Sent: Sunday, October 13, 2013 6:36 PM
To: user@avro.apache.org
Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files

I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase.

However, I guess you are looking for a DataFileWriter and its setCodec function.
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29

Regards

Bertrand

PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec?

On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com>> wrote:
Hi,

I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
I would like to compress with snappy codec before storing them on S3.
It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
My question is how can I generate within my application logic (not MR) snappy compressed avro files?

Re: Generating snappy compressed avro files as hadoop map reduce input files

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure to understand the relation between your problem and the way
the temporary data are stored after the map phase.

However, I guess you are looking for a DataFileWriter and its setCodec
function.
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29

Regards

Bertrand

PS : A snappy-compressed avro file is not a standard file which has been
compressed afterwards but really a specific file containing compressed
blocks. This principle is similar to the SequenceFile's. Maybe that's what
you mean by different snappy codec?

On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <da...@inner-active.com>wrote:

>  Hi,
>
> I am writing an application that produces avro record files , to be stored
> on AWS S3 as possible input to EMR.
> I would like to compress with snappy codec before storing them on S3.
> It is my understanding that hadoop currently uses a different snappy
> codec, mostly used as intermediate map output format .
> My question is how can I generate within my application logic (not MR)
> snappy compressed avro files?
>
>
>
>