You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by nir_zamir <ni...@gmail.com> on 2013/04/22 11:26:47 UTC

map/reduce of compressed Avro

Hi,

Does anyone know if/how it's possible to get compressed Avro files as an
input to a M/R job?
If so, which codecs are supported?

Thanks,
Nir



--
View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: map/reduce of compressed Avro

Posted by Martin Kleppmann <ma...@rapportive.com>.
To my knowledge, LZO is not a supported codec for Avro data files. It's
possible that you have a LZO-compressed Hadoop sequence file containing
Avro records, but that would be a format you defined yourself, and not the
same as an Avro data file.

Avro data files are designed to be splittable regardless of the codec they
use, so you can have multiple mappers that each consume a portion of the
input file. The format achieves that by breaking the data into blocks, and
compressing each block separately; hence it can be split at block
boundaries.

Best,
Martin


On 22 April 2013 23:47, nir_zamir <ni...@gmail.com> wrote:

> Thanks Martin.
>
> What will happen if I try to use an indexed LZO-compressed avro file? Will
> it work and utilize the index to allow multiple mappers?
>
> I think that for Snappy for example, the file is splittable and can use
> multiple mappers, but I haven't tested it yet - would be glad if anyone has
> any experience with that.
>
> Thanks!
> Nir.
>
>
>
> --
> View this message in context:
> http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947p4027009.html
> Sent from the Avro - Users mailing list archive at Nabble.com.
>

Re: map/reduce of compressed Avro

Posted by nir_zamir <ni...@gmail.com>.
Thanks.

If the compression codec doesn't matter, what does it mean that Avro added
support for Snappy codec?
If I need the files to be used as input for a M/R, I guess the avro module
should be able to decompress each block and extract the objects. Does it
make sense? 

So are you saying that in this case I can use a non-splittable codec (like
deflate)?

Thanks
Nir



--
View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947p4027170.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: map/reduce of compressed Avro

Posted by "Enns, Steven" <sa...@a9.com>.
Out of curiosity, are there any other file formats that provide splittable
gzip compression like Avro object containers?  I can only think of
Sequence Files.

On 4/29/13 3:47 PM, "Scott Carey" <sc...@apache.org> wrote:

>Martin said it already, but I will emphasize:
>
>Avro data files are splittable and can support multiple mappers no matter
>what codec is used for compression.  This is because avro files are block
>based, and only use the compression within the block.  I recommend
>starting with gzip compression, and moving to snappy only if deflate
>compression level '1' is not fast enough.
>
>For more information on avro data files, see:
>http://avro.apache.org/docs/current/spec.html#Object+Container+Files
>
>
>
>On 4/22/13 11:47 PM, "nir_zamir" <ni...@gmail.com> wrote:
>
>>Thanks Martin.
>>
>>What will happen if I try to use an indexed LZO-compressed avro file?
>>Will
>>it work and utilize the index to allow multiple mappers?
>>
>>I think that for Snappy for example, the file is splittable and can use
>>multiple mappers, but I haven't tested it yet - would be glad if anyone
>>has
>>any experience with that.
>>
>>Thanks!
>>Nir.
>>
>>
>>
>>--
>>View this message in context:
>>http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4
>>0
>>26947p4027009.html
>>Sent from the Avro - Users mailing list archive at Nabble.com.
>
>


Re: map/reduce of compressed Avro

Posted by Scott Carey <sc...@apache.org>.
Martin said it already, but I will emphasize:

Avro data files are splittable and can support multiple mappers no matter
what codec is used for compression.  This is because avro files are block
based, and only use the compression within the block.  I recommend
starting with gzip compression, and moving to snappy only if deflate
compression level '1' is not fast enough.

For more information on avro data files, see:
http://avro.apache.org/docs/current/spec.html#Object+Container+Files



On 4/22/13 11:47 PM, "nir_zamir" <ni...@gmail.com> wrote:

>Thanks Martin.
>
>What will happen if I try to use an indexed LZO-compressed avro file? Will
>it work and utilize the index to allow multiple mappers?
>
>I think that for Snappy for example, the file is splittable and can use
>multiple mappers, but I haven't tested it yet - would be glad if anyone
>has
>any experience with that.
>
>Thanks!
>Nir.
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp40
>26947p4027009.html
>Sent from the Avro - Users mailing list archive at Nabble.com.



Re: map/reduce of compressed Avro

Posted by nir_zamir <ni...@gmail.com>.
Thanks Martin.

What will happen if I try to use an indexed LZO-compressed avro file? Will
it work and utilize the index to allow multiple mappers?

I think that for Snappy for example, the file is splittable and can use
multiple mappers, but I haven't tested it yet - would be glad if anyone has
any experience with that.

Thanks!
Nir.



--
View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947p4027009.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: map/reduce of compressed Avro

Posted by Martin Kleppmann <ma...@rapportive.com>.
You don't need to do anything special to accept compressed Avro files as
input, as it's detected automatically and decompressed transparently. M/R
jobs support all codecs that the Java implementation supports; at the
moment I think that's deflate, snappy and bzip2.

If you want to generate compressed output, use
FileOutputFormat.setCompressOutput(job, true);

Martin


On 22 April 2013 02:26, nir_zamir <ni...@gmail.com> wrote:

> Hi,
>
> Does anyone know if/how it's possible to get compressed Avro files as an
> input to a M/R job?
> If so, which codecs are supported?
>
> Thanks,
> Nir
>
>
>
> --
> View this message in context:
> http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947.html
> Sent from the Avro - Users mailing list archive at Nabble.com.
>