You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by JOAQUIN GUANTER GONZALBEZ <xi...@tid.es> on 2012/06/15 08:04:26 UTC

LzopCodec and SequenceFile?

Hello,

I have a sequence of MR Jobs that are using the SequenceFile for their output and input format. If I run them without any compression enabled they work fine. If I use the LzoCodec they also work just fine (but then the output is not Lzop compatible which is inconvenient).

If I try using the LzopCodec, then the first MR job (which reads from a TextFile and outputs to a SequenceFile) runs OK, but when the second job tries to read what the first job wrote, I get the following exception:

java.io.EOFException: Premature EOF from inputStream
        at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75)
        at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114)
        at com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:54)
        at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1591)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.ha

Does anyone know why this could be happening? I'm using the latest's Couldera CDH3 distribution and I'm configuring the compression through the mapred.output.compression.codec property in the mapred-site.xml file.

Thanks!
Ximo.

________________________________
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra pol?tica de env?o y recepci?n de correo electr?nico en el enlace situado m?s abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

RE: LzopCodec and SequenceFile?

Posted by JOAQUIN GUANTER GONZALBEZ <xi...@tid.es>.
Hi Harsh,

Thanks for the super-quick answer! It would be great if we could have this documented somewhere in the official documentation, since there's no mention that SequenceFile cannot be used with LzopCodec in either the SequenceFile documentation or the LzopCodec documentation.

Thanks again!
Ximo.

-----Mensaje original-----
De: Harsh J [mailto:harsh@cloudera.com]
Enviado el: viernes, 15 de junio de 2012 12:59
Para: mapreduce-user@hadoop.apache.org
Asunto: Re: LzopCodec and SequenceFile?

Hey Joaquin,

When using SequenceFiles, use LzoCodec. The reason is that SequenceFile is a container format of its own, just like LZOP files are. It does not make sense combining the two.

For reading sequence files, use the SequenceFile.Reader class
(http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/io/SequenceFile.Reader.html)
and it will auto handle decompressing the K/V fields for you. You don't have to run lzop/etc. first to be able to read it, as the compression is applied internally and not over the entire file.

Here is also a good link on the difference at Quora:
http://www.quora.com/Whats-the-difference-between-the-LzoCodec-and-the-LzopCodec-in-Hadoop-LZO

On Fri, Jun 15, 2012 at 11:34 AM, JOAQUIN GUANTER GONZALBEZ <xi...@tid.es> wrote:
> Hello,
>
>
>
> I have a sequence of MR Jobs that are using the SequenceFile for their
> output and input format. If I run them without any compression enabled
> they work fine. If I use the LzoCodec they also work just fine (but
> then the output is not Lzop compatible which is inconvenient).
>
>
>
> If I try using the LzopCodec, then the first MR job (which reads from
> a TextFile and outputs to a SequenceFile) runs OK, but when the second
> job tries to read what the first job wrote, I get the following exception:
>
>
>
> java.io.EOFException: Premature EOF from inputStream
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.j
> ava:75)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.
> java:114)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java
> :54)
>
>         at
> com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:
> 83)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1591)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493
> )
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480
> )
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475
> )
>
>         at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initial
> ize(SequenceFileRecordReader.java:50)
>
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(Ma
> pTask.java:451)
>
>         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>
>         at org.apache.ha
>
>
>
> Does anyone know why this could be happening? I'm using the latest's
> Couldera CDH3 distribution and I'm configuring the compression through
> the mapred.output.compression.codec property in the mapred-site.xml file.
>
>
>
> Thanks!
>
> Ximo.
>
>
> ________________________________
> Este mensaje se dirige exclusivamente a su destinatario. Puede
> consultar nuestra política de envío y recepción de correo electrónico
> en el enlace situado más abajo.
> This message is intended exclusively for its addressee. We only send
> and receive email on the basis of the terms set out at
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx



--
Harsh J

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: LzopCodec and SequenceFile?

Posted by Harsh J <ha...@cloudera.com>.
Hey Joaquin,

When using SequenceFiles, use LzoCodec. The reason is that
SequenceFile is a container format of its own, just like LZOP files
are. It does not make sense combining the two.

For reading sequence files, use the SequenceFile.Reader class
(http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/io/SequenceFile.Reader.html)
and it will auto handle decompressing the K/V fields for you. You
don't have to run lzop/etc. first to be able to read it, as the
compression is applied internally and not over the entire file.

Here is also a good link on the difference at Quora:
http://www.quora.com/Whats-the-difference-between-the-LzoCodec-and-the-LzopCodec-in-Hadoop-LZO

On Fri, Jun 15, 2012 at 11:34 AM, JOAQUIN GUANTER GONZALBEZ <xi...@tid.es> wrote:
> Hello,
>
>
>
> I have a sequence of MR Jobs that are using the SequenceFile for their
> output and input format. If I run them without any compression enabled they
> work fine. If I use the LzoCodec they also work just fine (but then the
> output is not Lzop compatible which is inconvenient).
>
>
>
> If I try using the LzopCodec, then the first MR job (which reads from a
> TextFile and outputs to a SequenceFile) runs OK, but when the second job
> tries to read what the first job wrote, I get the following exception:
>
>
>
> java.io.EOFException: Premature EOF from inputStream
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:54)
>
>         at
> com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1591)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>
>         at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
>
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451)
>
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>
>         at org.apache.ha
>
>
>
> Does anyone know why this could be happening? I’m using the latest’s
> Couldera CDH3 distribution and I’m configuring the compression through the
> mapred.output.compression.codec property in the mapred-site.xml file.
>
>
>
> Thanks!
>
> Ximo.
>
>
> ________________________________
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra política de envío y recepción de correo electrónico en el enlace
> situado más abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx



-- 
Harsh J