You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by edward choi <mp...@gmail.com> on 2012/01/02 06:34:24 UTC

How to read LZO compressed files?

Hi,

I'm having trouble trying to handle lzo compressed files.
The input files are compressed by LzopCodec provided by hadoop-lzo package.
And I am using Cloudera 3 update 2 version Hadoop.

I don't need to split the input file, so there is no need telling me to
index the input file and to use LzoTextInputFormat, unless that is the only
way to handle lzo-compressed files.

I thought all I needed to do was set the job input format as
"TextInputFormat" and hadoop will take care of the rest.
When I do that, I don't get any error messages but log files tell me that
input files are not decompressed at all. Input files are being handled as
raw text files.

Is there a specific way to read files with lzo extension?

Regards,
Ed

Re: How to read LZO compressed files?

Posted by edward choi <mp...@gmail.com>.
Harsh, your comment just saved me from several wasteful hours of aimless
labor.
I added LzoCodec in core-site.xml. But I forgot to add LzopCodec.
Now it works all good. Thanks for the reply!!!

Regards,
Ed

2012/1/2 Harsh J <ha...@cloudera.com>

> Hello Edward,
>
> On Mon, Jan 2, 2012 at 11:04 AM, edward choi <mp...@gmail.com> wrote:
> > Hi,
> >
> > I'm having trouble trying to handle lzo compressed files.
> > The input files are compressed by LzopCodec provided by hadoop-lzo
> package.
> > And I am using Cloudera 3 update 2 version Hadoop.
> >
> > I don't need to split the input file, so there is no need telling me to
> > index the input file and to use LzoTextInputFormat, unless that is the
> only
> > way to handle lzo-compressed files.
>
> Its OK to use LZO without splitting. There are no issues in doing that.
>
> > I thought all I needed to do was set the job input format as
> > "TextInputFormat" and hadoop will take care of the rest.
> > When I do that, I don't get any error messages but log files tell me that
> > input files are not decompressed at all. Input files are being handled as
> > raw text files.
>
> By 'Input files are being handled as raw text files.' I assume you
> mean that your mappers are receiving garbage (compressed) input,
> without being decoded?
>
> Have you ensured that your io.compression.codecs property in
> core-site.xml carries LzoCodec and LzopCodec canonical classnames, and
> that your MR cluster was restarted with this change added?
>
> > Is there a specific way to read files with lzo extension?
>
> The above config registers ".lzo" look-outs and auto-detection of LZO
> files so you shouldn't need an explicit way.
>
> --
> Harsh J
>

Re: How to read LZO compressed files?

Posted by Harsh J <ha...@cloudera.com>.
Hello Edward,

On Mon, Jan 2, 2012 at 11:04 AM, edward choi <mp...@gmail.com> wrote:
> Hi,
>
> I'm having trouble trying to handle lzo compressed files.
> The input files are compressed by LzopCodec provided by hadoop-lzo package.
> And I am using Cloudera 3 update 2 version Hadoop.
>
> I don't need to split the input file, so there is no need telling me to
> index the input file and to use LzoTextInputFormat, unless that is the only
> way to handle lzo-compressed files.

Its OK to use LZO without splitting. There are no issues in doing that.

> I thought all I needed to do was set the job input format as
> "TextInputFormat" and hadoop will take care of the rest.
> When I do that, I don't get any error messages but log files tell me that
> input files are not decompressed at all. Input files are being handled as
> raw text files.

By 'Input files are being handled as raw text files.' I assume you
mean that your mappers are receiving garbage (compressed) input,
without being decoded?

Have you ensured that your io.compression.codecs property in
core-site.xml carries LzoCodec and LzopCodec canonical classnames, and
that your MR cluster was restarted with this change added?

> Is there a specific way to read files with lzo extension?

The above config registers ".lzo" look-outs and auto-detection of LZO
files so you shouldn't need an explicit way.

-- 
Harsh J

Re: How to read LZO compressed files?

Posted by edward choi <mp...@gmail.com>.
Hi,

The first solution is my final plan. There are so many lzo files, that
manual decompression would take quite a while

As you suggested, I have used LzoTextInputFormat but I get the following
error

2012-01-02 16:15:16,668 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2012-01-02 16:15:16,765 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2012-01-02 16:15:16,858 INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
library
2012-01-02 16:15:16,860 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo library [hadoop-lzo rev
8aa060526bc6778c971775b832751d2894c73b5f]
2012-01-02 16:15:16,906 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-01-02 16:15:16,908 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Codec for file
hdfs://lp182:54310/user/hadoop/blog_result/20111106_20111112/part-m-00000.lzo
not found, cannot run
	at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
	at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-01-02 16:15:16,910 INFO org.apache.hadoop.mapred.Task: Runnning
cleanup for the task

which I don't understand, because I do have LZO codec.
Could you tell me what I am doing wrong here?

Regards,
Ed

2012/1/2 Shi Yu <sh...@uchicago.edu>

> You could decompress the LZO file manually into plain text then
> using TextInputFormat.
>
> Alternatively, you don't need to index the LZO compressed file,
> just using LZOInputFormat on non-indexed files, then the LZO
> file will not be split anymore.
>

Re: How to read LZO compressed files?

Posted by Shi Yu <sh...@uchicago.edu>.
You could decompress the LZO file manually into plain text then 
using TextInputFormat.

Alternatively, you don't need to index the LZO compressed file, 
just using LZOInputFormat on non-indexed files, then the LZO 
file will not be split anymore.