You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kevin Weil <ke...@gmail.com> on 2010/01/01 00:44:33 UTC

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

Steve, glad you got it figured out.  Interested to hear how it goes, and of
course feel free to post bugs/requests to the github page
www.github.com/kevinweil/hadoop-lzo.

Kevin

On Thu, Dec 31, 2009 at 12:21 PM, Steve Kuo <ku...@gmail.com> wrote:

> Digging around the new Job api with a rested brain came up with
>
>             job.setInputFormatClass(LzoTextInputFormat.class);
>
> that solved the problem.
>
> On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <ku...@gmail.com> wrote:
>
> > I have followed
> >
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
> > requisite hadoop-lzo jar and native .so files.  (The jar and .so files
> were
> > built from Kevin Weil's git repository.  Thanks Kevin.)  I have
> configured
> > core-site.xml and mapred-site.xml as instructed to enable lzo for both
> map
> > and reduce output.  The creation of lzo index also worked. The last step
> was
> > to replace TextInputFormat with LzoTextInputFormat.  As I only have
> >
> >     FileInputFormat.addInputPath(jobConf, new Path(inputPath));
> >
> > it was replaced with
> >
> >      LzoTextInputFormat.addInputPath(job, new Path(inputPath));
> >
> > When I ran my MR job, I noticed that the new code was able to read in
> .lzo
> > input files and decompressed fine.   The output was also lzo compressed.
> > However, only one map job was created for each input .lzo file indicating
> > that input splitting was not done by LzoTextInputFormat but more likely
> by
> > its parent such as FileInputFormat.  There must be a way to ensure
> > LzoTextInputFormat is used in the Map task.  How can this be done?
> >
> > Thanks in advance.
> >
> >
>

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

Posted by Kevin Weil <ke...@gmail.com>.

Ted, this might be tough -- the underlying LZO compression algorithm creates
the block offsets.  You can specify the LZO block size, but I don't think
it's exact enough for what you're looking for.

Kevin


On Mon, Jan 18, 2010 at 11:12 AM, Ted Yu <yu...@gmail.com> wrote:

> For our custom text-based file format, we use empty line to mark data for
> different households.
> Can we make LZO block start to be aligned with new household, possibly by
> modifying LzoIndexRecordWriter ?
>
> Thanks
>
> On Thu, Dec 31, 2009 at 3:44 PM, Kevin Weil <ke...@gmail.com> wrote:
>
> > Steve, glad you got it figured out.  Interested to hear how it goes, and
> of
> > course feel free to post bugs/requests to the github page
> > www.github.com/kevinweil/hadoop-lzo.
> >
> > Kevin
> >
> > On Thu, Dec 31, 2009 at 12:21 PM, Steve Kuo <ku...@gmail.com> wrote:
> >
> > > Digging around the new Job api with a rested brain came up with
> > >
> > >             job.setInputFormatClass(LzoTextInputFormat.class);
> > >
> > > that solved the problem.
> > >
> > > On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <ku...@gmail.com>
> wrote:
> > >
> > > > I have followed
> > > >
> > >
> >
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> > > > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build
> the
> > > > requisite hadoop-lzo jar and native .so files.  (The jar and .so
> files
> > > were
> > > > built from Kevin Weil's git repository.  Thanks Kevin.)  I have
> > > configured
> > > > core-site.xml and mapred-site.xml as instructed to enable lzo for
> both
> > > map
> > > > and reduce output.  The creation of lzo index also worked. The last
> > step
> > > was
> > > > to replace TextInputFormat with LzoTextInputFormat.  As I only have
> > > >
> > > >     FileInputFormat.addInputPath(jobConf, new Path(inputPath));
> > > >
> > > > it was replaced with
> > > >
> > > >      LzoTextInputFormat.addInputPath(job, new Path(inputPath));
> > > >
> > > > When I ran my MR job, I noticed that the new code was able to read in
> > > .lzo
> > > > input files and decompressed fine.   The output was also lzo
> > compressed.
> > > > However, only one map job was created for each input .lzo file
> > indicating
> > > > that input splitting was not done by LzoTextInputFormat but more
> likely
> > > by
> > > > its parent such as FileInputFormat.  There must be a way to ensure
> > > > LzoTextInputFormat is used in the Map task.  How can this be done?
> > > >
> > > > Thanks in advance.
> > > >
> > > >
> > >
> >
>

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

Posted by Ted Yu <yu...@gmail.com>.

For our custom text-based file format, we use empty line to mark data for
different households.
Can we make LZO block start to be aligned with new household, possibly by
modifying LzoIndexRecordWriter ?

Thanks

On Thu, Dec 31, 2009 at 3:44 PM, Kevin Weil <ke...@gmail.com> wrote:

> Steve, glad you got it figured out.  Interested to hear how it goes, and of
> course feel free to post bugs/requests to the github page
> www.github.com/kevinweil/hadoop-lzo.
>
> Kevin
>
> On Thu, Dec 31, 2009 at 12:21 PM, Steve Kuo <ku...@gmail.com> wrote:
>
> > Digging around the new Job api with a rested brain came up with
> >
> >             job.setInputFormatClass(LzoTextInputFormat.class);
> >
> > that solved the problem.
> >
> > On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <ku...@gmail.com> wrote:
> >
> > > I have followed
> > >
> >
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> > > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
> > > requisite hadoop-lzo jar and native .so files.  (The jar and .so files
> > were
> > > built from Kevin Weil's git repository.  Thanks Kevin.)  I have
> > configured
> > > core-site.xml and mapred-site.xml as instructed to enable lzo for both
> > map
> > > and reduce output.  The creation of lzo index also worked. The last
> step
> > was
> > > to replace TextInputFormat with LzoTextInputFormat.  As I only have
> > >
> > >     FileInputFormat.addInputPath(jobConf, new Path(inputPath));
> > >
> > > it was replaced with
> > >
> > >      LzoTextInputFormat.addInputPath(job, new Path(inputPath));
> > >
> > > When I ran my MR job, I noticed that the new code was able to read in
> > .lzo
> > > input files and decompressed fine.   The output was also lzo
> compressed.
> > > However, only one map job was created for each input .lzo file
> indicating
> > > that input splitting was not done by LzoTextInputFormat but more likely
> > by
> > > its parent such as FileInputFormat.  There must be a way to ensure
> > > LzoTextInputFormat is used in the Map task.  How can this be done?
> > >
> > > Thanks in advance.
> > >
> > >
> >
>