You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gurvinder Singh <gu...@uninett.no> on 2014/07/03 18:24:27 UTC

reading compress lzo files

Hi all,

I am trying to read the lzo files. It seems spark recognizes that the
input file is compressed and got the decompressor as

14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev
ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
[.lzo]

But it has two issues

1. It just stuck here without doing anything waited for 15 min for a
small files.
2. I used the hadoop-lzo to create the index so that spark can split
the input to multiple maps but spark creates only one mapper.

I am using python with reading using sc.textFile(). Spark version is
of the git master.

Regards,
Gurvinder

Re: reading compress lzo files

Posted by Sean Owen <so...@cloudera.com>.

Pardon, I was wrong about this. There is actually code distributed
under com.hadoop, and that's where this class is. Oops.

https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java

On Sun, Jul 6, 2014 at 6:37 AM, Sean Owen <so...@cloudera.com> wrote:
> The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
> class it starts with org.apache.hadoop
>
> On Jul 6, 2014 4:20 AM, "Nicholas Chammas" <ni...@gmail.com>
> wrote:
>>
>> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>> <gu...@uninett.no> wrote:
>>>
>>> csv =
>>>
>>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>>
>> Does anyone know what the rough equivalent of this would be in the Scala
>> API?
>>
>> I am trying the following, but the first import yields an error on my
>> spark-ec2 cluster:
>>
>> import com.hadoop.mapreduce.LzoTextInputFormat
>> import org.apache.hadoop.io.LongWritable
>> import org.apache.hadoop.io.Text
>>
>>
>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>> LzoTextInputFormat, LongWritable, Text)
>>
>> scala> import com.hadoop.mapreduce.LzoTextInputFormat
>> <console>:12: error: object hadoop is not a member of package com
>>        import com.hadoop.mapreduce.LzoTextInputFormat
>>
>> Nick

Re: reading compress lzo files

Posted by Sean Owen <so...@cloudera.com>.

The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
class it starts with org.apache.hadoop
On Jul 6, 2014 4:20 AM, "Nicholas Chammas" <ni...@gmail.com>
wrote:

> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh <
> gurvinder.singh@uninett.no> wrote:
>
> csv =
>> sc.newAPIHadoopFile(opts.input,"com.hadoop
>> .mapreduce.LzoTextInputFormat","org.apache.hadoop
>> .io.LongWritable","org.apache.hadoop.io.Text").count()
>>
> Does anyone know what the rough equivalent of this would be in the Scala
> API?
>
> I am trying the following, but the first import yields an error on my
> spark-ec2 cluster:
>
> import com.hadoop.mapreduce.LzoTextInputFormatimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", LzoTextInputFormat, LongWritable, Text)
>
> scala> import com.hadoop.mapreduce.LzoTextInputFormat
> <console>:12: error: object hadoop is not a member of package com
>        import com.hadoop.mapreduce.LzoTextInputFormat
>
> Nick
> 
>

Re: reading compress lzo files

Posted by Nicholas Chammas <ni...@gmail.com>.

I found it quite painful to figure out all the steps required and have
filed SPARK-2394 <https://issues.apache.org/jira/browse/SPARK-2394> to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using spark-ec2 to read LZO-compressed input.

Nick

Re: reading compress lzo files

Posted by Andrew Ash <an...@andrewash.com>.

Ni Nick,

The cluster I was working on in those linked messages was a private data
center cluster, not on EC2.  I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.

Also I upgraded that cluster to 1.0 recently and am continuing to use
LZO-compressed data, so I know there's not a version issue.

Andrew


On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> I’ve been reading through several pages trying to figure out how to set up
> my spark-ec2 cluster to read LZO-compressed files from S3.
>
>    -
>    http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=sAw_mgrmnw@mail.gmail.com%3E
>    -
>    http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGA6f86qcSOwP7k_r+8R-DGBmj3gz+4xLJZjpr90DbNxg@mail.gmail.com%3E
>    - https://github.com/twitter/hadoop-lzo
>    -
>    http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>
> It seems that several things may have changed since the above pages were
> put together, so getting this to work is more work than I expected.
>
> Is there a simple set of instructions somewhere one can follow to get a
> Spark EC2 cluster reading LZO-compressed input files correctly?
>
> Nick
> 
>
>
> On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Ah, indeed it looks like I need to install this separately
>> <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
>> as it is not part of the core.
>>
>> Nick
>>
>>
>>
>> On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <
>> gurvinder.singh@uninett.no> wrote:
>>
>>> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
>>> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>>> > <gurvinder.singh@uninett.no <ma...@uninett.no>>
>>> wrote:
>>> >
>>> >     csv =
>>> >
>>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>>> >
>>> > Does anyone know what the rough equivalent of this would be in the
>>> Scala
>>> > API?
>>> >
>>> I am not sure, I haven't tested it using scala.
>>> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
>>> https://github.com/twitter/hadoop-lzo
>>>
>>> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
>>> debian package on all of my workers. Make sure you have hadoop-lzo.jar
>>> in your class path for spark.
>>>
>>> - Gurvinder
>>>
>>> > I am trying the following, but the first import yields an error on my
>>> > |spark-ec2| cluster:
>>> >
>>> > |import com.hadoop.mapreduce.LzoTextInputFormat
>>> > import org.apache.hadoop.io.LongWritable
>>> > import org.apache.hadoop.io.Text
>>> >
>>> >
>>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>>> LzoTextInputFormat, LongWritable, Text)
>>> > |
>>> >
>>> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
>>> > <console>:12: error: object hadoop is not a member of package com
>>> >        import com.hadoop.mapreduce.LzoTextInputFormat
>>> > |
>>> >
>>> > Nick
>>> >
>>> > 
>>>
>>>
>>>
>>
>

Re: reading compress lzo files

Posted by Nicholas Chammas <ni...@gmail.com>.

I’ve been reading through several pages trying to figure out how to set up
my spark-ec2 cluster to read LZO-compressed files from S3.

   -
   http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=sAw_mgrmnw@mail.gmail.com%3E
   -
   http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGA6f86qcSOwP7k_r+8R-DGBmj3gz+4xLJZjpr90DbNxg@mail.gmail.com%3E
   - https://github.com/twitter/hadoop-lzo
   -
   http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

It seems that several things may have changed since the above pages were
put together, so getting this to work is more work than I expected.

Is there a simple set of instructions somewhere one can follow to get a
Spark EC2 cluster reading LZO-compressed input files correctly?

Nick



On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Ah, indeed it looks like I need to install this separately
> <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
> as it is not part of the core.
>
> Nick
>
>
>
> On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <
> gurvinder.singh@uninett.no> wrote:
>
>> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
>> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>> > <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
>> >
>> >     csv =
>> >
>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>> >
>> > Does anyone know what the rough equivalent of this would be in the Scala
>> > API?
>> >
>> I am not sure, I haven't tested it using scala.
>> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
>> https://github.com/twitter/hadoop-lzo
>>
>> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
>> debian package on all of my workers. Make sure you have hadoop-lzo.jar
>> in your class path for spark.
>>
>> - Gurvinder
>>
>> > I am trying the following, but the first import yields an error on my
>> > |spark-ec2| cluster:
>> >
>> > |import com.hadoop.mapreduce.LzoTextInputFormat
>> > import org.apache.hadoop.io.LongWritable
>> > import org.apache.hadoop.io.Text
>> >
>> >
>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>> LzoTextInputFormat, LongWritable, Text)
>> > |
>> >
>> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
>> > <console>:12: error: object hadoop is not a member of package com
>> >        import com.hadoop.mapreduce.LzoTextInputFormat
>> > |
>> >
>> > Nick
>> >
>> > 
>>
>>
>>
>

Re: reading compress lzo files

Posted by Nicholas Chammas <ni...@gmail.com>.

Ah, indeed it looks like I need to install this separately
<https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
as it is not part of the core.

Nick



On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <gu...@uninett.no>
wrote:

> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
> > <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
> >
> >     csv =
> >
> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
> >
> > Does anyone know what the rough equivalent of this would be in the Scala
> > API?
> >
> I am not sure, I haven't tested it using scala.
> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
> https://github.com/twitter/hadoop-lzo
>
> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
> debian package on all of my workers. Make sure you have hadoop-lzo.jar
> in your class path for spark.
>
> - Gurvinder
>
> > I am trying the following, but the first import yields an error on my
> > |spark-ec2| cluster:
> >
> > |import com.hadoop.mapreduce.LzoTextInputFormat
> > import org.apache.hadoop.io.LongWritable
> > import org.apache.hadoop.io.Text
> >
> >
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
> LzoTextInputFormat, LongWritable, Text)
> > |
> >
> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
> > <console>:12: error: object hadoop is not a member of package com
> >        import com.hadoop.mapreduce.LzoTextInputFormat
> > |
> >
> > Nick
> >
> > 
>
>
>

Re: reading compress lzo files

Posted by Gurvinder Singh <gu...@uninett.no>.

On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
> <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
> 
>     csv =
>     sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
> 
> Does anyone know what the rough equivalent of this would be in the Scala
> API?
> 
I am not sure, I haven't tested it using scala.
com.hadoop.mapreduce.LzoTextInputFormat class is from this package
https://github.com/twitter/hadoop-lzo

I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
debian package on all of my workers. Make sure you have hadoop-lzo.jar
in your class path for spark.

- Gurvinder

> I am trying the following, but the first import yields an error on my
> |spark-ec2| cluster:
> 
> |import com.hadoop.mapreduce.LzoTextInputFormat
> import org.apache.hadoop.io.LongWritable
> import org.apache.hadoop.io.Text
> 
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", LzoTextInputFormat, LongWritable, Text)
> |
> 
> |scala> import com.hadoop.mapreduce.LzoTextInputFormat
> <console>:12: error: object hadoop is not a member of package com
>        import com.hadoop.mapreduce.LzoTextInputFormat
> |
> 
> Nick
> 
>

Re: reading compress lzo files

Posted by Nicholas Chammas <ni...@gmail.com>.

On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh <gu...@uninett.no>
wrote:

csv =
> sc.newAPIHadoopFile(opts.input,"com.hadoop
> .mapreduce.LzoTextInputFormat","org.apache.hadoop
> .io.LongWritable","org.apache.hadoop.io.Text").count()
>
Does anyone know what the rough equivalent of this would be in the Scala
API?

I am trying the following, but the first import yields an error on my
spark-ec2 cluster:

import com.hadoop.mapreduce.LzoTextInputFormatimport
org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text

sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
LzoTextInputFormat, LongWritable, Text)

scala> import com.hadoop.mapreduce.LzoTextInputFormat
<console>:12: error: object hadoop is not a member of package com
       import com.hadoop.mapreduce.LzoTextInputFormat

Nick

Re: reading compress lzo files

Posted by Gurvinder Singh <gu...@uninett.no>.

an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile

csv =
sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()

- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
> Hi all,
> 
> I am trying to read the lzo files. It seems spark recognizes that the
> input file is compressed and got the decompressor as
> 
> 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
> 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
> deprecated. Instead, use io.native.lib.available
> 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
> [.lzo]
> 
> But it has two issues
> 
> 1. It just stuck here without doing anything waited for 15 min for a
> small files.
> 2. I used the hadoop-lzo to create the index so that spark can split
> the input to multiple maps but spark creates only one mapper.
> 
> I am using python with reading using sc.textFile(). Spark version is
> of the git master.
> 
> Regards,
> Gurvinder
>