You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gurvinder Singh <gu...@uninett.no> on 2014/07/03 18:24:27 UTC
reading compress lzo files
Hi all,
I am trying to read the lzo files. It seems spark recognizes that the
input file is compressed and got the decompressor as
14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev
ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
[.lzo]
But it has two issues
1. It just stuck here without doing anything waited for 15 min for a
small files.
2. I used the hadoop-lzo to create the index so that spark can split
the input to multiple maps but spark creates only one mapper.
I am using python with reading using sc.textFile(). Spark version is
of the git master.
Regards,
Gurvinder
Re: reading compress lzo files
Posted by Sean Owen <so...@cloudera.com>.
Pardon, I was wrong about this. There is actually code distributed
under com.hadoop, and that's where this class is. Oops.
https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java
On Sun, Jul 6, 2014 at 6:37 AM, Sean Owen <so...@cloudera.com> wrote:
> The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
> class it starts with org.apache.hadoop
>
> On Jul 6, 2014 4:20 AM, "Nicholas Chammas" <ni...@gmail.com>
> wrote:
>>
>> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>> <gu...@uninett.no> wrote:
>>>
>>> csv =
>>>
>>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>>
>> Does anyone know what the rough equivalent of this would be in the Scala
>> API?
>>
>> I am trying the following, but the first import yields an error on my
>> spark-ec2 cluster:
>>
>> import com.hadoop.mapreduce.LzoTextInputFormat
>> import org.apache.hadoop.io.LongWritable
>> import org.apache.hadoop.io.Text
>>
>>
>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>> LzoTextInputFormat, LongWritable, Text)
>>
>> scala> import com.hadoop.mapreduce.LzoTextInputFormat
>> <console>:12: error: object hadoop is not a member of package com
>> import com.hadoop.mapreduce.LzoTextInputFormat
>>
>> Nick
Re: reading compress lzo files
Posted by Sean Owen <so...@cloudera.com>.
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
class it starts with org.apache.hadoop
On Jul 6, 2014 4:20 AM, "Nicholas Chammas" <ni...@gmail.com>
wrote:
> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh <
> gurvinder.singh@uninett.no> wrote:
>
> csv =
>> sc.newAPIHadoopFile(opts.input,"com.hadoop
>> .mapreduce.LzoTextInputFormat","org.apache.hadoop
>> .io.LongWritable","org.apache.hadoop.io.Text").count()
>>
> Does anyone know what the rough equivalent of this would be in the Scala
> API?
>
> I am trying the following, but the first import yields an error on my
> spark-ec2 cluster:
>
> import com.hadoop.mapreduce.LzoTextInputFormatimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", LzoTextInputFormat, LongWritable, Text)
>
> scala> import com.hadoop.mapreduce.LzoTextInputFormat
> <console>:12: error: object hadoop is not a member of package com
> import com.hadoop.mapreduce.LzoTextInputFormat
>
> Nick
>
>
Re: reading compress lzo files
Posted by Nicholas Chammas <ni...@gmail.com>.
I found it quite painful to figure out all the steps required and have
filed SPARK-2394 <https://issues.apache.org/jira/browse/SPARK-2394> to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using spark-ec2 to read LZO-compressed input.
Nick
Re: reading compress lzo files
Posted by Andrew Ash <an...@andrewash.com>.
Ni Nick,
The cluster I was working on in those linked messages was a private data
center cluster, not on EC2. I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.
Also I upgraded that cluster to 1.0 recently and am continuing to use
LZO-compressed data, so I know there's not a version issue.
Andrew
On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:
> I’ve been reading through several pages trying to figure out how to set up
> my spark-ec2 cluster to read LZO-compressed files from S3.
>
> -
> http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=sAw_mgrmnw@mail.gmail.com%3E
> -
> http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGA6f86qcSOwP7k_r+8R-DGBmj3gz+4xLJZjpr90DbNxg@mail.gmail.com%3E
> - https://github.com/twitter/hadoop-lzo
> -
> http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>
> It seems that several things may have changed since the above pages were
> put together, so getting this to work is more work than I expected.
>
> Is there a simple set of instructions somewhere one can follow to get a
> Spark EC2 cluster reading LZO-compressed input files correctly?
>
> Nick
>
>
>
> On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Ah, indeed it looks like I need to install this separately
>> <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
>> as it is not part of the core.
>>
>> Nick
>>
>>
>>
>> On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <
>> gurvinder.singh@uninett.no> wrote:
>>
>>> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
>>> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>>> > <gurvinder.singh@uninett.no <ma...@uninett.no>>
>>> wrote:
>>> >
>>> > csv =
>>> >
>>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>>> >
>>> > Does anyone know what the rough equivalent of this would be in the
>>> Scala
>>> > API?
>>> >
>>> I am not sure, I haven't tested it using scala.
>>> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
>>> https://github.com/twitter/hadoop-lzo
>>>
>>> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
>>> debian package on all of my workers. Make sure you have hadoop-lzo.jar
>>> in your class path for spark.
>>>
>>> - Gurvinder
>>>
>>> > I am trying the following, but the first import yields an error on my
>>> > |spark-ec2| cluster:
>>> >
>>> > |import com.hadoop.mapreduce.LzoTextInputFormat
>>> > import org.apache.hadoop.io.LongWritable
>>> > import org.apache.hadoop.io.Text
>>> >
>>> >
>>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>>> LzoTextInputFormat, LongWritable, Text)
>>> > |
>>> >
>>> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
>>> > <console>:12: error: object hadoop is not a member of package com
>>> > import com.hadoop.mapreduce.LzoTextInputFormat
>>> > |
>>> >
>>> > Nick
>>> >
>>> >
>>>
>>>
>>>
>>
>
Re: reading compress lzo files
Posted by Nicholas Chammas <ni...@gmail.com>.
I’ve been reading through several pages trying to figure out how to set up
my spark-ec2 cluster to read LZO-compressed files from S3.
-
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=sAw_mgrmnw@mail.gmail.com%3E
-
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGA6f86qcSOwP7k_r+8R-DGBmj3gz+4xLJZjpr90DbNxg@mail.gmail.com%3E
- https://github.com/twitter/hadoop-lzo
-
http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
It seems that several things may have changed since the above pages were
put together, so getting this to work is more work than I expected.
Is there a simple set of instructions somewhere one can follow to get a
Spark EC2 cluster reading LZO-compressed input files correctly?
Nick
On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:
> Ah, indeed it looks like I need to install this separately
> <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
> as it is not part of the core.
>
> Nick
>
>
>
> On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <
> gurvinder.singh@uninett.no> wrote:
>
>> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
>> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>> > <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
>> >
>> > csv =
>> >
>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>> >
>> > Does anyone know what the rough equivalent of this would be in the Scala
>> > API?
>> >
>> I am not sure, I haven't tested it using scala.
>> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
>> https://github.com/twitter/hadoop-lzo
>>
>> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
>> debian package on all of my workers. Make sure you have hadoop-lzo.jar
>> in your class path for spark.
>>
>> - Gurvinder
>>
>> > I am trying the following, but the first import yields an error on my
>> > |spark-ec2| cluster:
>> >
>> > |import com.hadoop.mapreduce.LzoTextInputFormat
>> > import org.apache.hadoop.io.LongWritable
>> > import org.apache.hadoop.io.Text
>> >
>> >
>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>> LzoTextInputFormat, LongWritable, Text)
>> > |
>> >
>> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
>> > <console>:12: error: object hadoop is not a member of package com
>> > import com.hadoop.mapreduce.LzoTextInputFormat
>> > |
>> >
>> > Nick
>> >
>> >
>>
>>
>>
>
Re: reading compress lzo files
Posted by Nicholas Chammas <ni...@gmail.com>.
Ah, indeed it looks like I need to install this separately
<https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
as it is not part of the core.
Nick
On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <gu...@uninett.no>
wrote:
> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
> > <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
> >
> > csv =
> >
> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
> >
> > Does anyone know what the rough equivalent of this would be in the Scala
> > API?
> >
> I am not sure, I haven't tested it using scala.
> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
> https://github.com/twitter/hadoop-lzo
>
> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
> debian package on all of my workers. Make sure you have hadoop-lzo.jar
> in your class path for spark.
>
> - Gurvinder
>
> > I am trying the following, but the first import yields an error on my
> > |spark-ec2| cluster:
> >
> > |import com.hadoop.mapreduce.LzoTextInputFormat
> > import org.apache.hadoop.io.LongWritable
> > import org.apache.hadoop.io.Text
> >
> >
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
> LzoTextInputFormat, LongWritable, Text)
> > |
> >
> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
> > <console>:12: error: object hadoop is not a member of package com
> > import com.hadoop.mapreduce.LzoTextInputFormat
> > |
> >
> > Nick
> >
> >
>
>
>
Re: reading compress lzo files
Posted by Gurvinder Singh <gu...@uninett.no>.
On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
> <gurvinder.singh@uninett.no <ma...@uninett.no>> wrote:
>
> csv =
> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>
> Does anyone know what the rough equivalent of this would be in the Scala
> API?
>
I am not sure, I haven't tested it using scala.
com.hadoop.mapreduce.LzoTextInputFormat class is from this package
https://github.com/twitter/hadoop-lzo
I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
debian package on all of my workers. Make sure you have hadoop-lzo.jar
in your class path for spark.
- Gurvinder
> I am trying the following, but the first import yields an error on my
> |spark-ec2| cluster:
>
> |import com.hadoop.mapreduce.LzoTextInputFormat
> import org.apache.hadoop.io.LongWritable
> import org.apache.hadoop.io.Text
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", LzoTextInputFormat, LongWritable, Text)
> |
>
> |scala> import com.hadoop.mapreduce.LzoTextInputFormat
> <console>:12: error: object hadoop is not a member of package com
> import com.hadoop.mapreduce.LzoTextInputFormat
> |
>
> Nick
>
>
Re: reading compress lzo files
Posted by Nicholas Chammas <ni...@gmail.com>.
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh <gu...@uninett.no>
wrote:
csv =
> sc.newAPIHadoopFile(opts.input,"com.hadoop
> .mapreduce.LzoTextInputFormat","org.apache.hadoop
> .io.LongWritable","org.apache.hadoop.io.Text").count()
>
Does anyone know what the rough equivalent of this would be in the Scala
API?
I am trying the following, but the first import yields an error on my
spark-ec2 cluster:
import com.hadoop.mapreduce.LzoTextInputFormatimport
org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text
sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
LzoTextInputFormat, LongWritable, Text)
scala> import com.hadoop.mapreduce.LzoTextInputFormat
<console>:12: error: object hadoop is not a member of package com
import com.hadoop.mapreduce.LzoTextInputFormat
Nick
Re: reading compress lzo files
Posted by Gurvinder Singh <gu...@uninett.no>.
an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile
csv =
sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
> Hi all,
>
> I am trying to read the lzo files. It seems spark recognizes that the
> input file is compressed and got the decompressor as
>
> 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
> 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
> deprecated. Instead, use io.native.lib.available
> 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
> [.lzo]
>
> But it has two issues
>
> 1. It just stuck here without doing anything waited for 15 min for a
> small files.
> 2. I used the hadoop-lzo to create the index so that spark can split
> the input to multiple maps but spark creates only one mapper.
>
> I am using python with reading using sc.textFile(). Spark version is
> of the git master.
>
> Regards,
> Gurvinder
>