You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vipul Pandey <vi...@gmail.com> on 2014/01/22 07:56:54 UTC

Re: reading LZO compressed file in spark

Hi Rajeev,

Did you get past this exception?

Thanks,
Vipul


On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava <ra...@silverline-da.com> wrote:

> Hi Andrew,
>      Thanks for your example
> I used your command and i get the following errors from worker  ( missing codec from worker i guess)
> How do i get codecs over to worker machines
> regards
> Rajeev
> *******************************************************************
> 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, cannot run                                                                        at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)                                      at spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68)                                                          at spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)                                                                 at spark.RDD.computeOrReadCheckpoint(RDD.scala:207)                                                                      at spark.RDD.iterator(RDD.scala:196)                                                                                     at spark.scheduler.ResultTask.run(ResultTask.scala:77)                                                                   at spark.executor.Executor$TaskRunner.run(Executor.scala:98)                                                             at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                       at java.lang.Thread.run(Thread.java:724)                                                                         13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on executor 4: hadoop02 (preferred)                                                                                                                       13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes in 0 ms                                      13/12/26 12:34:42 INFO TaskSetManager: Lost TID 22 (task 0.0:20)                                                         13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, cannot run [duplicate 1]   
> 
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
> 
> 
> On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <an...@andrewash.com> wrote:
> Hi Berkeley,
> 
> By RF=3 I mean replication factor of 3 on the files in HDFS, so each block is stored 3 times across the cluster.  It's a pretty standard choice for the replication factor in order to give a hardware team time to replace bad hardware in the case of failure.  With RF=3 the cluster can sustain failure on any two nodes without data loss, but the loss of the third node may cause loss.
> 
> When reading the LZO files with the newAPIHadoopFile() call I showed below, the data in the RDD is already decompressed -- it transparently looks the same to my Spark program as if I was operating on an uncompressed file.
> 
> Cheers,
> Andrew
> 
> 
> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon <be...@firestickgames.com> wrote:
> Andrew, This is great. 
> 
> Excuse my ignorance, but what do you mean by RF=3? Also, after reading the LZO files, are you able to access the contents directly, or do you have to decompress them after reading them?
> 
> Sent from my iPhone
> 
> On Dec 24, 2013, at 12:03 AM, Andrew Ash <an...@andrewash.com> wrote:
> 
>> Hi Rajeev,
>> 
>> I'm not sure if you ever got it working, but I just got mine up and going.  If you just use sc.textFile(...) the file will be read but the LZO index won't be used so a .count() on my 1B+ row file took 2483s.  When I ran it like this though:
>> 
>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text]).count
>> 
>> the LZO index file was used and the .count() took just 101s.  For reference this file is 43GB when .gz compressed and 78.4GB when .lzo compressed.  I have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes and Spark both running on each machine.
>> 
>> Cheers!
>> Andrew
>> 
>> 
>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <ra...@silverline-da.com> wrote:
>> Thanks for your suggestion. I will try this and update by late evening.
>> 
>> regards
>> Rajeev
>> 
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>> 
>> 
>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <an...@andrewash.com> wrote:
>> Hi Rajeev,
>> 
>> It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while Stephen referred to com.hadoop.mapreduce.LzoTextInputFormat
>> 
>> I think the way to use this in Spark would be to use the SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with the path and the InputFormat as parameters.  Can you give those a shot?
>> 
>> Andrew
>> 
>> 
>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <ra...@silverline-da.com> wrote:
>> Hi Stephen,
>>      I tried the same lzo file with a simple hadoop script
>> this seems to work fine
>> 
>> HADOOP_HOME=/usr/lib/hadoop
>> /usr/bin/hadoop  jar /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar \
>> -libjars /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar \
>> -input /tmp/ldpc.sstv3.lzo \
>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>> -output wc_test \
>> -mapper 'cat' \
>> -reducer 'wc -l'
>> 
>> This means hadoop is able to handle the lzo file correctly
>> 
>> Can you suggest me what i should do in spark for it to work
>> 
>> regards
>> Rajeev
>> 
>> 
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>> 
>> 
>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <st...@gmail.com> wrote:
>> 
>> > System.setProperty("spark.io.compression.codec",
>> > "com.hadoop.compression.lzo.LzopCodec")
>> 
>> This spark.io.compression.codec is a completely different setting than the
>> codecs that are used for reading/writing from HDFS. (It is for compressing
>> Spark's internal/non-HDFS intermediate output.)
>> 
>> > Hope this helps and someone can help read a LZO file
>> 
>> Spark just uses the regular Hadoop File System API, so any issues with reading
>> LZO files would be Hadoop issues. I would search in the Hadoop issue tracker,
>> and look for information on using LZO files with Hadoop/Hive, and whatever works
>> for them, should magically work for Spark as well.
>> 
>> This looks like a good place to start:
>> 
>> https://github.com/twitter/hadoop-lzo
>> 
>> IANAE, but I would try passing one of these:
>> 
>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>> 
>> To the SparkContext.hadoopFile method.
>> 
>> - Stephen
>> 
>> 
>> 
>> 
>> 
> 
>

Re: reading LZO compressed file in spark

Posted by Rajeev Srivastava <ra...@silverline-da.com>.

Hi Vipul,
Andrew Ash suggested the answer which i am yet to try.
apparently his experiment worked for his LZO files. I don't think i will be
able to try his suggestions before Feb

Do share if his solution works for you.
regards
Rajeev


Rajeev Srivastava
Silverline Design Inc
2118 Walsh ave, suite 204
Santa Clara, CA, 95050
cell : 408-409-0940


On Tue, Jan 21, 2014 at 10:56 PM, Vipul Pandey <vi...@gmail.com> wrote:

> Hi Rajeev,
>
> Did you get past this exception?
>
> Thanks,
> Vipul
>
>
>
> On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava <ra...@silverline-da.com>
> wrote:
>
> Hi Andrew,
>      Thanks for your example
> I used your command and i get the following errors from worker  ( missing
> codec from worker i guess)
> How do i get codecs over to worker machines
> regards
> Rajeev
> *******************************************************************
> 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to
> java.io.IOException: Codec for file
> hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not
> found, cannot
> run
> at
> com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)
> at
> spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68)
> at
> spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
> at
> spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
> at
> spark.RDD.iterator(RDD.scala:196)
> at
> spark.scheduler.ResultTask.run(ResultTask.scala:77)
> at
> spark.executor.Executor$TaskRunner.run(Executor.scala:98)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> java.lang.Thread.run(Thread.java:724)
> 13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on
> executor 4: hadoop02
> (preferred)
> 13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes
> in 0 ms                                      13/12/26 12:34:42 INFO
> TaskSetManager: Lost TID 22 (task
> 0.0:20)                                                         13/12/26
> 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec
> for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzonot found, cannot run [duplicate 1]
>
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
>
>
> On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Hi Berkeley,
>>
>> By RF=3 I mean replication factor of 3 on the files in HDFS, so each
>> block is stored 3 times across the cluster.  It's a pretty standard choice
>> for the replication factor in order to give a hardware team time to replace
>> bad hardware in the case of failure.  With RF=3 the cluster can sustain
>> failure on any two nodes without data loss, but the loss of the third node
>> may cause loss.
>>
>> When reading the LZO files with the newAPIHadoopFile() call I showed
>> below, the data in the RDD is already decompressed -- it transparently
>> looks the same to my Spark program as if I was operating on an uncompressed
>> file.
>>
>> Cheers,
>> Andrew
>>
>>
>> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon <
>> berkeley@firestickgames.com> wrote:
>>
>>> Andrew, This is great.
>>>
>>> Excuse my ignorance, but what do you mean by RF=3? Also, after reading
>>> the LZO files, are you able to access the contents directly, or do you have
>>> to decompress them after reading them?
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2013, at 12:03 AM, Andrew Ash <an...@andrewash.com> wrote:
>>>
>>> Hi Rajeev,
>>>
>>> I'm not sure if you ever got it working, but I just got mine up and
>>> going.  If you just use sc.textFile(...) the file will be read but the LZO
>>> index won't be used so a .count() on my 1B+ row file took 2483s.  When I
>>> ran it like this though:
>>>
>>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo",
>>> classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>>> classOf[org.apache.hadoop.io.LongWritable],
>>> classOf[org.apache.hadoop.io.Text]).count
>>>
>>> the LZO index file was used and the .count() took just 101s.  For
>>> reference this file is 43GB when .gz compressed and 78.4GB when .lzo
>>> compressed.  I have RF=3 and this is across 4 pretty beefy machines with
>>> Hadoop DataNodes and Spark both running on each machine.
>>>
>>> Cheers!
>>> Andrew
>>>
>>>
>>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <
>>> rajeev@silverline-da.com> wrote:
>>>
>>>> Thanks for your suggestion. I will try this and update by late evening.
>>>>
>>>> regards
>>>> Rajeev
>>>>
>>>> Rajeev Srivastava
>>>> Silverline Design Inc
>>>> 2118 Walsh ave, suite 204
>>>> Santa Clara, CA, 95050
>>>> cell : 408-409-0940
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>>
>>>>> Hi Rajeev,
>>>>>
>>>>> It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat
>>>>> input format above, while Stephen referred to com.hadoop.mapreduce.
>>>>> LzoTextInputFormat
>>>>>
>>>>> I think the way to use this in Spark would be to use the
>>>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
>>>>> the path and the InputFormat as parameters.  Can you give those a shot?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
>>>>> rajeev@silverline-da.com> wrote:
>>>>>
>>>>>> Hi Stephen,
>>>>>>      I tried the same lzo file with a simple hadoop script
>>>>>> this seems to work fine
>>>>>>
>>>>>> HADOOP_HOME=/usr/lib/hadoop
>>>>>> /usr/bin/hadoop  jar
>>>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>>>> \
>>>>>> -libjars
>>>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>>>>> \
>>>>>> -input /tmp/ldpc.sstv3.lzo \
>>>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>>>>> -output wc_test \
>>>>>> -mapper 'cat' \
>>>>>> -reducer 'wc -l'
>>>>>>
>>>>>> This means hadoop is able to handle the lzo file correctly
>>>>>>
>>>>>> Can you suggest me what i should do in spark for it to work
>>>>>>
>>>>>> regards
>>>>>> Rajeev
>>>>>>
>>>>>>
>>>>>> Rajeev Srivastava
>>>>>> Silverline Design Inc
>>>>>> 2118 Walsh ave, suite 204
>>>>>> Santa Clara, CA, 95050
>>>>>> cell : 408-409-0940
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>>>>>> stephen.haberman@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> > System.setProperty("spark.io.compression.codec",
>>>>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>>>>
>>>>>>> This spark.io.compression.codec is a completely different setting
>>>>>>> than the
>>>>>>> codecs that are used for reading/writing from HDFS. (It is for
>>>>>>> compressing
>>>>>>> Spark's internal/non-HDFS intermediate output.)
>>>>>>>
>>>>>>> > Hope this helps and someone can help read a LZO file
>>>>>>>
>>>>>>> Spark just uses the regular Hadoop File System API, so any issues
>>>>>>> with reading
>>>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>>>>>>> tracker,
>>>>>>> and look for information on using LZO files with Hadoop/Hive, and
>>>>>>> whatever works
>>>>>>> for them, should magically work for Spark as well.
>>>>>>>
>>>>>>> This looks like a good place to start:
>>>>>>>
>>>>>>> https://github.com/twitter/hadoop-lzo
>>>>>>>
>>>>>>> IANAE, but I would try passing one of these:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>>>>
>>>>>>> To the SparkContext.hadoopFile method.
>>>>>>>
>>>>>>> - Stephen
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>