You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mahebub Sayyed <ma...@gmail.com> on 2014/07/13 15:43:40 UTC

Error in JavaKafkaWordCount.java example

Hello,

I am referring following example:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

I am getting following C*ompilation Error* :
\example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag

Please help me.
Thanks in advance.

-- 
*Regards,*
*Mahebub Sayyed*

Re: Error in JavaKafkaWordCount.java example

Posted by Tathagata Das <ta...@gmail.com>.

Are you compiling it within Spark using Spark's recommended way (see doc
web page)? Or are you compiling it in your own project? In the latter case,
make sure you are using the Scala 2.10.4.

TD

On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed <ma...@gmail.com>
wrote:

> Hello,
>
> I am referring following example:
>
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>
> I am getting following C*ompilation Error* :
> \example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag
>
> Please help me.
> Thanks in advance.
>
> --
> *Regards,*
> *Mahebub Sayyed*
>

Re: Problem reading in LZO compressed files

Posted by Ognen Duzlevski <og...@gmail.com>.

Nicholas, thanks nevertheless! I am going to spend some time to try and 
figure this out and report back :-)
Ognen

On 7/13/14, 7:05 PM, Nicholas Chammas wrote:
>
> I actually never got this to work, which is part of the reason why I 
> filed that JIRA. Apart from using |--jar| when starting the shell, I 
> don’t have any more pointers for you. :(
>
> 
>
>
> On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski 
> <ognen.duzlevski@gmail.com <ma...@gmail.com>> wrote:
>
>     Nicholas,
>
>     Thanks!
>
>     How do I make spark assemble against a local version of Hadoop?
>
>     I have 2.4.1 running on a test cluster and I did
>     "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was
>     pull in hadoop-2.4.1 dependencies via sbt (which is sufficient for
>     using a 2.4.1 HDFS). I am guessing my local version of Hadoop
>     libraries/jars is not used. Alternatively, how do I add the
>     hadoop-gpl-compression-0.1.0.jar (responsible for the lzo stuff)
>     to this hand assembled Spark?
>
>     I am running the spark-shell like this:
>     bin/spark-shell --jars
>     /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
>
>     and getting this:
>
>     scala> val f =
>     sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo
>     <http://10.10.0.98:54310/data/1gram.lzo>",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
>     14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called
>     with curMem=0, maxMem=311387750
>     14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as
>     values to memory (estimated size 211.0 KB, free 296.8 MB)
>     f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
>     org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile
>     at <console>:12
>
>     scala> f.take(1)
>     14/07/13 16:53:08 INFO FileInputFormat: Total input paths to
>     process : 1
>     java.lang.IncompatibleClassChangeError: Found interface
>     org.apache.hadoop.mapreduce.JobContext, but class was expected
>         at
>     com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)
>
>     which makes me think something is not linked to something properly
>     (not a Java expert unfortunately).
>
>     Thanks!
>     Ognen
>
>
>
>     On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
>>
>>     If you’re still seeing gibberish, it’s because Spark is not using
>>     the LZO libraries properly. In your case, I believe you should be
>>     calling |newAPIHadoopFile()| instead of |textFile()|.
>>
>>     For example:
>>
>>     |sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>>        classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>>        classOf[org.apache.hadoop.io.LongWritable],
>>        classOf[org.apache.hadoop.io.Text])
>>     |
>>
>>     On a side note, here’s a related JIRA issue: SPARK-2394: Make it
>>     easier to read LZO-compressed files from EC2 clusters
>>     <https://issues.apache.org/jira/browse/SPARK-2394>
>>
>>     Nick
>>
>>     
>>
>>
>>     On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski
>>     <ognen.duzlevski@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hello,
>>
>>         I have been trying to play with the Google ngram dataset
>>         provided by Amazon in form of LZO compressed files.
>>
>>         I am having trouble understanding what is going on ;). I have
>>         added the compression jar and native library to the
>>         underlying Hadoop/HDFS installation, restarted the name node
>>         and the datanodes, Spark can obviously see the file but I get
>>         gibberish on a read. Any ideas?
>>
>>         See output below:
>>
>>         14/07/13 14:39:19 INFO SparkContext: Added JAR
>>         file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
>>         at
>>         http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
>>         timestamp 1405262359777
>>         14/07/13 14:39:20 INFO SparkILoop: Created spark context..
>>         Spark context available as sc.
>>
>>         scala> val f =
>>         sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo
>>         <http://10.10.0.98:54310/data/1gram.lzo>")
>>         14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793)
>>         called with curMem=0, maxMem=311387750
>>         14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored
>>         as values to memory (estimated size 160.0 KB, free 296.8 MB)
>>         f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at
>>         textFile at <console>:12
>>
>>         scala> f.take(10)
>>         14/07/13 14:39:43 INFO SparkContext: Job finished: take at
>>         <console>:15, took 0.419708348 s
>>         res0: Array[String] =
>>         Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
>>         ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
>>         �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �?
>>         �? �? �?
>>         �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
>>         �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>>
>>         Thanks!
>>         Ognen
>>
>>
>
>

Re: Problem reading in LZO compressed files

Posted by Nicholas Chammas <ni...@gmail.com>.

I actually never got this to work, which is part of the reason why I filed
that JIRA. Apart from using --jar when starting the shell, I don’t have any
more pointers for you. :(



On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski <ognen.duzlevski@gmail.com
> wrote:

>  Nicholas,
>
> Thanks!
>
> How do I make spark assemble against a local version of Hadoop?
>
> I have 2.4.1 running on a test cluster and I did
> "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in
> hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1
> HDFS). I am guessing my local version of Hadoop libraries/jars is not used.
> Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar
> (responsible for the lzo stuff) to this hand assembled Spark?
>
> I am running the spark-shell like this:
> bin/spark-shell --jars
> /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
>
> and getting this:
>
> scala> val f = sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo
> ",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
> 14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with
> curMem=0, maxMem=311387750
> 14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 211.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at
> <console>:12
>
> scala> f.take(1)
> 14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
>     at
> com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)
>
> which makes me think something is not linked to something properly (not a
> Java expert unfortunately).
>
> Thanks!
> Ognen
>
>
>
> On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
>
>  If you’re still seeing gibberish, it’s because Spark is not using the
> LZO libraries properly. In your case, I believe you should be calling
> newAPIHadoopFile() instead of textFile().
>
> For example:
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>   classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>   classOf[org.apache.hadoop.io.LongWritable],
>   classOf[org.apache.hadoop.io.Text])
>
> On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier
> to read LZO-compressed files from EC2 clusters
> <https://issues.apache.org/jira/browse/SPARK-2394>
>
> Nick
> 
>
>
> On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <
> ognen.duzlevski@gmail.com> wrote:
>
>> Hello,
>>
>> I have been trying to play with the Google ngram dataset provided by
>> Amazon in form of LZO compressed files.
>>
>> I am having trouble understanding what is going on ;). I have added the
>> compression jar and native library to the underlying Hadoop/HDFS
>> installation, restarted the name node and the datanodes, Spark can
>> obviously see the file but I get gibberish on a read. Any ideas?
>>
>> See output below:
>>
>> 14/07/13 14:39:19 INFO SparkContext: Added JAR
>> file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
>> http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
>> timestamp 1405262359777
>> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
>> Spark context available as sc.
>>
>> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
>> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
>> curMem=0, maxMem=311387750
>> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
>> memory (estimated size 160.0 KB, free 296.8 MB)
>> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>> <console>:12
>>
>> scala> f.take(10)
>> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15,
>> took 0.419708348 s
>> res0: Array[String] =
>> Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
>> ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
>> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
>> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
>> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>>
>> Thanks!
>> Ognen
>>
>
>
>

Re: Problem reading in LZO compressed files

Posted by Ognen Duzlevski <og...@gmail.com>.

Nicholas,

Thanks!

How do I make spark assemble against a local version of Hadoop?

I have 2.4.1 running on a test cluster and I did 
"SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in 
hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1 
HDFS). I am guessing my local version of Hadoop libraries/jars is not 
used. Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar 
(responsible for the lzo stuff) to this hand assembled Spark?

I am running the spark-shell like this:
bin/spark-shell --jars 
/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar

and getting this:

scala> val f = 
sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with 
curMem=0, maxMem=311387750
14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 211.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, 
org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at 
<console>:12

scala> f.take(1)
14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.JobContext, but class was expected
     at 
com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)

which makes me think something is not linked to something properly (not 
a Java expert unfortunately).

Thanks!
Ognen


On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
>
> If you’re still seeing gibberish, it’s because Spark is not using the 
> LZO libraries properly. In your case, I believe you should be calling 
> |newAPIHadoopFile()| instead of |textFile()|.
>
> For example:
>
> |sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>    classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>    classOf[org.apache.hadoop.io.LongWritable],
>    classOf[org.apache.hadoop.io.Text])
> |
>
> On a side note, here’s a related JIRA issue: SPARK-2394: Make it 
> easier to read LZO-compressed files from EC2 clusters 
> <https://issues.apache.org/jira/browse/SPARK-2394>
>
> Nick
>
> 
>
>
> On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski 
> <ognen.duzlevski@gmail.com <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     I have been trying to play with the Google ngram dataset provided
>     by Amazon in form of LZO compressed files.
>
>     I am having trouble understanding what is going on ;). I have
>     added the compression jar and native library to the underlying
>     Hadoop/HDFS installation, restarted the name node and the
>     datanodes, Spark can obviously see the file but I get gibberish on
>     a read. Any ideas?
>
>     See output below:
>
>     14/07/13 14:39:19 INFO SparkContext: Added JAR
>     file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
>     http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar
>     with timestamp 1405262359777
>     14/07/13 14:39:20 INFO SparkILoop: Created spark context..
>     Spark context available as sc.
>
>     scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo
>     <http://10.10.0.98:54310/data/1gram.lzo>")
>     14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called
>     with curMem=0, maxMem=311387750
>     14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as
>     values to memory (estimated size 160.0 KB, free 296.8 MB)
>     f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>     <console>:12
>
>     scala> f.take(10)
>     14/07/13 14:39:43 INFO SparkContext: Job finished: take at
>     <console>:15, took 0.419708348 s
>     res0: Array[String] =
>     Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
>     ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
>     �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �?
>     �? �?
>     �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
>     �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>
>     Thanks!
>     Ognen
>
>

Re: Problem reading in LZO compressed files

Posted by Nicholas Chammas <ni...@gmail.com>.

If you’re still seeing gibberish, it’s because Spark is not using the LZO
libraries properly. In your case, I believe you should be calling
newAPIHadoopFile() instead of textFile().

For example:

sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
  classOf[com.hadoop.mapreduce.LzoTextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])

On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier to
read LZO-compressed files from EC2 clusters
<https://issues.apache.org/jira/browse/SPARK-2394>

Nick



On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <ognen.duzlevski@gmail.com
> wrote:

> Hello,
>
> I have been trying to play with the Google ngram dataset provided by
> Amazon in form of LZO compressed files.
>
> I am having trouble understanding what is going on ;). I have added the
> compression jar and native library to the underlying Hadoop/HDFS
> installation, restarted the name node and the datanodes, Spark can
> obviously see the file but I get gibberish on a read. Any ideas?
>
> See output below:
>
> 14/07/13 14:39:19 INFO SparkContext: Added JAR file:/home/ec2-user/hadoop/
> lib/hadoop-gpl-compression-0.1.0.jar at http://10.10.0.100:40100/jars/
> hadoop-gpl-compression-0.1.0.jar with timestamp 1405262359777
> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
> Spark context available as sc.
>
> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
> curMem=0, maxMem=311387750
> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 160.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> <console>:12
>
> scala> f.take(10)
> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15,
> took 0.419708348 s
> res0: Array[String] = Array(SEQ?!org.apache.hadoop.
> io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.
> compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
> ????????????????????????????????????????????????????????????
> ????????????????????????????????????????????????????????????
> ?????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?
> 4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?
> H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?
> \�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?
> p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>
> Thanks!
> Ognen
>

Problem reading in LZO compressed files

Posted by Ognen Duzlevski <og...@gmail.com>.

Hello,

I have been trying to play with the Google ngram dataset provided by 
Amazon in form of LZO compressed files.

I am having trouble understanding what is going on ;). I have added the 
compression jar and native library to the underlying Hadoop/HDFS 
installation, restarted the name node and the datanodes, Spark can 
obviously see the file but I get gibberish on a read. Any ideas?

See output below:

14/07/13 14:39:19 INFO SparkContext: Added JAR 
file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at 
http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with 
timestamp 1405262359777
14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with 
curMem=0, maxMem=311387750
14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 160.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
<console>:12

scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15, 
took 0.419708348 s
res0: Array[String] = 
Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�?????? 
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�? 
�?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �? 
�?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...

Thanks!
Ognen