You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Igor Berman (JIRA)" <ji...@apache.org> on 2015/06/05 15:22:00 UTC
[jira] [Commented] (SPARK-1018) take and collect don't work on HadoopRDD

    [ https://issues.apache.org/jira/browse/SPARK-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574472#comment-14574472 ] 

Igor Berman commented on SPARK-1018:
------------------------------------

Hi Patrick, 
We spent some time to understand why data is "corrupted" while working with avro objects that wasn't copied at our layer, and yes I've seen the note above newHadoopApiFile, but note doesn't describe the consequences of not copying objects, at least not for people who came to spark without deep knowledge in hadoop format(I think there are plenty of those)

Do you think that even using read-avro - transform - write-avro chain can be corrupted due to not copying avro objects in the beginning? 
e.g. we've seen that several objects has same data when they shouldn't, this was solved by deep-copy of whole avro-object

If you permit me to suggest, it would be nice to have section about working with haddopRDD or newHadoopRDD which will advice on best practices and "do-s" and "don't-s" when working with hadoop files


> take and collect don't work on HadoopRDD
> ----------------------------------------
>
>                 Key: SPARK-1018
>                 URL: https://issues.apache.org/jira/browse/SPARK-1018
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.8.1
>            Reporter: Diana Carroll
>              Labels: hadoop
>
> I am reading a simple text file using hadoopFile as follows:
> var hrdd1 = sc.hadoopFile("/home/training/testdata.txt",classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
> Testing using this simple text file:
> 001 this is line 1
> 002 this is line two
> 003 yet another line
> the data read is correct, as I can tell using println 
> scala> hrdd1.foreach(println):
> (0,001 this is line 1)
> (19,002 this is line two)
> (40,003 yet another line)
> But neither collect nor take work properly.  Take prints out the key (byte offset) of the last (non-existent) line repeatedly:
> scala> hrdd1.take(4):
> res146: Array[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = Array((61,), (61,), (61,))
> Collect is even worse: it complains:
> java.io.NotSerializableException: org.apache.hadoop.io.LongWritable at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
> The problem appears to be the LongWritable in both cases, because if I map to a new RDD, converting the values from Text objects to strings, it works:
> scala> hrdd1.map(pair => (pair._1.toString,pair._2.toString)).take(4)
> res148: Array[(java.lang.String, java.lang.String)] = Array((0,001 this is line 1), (19,002 this is line two), (40,003 yet another line))
> Seems to me either rdd.collect and rdd.take ought to handle non-serializable types gracefully, or hadoopFile should return a mapped RDD that converts the hadoop types into the appropriate serializable Java objects.  (Or at very least the docs for the API should indicate that the usual RDD methods don't work on HadoopRDDs).
> BTW, this behavior is the same for both the old and new API versions of hadoopFile.  It also is the same whether the file is from HDFS or a plain old text file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org