You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Liquan Pei (JIRA)" <ji...@apache.org> on 2014/10/07 21:33:33 UTC

[jira] [Comment Edited] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

    [ https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162374#comment-14162374 ] 

Liquan Pei edited comment on SPARK-3828 at 10/7/14 7:33 PM:
------------------------------------------------------------

It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, when running
{code}
 sc.textFile("text8").map(_.size).collect()
{code}
It returns, 
{code}
Array[Int] = Array(100000000)
{code} 
which is consistent with text8 file size. However, For Spark built with 2.4.0, the above code returns
{code}
Array[Int] = Array(100000000, 32891136)
{code}
Note that the second entry for 2.4.0 result equals to the 2nd partition size of text8, which means that the first record for one that partition is not correctly skipped. 

Will try to reproduce it with Hadoop.   


was (Author: liquanpei):
It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, when running
{code}
 sc.textFile("text8").map(_.size).collect()
{code}
It returns, 
{code}
Array[Int] = Array(100000000)
{code} 
which is consistent with text8 file size. However, For Spark built with 2.4.0, the above code returns
{code}
Array[Int] = Array(100000000, 32891136)
{code}
Note that the second entry for 2.4.0 result equals to the second partition size of text8, which means that the first record for one that partition is not correctly skipped. 

Will try to reproduce it with Hadoop.   

> Spark returns inconsistent results when building with different Hadoop version 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-3828
>                 URL: https://issues.apache.org/jira/browse/SPARK-3828
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0
>         Environment: OSX 10.9, Spark master branch
>            Reporter: Liquan Pei
>
> For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please unzip first. 
> Spark build with different Hadoop version returns different result. 
> {code}
> val data = sc.textFile("text8")
> data.count()
> {code}
> returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built with SPARK_HADOOP_VERSION=2.4.0. 
> Looking through the rdd code, it seems that textFile uses hadoopFile which creates HadoopRDD, we should probably create newHadoopRDD when building spark with SPARK_HADOOP_VERSION >= 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org