You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Shivaram Venkataraman (JIRA)" <ji...@apache.org> on 2016/10/03 18:04:21 UTC

[jira] [Commented] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

    [ https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543008#comment-15543008 ] 

Shivaram Venkataraman commented on SPARK-17752:
-----------------------------------------------

Thanks for the bug report - I can't seem to reproduce this on a build from master branch or in the 2.0.1 RC4 that just passed the vote, but I am not sure what change actually fixed this. 

It'll be great if you could also verify whether 2.0.1 fixes your problem and if so we can mark this issue as resolved.

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-17752
>                 URL: https://issues.apache.org/jira/browse/SPARK-17752
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.0.0
>            Reporter: Kevin Ushey
>            Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org