You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kevin Ushey (JIRA)" <ji...@apache.org> on 2016/09/30 22:17:20 UTC

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

     [ https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Ushey updated SPARK-17752:
--------------------------------
    Description: 
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):

```
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
```

Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)


  was:
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):

---

SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE

---

Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.


> Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-17752
>                 URL: https://issues.apache.org/jira/browse/SPARK-17752
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.0.0
>            Reporter: Kevin Ushey
>            Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):
> ```
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> path <- tempfile()
> write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> ```
> Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org