You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/08/29 14:59:20 UTC
[jira] [Resolved] (SPARK-17294) Caching invalidates data on mildly wide dataframes

     [ https://issues.apache.org/jira/browse/SPARK-17294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-17294.
-------------------------------
    Resolution: Duplicate

Duplicate #5, popular issue

> Caching invalidates data on mildly wide dataframes
> --------------------------------------------------
>
>                 Key: SPARK-17294
>                 URL: https://issues.apache.org/jira/browse/SPARK-17294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Kalle Jepsen
>
> Caching a dataframe with > 200 columns causes the data within to simply vanish under certain circumstances.
> Consider the following code, where we create a one-row dataframe containing the numbers from 0 to 200.
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
>     data=[rng]
> )
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> +--------+
> |greatest|
> +--------+
> |     200|
> +--------+
> {noformat}
> As expected column {{_201}} contains the number 200 and as expected the greatest value within that single row is 200.
> Now if we introduce a {{.cache}} on {{df}}:
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
>     data=[rng]
> ).cache()
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> +--------+
> |greatest|
> +--------+
> |       0|
> +--------+
> {noformat}
> the last column {{_201}} still seems to contain the correct value, but when I try to select the greatest value within the row, 0 is returned. When I issue {{.show()}} on the dataframe, all values will be zero. As soon as I limit the columns on a number < 200, everything looks fine again.
> When the number of columns is < 200 from the beginning, even the cache will not break things and everything works as expected.
> It doesn't matter whether the data is loaded from disk or created on the fly and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).
> Can anyone confirm this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org