You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shea Parkes (JIRA)" <ji...@apache.org> on 2016/08/24 15:42:20 UTC
[jira] [Updated] (SPARK-17218) Caching a DataFrame with >200 columns ~nulls the contents

     [ https://issues.apache.org/jira/browse/SPARK-17218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shea Parkes updated SPARK-17218:
--------------------------------
    Environment: 
Microsoft Windows 10
Python v3.5.x
Standalone Spark Cluster

  was:
Microsoft Windows 10
Python v3.5.x


> Caching a DataFrame with >200 columns ~nulls the contents
> ---------------------------------------------------------
>
>                 Key: SPARK-17218
>                 URL: https://issues.apache.org/jira/browse/SPARK-17218
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.2
>         Environment: Microsoft Windows 10
> Python v3.5.x
> Standalone Spark Cluster
>            Reporter: Shea Parkes
>
> Caching a DataFrame with >200 columns causes the contents to be ~nulled.  This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.
> Minimally reproducible example:
> {code:python}
> from pyspark.sql import SQLContext
> import tempfile
> sqlContext = SQLContext(sc)
> path_fail_parquet = tempfile.mkdtemp() + '/fail_parquet.parquet'
> list_df_varnames = []
> list_df_values = []
> for i in range(210):
>     list_df_varnames.append('var'+str(i))
>     list_df_values.append(str(i))
> test_df = sqlContext.createDataFrame([list_df_values], list_df_varnames)
> test_df.show() # Still looks okay
> print(test_df.collect()) # Still looks okay
> test_df.cache() # When everything goes awry
> test_df.show() # All values have been ~nulled
> print(test_df.collect()) # Still looks okay
> # Serialize and read back from parquet now
> test_df.write.parquet(path_fail_parquet)
> loaded_df = sqlContext.read.parquet(path_fail_parquet)
> loaded_df.show() # All values have been ~nulled
> print(loaded_df.collect()) # All values have been ~nulled
> {code}
> As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.
> This is occurring on Windows 10 with Python 3.5.x.  We're running a Spark Standalone cluster.  Everything works fine with <200 columns/fields.  We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.
> I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.
> I tried to search for another issue related to this and came up with nothing.  My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.
> Happy to provide more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org