You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/08/28 15:40:00 UTC

[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

    [ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143913#comment-16143913 ] 

Sean Owen commented on SPARK-21851:
-----------------------------------

Likely the same as https://issues.apache.org/jira/browse/SPARK-16664  -- please test vs a later version.

> Spark 2.0 data corruption with cache and 200 columns
> ----------------------------------------------------
>
>                 Key: SPARK-21851
>                 URL: https://issues.apache.org/jira/browse/SPARK-21851
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
>     # create data frame with id and some dummy cols
>     df1=spark.range(num_rows, numPartitions=100)
>     for i in range(num_cols-1):
>         df1=df1.withColumn("a"+str(i), F.lit("a"))
>     # create data frame with id to join
>     df2=spark.range(num_rows, numPartitions=100)
>     # write and read to start "fresh"
>     df1.write.parquet("delme_1.parquet", mode="overwrite")
>     df2.write.parquet("delme_2.parquet", mode="overwrite")
>     df1=spark.read.parquet("delme_1.parquet");
>     df2=spark.read.parquet("delme_2.parquet");
>     df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make a difference
>     df4=df3.filter("id<10")
>     print(len(df4.columns), df4.count(), df4.cache().count())   # second cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last column may show different values, but 12 and 16 are common. Sometimes you can try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node YARN cluster.
> I am happy to provide more information, if you let me know what is interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org