You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/08/21 10:08:00 UTC

[jira] [Resolved] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

     [ https://issues.apache.org/jira/browse/SPARK-39833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-39833.
----------------------------------
    Fix Version/s: 3.3.1
                   3.2.3
                   3.4.0
       Resolution: Fixed

Issue resolved by pull request 37419
[https://github.com/apache/spark/pull/37419]

> Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39833
>                 URL: https://issues.apache.org/jira/browse/SPARK-39833
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1, 3.3.0
>            Reporter: Michael Allman
>            Assignee: Ivan Sadikov
>            Priority: Major
>              Labels: correctness
>             Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
> One of our data scientists discovered a problem wherein a data frame `.show()` call printed non-empty results, but `.count()` printed 0. I've narrowed the issue to a small, reproducible test case which exhibits this aberrant behavior. In pyspark, run the following code:
> {code:python}
> from pyspark.sql.types import *
> parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
> parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
> reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
> {code}
> In my usage, this prints a data frame with 1 row and a count of 0. However, disabling `spark.sql.parquet.filterPushdown` produces consistent results:
> {code:python}
> spark.conf.set("spark.sql.parquet.filterPushdown", False)
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
> {code}
> This will print the same data frame, however it will print a count of 1. The key to triggering this bug is not just enabling `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of the column in the data frame (before writing) must differ from the case of the partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org