You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Benoit Roy (Jira)" <ji...@apache.org> on 2022/07/27 20:26:00 UTC

[jira] [Updated] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format

     [ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Roy updated SPARK-39900:
-------------------------------
    Description: 
When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
git@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
{code:java}
   test("binary file dataframe") {
    // load files in directly into df using 'binaryFile' format.
    val df = spark
      .read
      .format("binaryFile")
      .load("src/test/resources/files")

    df.createOrReplaceTempView("files")



    // This works as expected.
    val like_count = spark.sql("select * from files where path like '%.csv'").count()
    assert(like_count === 1)

    // This does not work as expected.
    val not_like_count = spark.sql("select * from files where path not like '%.csv'").count()
    assert(not_like_count === 2)

    // This used to work in 3.2.1
    // df.filter(col("path").endsWith(".csv") === false).show()
  }{code}

  was:
When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator.



Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]

 
{code:java}
git@github.com:cccs-br/spark-binaryfile-issue.git {code}

Here's a very simple test case that illustrate what's going on:

 

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]




{code:java}
   test("binary file dataframe") {
    // load files in directly into df using 'binaryFile' format.
    val df = spark
      .read
      .format("binaryFile")
      .load("src/test/resources/files")

    df.createOrReplaceTempView("files")



    // This works as expected.
    val like_count = spark.sql("select * from files where path like '%.csv'").count()
    assert(like_count === 1)

    // This does not work as expected.
    val not_like_count = spark.sql("select * from files where path not like '%.csv'").count()
    assert(not_like_count === 2)

    // This used to work in 3.2.1
    // df.filter(col("path").endsWith(".csv") === false).show()
  }{code}


> Incorrect result when query dataframe produced by 'binaryFile' format
> ---------------------------------------------------------------------
>
>                 Key: SPARK-39900
>                 URL: https://issues.apache.org/jira/browse/SPARK-39900
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1, 3.3.0
>            Reporter: Benoit Roy
>            Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> git@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> {code:java}
>    test("binary file dataframe") {
>     // load files in directly into df using 'binaryFile' format.
>     val df = spark
>       .read
>       .format("binaryFile")
>       .load("src/test/resources/files")
>     df.createOrReplaceTempView("files")
>     // This works as expected.
>     val like_count = spark.sql("select * from files where path like '%.csv'").count()
>     assert(like_count === 1)
>     // This does not work as expected.
>     val not_like_count = spark.sql("select * from files where path not like '%.csv'").count()
>     assert(not_like_count === 2)
>     // This used to work in 3.2.1
>     // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org