You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dmitry Gorbatsevich (Jira)" <ji...@apache.org> on 2022/05/20 15:10:00 UTC
[jira] [Updated] (SPARK-39241) Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1

     [ https://issues.apache.org/jira/browse/SPARK-39241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Gorbatsevich updated SPARK-39241:
----------------------------------------
    Description: 
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:

1. Create test table:
{code:java}
scala> spark.sql(
     | """
     | CREATE EXTERNAL TABLE tmp(
     |         f1 STRING
     |     )
     |     PARTITIONED BY (dt STRING)
     |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
     |     LINES TERMINATED BY '\n'
     |     STORED AS TEXTFILE
     |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
     | """) 
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:
{code:java}
scala> spark.sql(
     | """
     |     insert into table tmp partition(dt="2022051000") values("1")
     | """
     | )
res3: org.apache.spark.sql.DataFrame = [] {code}
3. Do select using 'like':
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like '202205100%'
     |     """
     |     ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like any ('202205100%')
     |     """
     |     ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
+---+----------+
| f1|        dt|
+---+----------+
|  1|2022051000|
+---+----------+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong. 

 

*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
 

  was:
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:

1. Create test table:

 
{code:java}
scala> spark.sql(
     | """
     | CREATE EXTERNAL TABLE tmp(
     |         f1 STRING
     |     )
     |     PARTITIONED BY (dt STRING)
     |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
     |     LINES TERMINATED BY '\n'
     |     STORED AS TEXTFILE
     |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
     | """) 
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:

 
{code:java}
scala> spark.sql(
     | """
     |     insert into table tmp partition(dt="2022051000") values("1")
     | """
     | )
res3: org.apache.spark.sql.DataFrame = [] {code}
 

3. Do select using 'like':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like '202205100%'
     |     """
     |     ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like any ('202205100%')
     |     """
     |     ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
+---+----------+
| f1|        dt|
+---+----------+
|  1|2022051000|
+---+----------+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong. 

 

*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
 


> Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39241
>                 URL: https://issues.apache.org/jira/browse/SPARK-39241
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>         Environment: *Environment: EMR*
> Release label:emr-6.5.0
> Hadoop distribution:Amazon 3.2.1
> Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
>            Reporter: Dmitry Gorbatsevich
>            Priority: Major
>
> It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:
> 1. Create test table:
> {code:java}
> scala> spark.sql(
>      | """
>      | CREATE EXTERNAL TABLE tmp(
>      |         f1 STRING
>      |     )
>      |     PARTITIONED BY (dt STRING)
>      |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
>      |     LINES TERMINATED BY '\n'
>      |     STORED AS TEXTFILE
>      |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
>      | """) 
> res2: org.apache.spark.sql.DataFrame = []{code}
> 2. insert something there:
> {code:java}
> scala> spark.sql(
>      | """
>      |     insert into table tmp partition(dt="2022051000") values("1")
>      | """
>      | )
> res3: org.apache.spark.sql.DataFrame = [] {code}
> 3. Do select using 'like':
> {code:java}
> scala> spark.sql(
>      |     """
>      |         select * from tmp
>      |         where dt like '202205100%'
>      |     """
>      |     ).show()
> +---+---+
> | f1| dt|
> +---+---+
> +---+---+ {code}
> 4. Do select using 'like any':
> {code:java}
> scala> spark.sql(
>      |     """
>      |         select * from tmp
>      |         where dt like any ('202205100%')
>      |     """
>      |     ).show()
> 22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
> +---+----------+
> | f1|        dt|
> +---+----------+
> |  1|2022051000|
> +---+----------+ {code}
> Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong. 
>  
> *Environment: EMR*
> Release label:emr-6.5.0
> Hadoop distribution:Amazon 3.2.1
> Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org