You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dmitry Gorbatsevich (Jira)" <ji...@apache.org> on 2022/05/20 15:10:00 UTC
[jira] [Updated] (SPARK-39241) Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1
[ https://issues.apache.org/jira/browse/SPARK-39241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Gorbatsevich updated SPARK-39241:
----------------------------------------
Description:
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:
1. Create test table:
{code:java}
scala> spark.sql(
| """
| CREATE EXTERNAL TABLE tmp(
| f1 STRING
| )
| PARTITIONED BY (dt STRING)
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
| LINES TERMINATED BY '\n'
| STORED AS TEXTFILE
| LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
| """)
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:
{code:java}
scala> spark.sql(
| """
| insert into table tmp partition(dt="2022051000") values("1")
| """
| )
res3: org.apache.spark.sql.DataFrame = [] {code}
3. Do select using 'like':
{code:java}
scala> spark.sql(
| """
| select * from tmp
| where dt like '202205100%'
| """
| ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':
{code:java}
scala> spark.sql(
| """
| select * from tmp
| where dt like any ('202205100%')
| """
| ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
+---+----------+
| f1| dt|
+---+----------+
| 1|2022051000|
+---+----------+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong.
*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
was:
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:
1. Create test table:
{code:java}
scala> spark.sql(
| """
| CREATE EXTERNAL TABLE tmp(
| f1 STRING
| )
| PARTITIONED BY (dt STRING)
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
| LINES TERMINATED BY '\n'
| STORED AS TEXTFILE
| LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
| """)
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:
{code:java}
scala> spark.sql(
| """
| insert into table tmp partition(dt="2022051000") values("1")
| """
| )
res3: org.apache.spark.sql.DataFrame = [] {code}
3. Do select using 'like':
{code:java}
scala> spark.sql(
| """
| select * from tmp
| where dt like '202205100%'
| """
| ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':
{code:java}
scala> spark.sql(
| """
| select * from tmp
| where dt like any ('202205100%')
| """
| ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
+---+----------+
| f1| dt|
+---+----------+
| 1|2022051000|
+---+----------+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong.
*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
> Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-39241
> URL: https://issues.apache.org/jira/browse/SPARK-39241
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2
> Environment: *Environment: EMR*
> Release label:emr-6.5.0
> Hadoop distribution:Amazon 3.2.1
> Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
> Reporter: Dmitry Gorbatsevich
> Priority: Major
>
> It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour when filtering on partitioned column. Here is the example:
> 1. Create test table:
> {code:java}
> scala> spark.sql(
> | """
> | CREATE EXTERNAL TABLE tmp(
> | f1 STRING
> | )
> | PARTITIONED BY (dt STRING)
> | ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> | LINES TERMINATED BY '\n'
> | STORED AS TEXTFILE
> | LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
> | """)
> res2: org.apache.spark.sql.DataFrame = []{code}
> 2. insert something there:
> {code:java}
> scala> spark.sql(
> | """
> | insert into table tmp partition(dt="2022051000") values("1")
> | """
> | )
> res3: org.apache.spark.sql.DataFrame = [] {code}
> 3. Do select using 'like':
> {code:java}
> scala> spark.sql(
> | """
> | select * from tmp
> | where dt like '202205100%'
> | """
> | ).show()
> +---+---+
> | f1| dt|
> +---+---+
> +---+---+ {code}
> 4. Do select using 'like any':
> {code:java}
> scala> spark.sql(
> | """
> | select * from tmp
> | where dt like any ('202205100%')
> | """
> | ).show()
> 22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
> +---+----------+
> | f1| dt|
> +---+----------+
> | 1|2022051000|
> +---+----------+ {code}
> Expectation is that results 3 and 4 are identical, however this is not the case and result #3 is obviously wrong.
>
> *Environment: EMR*
> Release label:emr-6.5.0
> Hadoop distribution:Amazon 3.2.1
> Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org