You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/03/10 02:52:00 UTC
[jira] [Updated] (SPARK-38454) Partition Data Type Prevents Filtering Sporadically

     [ https://issues.apache.org/jira/browse/SPARK-38454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-38454:
---------------------------------
    Priority: Major  (was: Critical)

> Partition Data Type Prevents Filtering Sporadically
> ---------------------------------------------------
>
>                 Key: SPARK-38454
>                 URL: https://issues.apache.org/jira/browse/SPARK-38454
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Christopher
>            Priority: Major
>
> A pipeline (an airflow DAG) that has been running successfully in +production+ for 72+ hours has started failing with the same error on two different queries with the only difference being the table. We believe the root of the error is 
> {quote}Caused by: MetaException(message:Filtering is supported only on partition keys of type string){quote}
>  
> We've seen this error resolve itself on task retry attempts, but the latest occurrence of this task was not resolved on retry attempts, and all proceeding airflow DAGs failed. The queries that trigger this error are 
> {quote}select * from db.cleansed_layer_table  where (`dataset`='20220305185000_4d' AND `date_partition`=CAST('2022-03-05' as DATE)):
> select * from db.raw_layer_table  where (`date_partition`=CAST('2022-03-05' as DATE) AND `dataset`='20220305185000_4d')
> {quote}
>  
> The date_partition field was a DATE type when this error started occurring. The task writes and queries the raw layer before the cleansed layer is written & queried.
>  
> The first task failure was caused by the cleansed layer query, and the proceeding ones all failed on the raw layer query. The inconsistent behavior of the pipeline is of highest concern; there were 35 successful DAG runs in Airflow of this pipeline.
>  
> The error suggests
> {quote}{{You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem}}
> {quote}
> which resulted in too large of a performance hit to keep. 
>  
> We've changed the field to a STRING in our +development+ environment, and have had 78 consecutive successful __ task runs. We've paused that test for now, in favor of filtering only on dataset for now which we just started running.
>  
> Is our assessment that we will experience higher reliability by changing the data type of date_partition to STRING reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org