You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Eyal Farago (JIRA)" <ji...@apache.org> on 2018/11/03 07:50:00 UTC

[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

    [ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673957#comment-16673957 ] 

Eyal Farago commented on SPARK-25548:
-------------------------------------

[~eaton], I think there are two possible approaches to handle this:

first would be extracting the partitions predicate and _And_ing it with the original predicate:
{code:java}
select * from src_par where 
(P_d in (2,3)) and 
((p_d=2 and key=2) or (p_d=3 and key=3))
{code}
second approach would be transforming this into a union:
{code:java}
select * from src_par where (p_d=2 and key=2) 
UNION ALL
select * from src_par where (p_d=3 and key=3)
{code}
I think second approach is easier to implement but it'd require additional rules to make sure partitioned are not scanned multiple times, ie. consider what'd happen if your predicate looked like this:
{code:java}
 (p_d=2 and key=2) or (p_d=3 and key=3) or (p_d=2 and key=33)
{code}
a naive approach would scan partition #2 twice while it's pretty obvious this can be avoided by _OR_ing the first and third conditions.

The first approach seems a bit more complicated by I think it somewhat resembles what you've started implementing in your pr, [~cloud_fan] your thoughts?

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25548
>                 URL: https://issues.apache.org/jira/browse/SPARK-25548
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: eaton
>            Assignee: Apache Spark
>            Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be pruned if we use partition filter and non partition filter together, for example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as value")
> The sql below will scan all the partition files, in which, the partition **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org