You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Guy Khazma <Gu...@ibm.com> on 2019/12/30 14:41:08 UTC

[Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Hi,

It seems that hive style partition pruning is not working for file based
data sources such as Parquet and ORC.
This causes serious performance degradation for non hive tables.

The reason for that is that the  FileScan
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala>  
abstract class is not aware of the partition and data filters. 
The method for getting the selectedPartitions calls the FileIndex listFiles
method with empty sequence for both - see  here
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74> 
.

In the v1 datasource the  FileSourceScanExec
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160>  
class gets the partition and data filters and use them to filter unnecessary
partitions by passing them to the listFiles function - see  here
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210> 
.

Are there any ongoing plans to add a support for that?

Thanks,
Guy



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Posted by Guy Khazma <Gu...@ibm.com>.

Thanks Gengliang.

Please let me know if I can help.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Posted by Gengliang <lt...@gmail.com>.

Hi Guy,

Thanks for reporting the issue. I am working on it and there will be a PR
this week.

Gengliang

On Mon, Dec 30, 2019 at 6:41 AM Guy Khazma <Gu...@ibm.com> wrote:

> Hi,
>
> It seems that hive style partition pruning is not working for file based
> data sources such as Parquet and ORC.
> This causes serious performance degradation for non hive tables.
>
> The reason for that is that the  FileScan
> <
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala>
>
> abstract class is not aware of the partition and data filters.
> The method for getting the selectedPartitions calls the FileIndex listFiles
> method with empty sequence for both - see  here
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74>
>
> .
>
> In the v1 datasource the  FileSourceScanExec
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160>
>
> class gets the partition and data filters and use them to filter
> unnecessary
> partitions by passing them to the listFiles function - see  here
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210>
>
> .
>
> Are there any ongoing plans to add a support for that?
>
> Thanks,
> Guy
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>