You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ovidiu-Cristian MARCU <ov...@inria.fr> on 2016/07/25 10:01:09 UTC

orc/parquet sql conf

Hi,

Assuming I have some data in both ORC/Parquet formats, and some complex workflow that eventually combine results of some queries on these datasets, I would like to get the best execution and looking at the default configs I noticed:

1) Vectorized query execution possible with Parquet only, can you confirm this is possible with the ORC format?

parameter spark.sql.parquet.enableVectorizedReader
[1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala>
Hive is assuming ORC, parameter hive.vectorized.execution.enabled
[2] https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution>

2) Enabling filter pushdown is by default true for Parquet only, why not also for ORC?
spark.sql.parquet.filterPushdown=true
spark.sql.orc.filterPushdown=false

3) Should I even try to process ORC format with Spark at it seems there is Parquet native support?


Thank you!

Best,
Ovidiu

Re: orc/parquet sql conf

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you! Any chance for this work being reviewed and integrated with next Spark release?

Best,
Ovidiu
> On 25 Jul 2016, at 12:20, Hyukjin Kwon <gu...@gmail.com> wrote:
> 
> For the question 1., It is possible but not supported yet. Please refer https://github.com/apache/spark/pull/13775 <https://github.com/apache/spark/pull/13775>
> 
> Thanks!
> 
> 2016-07-25 19:01 GMT+09:00 Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>>:
> Hi,
> 
> Assuming I have some data in both ORC/Parquet formats, and some complex workflow that eventually combine results of some queries on these datasets, I would like to get the best execution and looking at the default configs I noticed:
> 
> 1) Vectorized query execution possible with Parquet only, can you confirm this is possible with the ORC format?
> 
> parameter spark.sql.parquet.enableVectorizedReader
> [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala>
> Hive is assuming ORC, parameter hive.vectorized.execution.enabled
> [2] https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution>
> 
> 2) Enabling filter pushdown is by default true for Parquet only, why not also for ORC?
> spark.sql.parquet.filterPushdown=true
> spark.sql.orc.filterPushdown=false
> 
> 3) Should I even try to process ORC format with Spark at it seems there is Parquet native support?
> 
> 
> Thank you!
> 
> Best,
> Ovidiu
>

Re: orc/parquet sql conf

Posted by Hyukjin Kwon <gu...@gmail.com>.

For the question 1., It is possible but not supported yet. Please refer
https://github.com/apache/spark/pull/13775

Thanks!

2016-07-25 19:01 GMT+09:00 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr>:

> Hi,
>
> Assuming I have some data in both ORC/Parquet formats, and some complex
> workflow that eventually combine results of some queries on these datasets,
> I would like to get the best execution and looking at the default configs I
> noticed:
>
> 1) Vectorized query execution possible with Parquet only, can you confirm
> this is possible with the ORC format?
>
> parameter spark.sql.parquet.enableVectorizedReader
> [1]
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
> Hive is assuming ORC, parameter hive.vectorized.execution.enabled
> [2]
> https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
>
> 2) Enabling filter pushdown is by default true for Parquet only, why not
> also for ORC?
> spark.sql.parquet.filterPushdown=true
> spark.sql.orc.filterPushdown=false
>
> 3) Should I even try to process ORC format with Spark at it seems there is
> Parquet native support?
>
>
> Thank you!
>
> Best,
> Ovidiu
>