You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ahmed Hussein (Jira)" <ji...@apache.org> on 2022/03/10 15:58:00 UTC
[jira] [Comment Edited] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

    [ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504374#comment-17504374 ] 

Ahmed Hussein edited comment on SPARK-34960 at 3/10/22, 3:57 PM:
-----------------------------------------------------------------

Thanks [~chengsu] for putting up the optimization on pushed aggregates.
I am concerned that the changes introduced in this jira leads to inconsistent behavior in the following scenario:
 * Assume an ORC file with empty column statistics ([^file_no_stats-orc.tar.gz]).
 * Run a read job as {{spark.read.orc(path).selectExpr('count(p)')}} with default configuration. This will be fine.
 * Now, enable {{'spark.sql.orc.aggregatePushdown': 'true'}} and re-run. There will be an exception because the new code assumes that an ORC file must have file statistics.

In other words, enabling {{spark.sql.orc.aggregatePushdown}} will cause read jobs to fail on any ORC file with empty statistics.
This is going to be problematic for users because they would have to identify all ORC files or they would risk failing their jobs at runtime.

Note that according [ORC-specs|https://orc.apache.org/specification], the statistics are optional even for the futuristic ORCV2.

I second [~tgraves] that there should be a way to recover safely if those fields are missing.


was (Author: ahussein):
Thanks [~chengsu] for putting up the optimization on pushed aggregates.
I am concerned that the changes introduced in this jira leads to inconsistent behavior in the following scenario:
 * Assume an ORC file with empty column statistics (no_col_stats.orc).
 * Run a read job as {{spark.read.orc(path).selectExpr('count(p)')}} with default configuration. This will be fine.
 * Now, enable {{'spark.sql.orc.aggregatePushdown': 'true'}} and re-run. There will be an exception because the new code assumes that an ORC file must have file statistics.

In other words, enabling {{spark.sql.orc.aggregatePushdown}} will cause read jobs to fail on any ORC file with empty statistics.
This is going to be problematic for users because they would have to identify all ORC files or they would risk failing their jobs at runtime.

Note that according [ORC-specs|https://orc.apache.org/specification], the statistics are optional even for the futuristic ORCV2.

I second [~tgraves] that there should be a way to recover safely if those fields are missing.

> Aggregate (Min/Max/Count) push down for ORC
> -------------------------------------------
>
>                 Key: SPARK-34960
>                 URL: https://issues.apache.org/jira/browse/SPARK-34960
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.3.0
>
>         Attachments: file_no_stats-orc.tar.gz
>
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we can also push down certain aggregations into ORC. ORC exposes column statistics in interface `org.apache.orc.Reader` ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org