You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Panagiotis Garefalakis (Jira)" <ji...@apache.org> on 2020/05/27 15:14:00 UTC
[jira] [Updated] (ORC-577) Allow row-level filtering

     [ https://issues.apache.org/jira/browse/ORC-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Panagiotis Garefalakis updated ORC-577:
---------------------------------------
    Description: 
Currently, ORC filters at three levels:
 * File level
 * Stripe (64 to 256mb) level
 * Row group (10k row) level

The filters are specified as Sargs (Search Arguments), which have a relatively small vocabulary. Furthermore, they only filter sets of rows if they can guarantee that none of the rows can pass the filter.

There are some use cases where the user needs to read a subset of the columns and apply more detailed row level filters. I'd suggest that we add a new method in Reader.Options

{{setRowFilter(String[] filterColumnNames, Consumer<VectorizedRowBatch> filterCallback))}}

Where the columns named in columnNames are read expanded first, then the filter is run and the rest of the data is read only if the predicate returns true.



  was:
Currently, ORC filters at three levels:
 * File level
 * Stripe (64 to 256mb) level
 * Row group (10k row) level

The filters are specified as Sargs (Search Arguments), which have a relatively small vocabulary. Furthermore, they only filter sets of rows if they can guarantee that none of the rows can pass the filter.

There are some use cases where the user needs to read a subset of the columns and apply more detailed row level filters. I'd suggest that we add a new method in Reader.Options

{{setFilter(String columnNames, Predicate<VectorizedRowBatch> filter)}}

Where the columns named in columnNames are read expanded first, then the filter is run and the rest of the data is read only if the predicate returns true.




> Allow row-level filtering
> -------------------------
>
>                 Key: ORC-577
>                 URL: https://issues.apache.org/jira/browse/ORC-577
>             Project: ORC
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>             Fix For: 1.7.0
>
>         Attachments: RowFilterBenchBoolean.out, RowFilterBenchDecimal.out, RowFilterBenchDouble.out, RowFilterBenchString.out, RowFilterBenchTimestamp.out
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, ORC filters at three levels:
>  * File level
>  * Stripe (64 to 256mb) level
>  * Row group (10k row) level
> The filters are specified as Sargs (Search Arguments), which have a relatively small vocabulary. Furthermore, they only filter sets of rows if they can guarantee that none of the rows can pass the filter.
> There are some use cases where the user needs to read a subset of the columns and apply more detailed row level filters. I'd suggest that we add a new method in Reader.Options
> {{setRowFilter(String[] filterColumnNames, Consumer<VectorizedRowBatch> filterCallback))}}
> Where the columns named in columnNames are read expanded first, then the filter is run and the rest of the data is read only if the predicate returns true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)