You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by "Pavan Lanka (Jira)" <ji...@apache.org> on 2021/10/12 16:56:00 UTC

[jira] [Created] (ORC-1027) Filter processing to allow filter injections that cannot be represented via SArgs

Pavan Lanka created ORC-1027:
--------------------------------

             Summary: Filter processing to allow filter injections that cannot be represented via SArgs
                 Key: ORC-1027
                 URL: https://issues.apache.org/jira/browse/ORC-1027
             Project: ORC
          Issue Type: Improvement
          Components: Java
    Affects Versions: 1.7.0, 1.8.0
            Reporter: Pavan Lanka
            Assignee: Pavan Lanka


Currently in the ORCRecordReader the filter logic that perform LazyIO receives the following inputs:
 * SearchArgument as passed by the client using `Reader.Options.getSearchArgument`
 * Input filter as passed by the client using `Reader.Options.getFilterCallback`

The SearchArgument is particularly convenient in allowing for easy integration with the existing engines such as Spark without necessitating any code changes on the engine. However this push down is limited to what can be represented via SearchArguments as an example if we take any predicate that uses a function this cannot be pushed down.
{quote}SELECT * FROM table WHERE lower(f1) IN ... OR f2 IN ... OR f3 IN ...
{quote}
For the above query none of the filters are pushed down to ORC from the engine as we have no means for representing Functions and the use of OR to combine the multiple predicates.

An additional input mechanism is requested for supplying filters that is plugable without requiring a change in the clients directly. We are proposing the use of **ServiceLoader** to dynamically determine the desired filters for a given fully qualified file path.

This filter if determined is applied as an AND in conjunction with the other available filters. It is understood that the plugin filter cannot differentiate multiple aliases for the same table.

This generic capability will allow us to represent complex filters that currently cannot be pushed down to the storage layer from the existing engines allowing us to reap the benefits of LazyIO in many cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)