You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Elliot West <te...@gmail.com> on 2015/07/15 12:43:41 UTC

Search argument scope

Hello, I have a question regarding the design of search arguments.

As I understand it, search arguments are used in conjunction with ORC file
indexes to identify files that need not be read. I presume that in practice
the search argument is derived from some higher-level filter (e.g. a
condition in a Hive statement) that is also applied by the processing
framework (typically Hive) once records are read.

Is there any reason why search arguments could/should not also be used to
filter out non-matching records in the OrcRecordReader in addition to
filtering out stripes? This would remove irrelevant records earlier in the
data processing pipeline, and possibly remove the need for the downstream
filter.

Thanks - Elliot.

Re: Search argument scope

Posted by Owen O'Malley <om...@apache.org>.
Elliot,
  Yes, we could apply the search arguments (sargs) at the row level in
addition to the levels that we do use them for filtering:

* file level
* stripe level
* row group (10k rows)

Sargs are a subset of the filters from the query based on things that are
likely in the indexes and thus many queries will run additional filters. So
yes, the reader could enforce the sargs at a row by row level, but no one
has done that work yet. The biggest bang for the buck is throwing out the
larger units of work.

.. Owen

On Wed, Jul 15, 2015 at 3:43 AM, Elliot West <te...@gmail.com> wrote:

> Hello, I have a question regarding the design of search arguments.
>
> As I understand it, search arguments are used in conjunction with ORC file
> indexes to identify files that need not be read. I presume that in practice
> the search argument is derived from some higher-level filter (e.g. a
> condition in a Hive statement) that is also applied by the processing
> framework (typically Hive) once records are read.
>
> Is there any reason why search arguments could/should not also be used to
> filter out non-matching records in the OrcRecordReader in addition to
> filtering out stripes? This would remove irrelevant records earlier in the
> data processing pipeline, and possibly remove the need for the downstream
> filter.
>
> Thanks - Elliot.
>