You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Xinyu Z <xz...@gmail.com> on 2023/01/16 09:40:57 UTC

Question about Sargs and row index

Hi,

I know that in ORC with SearchArguments and row index, we can skip
reading and decoding row groups that are out of the range of
predicate. But does ORC have late materialization functionality?
Basically after decoding and evaluating the predicate column(s), we
can only read and decode the row groups of projection columns where
the matching rows reside. This can further reduce IO and decoding
overhead. It seems the C++ version does not have this. I am asking
because parquet-rs recently add this:
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

Another question is about row index. Since each row group is logically
10000 rows and may not align with CompressionChunk boundaries, does
this cause issue for predicate pushdown? E.g, even we can skip one row
group, we may still need to do IO on the boundary CompressionChunks.

Thanks a lot,
Xinyu

Re: Question about Sargs and row index

Posted by Gang Wu <us...@gmail.com>.

No, it only records the start offset. So it doesn't matter how many
compressed chunks in the row group.

On Tue, Feb 7, 2023 at 2:01 PM Xinyu Z <xz...@gmail.com> wrote:

> I see. So for example, if one row group of a compressed stream spans
> two compression chunks, the positions in that RowIndex are [(byte
> offset of chunk1, decompressed size, # of values), (byte offset of
> chunk2, decompressed size, # of values)]. Is that correct?
>
> On Tue, Feb 7, 2023 at 1:04 PM Gang Wu <us...@gmail.com> wrote:
> >
> > Not exactly. It starts with the byte offset of the compression chunk and
> appends offset to values in the chunk based on the encoding type.
> >
> > I copied the description from the specs as below:
> >
> > To record positions, each stream needs a sequence of numbers. For
> uncompressed streams, the position is the byte offset of the RLE run’s
> start location followed by the number of values that need to be consumed
> from the run. In compressed streams, the first number is the start of the
> compression chunk in the stream, followed by the number of decompressed
> bytes that need to be consumed, and finally the number of values consumed
> in the RLE.
> >
> > For columns with multiple streams, the sequences of positions in each
> stream are concatenated. That was an unfortunate decision on my part that
> we should fix at some point, because it makes code that uses the indexes
> error-prone.
> >
> > Because dictionaries are accessed randomly, there is not a position to
> record for the dictionary and the entire dictionary must be read even if
> only part of a stripe is being read.
> >
> > More details can be found here:
> https://orc.apache.org/specification/ORCv1/
> >
> > Best,
> > Gang
> >
> > On Tue, Feb 7, 2023 at 11:57 AM Xinyu Z <xz...@gmail.com> wrote:
> >>
> >> Hi Gang,
> >>
> >> Thanks for your reply.
> >> A follow up question on Row Index, what is the exact meaning of
> >> 'position' in RowIndexEntry? Is it the byte offset of the starting
> >> position of the first compression chunk of that row group?
> >>
> >> On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <us...@gmail.com> wrote:
> >> >
> >> > Hi Xinyu,
> >> >
> >> > Sorry I am not sure about that.
> >> >
> >> > You may be interested in the implementation of Apache Impala.
> >> >
> >> > Best,
> >> > Gang
> >> >
> >> > On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xz...@gmail.com> wrote:
> >> >>
> >> >> Hi Gang, do you know any upstream system that uses ORC C++ and does
> >> >> vectorized predicate evaluation on the resulting ColumnVectorBatch
> >> >> produced by C++ reader with PPD?
> >> >>
> >> >> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
> >> >> >
> >> >> > Hi Gang,
> >> >> >
> >> >> > Thanks for your reply! It helps.
> >> >> >
> >> >> > Xinyu
> >> >> >
> >> >> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
> >> >> > >
> >> >> > > Hi Xinyu,
> >> >> > >
> >> >> > > The C++ library does not provide lazy materialization. The java
> library supports row level filtering, please check it if interested:
> https://issues.apache.org/jira/browse/ORC-577
> >> >> > >
> >> >> > > With regards to the IO magnification introduced by PPD, I think
> we have discussed this earlier and there is a pending work item:
> https://issues.apache.org/jira/browse/ORC-1264
> >> >> > >
> >> >> > > Best,
> >> >> > > Gang
> >> >> > >
> >> >> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com>
> wrote:
> >> >> > >>
> >> >> > >> Hi,
> >> >> > >>
> >> >> > >> I know that in ORC with SearchArguments and row index, we can
> skip
> >> >> > >> reading and decoding row groups that are out of the range of
> >> >> > >> predicate. But does ORC have late materialization functionality?
> >> >> > >> Basically after decoding and evaluating the predicate
> column(s), we
> >> >> > >> can only read and decode the row groups of projection columns
> where
> >> >> > >> the matching rows reside. This can further reduce IO and
> decoding
> >> >> > >> overhead. It seems the C++ version does not have this. I am
> asking
> >> >> > >> because parquet-rs recently add this:
> >> >> > >>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >> >> > >>
> >> >> > >> Another question is about row index. Since each row group is
> logically
> >> >> > >> 10000 rows and may not align with CompressionChunk boundaries,
> does
> >> >> > >> this cause issue for predicate pushdown? E.g, even we can skip
> one row
> >> >> > >> group, we may still need to do IO on the boundary
> CompressionChunks.
> >> >> > >>
> >> >> > >> Thanks a lot,
> >> >> > >> Xinyu
>

Re: Question about Sargs and row index

Posted by Xinyu Z <xz...@gmail.com>.

I see. So for example, if one row group of a compressed stream spans
two compression chunks, the positions in that RowIndex are [(byte
offset of chunk1, decompressed size, # of values), (byte offset of
chunk2, decompressed size, # of values)]. Is that correct?

On Tue, Feb 7, 2023 at 1:04 PM Gang Wu <us...@gmail.com> wrote:
>
> Not exactly. It starts with the byte offset of the compression chunk and appends offset to values in the chunk based on the encoding type.
>
> I copied the description from the specs as below:
>
> To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.
>
> For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.
>
> Because dictionaries are accessed randomly, there is not a position to record for the dictionary and the entire dictionary must be read even if only part of a stripe is being read.
>
> More details can be found here: https://orc.apache.org/specification/ORCv1/
>
> Best,
> Gang
>
> On Tue, Feb 7, 2023 at 11:57 AM Xinyu Z <xz...@gmail.com> wrote:
>>
>> Hi Gang,
>>
>> Thanks for your reply.
>> A follow up question on Row Index, what is the exact meaning of
>> 'position' in RowIndexEntry? Is it the byte offset of the starting
>> position of the first compression chunk of that row group?
>>
>> On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <us...@gmail.com> wrote:
>> >
>> > Hi Xinyu,
>> >
>> > Sorry I am not sure about that.
>> >
>> > You may be interested in the implementation of Apache Impala.
>> >
>> > Best,
>> > Gang
>> >
>> > On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xz...@gmail.com> wrote:
>> >>
>> >> Hi Gang, do you know any upstream system that uses ORC C++ and does
>> >> vectorized predicate evaluation on the resulting ColumnVectorBatch
>> >> produced by C++ reader with PPD?
>> >>
>> >> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
>> >> >
>> >> > Hi Gang,
>> >> >
>> >> > Thanks for your reply! It helps.
>> >> >
>> >> > Xinyu
>> >> >
>> >> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
>> >> > >
>> >> > > Hi Xinyu,
>> >> > >
>> >> > > The C++ library does not provide lazy materialization. The java library supports row level filtering, please check it if interested: https://issues.apache.org/jira/browse/ORC-577
>> >> > >
>> >> > > With regards to the IO magnification introduced by PPD, I think we have discussed this earlier and there is a pending work item: https://issues.apache.org/jira/browse/ORC-1264
>> >> > >
>> >> > > Best,
>> >> > > Gang
>> >> > >
>> >> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
>> >> > >>
>> >> > >> Hi,
>> >> > >>
>> >> > >> I know that in ORC with SearchArguments and row index, we can skip
>> >> > >> reading and decoding row groups that are out of the range of
>> >> > >> predicate. But does ORC have late materialization functionality?
>> >> > >> Basically after decoding and evaluating the predicate column(s), we
>> >> > >> can only read and decode the row groups of projection columns where
>> >> > >> the matching rows reside. This can further reduce IO and decoding
>> >> > >> overhead. It seems the C++ version does not have this. I am asking
>> >> > >> because parquet-rs recently add this:
>> >> > >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>> >> > >>
>> >> > >> Another question is about row index. Since each row group is logically
>> >> > >> 10000 rows and may not align with CompressionChunk boundaries, does
>> >> > >> this cause issue for predicate pushdown? E.g, even we can skip one row
>> >> > >> group, we may still need to do IO on the boundary CompressionChunks.
>> >> > >>
>> >> > >> Thanks a lot,
>> >> > >> Xinyu

Re: Question about Sargs and row index

Posted by Gang Wu <us...@gmail.com>.

Not exactly. It starts with the byte offset of the compression chunk and
appends offset to values in the chunk based on the encoding type.

I copied the description from the specs as below:

*To record positions, each stream needs a sequence of numbers. For
uncompressed streams, the position is the byte offset of the RLE run’s
start location followed by the number of values that need to be consumed
from the run. In compressed streams, the first number is the start of the
compression chunk in the stream, followed by the number of decompressed
bytes that need to be consumed, and finally the number of values consumed
in the RLE.*

*For columns with multiple streams, the sequences of positions in each
stream are concatenated. That was an unfortunate decision on my part that
we should fix at some point, because it makes code that uses the indexes
error-prone.*

*Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even if
only part of a stripe is being read.*

More details can be found here: https://orc.apache.org/specification/ORCv1/

Best,
Gang

On Tue, Feb 7, 2023 at 11:57 AM Xinyu Z <xz...@gmail.com> wrote:

> Hi Gang,
>
> Thanks for your reply.
> A follow up question on Row Index, what is the exact meaning of
> 'position' in RowIndexEntry? Is it the byte offset of the starting
> position of the first compression chunk of that row group?
>
> On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <us...@gmail.com> wrote:
> >
> > Hi Xinyu,
> >
> > Sorry I am not sure about that.
> >
> > You may be interested in the implementation of Apache Impala.
> >
> > Best,
> > Gang
> >
> > On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xz...@gmail.com> wrote:
> >>
> >> Hi Gang, do you know any upstream system that uses ORC C++ and does
> >> vectorized predicate evaluation on the resulting ColumnVectorBatch
> >> produced by C++ reader with PPD?
> >>
> >> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
> >> >
> >> > Hi Gang,
> >> >
> >> > Thanks for your reply! It helps.
> >> >
> >> > Xinyu
> >> >
> >> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
> >> > >
> >> > > Hi Xinyu,
> >> > >
> >> > > The C++ library does not provide lazy materialization. The java
> library supports row level filtering, please check it if interested:
> https://issues.apache.org/jira/browse/ORC-577
> >> > >
> >> > > With regards to the IO magnification introduced by PPD, I think we
> have discussed this earlier and there is a pending work item:
> https://issues.apache.org/jira/browse/ORC-1264
> >> > >
> >> > > Best,
> >> > > Gang
> >> > >
> >> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
> >> > >>
> >> > >> Hi,
> >> > >>
> >> > >> I know that in ORC with SearchArguments and row index, we can skip
> >> > >> reading and decoding row groups that are out of the range of
> >> > >> predicate. But does ORC have late materialization functionality?
> >> > >> Basically after decoding and evaluating the predicate column(s), we
> >> > >> can only read and decode the row groups of projection columns where
> >> > >> the matching rows reside. This can further reduce IO and decoding
> >> > >> overhead. It seems the C++ version does not have this. I am asking
> >> > >> because parquet-rs recently add this:
> >> > >>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >> > >>
> >> > >> Another question is about row index. Since each row group is
> logically
> >> > >> 10000 rows and may not align with CompressionChunk boundaries, does
> >> > >> this cause issue for predicate pushdown? E.g, even we can skip one
> row
> >> > >> group, we may still need to do IO on the boundary
> CompressionChunks.
> >> > >>
> >> > >> Thanks a lot,
> >> > >> Xinyu
>

Re: Question about Sargs and row index

Posted by Xinyu Z <xz...@gmail.com>.

Hi Gang,

Thanks for your reply.
A follow up question on Row Index, what is the exact meaning of
'position' in RowIndexEntry? Is it the byte offset of the starting
position of the first compression chunk of that row group?

On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <us...@gmail.com> wrote:
>
> Hi Xinyu,
>
> Sorry I am not sure about that.
>
> You may be interested in the implementation of Apache Impala.
>
> Best,
> Gang
>
> On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xz...@gmail.com> wrote:
>>
>> Hi Gang, do you know any upstream system that uses ORC C++ and does
>> vectorized predicate evaluation on the resulting ColumnVectorBatch
>> produced by C++ reader with PPD?
>>
>> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
>> >
>> > Hi Gang,
>> >
>> > Thanks for your reply! It helps.
>> >
>> > Xinyu
>> >
>> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
>> > >
>> > > Hi Xinyu,
>> > >
>> > > The C++ library does not provide lazy materialization. The java library supports row level filtering, please check it if interested: https://issues.apache.org/jira/browse/ORC-577
>> > >
>> > > With regards to the IO magnification introduced by PPD, I think we have discussed this earlier and there is a pending work item: https://issues.apache.org/jira/browse/ORC-1264
>> > >
>> > > Best,
>> > > Gang
>> > >
>> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> I know that in ORC with SearchArguments and row index, we can skip
>> > >> reading and decoding row groups that are out of the range of
>> > >> predicate. But does ORC have late materialization functionality?
>> > >> Basically after decoding and evaluating the predicate column(s), we
>> > >> can only read and decode the row groups of projection columns where
>> > >> the matching rows reside. This can further reduce IO and decoding
>> > >> overhead. It seems the C++ version does not have this. I am asking
>> > >> because parquet-rs recently add this:
>> > >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>> > >>
>> > >> Another question is about row index. Since each row group is logically
>> > >> 10000 rows and may not align with CompressionChunk boundaries, does
>> > >> this cause issue for predicate pushdown? E.g, even we can skip one row
>> > >> group, we may still need to do IO on the boundary CompressionChunks.
>> > >>
>> > >> Thanks a lot,
>> > >> Xinyu

Re: Question about Sargs and row index

Posted by Gang Wu <us...@gmail.com>.

Hi Xinyu,

Sorry I am not sure about that.

You may be interested in the implementation of Apache Impala.

Best,
Gang

On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xz...@gmail.com> wrote:

> Hi Gang, do you know any upstream system that uses ORC C++ and does
> vectorized predicate evaluation on the resulting ColumnVectorBatch
> produced by C++ reader with PPD?
>
> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
> >
> > Hi Gang,
> >
> > Thanks for your reply! It helps.
> >
> > Xinyu
> >
> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
> > >
> > > Hi Xinyu,
> > >
> > > The C++ library does not provide lazy materialization. The java
> library supports row level filtering, please check it if interested:
> https://issues.apache.org/jira/browse/ORC-577
> > >
> > > With regards to the IO magnification introduced by PPD, I think we
> have discussed this earlier and there is a pending work item:
> https://issues.apache.org/jira/browse/ORC-1264
> > >
> > > Best,
> > > Gang
> > >
> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I know that in ORC with SearchArguments and row index, we can skip
> > >> reading and decoding row groups that are out of the range of
> > >> predicate. But does ORC have late materialization functionality?
> > >> Basically after decoding and evaluating the predicate column(s), we
> > >> can only read and decode the row groups of projection columns where
> > >> the matching rows reside. This can further reduce IO and decoding
> > >> overhead. It seems the C++ version does not have this. I am asking
> > >> because parquet-rs recently add this:
> > >>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > >>
> > >> Another question is about row index. Since each row group is logically
> > >> 10000 rows and may not align with CompressionChunk boundaries, does
> > >> this cause issue for predicate pushdown? E.g, even we can skip one row
> > >> group, we may still need to do IO on the boundary CompressionChunks.
> > >>
> > >> Thanks a lot,
> > >> Xinyu
>

Re: Question about Sargs and row index

Posted by Xinyu Z <xz...@gmail.com>.

Hi Gang, do you know any upstream system that uses ORC C++ and does
vectorized predicate evaluation on the resulting ColumnVectorBatch
produced by C++ reader with PPD?

On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xz...@gmail.com> wrote:
>
> Hi Gang,
>
> Thanks for your reply! It helps.
>
> Xinyu
>
> On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
> >
> > Hi Xinyu,
> >
> > The C++ library does not provide lazy materialization. The java library supports row level filtering, please check it if interested: https://issues.apache.org/jira/browse/ORC-577
> >
> > With regards to the IO magnification introduced by PPD, I think we have discussed this earlier and there is a pending work item: https://issues.apache.org/jira/browse/ORC-1264
> >
> > Best,
> > Gang
> >
> > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I know that in ORC with SearchArguments and row index, we can skip
> >> reading and decoding row groups that are out of the range of
> >> predicate. But does ORC have late materialization functionality?
> >> Basically after decoding and evaluating the predicate column(s), we
> >> can only read and decode the row groups of projection columns where
> >> the matching rows reside. This can further reduce IO and decoding
> >> overhead. It seems the C++ version does not have this. I am asking
> >> because parquet-rs recently add this:
> >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >>
> >> Another question is about row index. Since each row group is logically
> >> 10000 rows and may not align with CompressionChunk boundaries, does
> >> this cause issue for predicate pushdown? E.g, even we can skip one row
> >> group, we may still need to do IO on the boundary CompressionChunks.
> >>
> >> Thanks a lot,
> >> Xinyu

Re: Question about Sargs and row index

Posted by Xinyu Z <xz...@gmail.com>.

Hi Gang,

Thanks for your reply! It helps.

Xinyu

On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <us...@gmail.com> wrote:
>
> Hi Xinyu,
>
> The C++ library does not provide lazy materialization. The java library supports row level filtering, please check it if interested: https://issues.apache.org/jira/browse/ORC-577
>
> With regards to the IO magnification introduced by PPD, I think we have discussed this earlier and there is a pending work item: https://issues.apache.org/jira/browse/ORC-1264
>
> Best,
> Gang
>
> On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:
>>
>> Hi,
>>
>> I know that in ORC with SearchArguments and row index, we can skip
>> reading and decoding row groups that are out of the range of
>> predicate. But does ORC have late materialization functionality?
>> Basically after decoding and evaluating the predicate column(s), we
>> can only read and decode the row groups of projection columns where
>> the matching rows reside. This can further reduce IO and decoding
>> overhead. It seems the C++ version does not have this. I am asking
>> because parquet-rs recently add this:
>> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>>
>> Another question is about row index. Since each row group is logically
>> 10000 rows and may not align with CompressionChunk boundaries, does
>> this cause issue for predicate pushdown? E.g, even we can skip one row
>> group, we may still need to do IO on the boundary CompressionChunks.
>>
>> Thanks a lot,
>> Xinyu

Re: Question about Sargs and row index

Posted by Gang Wu <us...@gmail.com>.

Hi Xinyu,

The C++ library does not provide lazy materialization. The java library
supports row level filtering, please check it if interested:
https://issues.apache.org/jira/browse/ORC-577

With regards to the IO magnification introduced by PPD, I think we have
discussed this earlier and there is a pending work item:
https://issues.apache.org/jira/browse/ORC-1264

Best,
Gang

On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xz...@gmail.com> wrote:

> Hi,
>
> I know that in ORC with SearchArguments and row index, we can skip
> reading and decoding row groups that are out of the range of
> predicate. But does ORC have late materialization functionality?
> Basically after decoding and evaluating the predicate column(s), we
> can only read and decode the row groups of projection columns where
> the matching rows reside. This can further reduce IO and decoding
> overhead. It seems the C++ version does not have this. I am asking
> because parquet-rs recently add this:
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>
> Another question is about row index. Since each row group is logically
> 10000 rows and may not align with CompressionChunk boundaries, does
> this cause issue for predicate pushdown? E.g, even we can skip one row
> group, we may still need to do IO on the boundary CompressionChunks.
>
> Thanks a lot,
> Xinyu
>