You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Gang Wu <ga...@apache.org> on 2019/06/02 12:43:29 UTC

Re: C++ API seekToRow() performance.

I can open a JIRA for the issue and port our fix back.

For the last suggestion, we can add the optimization as a writer option if
anyone is interested.

Gang

On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:

> Hi Shankar,
>
> This is a known issue. As far as I know, there are two issues here -
>
> 1. The reader doesn’t use row group index to skip unnecessary rows.
> Instead it read through every row until the cursor moves to the desired
> position. [1]
> 2. We could have skip the entire compression block when current offset +
> decompressed size <= desired offset. But we are currently not doing that.
> [2]
>
> These issues can be fixed. Feel free to open a JIRA.
>
> There’s one more thing we could discuss here. Currently the compression
> block and RLE run can span across two row groups, which means even for
> seeking to the beginning of a row group, it will possibly require
> decompression and decoding. This might not be desirable in cases where
> latency is sensitive. In our setup, we modify the writer to close the RLE
> runs and compression blocks at the end of each row group. So seeking to a
> row group doesn’t require any decompression. The difference in terms of
> storage efficiency is barely noticeable (< 1%). I would suggest we make
> this change into Orc v2. The other benefit is we could greatly simply
> current row position index design.
>
>
> [1]
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> <
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> >
> [2]
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> <
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> >
>
>
>
>
> On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
> shiyer22@gmail.com>> wrote:
>
> Hello,
>
> We are developing a data store based on ORC files and using the C++ API. We
> are using min/max statistics from the row index, bloom filters and our
> custom partitioning stuff to read only the required rows from the ORC
> files. This implementation relies on the seekToRow() method in the
> RowReader class to seek the appropriate row groups and then read the batch.
> I am noticing that the seekToRow() is not efficient and degrades the
> performance, even if just a few row groups have to be read. Some numbers
> from my testing :-
>
> Number of rows in ORC file : 30 million
> File Size : 845 MB (7 stripes)
> Number of Columns : 16 (tpc-h lineitem table)
>
> Sequential read of all rows/all columns : 10 seconds
> Read only 1% of the row groups using seek (forward direction only) : 1.5
> seconds
> Read only 3% of the row groups using seek (forward direction only) : 12
> seconds
> Read only 4% of the row groups using seek (forward direction only) : 20
> seconds
> Read only 5% of the row groups using seek (forward direction only) : 33
> seconds
>
>
> I tried the Java API and implemented the same filtering logic via predicate
> push down and got good numbers with the same ORC file :-
>
> Sequential read of all rows/all columns : 18 seconds
> Match & read 20% of row groups : 7 seconds
> Match & read 33% of row groups.: 11 seconds
> Match & read 50% of row groups : 13.5 seconds
>
> I think the seekToRow() implementation needs to use the row index positions
> and read only the appropriate stream portions(like the Java API). The
> current seekToRow() implementation starts over from the beginning of the
> stripe for each seek. I would like to work on changing the seekToRow()
> implementation, if this is not actively being worked on right now by
> anyone. The seek is critical for us as we have multiple feature paths that
> need to read only portions of the ORC file.
>
> I am looking for opinion from the community and contributors.
>
> Thanks,
> Shankar
>
>

Re: C++ API seekToRow() performance.

Posted by Owen O'Malley <ow...@gmail.com>.

> On Jun 2, 2019, at 5:43 AM, Gang Wu <ga...@apache.org> wrote:
> 
> I can open a JIRA for the issue and port our fix back.

That would be great.

> 
> For the last suggestion, we can add the optimization as a writer option if
> anyone is interested.

It does significantly hurt compression to flush the streams every 10k rows.

.. Owen

> 
> Gang
> 
> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:
> 
>> Hi Shankar,
>> 
>> This is a known issue. As far as I know, there are two issues here -
>> 
>> 1. The reader doesn’t use row group index to skip unnecessary rows.
>> Instead it read through every row until the cursor moves to the desired
>> position. [1]
>> 2. We could have skip the entire compression block when current offset +
>> decompressed size <= desired offset. But we are currently not doing that.
>> [2]
>> 
>> These issues can be fixed. Feel free to open a JIRA.
>> 
>> There’s one more thing we could discuss here. Currently the compression
>> block and RLE run can span across two row groups, which means even for
>> seeking to the beginning of a row group, it will possibly require
>> decompression and decoding. This might not be desirable in cases where
>> latency is sensitive. In our setup, we modify the writer to close the RLE
>> runs and compression blocks at the end of each row group. So seeking to a
>> row group doesn’t require any decompression. The difference in terms of
>> storage efficiency is barely noticeable (< 1%). I would suggest we make
>> this change into Orc v2. The other benefit is we could greatly simply
>> current row position index design.
>> 
>> 
>> [1]
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
>> <
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
>>> 
>> [2]
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
>> <
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
>>> 
>> 
>> 
>> 
>> 
>> On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
>> shiyer22@gmail.com>> wrote:
>> 
>> Hello,
>> 
>> We are developing a data store based on ORC files and using the C++ API. We
>> are using min/max statistics from the row index, bloom filters and our
>> custom partitioning stuff to read only the required rows from the ORC
>> files. This implementation relies on the seekToRow() method in the
>> RowReader class to seek the appropriate row groups and then read the batch.
>> I am noticing that the seekToRow() is not efficient and degrades the
>> performance, even if just a few row groups have to be read. Some numbers
>> from my testing :-
>> 
>> Number of rows in ORC file : 30 million
>> File Size : 845 MB (7 stripes)
>> Number of Columns : 16 (tpc-h lineitem table)
>> 
>> Sequential read of all rows/all columns : 10 seconds
>> Read only 1% of the row groups using seek (forward direction only) : 1.5
>> seconds
>> Read only 3% of the row groups using seek (forward direction only) : 12
>> seconds
>> Read only 4% of the row groups using seek (forward direction only) : 20
>> seconds
>> Read only 5% of the row groups using seek (forward direction only) : 33
>> seconds
>> 
>> 
>> I tried the Java API and implemented the same filtering logic via predicate
>> push down and got good numbers with the same ORC file :-
>> 
>> Sequential read of all rows/all columns : 18 seconds
>> Match & read 20% of row groups : 7 seconds
>> Match & read 33% of row groups.: 11 seconds
>> Match & read 50% of row groups : 13.5 seconds
>> 
>> I think the seekToRow() implementation needs to use the row index positions
>> and read only the appropriate stream portions(like the Java API). The
>> current seekToRow() implementation starts over from the beginning of the
>> stripe for each seek. I would like to work on changing the seekToRow()
>> implementation, if this is not actively being worked on right now by
>> anyone. The seek is critical for us as we have multiple feature paths that
>> need to read only portions of the ORC file.
>> 
>> I am looking for opinion from the community and contributors.
>> 
>> Thanks,
>> Shankar
>> 
>> 


Re: C++ API seekToRow() performance.

Posted by Shankar Iyer <sh...@gmail.com>.
Great, thank you very much Gang! I will test this over the next couple of
days and get back.

Regards,
Shankar

On Wed, Jun 19, 2019 at 7:30 PM Gang Wu <ga...@apache.org> wrote:

> Hi Shankar,
>
> Can you test this PR to see if it works:
> https://github.com/apache/orc/pull/401
>
> Thanks!
> Gang
>
> On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <sh...@gmail.com> wrote:
>
> > Hi Gang,
> >
> >     Is it possible to give an update or time frame for this?
> >
> > Thanks,
> > Shankar
> >
> > On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:
> >
> > > Hi Shankar,
> > >
> > > The fix is in our internal repo at the moment. I will let you know when
> > it
> > > is ready to test.
> > >
> > > Thanks,
> > > Gang
> > >
> > > On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com>
> wrote:
> > >
> > > > Thanks Gang. Since you mentioned about back porting, is the fix
> already
> > > > available in some branch/commit? I can test it. Please let me know!
> > > >
> > > > Regards
> > > > Shankar
> > > >
> > > > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> > > >
> > > > > I can open a JIRA for the issue and port our fix back.
> > > > >
> > > > > For the last suggestion, we can add the optimization as a writer
> > option
> > > > if
> > > > > anyone is interested.
> > > > >
> > > > > Gang
> > > > >
> > > > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com>
> > wrote:
> > > > >
> > > > > > Hi Shankar,
> > > > > >
> > > > > > This is a known issue. As far as I know, there are two issues
> here
> > -
> > > > > >
> > > > > > 1. The reader doesn’t use row group index to skip unnecessary
> rows.
> > > > > > Instead it read through every row until the cursor moves to the
> > > desired
> > > > > > position. [1]
> > > > > > 2. We could have skip the entire compression block when current
> > > offset
> > > > +
> > > > > > decompressed size <= desired offset. But we are currently not
> doing
> > > > that.
> > > > > > [2]
> > > > > >
> > > > > > These issues can be fixed. Feel free to open a JIRA.
> > > > > >
> > > > > > There’s one more thing we could discuss here. Currently the
> > > compression
> > > > > > block and RLE run can span across two row groups, which means
> even
> > > for
> > > > > > seeking to the beginning of a row group, it will possibly require
> > > > > > decompression and decoding. This might not be desirable in cases
> > > where
> > > > > > latency is sensitive. In our setup, we modify the writer to close
> > the
> > > > RLE
> > > > > > runs and compression blocks at the end of each row group. So
> > seeking
> > > > to a
> > > > > > row group doesn’t require any decompression. The difference in
> > terms
> > > of
> > > > > > storage efficiency is barely noticeable (< 1%). I would suggest
> we
> > > make
> > > > > > this change into Orc v2. The other benefit is we could greatly
> > simply
> > > > > > current row position index design.
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> > > <mailto:
> > > > > > shiyer22@gmail.com>> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > We are developing a data store based on ORC files and using the
> C++
> > > > API.
> > > > > We
> > > > > > are using min/max statistics from the row index, bloom filters
> and
> > > our
> > > > > > custom partitioning stuff to read only the required rows from the
> > ORC
> > > > > > files. This implementation relies on the seekToRow() method in
> the
> > > > > > RowReader class to seek the appropriate row groups and then read
> > the
> > > > > batch.
> > > > > > I am noticing that the seekToRow() is not efficient and degrades
> > the
> > > > > > performance, even if just a few row groups have to be read. Some
> > > > numbers
> > > > > > from my testing :-
> > > > > >
> > > > > > Number of rows in ORC file : 30 million
> > > > > > File Size : 845 MB (7 stripes)
> > > > > > Number of Columns : 16 (tpc-h lineitem table)
> > > > > >
> > > > > > Sequential read of all rows/all columns : 10 seconds
> > > > > > Read only 1% of the row groups using seek (forward direction
> only)
> > :
> > > > 1.5
> > > > > > seconds
> > > > > > Read only 3% of the row groups using seek (forward direction
> only)
> > :
> > > 12
> > > > > > seconds
> > > > > > Read only 4% of the row groups using seek (forward direction
> only)
> > :
> > > 20
> > > > > > seconds
> > > > > > Read only 5% of the row groups using seek (forward direction
> only)
> > :
> > > 33
> > > > > > seconds
> > > > > >
> > > > > >
> > > > > > I tried the Java API and implemented the same filtering logic via
> > > > > predicate
> > > > > > push down and got good numbers with the same ORC file :-
> > > > > >
> > > > > > Sequential read of all rows/all columns : 18 seconds
> > > > > > Match & read 20% of row groups : 7 seconds
> > > > > > Match & read 33% of row groups.: 11 seconds
> > > > > > Match & read 50% of row groups : 13.5 seconds
> > > > > >
> > > > > > I think the seekToRow() implementation needs to use the row index
> > > > > positions
> > > > > > and read only the appropriate stream portions(like the Java API).
> > The
> > > > > > current seekToRow() implementation starts over from the beginning
> > of
> > > > the
> > > > > > stripe for each seek. I would like to work on changing the
> > > seekToRow()
> > > > > > implementation, if this is not actively being worked on right now
> > by
> > > > > > anyone. The seek is critical for us as we have multiple feature
> > paths
> > > > > that
> > > > > > need to read only portions of the ORC file.
> > > > > >
> > > > > > I am looking for opinion from the community and contributors.
> > > > > >
> > > > > > Thanks,
> > > > > > Shankar
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by us...@gmail.com.
Thanks for the testing! I will proceed with this PR this week.

Best,
Gang

Sent from my iPhone

> On Jun 24, 2019, at 14:49, Shankar Iyer <sh...@gmail.com> wrote:
> 
> Hi Gang,
> 
>     I tested with the TPC-H lineitem kind of schema and with
> zlib/zstd/no-compression, typically with 30M rows & 3000 rowgroups. Results
> are good and I did not hit any issue.
> 
>     Thanks again!
> 
> -Shankar
> 
>> On Wed, Jun 19, 2019 at 7:30 PM Gang Wu <ga...@apache.org> wrote:
>> 
>> Hi Shankar,
>> 
>> Can you test this PR to see if it works:
>> https://github.com/apache/orc/pull/401
>> 
>> Thanks!
>> Gang
>> 
>>> On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <sh...@gmail.com> wrote:
>>> 
>>> Hi Gang,
>>> 
>>>    Is it possible to give an update or time frame for this?
>>> 
>>> Thanks,
>>> Shankar
>>> 
>>>> On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:
>>>> 
>>>> Hi Shankar,
>>>> 
>>>> The fix is in our internal repo at the moment. I will let you know when
>>> it
>>>> is ready to test.
>>>> 
>>>> Thanks,
>>>> Gang
>>>> 
>>>> On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com>
>> wrote:
>>>> 
>>>>> Thanks Gang. Since you mentioned about back porting, is the fix
>> already
>>>>> available in some branch/commit? I can test it. Please let me know!
>>>>> 
>>>>> Regards
>>>>> Shankar
>>>>> 
>>>>>> On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
>>>>>> 
>>>>>> I can open a JIRA for the issue and port our fix back.
>>>>>> 
>>>>>> For the last suggestion, we can add the optimization as a writer
>>> option
>>>>> if
>>>>>> anyone is interested.
>>>>>> 
>>>>>> Gang
>>>>>> 
>>>>>> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Shankar,
>>>>>>> 
>>>>>>> This is a known issue. As far as I know, there are two issues
>> here
>>> -
>>>>>>> 
>>>>>>> 1. The reader doesn’t use row group index to skip unnecessary
>> rows.
>>>>>>> Instead it read through every row until the cursor moves to the
>>>> desired
>>>>>>> position. [1]
>>>>>>> 2. We could have skip the entire compression block when current
>>>> offset
>>>>> +
>>>>>>> decompressed size <= desired offset. But we are currently not
>> doing
>>>>> that.
>>>>>>> [2]
>>>>>>> 
>>>>>>> These issues can be fixed. Feel free to open a JIRA.
>>>>>>> 
>>>>>>> There’s one more thing we could discuss here. Currently the
>>>> compression
>>>>>>> block and RLE run can span across two row groups, which means
>> even
>>>> for
>>>>>>> seeking to the beginning of a row group, it will possibly require
>>>>>>> decompression and decoding. This might not be desirable in cases
>>>> where
>>>>>>> latency is sensitive. In our setup, we modify the writer to close
>>> the
>>>>> RLE
>>>>>>> runs and compression blocks at the end of each row group. So
>>> seeking
>>>>> to a
>>>>>>> row group doesn’t require any decompression. The difference in
>>> terms
>>>> of
>>>>>>> storage efficiency is barely noticeable (< 1%). I would suggest
>> we
>>>> make
>>>>>>> this change into Orc v2. The other benefit is we could greatly
>>> simply
>>>>>>> current row position index design.
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
>>>>>>> <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
>>>>>>>> 
>>>>>>> [2]
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
>>>>>>> <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
>>>> <mailto:
>>>>>>> shiyer22@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> We are developing a data store based on ORC files and using the
>> C++
>>>>> API.
>>>>>> We
>>>>>>> are using min/max statistics from the row index, bloom filters
>> and
>>>> our
>>>>>>> custom partitioning stuff to read only the required rows from the
>>> ORC
>>>>>>> files. This implementation relies on the seekToRow() method in
>> the
>>>>>>> RowReader class to seek the appropriate row groups and then read
>>> the
>>>>>> batch.
>>>>>>> I am noticing that the seekToRow() is not efficient and degrades
>>> the
>>>>>>> performance, even if just a few row groups have to be read. Some
>>>>> numbers
>>>>>>> from my testing :-
>>>>>>> 
>>>>>>> Number of rows in ORC file : 30 million
>>>>>>> File Size : 845 MB (7 stripes)
>>>>>>> Number of Columns : 16 (tpc-h lineitem table)
>>>>>>> 
>>>>>>> Sequential read of all rows/all columns : 10 seconds
>>>>>>> Read only 1% of the row groups using seek (forward direction
>> only)
>>> :
>>>>> 1.5
>>>>>>> seconds
>>>>>>> Read only 3% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 12
>>>>>>> seconds
>>>>>>> Read only 4% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 20
>>>>>>> seconds
>>>>>>> Read only 5% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 33
>>>>>>> seconds
>>>>>>> 
>>>>>>> 
>>>>>>> I tried the Java API and implemented the same filtering logic via
>>>>>> predicate
>>>>>>> push down and got good numbers with the same ORC file :-
>>>>>>> 
>>>>>>> Sequential read of all rows/all columns : 18 seconds
>>>>>>> Match & read 20% of row groups : 7 seconds
>>>>>>> Match & read 33% of row groups.: 11 seconds
>>>>>>> Match & read 50% of row groups : 13.5 seconds
>>>>>>> 
>>>>>>> I think the seekToRow() implementation needs to use the row index
>>>>>> positions
>>>>>>> and read only the appropriate stream portions(like the Java API).
>>> The
>>>>>>> current seekToRow() implementation starts over from the beginning
>>> of
>>>>> the
>>>>>>> stripe for each seek. I would like to work on changing the
>>>> seekToRow()
>>>>>>> implementation, if this is not actively being worked on right now
>>> by
>>>>>>> anyone. The seek is critical for us as we have multiple feature
>>> paths
>>>>>> that
>>>>>>> need to read only portions of the ORC file.
>>>>>>> 
>>>>>>> I am looking for opinion from the community and contributors.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shankar
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Re: C++ API seekToRow() performance.

Posted by Shankar Iyer <sh...@gmail.com>.
Hi Gang,

     I tested with the TPC-H lineitem kind of schema and with
zlib/zstd/no-compression, typically with 30M rows & 3000 rowgroups. Results
are good and I did not hit any issue.

     Thanks again!

-Shankar

On Wed, Jun 19, 2019 at 7:30 PM Gang Wu <ga...@apache.org> wrote:

> Hi Shankar,
>
> Can you test this PR to see if it works:
> https://github.com/apache/orc/pull/401
>
> Thanks!
> Gang
>
> On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <sh...@gmail.com> wrote:
>
> > Hi Gang,
> >
> >     Is it possible to give an update or time frame for this?
> >
> > Thanks,
> > Shankar
> >
> > On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:
> >
> > > Hi Shankar,
> > >
> > > The fix is in our internal repo at the moment. I will let you know when
> > it
> > > is ready to test.
> > >
> > > Thanks,
> > > Gang
> > >
> > > On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com>
> wrote:
> > >
> > > > Thanks Gang. Since you mentioned about back porting, is the fix
> already
> > > > available in some branch/commit? I can test it. Please let me know!
> > > >
> > > > Regards
> > > > Shankar
> > > >
> > > > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> > > >
> > > > > I can open a JIRA for the issue and port our fix back.
> > > > >
> > > > > For the last suggestion, we can add the optimization as a writer
> > option
> > > > if
> > > > > anyone is interested.
> > > > >
> > > > > Gang
> > > > >
> > > > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com>
> > wrote:
> > > > >
> > > > > > Hi Shankar,
> > > > > >
> > > > > > This is a known issue. As far as I know, there are two issues
> here
> > -
> > > > > >
> > > > > > 1. The reader doesn’t use row group index to skip unnecessary
> rows.
> > > > > > Instead it read through every row until the cursor moves to the
> > > desired
> > > > > > position. [1]
> > > > > > 2. We could have skip the entire compression block when current
> > > offset
> > > > +
> > > > > > decompressed size <= desired offset. But we are currently not
> doing
> > > > that.
> > > > > > [2]
> > > > > >
> > > > > > These issues can be fixed. Feel free to open a JIRA.
> > > > > >
> > > > > > There’s one more thing we could discuss here. Currently the
> > > compression
> > > > > > block and RLE run can span across two row groups, which means
> even
> > > for
> > > > > > seeking to the beginning of a row group, it will possibly require
> > > > > > decompression and decoding. This might not be desirable in cases
> > > where
> > > > > > latency is sensitive. In our setup, we modify the writer to close
> > the
> > > > RLE
> > > > > > runs and compression blocks at the end of each row group. So
> > seeking
> > > > to a
> > > > > > row group doesn’t require any decompression. The difference in
> > terms
> > > of
> > > > > > storage efficiency is barely noticeable (< 1%). I would suggest
> we
> > > make
> > > > > > this change into Orc v2. The other benefit is we could greatly
> > simply
> > > > > > current row position index design.
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> > > <mailto:
> > > > > > shiyer22@gmail.com>> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > We are developing a data store based on ORC files and using the
> C++
> > > > API.
> > > > > We
> > > > > > are using min/max statistics from the row index, bloom filters
> and
> > > our
> > > > > > custom partitioning stuff to read only the required rows from the
> > ORC
> > > > > > files. This implementation relies on the seekToRow() method in
> the
> > > > > > RowReader class to seek the appropriate row groups and then read
> > the
> > > > > batch.
> > > > > > I am noticing that the seekToRow() is not efficient and degrades
> > the
> > > > > > performance, even if just a few row groups have to be read. Some
> > > > numbers
> > > > > > from my testing :-
> > > > > >
> > > > > > Number of rows in ORC file : 30 million
> > > > > > File Size : 845 MB (7 stripes)
> > > > > > Number of Columns : 16 (tpc-h lineitem table)
> > > > > >
> > > > > > Sequential read of all rows/all columns : 10 seconds
> > > > > > Read only 1% of the row groups using seek (forward direction
> only)
> > :
> > > > 1.5
> > > > > > seconds
> > > > > > Read only 3% of the row groups using seek (forward direction
> only)
> > :
> > > 12
> > > > > > seconds
> > > > > > Read only 4% of the row groups using seek (forward direction
> only)
> > :
> > > 20
> > > > > > seconds
> > > > > > Read only 5% of the row groups using seek (forward direction
> only)
> > :
> > > 33
> > > > > > seconds
> > > > > >
> > > > > >
> > > > > > I tried the Java API and implemented the same filtering logic via
> > > > > predicate
> > > > > > push down and got good numbers with the same ORC file :-
> > > > > >
> > > > > > Sequential read of all rows/all columns : 18 seconds
> > > > > > Match & read 20% of row groups : 7 seconds
> > > > > > Match & read 33% of row groups.: 11 seconds
> > > > > > Match & read 50% of row groups : 13.5 seconds
> > > > > >
> > > > > > I think the seekToRow() implementation needs to use the row index
> > > > > positions
> > > > > > and read only the appropriate stream portions(like the Java API).
> > The
> > > > > > current seekToRow() implementation starts over from the beginning
> > of
> > > > the
> > > > > > stripe for each seek. I would like to work on changing the
> > > seekToRow()
> > > > > > implementation, if this is not actively being worked on right now
> > by
> > > > > > anyone. The seek is critical for us as we have multiple feature
> > paths
> > > > > that
> > > > > > need to read only portions of the ORC file.
> > > > > >
> > > > > > I am looking for opinion from the community and contributors.
> > > > > >
> > > > > > Thanks,
> > > > > > Shankar
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Gang Wu <ga...@apache.org>.
Hi Shankar,

Can you test this PR to see if it works:
https://github.com/apache/orc/pull/401

Thanks!
Gang

On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <sh...@gmail.com> wrote:

> Hi Gang,
>
>     Is it possible to give an update or time frame for this?
>
> Thanks,
> Shankar
>
> On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:
>
> > Hi Shankar,
> >
> > The fix is in our internal repo at the moment. I will let you know when
> it
> > is ready to test.
> >
> > Thanks,
> > Gang
> >
> > On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com> wrote:
> >
> > > Thanks Gang. Since you mentioned about back porting, is the fix already
> > > available in some branch/commit? I can test it. Please let me know!
> > >
> > > Regards
> > > Shankar
> > >
> > > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> > >
> > > > I can open a JIRA for the issue and port our fix back.
> > > >
> > > > For the last suggestion, we can add the optimization as a writer
> option
> > > if
> > > > anyone is interested.
> > > >
> > > > Gang
> > > >
> > > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com>
> wrote:
> > > >
> > > > > Hi Shankar,
> > > > >
> > > > > This is a known issue. As far as I know, there are two issues here
> -
> > > > >
> > > > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > > > Instead it read through every row until the cursor moves to the
> > desired
> > > > > position. [1]
> > > > > 2. We could have skip the entire compression block when current
> > offset
> > > +
> > > > > decompressed size <= desired offset. But we are currently not doing
> > > that.
> > > > > [2]
> > > > >
> > > > > These issues can be fixed. Feel free to open a JIRA.
> > > > >
> > > > > There’s one more thing we could discuss here. Currently the
> > compression
> > > > > block and RLE run can span across two row groups, which means even
> > for
> > > > > seeking to the beginning of a row group, it will possibly require
> > > > > decompression and decoding. This might not be desirable in cases
> > where
> > > > > latency is sensitive. In our setup, we modify the writer to close
> the
> > > RLE
> > > > > runs and compression blocks at the end of each row group. So
> seeking
> > > to a
> > > > > row group doesn’t require any decompression. The difference in
> terms
> > of
> > > > > storage efficiency is barely noticeable (< 1%). I would suggest we
> > make
> > > > > this change into Orc v2. The other benefit is we could greatly
> simply
> > > > > current row position index design.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > > >
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> > <mailto:
> > > > > shiyer22@gmail.com>> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > We are developing a data store based on ORC files and using the C++
> > > API.
> > > > We
> > > > > are using min/max statistics from the row index, bloom filters and
> > our
> > > > > custom partitioning stuff to read only the required rows from the
> ORC
> > > > > files. This implementation relies on the seekToRow() method in the
> > > > > RowReader class to seek the appropriate row groups and then read
> the
> > > > batch.
> > > > > I am noticing that the seekToRow() is not efficient and degrades
> the
> > > > > performance, even if just a few row groups have to be read. Some
> > > numbers
> > > > > from my testing :-
> > > > >
> > > > > Number of rows in ORC file : 30 million
> > > > > File Size : 845 MB (7 stripes)
> > > > > Number of Columns : 16 (tpc-h lineitem table)
> > > > >
> > > > > Sequential read of all rows/all columns : 10 seconds
> > > > > Read only 1% of the row groups using seek (forward direction only)
> :
> > > 1.5
> > > > > seconds
> > > > > Read only 3% of the row groups using seek (forward direction only)
> :
> > 12
> > > > > seconds
> > > > > Read only 4% of the row groups using seek (forward direction only)
> :
> > 20
> > > > > seconds
> > > > > Read only 5% of the row groups using seek (forward direction only)
> :
> > 33
> > > > > seconds
> > > > >
> > > > >
> > > > > I tried the Java API and implemented the same filtering logic via
> > > > predicate
> > > > > push down and got good numbers with the same ORC file :-
> > > > >
> > > > > Sequential read of all rows/all columns : 18 seconds
> > > > > Match & read 20% of row groups : 7 seconds
> > > > > Match & read 33% of row groups.: 11 seconds
> > > > > Match & read 50% of row groups : 13.5 seconds
> > > > >
> > > > > I think the seekToRow() implementation needs to use the row index
> > > > positions
> > > > > and read only the appropriate stream portions(like the Java API).
> The
> > > > > current seekToRow() implementation starts over from the beginning
> of
> > > the
> > > > > stripe for each seek. I would like to work on changing the
> > seekToRow()
> > > > > implementation, if this is not actively being worked on right now
> by
> > > > > anyone. The seek is critical for us as we have multiple feature
> paths
> > > > that
> > > > > need to read only portions of the ORC file.
> > > > >
> > > > > I am looking for opinion from the community and contributors.
> > > > >
> > > > > Thanks,
> > > > > Shankar
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Gang Wu <ga...@apache.org>.
Hi Shankar,

I will work on it this week.

Thanks!
Gang

On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <sh...@gmail.com> wrote:

> Hi Gang,
>
>     Is it possible to give an update or time frame for this?
>
> Thanks,
> Shankar
>
> On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:
>
> > Hi Shankar,
> >
> > The fix is in our internal repo at the moment. I will let you know when
> it
> > is ready to test.
> >
> > Thanks,
> > Gang
> >
> > On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com> wrote:
> >
> > > Thanks Gang. Since you mentioned about back porting, is the fix already
> > > available in some branch/commit? I can test it. Please let me know!
> > >
> > > Regards
> > > Shankar
> > >
> > > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> > >
> > > > I can open a JIRA for the issue and port our fix back.
> > > >
> > > > For the last suggestion, we can add the optimization as a writer
> option
> > > if
> > > > anyone is interested.
> > > >
> > > > Gang
> > > >
> > > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com>
> wrote:
> > > >
> > > > > Hi Shankar,
> > > > >
> > > > > This is a known issue. As far as I know, there are two issues here
> -
> > > > >
> > > > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > > > Instead it read through every row until the cursor moves to the
> > desired
> > > > > position. [1]
> > > > > 2. We could have skip the entire compression block when current
> > offset
> > > +
> > > > > decompressed size <= desired offset. But we are currently not doing
> > > that.
> > > > > [2]
> > > > >
> > > > > These issues can be fixed. Feel free to open a JIRA.
> > > > >
> > > > > There’s one more thing we could discuss here. Currently the
> > compression
> > > > > block and RLE run can span across two row groups, which means even
> > for
> > > > > seeking to the beginning of a row group, it will possibly require
> > > > > decompression and decoding. This might not be desirable in cases
> > where
> > > > > latency is sensitive. In our setup, we modify the writer to close
> the
> > > RLE
> > > > > runs and compression blocks at the end of each row group. So
> seeking
> > > to a
> > > > > row group doesn’t require any decompression. The difference in
> terms
> > of
> > > > > storage efficiency is barely noticeable (< 1%). I would suggest we
> > make
> > > > > this change into Orc v2. The other benefit is we could greatly
> simply
> > > > > current row position index design.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > > >
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> > <mailto:
> > > > > shiyer22@gmail.com>> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > We are developing a data store based on ORC files and using the C++
> > > API.
> > > > We
> > > > > are using min/max statistics from the row index, bloom filters and
> > our
> > > > > custom partitioning stuff to read only the required rows from the
> ORC
> > > > > files. This implementation relies on the seekToRow() method in the
> > > > > RowReader class to seek the appropriate row groups and then read
> the
> > > > batch.
> > > > > I am noticing that the seekToRow() is not efficient and degrades
> the
> > > > > performance, even if just a few row groups have to be read. Some
> > > numbers
> > > > > from my testing :-
> > > > >
> > > > > Number of rows in ORC file : 30 million
> > > > > File Size : 845 MB (7 stripes)
> > > > > Number of Columns : 16 (tpc-h lineitem table)
> > > > >
> > > > > Sequential read of all rows/all columns : 10 seconds
> > > > > Read only 1% of the row groups using seek (forward direction only)
> :
> > > 1.5
> > > > > seconds
> > > > > Read only 3% of the row groups using seek (forward direction only)
> :
> > 12
> > > > > seconds
> > > > > Read only 4% of the row groups using seek (forward direction only)
> :
> > 20
> > > > > seconds
> > > > > Read only 5% of the row groups using seek (forward direction only)
> :
> > 33
> > > > > seconds
> > > > >
> > > > >
> > > > > I tried the Java API and implemented the same filtering logic via
> > > > predicate
> > > > > push down and got good numbers with the same ORC file :-
> > > > >
> > > > > Sequential read of all rows/all columns : 18 seconds
> > > > > Match & read 20% of row groups : 7 seconds
> > > > > Match & read 33% of row groups.: 11 seconds
> > > > > Match & read 50% of row groups : 13.5 seconds
> > > > >
> > > > > I think the seekToRow() implementation needs to use the row index
> > > > positions
> > > > > and read only the appropriate stream portions(like the Java API).
> The
> > > > > current seekToRow() implementation starts over from the beginning
> of
> > > the
> > > > > stripe for each seek. I would like to work on changing the
> > seekToRow()
> > > > > implementation, if this is not actively being worked on right now
> by
> > > > > anyone. The seek is critical for us as we have multiple feature
> paths
> > > > that
> > > > > need to read only portions of the ORC file.
> > > > >
> > > > > I am looking for opinion from the community and contributors.
> > > > >
> > > > > Thanks,
> > > > > Shankar
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Shankar Iyer <sh...@gmail.com>.
Hi Gang,

    Is it possible to give an update or time frame for this?

Thanks,
Shankar

On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:

> Hi Shankar,
>
> The fix is in our internal repo at the moment. I will let you know when it
> is ready to test.
>
> Thanks,
> Gang
>
> On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com> wrote:
>
> > Thanks Gang. Since you mentioned about back porting, is the fix already
> > available in some branch/commit? I can test it. Please let me know!
> >
> > Regards
> > Shankar
> >
> > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> >
> > > I can open a JIRA for the issue and port our fix back.
> > >
> > > For the last suggestion, we can add the optimization as a writer option
> > if
> > > anyone is interested.
> > >
> > > Gang
> > >
> > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:
> > >
> > > > Hi Shankar,
> > > >
> > > > This is a known issue. As far as I know, there are two issues here -
> > > >
> > > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > > Instead it read through every row until the cursor moves to the
> desired
> > > > position. [1]
> > > > 2. We could have skip the entire compression block when current
> offset
> > +
> > > > decompressed size <= desired offset. But we are currently not doing
> > that.
> > > > [2]
> > > >
> > > > These issues can be fixed. Feel free to open a JIRA.
> > > >
> > > > There’s one more thing we could discuss here. Currently the
> compression
> > > > block and RLE run can span across two row groups, which means even
> for
> > > > seeking to the beginning of a row group, it will possibly require
> > > > decompression and decoding. This might not be desirable in cases
> where
> > > > latency is sensitive. In our setup, we modify the writer to close the
> > RLE
> > > > runs and compression blocks at the end of each row group. So seeking
> > to a
> > > > row group doesn’t require any decompression. The difference in terms
> of
> > > > storage efficiency is barely noticeable (< 1%). I would suggest we
> make
> > > > this change into Orc v2. The other benefit is we could greatly simply
> > > > current row position index design.
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > >
> > > > [2]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> <mailto:
> > > > shiyer22@gmail.com>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > We are developing a data store based on ORC files and using the C++
> > API.
> > > We
> > > > are using min/max statistics from the row index, bloom filters and
> our
> > > > custom partitioning stuff to read only the required rows from the ORC
> > > > files. This implementation relies on the seekToRow() method in the
> > > > RowReader class to seek the appropriate row groups and then read the
> > > batch.
> > > > I am noticing that the seekToRow() is not efficient and degrades the
> > > > performance, even if just a few row groups have to be read. Some
> > numbers
> > > > from my testing :-
> > > >
> > > > Number of rows in ORC file : 30 million
> > > > File Size : 845 MB (7 stripes)
> > > > Number of Columns : 16 (tpc-h lineitem table)
> > > >
> > > > Sequential read of all rows/all columns : 10 seconds
> > > > Read only 1% of the row groups using seek (forward direction only) :
> > 1.5
> > > > seconds
> > > > Read only 3% of the row groups using seek (forward direction only) :
> 12
> > > > seconds
> > > > Read only 4% of the row groups using seek (forward direction only) :
> 20
> > > > seconds
> > > > Read only 5% of the row groups using seek (forward direction only) :
> 33
> > > > seconds
> > > >
> > > >
> > > > I tried the Java API and implemented the same filtering logic via
> > > predicate
> > > > push down and got good numbers with the same ORC file :-
> > > >
> > > > Sequential read of all rows/all columns : 18 seconds
> > > > Match & read 20% of row groups : 7 seconds
> > > > Match & read 33% of row groups.: 11 seconds
> > > > Match & read 50% of row groups : 13.5 seconds
> > > >
> > > > I think the seekToRow() implementation needs to use the row index
> > > positions
> > > > and read only the appropriate stream portions(like the Java API). The
> > > > current seekToRow() implementation starts over from the beginning of
> > the
> > > > stripe for each seek. I would like to work on changing the
> seekToRow()
> > > > implementation, if this is not actively being worked on right now by
> > > > anyone. The seek is critical for us as we have multiple feature paths
> > > that
> > > > need to read only portions of the ORC file.
> > > >
> > > > I am looking for opinion from the community and contributors.
> > > >
> > > > Thanks,
> > > > Shankar
> > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Shankar Iyer <sh...@gmail.com>.
Thanks again! Please let me know when the patch is ready to test. We have a
couple of features for minimizing file access completely dependent on the
seek mechanism.

Regards,
Shankar

On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <ga...@apache.org> wrote:

> Hi Shankar,
>
> The fix is in our internal repo at the moment. I will let you know when it
> is ready to test.
>
> Thanks,
> Gang
>
> On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com> wrote:
>
> > Thanks Gang. Since you mentioned about back porting, is the fix already
> > available in some branch/commit? I can test it. Please let me know!
> >
> > Regards
> > Shankar
> >
> > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
> >
> > > I can open a JIRA for the issue and port our fix back.
> > >
> > > For the last suggestion, we can add the optimization as a writer option
> > if
> > > anyone is interested.
> > >
> > > Gang
> > >
> > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:
> > >
> > > > Hi Shankar,
> > > >
> > > > This is a known issue. As far as I know, there are two issues here -
> > > >
> > > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > > Instead it read through every row until the cursor moves to the
> desired
> > > > position. [1]
> > > > 2. We could have skip the entire compression block when current
> offset
> > +
> > > > decompressed size <= desired offset. But we are currently not doing
> > that.
> > > > [2]
> > > >
> > > > These issues can be fixed. Feel free to open a JIRA.
> > > >
> > > > There’s one more thing we could discuss here. Currently the
> compression
> > > > block and RLE run can span across two row groups, which means even
> for
> > > > seeking to the beginning of a row group, it will possibly require
> > > > decompression and decoding. This might not be desirable in cases
> where
> > > > latency is sensitive. In our setup, we modify the writer to close the
> > RLE
> > > > runs and compression blocks at the end of each row group. So seeking
> > to a
> > > > row group doesn’t require any decompression. The difference in terms
> of
> > > > storage efficiency is barely noticeable (< 1%). I would suggest we
> make
> > > > this change into Orc v2. The other benefit is we could greatly simply
> > > > current row position index design.
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > >
> > > > [2]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> <mailto:
> > > > shiyer22@gmail.com>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > We are developing a data store based on ORC files and using the C++
> > API.
> > > We
> > > > are using min/max statistics from the row index, bloom filters and
> our
> > > > custom partitioning stuff to read only the required rows from the ORC
> > > > files. This implementation relies on the seekToRow() method in the
> > > > RowReader class to seek the appropriate row groups and then read the
> > > batch.
> > > > I am noticing that the seekToRow() is not efficient and degrades the
> > > > performance, even if just a few row groups have to be read. Some
> > numbers
> > > > from my testing :-
> > > >
> > > > Number of rows in ORC file : 30 million
> > > > File Size : 845 MB (7 stripes)
> > > > Number of Columns : 16 (tpc-h lineitem table)
> > > >
> > > > Sequential read of all rows/all columns : 10 seconds
> > > > Read only 1% of the row groups using seek (forward direction only) :
> > 1.5
> > > > seconds
> > > > Read only 3% of the row groups using seek (forward direction only) :
> 12
> > > > seconds
> > > > Read only 4% of the row groups using seek (forward direction only) :
> 20
> > > > seconds
> > > > Read only 5% of the row groups using seek (forward direction only) :
> 33
> > > > seconds
> > > >
> > > >
> > > > I tried the Java API and implemented the same filtering logic via
> > > predicate
> > > > push down and got good numbers with the same ORC file :-
> > > >
> > > > Sequential read of all rows/all columns : 18 seconds
> > > > Match & read 20% of row groups : 7 seconds
> > > > Match & read 33% of row groups.: 11 seconds
> > > > Match & read 50% of row groups : 13.5 seconds
> > > >
> > > > I think the seekToRow() implementation needs to use the row index
> > > positions
> > > > and read only the appropriate stream portions(like the Java API). The
> > > > current seekToRow() implementation starts over from the beginning of
> > the
> > > > stripe for each seek. I would like to work on changing the
> seekToRow()
> > > > implementation, if this is not actively being worked on right now by
> > > > anyone. The seek is critical for us as we have multiple feature paths
> > > that
> > > > need to read only portions of the ORC file.
> > > >
> > > > I am looking for opinion from the community and contributors.
> > > >
> > > > Thanks,
> > > > Shankar
> > > >
> > > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Gang Wu <ga...@apache.org>.
Hi Shankar,

The fix is in our internal repo at the moment. I will let you know when it
is ready to test.

Thanks,
Gang

On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <sh...@gmail.com> wrote:

> Thanks Gang. Since you mentioned about back porting, is the fix already
> available in some branch/commit? I can test it. Please let me know!
>
> Regards
> Shankar
>
> On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:
>
> > I can open a JIRA for the issue and port our fix back.
> >
> > For the last suggestion, we can add the optimization as a writer option
> if
> > anyone is interested.
> >
> > Gang
> >
> > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:
> >
> > > Hi Shankar,
> > >
> > > This is a known issue. As far as I know, there are two issues here -
> > >
> > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > Instead it read through every row until the cursor moves to the desired
> > > position. [1]
> > > 2. We could have skip the entire compression block when current offset
> +
> > > decompressed size <= desired offset. But we are currently not doing
> that.
> > > [2]
> > >
> > > These issues can be fixed. Feel free to open a JIRA.
> > >
> > > There’s one more thing we could discuss here. Currently the compression
> > > block and RLE run can span across two row groups, which means even for
> > > seeking to the beginning of a row group, it will possibly require
> > > decompression and decoding. This might not be desirable in cases where
> > > latency is sensitive. In our setup, we modify the writer to close the
> RLE
> > > runs and compression blocks at the end of each row group. So seeking
> to a
> > > row group doesn’t require any decompression. The difference in terms of
> > > storage efficiency is barely noticeable (< 1%). I would suggest we make
> > > this change into Orc v2. The other benefit is we could greatly simply
> > > current row position index design.
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > <
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > >
> > > [2]
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > <
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > >
> > >
> > >
> > >
> > >
> > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
> > > shiyer22@gmail.com>> wrote:
> > >
> > > Hello,
> > >
> > > We are developing a data store based on ORC files and using the C++
> API.
> > We
> > > are using min/max statistics from the row index, bloom filters and our
> > > custom partitioning stuff to read only the required rows from the ORC
> > > files. This implementation relies on the seekToRow() method in the
> > > RowReader class to seek the appropriate row groups and then read the
> > batch.
> > > I am noticing that the seekToRow() is not efficient and degrades the
> > > performance, even if just a few row groups have to be read. Some
> numbers
> > > from my testing :-
> > >
> > > Number of rows in ORC file : 30 million
> > > File Size : 845 MB (7 stripes)
> > > Number of Columns : 16 (tpc-h lineitem table)
> > >
> > > Sequential read of all rows/all columns : 10 seconds
> > > Read only 1% of the row groups using seek (forward direction only) :
> 1.5
> > > seconds
> > > Read only 3% of the row groups using seek (forward direction only) : 12
> > > seconds
> > > Read only 4% of the row groups using seek (forward direction only) : 20
> > > seconds
> > > Read only 5% of the row groups using seek (forward direction only) : 33
> > > seconds
> > >
> > >
> > > I tried the Java API and implemented the same filtering logic via
> > predicate
> > > push down and got good numbers with the same ORC file :-
> > >
> > > Sequential read of all rows/all columns : 18 seconds
> > > Match & read 20% of row groups : 7 seconds
> > > Match & read 33% of row groups.: 11 seconds
> > > Match & read 50% of row groups : 13.5 seconds
> > >
> > > I think the seekToRow() implementation needs to use the row index
> > positions
> > > and read only the appropriate stream portions(like the Java API). The
> > > current seekToRow() implementation starts over from the beginning of
> the
> > > stripe for each seek. I would like to work on changing the seekToRow()
> > > implementation, if this is not actively being worked on right now by
> > > anyone. The seek is critical for us as we have multiple feature paths
> > that
> > > need to read only portions of the ORC file.
> > >
> > > I am looking for opinion from the community and contributors.
> > >
> > > Thanks,
> > > Shankar
> > >
> > >
> >
>

Re: C++ API seekToRow() performance.

Posted by Shankar Iyer <sh...@gmail.com>.
Thanks Gang. Since you mentioned about back porting, is the fix already
available in some branch/commit? I can test it. Please let me know!

Regards
Shankar

On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <ga...@apache.org> wrote:

> I can open a JIRA for the issue and port our fix back.
>
> For the last suggestion, we can add the optimization as a writer option if
> anyone is interested.
>
> Gang
>
> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xn...@live.com> wrote:
>
> > Hi Shankar,
> >
> > This is a known issue. As far as I know, there are two issues here -
> >
> > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > Instead it read through every row until the cursor moves to the desired
> > position. [1]
> > 2. We could have skip the entire compression block when current offset +
> > decompressed size <= desired offset. But we are currently not doing that.
> > [2]
> >
> > These issues can be fixed. Feel free to open a JIRA.
> >
> > There’s one more thing we could discuss here. Currently the compression
> > block and RLE run can span across two row groups, which means even for
> > seeking to the beginning of a row group, it will possibly require
> > decompression and decoding. This might not be desirable in cases where
> > latency is sensitive. In our setup, we modify the writer to close the RLE
> > runs and compression blocks at the end of each row group. So seeking to a
> > row group doesn’t require any decompression. The difference in terms of
> > storage efficiency is barely noticeable (< 1%). I would suggest we make
> > this change into Orc v2. The other benefit is we could greatly simply
> > current row position index design.
> >
> >
> > [1]
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > <
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > >
> > [2]
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > <
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > >
> >
> >
> >
> >
> > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
> > shiyer22@gmail.com>> wrote:
> >
> > Hello,
> >
> > We are developing a data store based on ORC files and using the C++ API.
> We
> > are using min/max statistics from the row index, bloom filters and our
> > custom partitioning stuff to read only the required rows from the ORC
> > files. This implementation relies on the seekToRow() method in the
> > RowReader class to seek the appropriate row groups and then read the
> batch.
> > I am noticing that the seekToRow() is not efficient and degrades the
> > performance, even if just a few row groups have to be read. Some numbers
> > from my testing :-
> >
> > Number of rows in ORC file : 30 million
> > File Size : 845 MB (7 stripes)
> > Number of Columns : 16 (tpc-h lineitem table)
> >
> > Sequential read of all rows/all columns : 10 seconds
> > Read only 1% of the row groups using seek (forward direction only) : 1.5
> > seconds
> > Read only 3% of the row groups using seek (forward direction only) : 12
> > seconds
> > Read only 4% of the row groups using seek (forward direction only) : 20
> > seconds
> > Read only 5% of the row groups using seek (forward direction only) : 33
> > seconds
> >
> >
> > I tried the Java API and implemented the same filtering logic via
> predicate
> > push down and got good numbers with the same ORC file :-
> >
> > Sequential read of all rows/all columns : 18 seconds
> > Match & read 20% of row groups : 7 seconds
> > Match & read 33% of row groups.: 11 seconds
> > Match & read 50% of row groups : 13.5 seconds
> >
> > I think the seekToRow() implementation needs to use the row index
> positions
> > and read only the appropriate stream portions(like the Java API). The
> > current seekToRow() implementation starts over from the beginning of the
> > stripe for each seek. I would like to work on changing the seekToRow()
> > implementation, if this is not actively being worked on right now by
> > anyone. The seek is critical for us as we have multiple feature paths
> that
> > need to read only portions of the ORC file.
> >
> > I am looking for opinion from the community and contributors.
> >
> > Thanks,
> > Shankar
> >
> >
>