You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Korry Douglas <ko...@me.com> on 2019/03/11 20:19:00 UTC

Row index questions

I’m making progress on predicate pushdown using the C++ ORC api.  

However, I have not been able to find any sample data that contains more than one row group for any given stripe.

In other words, StripeStatistics::getNumberOfRowIndexStats() always returns 1.

My understanding is that a given stripe contains a “row index” that I can use to skip over unnecessary rows.  If I have a single row group (that is, if getNumberOfRowIndexStats() returns 1) the row index stats are equal to the stripe stats.  I suspect that will not always be the case, otherwise there would be no need for the row index.

Can someone please point me to some sample data that contains multiple row groups per stripe?

Or point me to some simple code that I can use to create such a file?  Something in Python preferably.

Also, is there a way to find out how many rows are in a row group?  I can call StripeStatistics::getNumberOfRowIndexStats(), but that requires a columnId, implying that I might get different counts depending on the columnId that I provide. 

Can I rely on the value returned by Reader::RowIndexStride().  The last row group may be less than the RowIndexStride() but I can handle that.
Remember that I want to use row index to skip row groups that I don’t need and to do that I need an accurate count of the number of rows in each group (but I can handle the fact that the last row group may differ from the RowIndexStride()).

Thanks.


		— Korry


Re: Row index questions

Posted by Gang Wu <us...@gmail.com>.
You are right. These files were created for unit tests of specific
scenarios. You may need to create ORC files by yourself.

Gang

On Mon, Mar 11, 2019 at 3:18 PM Korry Douglas <ko...@me.com> wrote:

> The demo_11 table seems to have a stride of 5000 rows (IIRC) and the whole
> file contains more than a million rows.  But still one row group per stripe.
>
> -- Korru
>
>
> On Mar 11, 2019 6:00 PM, Gang Wu <ga...@apache.org> wrote:
>
> The default number of rows in a stripe is 10000 which you can get from
> Reader::RowIndexStride(). You probably need to create more rows of data to
> verify this.
>
> Thanks
> Gang
>
> On Mon, Mar 11, 2019 at 1:19 PM Korry Douglas <ko...@me.com> wrote:
>
> I’m making progress on predicate pushdown using the C++ ORC api.
>
> However, I have not been able to find any sample data that contains more
> than one row group for any given stripe.
>
> In other words, StripeStatistics::getNumberOfRowIndexStats() always
> returns 1.
>
> My understanding is that a given stripe contains a “row index” that I can
> use to skip over unnecessary rows.  If I have a single row group (that is,
> if getNumberOfRowIndexStats() returns 1) the row index stats are equal to
> the stripe stats.  I suspect that will not always be the case, otherwise
> there would be no need for the row index.
>
> Can someone please point me to some sample data that contains multiple row
> groups per stripe?
>
> Or point me to some simple code that I can use to create such a file?
> Something in Python preferably.
>
> Also, is there a way to find out how many rows are in a row group?  I can
> call StripeStatistics::getNumberOfRowIndexStats(), but that requires a
> columnId, implying that I might get different counts depending on the
> columnId that I provide.
>
> Can I rely on the value returned by Reader::RowIndexStride().  The last
> row group may be less than the RowIndexStride() but I can handle that.
> Remember that I want to use row index to skip row groups that I don’t need
> and to do that I need an accurate count of the number of rows in each group
> (but I can handle the fact that the *last* row group may differ from the
> RowIndexStride()).
>
> Thanks.
>
>
> — Korry
>
>
>

Re: Row index questions

Posted by Korry Douglas <ko...@me.com>.
The demo_11 table seems to have a stride of 5000 rows (IIRC) and the whole
file contains more than a million rows.  But still one row group per stripe.

  

\-- Korru

  

  

On Mar 11, 2019 6:00 PM, Gang Wu <ga...@apache.org> wrote:  

> The default number of rows in a stripe is 10000 which you can get from
Reader::RowIndexStride(). You probably need to create more rows of data to
verify this.

>

>  
>

>

> Thanks

>

> Gang

>

>  
>

>

> On Mon, Mar 11, 2019 at 1:19 PM Korry Douglas
<[korry@me.com](mailto:korry@me.com)> wrote:  
>

>

>> I’m making progress on predicate pushdown using the C++ ORC api.  
>>

>>  
>

>>

>> However, I have not been able to find any sample data that contains more
than one row group for any given stripe.

>>

>>  
>

>>

>> In other words, StripeStatistics::getNumberOfRowIndexStats() always returns
1.

>>

>>  
>

>>

>> My understanding is that a given stripe contains a “row index” that I can
use to skip over unnecessary rows.  If I have a single row group (that is, if
getNumberOfRowIndexStats() returns 1) the row index stats are equal to the
stripe stats.  I suspect that will not always be the case, otherwise there
would be no need for the row index.

>>

>>  
>

>>

>> Can someone please point me to some sample data that contains multiple row
groups per stripe?

>>

>>  
>

>>

>> Or point me to some simple code that I can use to create such a file?
Something in Python preferably.

>>

>>  
>

>>

>> Also, is there a way to find out how many rows are in a row group?  I can
call StripeStatistics::getNumberOfRowIndexStats(), but that requires a
columnId, implying that I might get different counts depending on the columnId
that I provide.

>>

>>  
>

>>

>> Can I rely on the value returned by Reader::RowIndexStride().  The last row
group may be less than the RowIndexStride() but I can handle that.

>>

>> Remember that I want to use row index to skip row groups that I don’t need
and to do that I need an accurate count of the number of rows in each group
(but I can handle the fact that the _last_  row group may differ from the
RowIndexStride()).

>>

>>  
>

>>

>> Thanks.

>>

>>  
>

>>

>>  
>

>>

>> — Korry

>>

>>  
>

  


Re: Row index questions

Posted by Gang Wu <ga...@apache.org>.
The default number of rows in a stripe is 10000 which you can get from
Reader::RowIndexStride(). You probably need to create more rows of data to
verify this.

Thanks
Gang

On Mon, Mar 11, 2019 at 1:19 PM Korry Douglas <ko...@me.com> wrote:

> I’m making progress on predicate pushdown using the C++ ORC api.
>
> However, I have not been able to find any sample data that contains more
> than one row group for any given stripe.
>
> In other words, StripeStatistics::getNumberOfRowIndexStats() always
> returns 1.
>
> My understanding is that a given stripe contains a “row index” that I can
> use to skip over unnecessary rows.  If I have a single row group (that is,
> if getNumberOfRowIndexStats() returns 1) the row index stats are equal to
> the stripe stats.  I suspect that will not always be the case, otherwise
> there would be no need for the row index.
>
> Can someone please point me to some sample data that contains multiple row
> groups per stripe?
>
> Or point me to some simple code that I can use to create such a file?
> Something in Python preferably.
>
> Also, is there a way to find out how many rows are in a row group?  I can
> call StripeStatistics::getNumberOfRowIndexStats(), but that requires a
> columnId, implying that I might get different counts depending on the
> columnId that I provide.
>
> Can I rely on the value returned by Reader::RowIndexStride().  The last
> row group may be less than the RowIndexStride() but I can handle that.
> Remember that I want to use row index to skip row groups that I don’t need
> and to do that I need an accurate count of the number of rows in each group
> (but I can handle the fact that the *last* row group may differ from the
> RowIndexStride()).
>
> Thanks.
>
>
> — Korry
>
>