You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Veera Naranammalpuram <vn...@maprtech.com> on 2016/09/19 14:09:40 UTC

LIMIT push down to parquet row group

Does anyone know how and if the LIMIT push down to Parquet file works?

I have a parquet file with 53K records in 1 row group. When I run a SELECT
* from <table> LIMIT 1, I see the Parquet reader operator process 32768
records. I would have expected either 1 or 53K. So questions;

1) Does the Parquet MR library offer the ability to push down LIMITs to
Parquet files? From the above, the answer looks like yes.
2) If so, how does Drill come up with the magic number 32767? Is there a
way I can make it read just 1 row if the query is a LIMIT 1?

-- 
Veera Naranammalpuram
Product Specialist - SQL on Hadoop
*MapR Technologies (www.mapr.com <http://www.mapr.com>)*
*(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
*(Mobile) 917 683 8116 - can text *
*Timezone: ET (UTC -5:00 / -4:00)*

Re: LIMIT push down to parquet row group

Posted by Aman Sinha <am...@apache.org>.
Adding to what Jinfeng said, the LIMIT handling relies on the downstream
operator sending a 'kill incoming input stream' api which is called by the
parent operator on its child once the parent (Limit) has received the
required number of rows.   Since the unit of processing in Drill is record
batches, the downstream operator needs to wait until at least 1 batch (not
1 row) has been received.  In this case, the batch size happens to be 32K
records - this is an internal constant in Parquet reader if any of the
columns is variable width, see
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java#L69
.
I think we could enhance this behavior for LIMIT such that the internal
batch size is aware of the limit value.  Do you want to file an enhancement
JIRA ?

On Mon, Sep 19, 2016 at 7:28 AM, Jinfeng Ni <jn...@apache.org> wrote:

> Drill applies LIMIT filtering at row group level.  For LIMIT n, it
> will scan the first m row groups that have at least n rows, and
> discard the rest of row groups. In your case, since you have only 1
> row group, it does not have any row group filtering for LIMIT 1.
>
> I'm not sure how 32767 comes from. It's possibly that's the size of
> the first batch, depending on the column data in your file.
>
> On Mon, Sep 19, 2016 at 7:09 AM, Veera Naranammalpuram
> <vn...@maprtech.com> wrote:
> > Does anyone know how and if the LIMIT push down to Parquet file works?
> >
> > I have a parquet file with 53K records in 1 row group. When I run a
> SELECT
> > * from <table> LIMIT 1, I see the Parquet reader operator process 32768
> > records. I would have expected either 1 or 53K. So questions;
> >
> > 1) Does the Parquet MR library offer the ability to push down LIMITs to
> > Parquet files? From the above, the answer looks like yes.
> > 2) If so, how does Drill come up with the magic number 32767? Is there a
> > way I can make it read just 1 row if the query is a LIMIT 1?
> >
> > --
> > Veera Naranammalpuram
> > Product Specialist - SQL on Hadoop
> > *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
> > *(Mobile) 917 683 8116 - can text *
> > *Timezone: ET (UTC -5:00 / -4:00)*
>

Re: LIMIT push down to parquet row group

Posted by Jinfeng Ni <jn...@apache.org>.
Drill applies LIMIT filtering at row group level.  For LIMIT n, it
will scan the first m row groups that have at least n rows, and
discard the rest of row groups. In your case, since you have only 1
row group, it does not have any row group filtering for LIMIT 1.

I'm not sure how 32767 comes from. It's possibly that's the size of
the first batch, depending on the column data in your file.

On Mon, Sep 19, 2016 at 7:09 AM, Veera Naranammalpuram
<vn...@maprtech.com> wrote:
> Does anyone know how and if the LIMIT push down to Parquet file works?
>
> I have a parquet file with 53K records in 1 row group. When I run a SELECT
> * from <table> LIMIT 1, I see the Parquet reader operator process 32768
> records. I would have expected either 1 or 53K. So questions;
>
> 1) Does the Parquet MR library offer the ability to push down LIMITs to
> Parquet files? From the above, the answer looks like yes.
> 2) If so, how does Drill come up with the magic number 32767? Is there a
> way I can make it read just 1 row if the query is a LIMIT 1?
>
> --
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*