You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2017/05/01 19:26:36 UTC

Parquet, Arrow, and Drill Roadmap

Hey all - I posted this to both dev and user as I could mentally make the
argument for both,

Sorry if this is answered somewhere already. I know in the past, there have
been discussions around using two different readers for Parquet, and
performance gains/losses, issues. etc.

Right now, the store.parquet.use_new_reader is set to false (1.10) and I
was trying to get more information about this and what the eventual roadmap
for drill will be...

So, in the docs I see use_new_reader is set to be not supported in this
release.  What I am looking for is a little information on:

- What the two readers are (is one a special drill thing, is the other  a
standard reader from the parquet project?)
- What is the eventual goal here... to be able to use and switch between
both? To provide the option? To have code parity with another project?
- Do either of the readers work with Arrow?
- How does Arrow and Parquet readers fit together?
- Will the readers ever converge?

I have other questions too, but these are examples of where I am coming
from. Is there a good starting place for my research on this subject?

Thanks,
John

Re: Parquet, Arrow, and Drill Roadmap

Posted by John Omernik <jo...@omernik.com>.
I've created a JIRA on this request. The idea here being some higher level
descriptions of these projects (I included Calcite in the JIRA too), what
they do for the project, what the current state of integration is, what
options we have for future states, and what benefits those future states
bring.   For Parquet, I think we could go deeper into some of the
settings/tweaks with real world examples to help folks do data better.

Thanks!


https://issues.apache.org/jira/browse/DRILL-5471

On Tue, May 2, 2017 at 1:46 PM, Padma Penumarthy <pp...@mapr.com>
wrote:

> One thing I want to add is use_new_reader uses reader from parquet-mr
> library, where as
> default one is drill’s native reader which is supposed to be better,
> performance wise.
> But, it does not support complex types and we automatically switch to use
> reader from parquet library
> when we have to read complex types.
>
> Thanks,
> Padma
>
>
> On May 2, 2017, at 11:09 AM, Jinfeng Ni <jni@apache.org<mailto:jni@
> apache.org>> wrote:
>
>
> - What the two readers are (is one a special drill thing, is the other  a
> standard reader from the parquet project?)
> - What is the eventual goal here... to be able to use and switch between
> both? To provide the option? To have code parity with another project?
>
> Both readers were for reading parquet data into Drill's value vector.
> The default one (when store.parquet.use_new_reader is false) was
> faster (based on measurements done by people worked on the two
> readers), but it could not support complex type like map/array.  The
> new reader would be used by Drill either if you change the option to
> true, or when the parquet data you are querying contain complex type
> (even with the default option being false). Therefore, both readers
> might be used by Drill code.
>
> There was a Parquet hackathon some time ago, which aimed to make
> people in different projects using parquet work together to
> standardize a vectorized reader. I did not keep track of that effort.
> People with better knowledge of that may share their inputs.
>
>
> - Do either of the readers work with Arrow?
>
> For now, neither works with Arrow, since Drill has not integrated with
> Arrow yet. See DRILL-4455 for the latest discussion
> (https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
> Drill's parquet reader will work with Arrow, once the integration is
> done.
>
>

Re: Parquet, Arrow, and Drill Roadmap

Posted by Padma Penumarthy <pp...@mapr.com>.
One thing I want to add is use_new_reader uses reader from parquet-mr library, where as
default one is drill’s native reader which is supposed to be better, performance wise.
But, it does not support complex types and we automatically switch to use reader from parquet library
when we have to read complex types.

Thanks,
Padma


On May 2, 2017, at 11:09 AM, Jinfeng Ni <jn...@apache.org>> wrote:


- What the two readers are (is one a special drill thing, is the other  a
standard reader from the parquet project?)
- What is the eventual goal here... to be able to use and switch between
both? To provide the option? To have code parity with another project?

Both readers were for reading parquet data into Drill's value vector.
The default one (when store.parquet.use_new_reader is false) was
faster (based on measurements done by people worked on the two
readers), but it could not support complex type like map/array.  The
new reader would be used by Drill either if you change the option to
true, or when the parquet data you are querying contain complex type
(even with the default option being false). Therefore, both readers
might be used by Drill code.

There was a Parquet hackathon some time ago, which aimed to make
people in different projects using parquet work together to
standardize a vectorized reader. I did not keep track of that effort.
People with better knowledge of that may share their inputs.


- Do either of the readers work with Arrow?

For now, neither works with Arrow, since Drill has not integrated with
Arrow yet. See DRILL-4455 for the latest discussion
(https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
Drill's parquet reader will work with Arrow, once the integration is
done.


Re: Parquet, Arrow, and Drill Roadmap

Posted by Padma Penumarthy <pp...@mapr.com>.
One thing I want to add is use_new_reader uses reader from parquet-mr library, where as
default one is drill’s native reader which is supposed to be better, performance wise.
But, it does not support complex types and we automatically switch to use reader from parquet library
when we have to read complex types.

Thanks,
Padma


On May 2, 2017, at 11:09 AM, Jinfeng Ni <jn...@apache.org>> wrote:


- What the two readers are (is one a special drill thing, is the other  a
standard reader from the parquet project?)
- What is the eventual goal here... to be able to use and switch between
both? To provide the option? To have code parity with another project?

Both readers were for reading parquet data into Drill's value vector.
The default one (when store.parquet.use_new_reader is false) was
faster (based on measurements done by people worked on the two
readers), but it could not support complex type like map/array.  The
new reader would be used by Drill either if you change the option to
true, or when the parquet data you are querying contain complex type
(even with the default option being false). Therefore, both readers
might be used by Drill code.

There was a Parquet hackathon some time ago, which aimed to make
people in different projects using parquet work together to
standardize a vectorized reader. I did not keep track of that effort.
People with better knowledge of that may share their inputs.


- Do either of the readers work with Arrow?

For now, neither works with Arrow, since Drill has not integrated with
Arrow yet. See DRILL-4455 for the latest discussion
(https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
Drill's parquet reader will work with Arrow, once the integration is
done.


Re: Parquet, Arrow, and Drill Roadmap

Posted by Jinfeng Ni <jn...@apache.org>.
>
> - What the two readers are (is one a special drill thing, is the other  a
> standard reader from the parquet project?)
> - What is the eventual goal here... to be able to use and switch between
> both? To provide the option? To have code parity with another project?

Both readers were for reading parquet data into Drill's value vector.
The default one (when store.parquet.use_new_reader is false) was
faster (based on measurements done by people worked on the two
readers), but it could not support complex type like map/array.  The
new reader would be used by Drill either if you change the option to
true, or when the parquet data you are querying contain complex type
(even with the default option being false). Therefore, both readers
might be used by Drill code.

There was a Parquet hackathon some time ago, which aimed to make
people in different projects using parquet work together to
standardize a vectorized reader. I did not keep track of that effort.
People with better knowledge of that may share their inputs.


> - Do either of the readers work with Arrow?

For now, neither works with Arrow, since Drill has not integrated with
Arrow yet. See DRILL-4455 for the latest discussion
(https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
Drill's parquet reader will work with Arrow, once the integration is
done.

Re: Parquet, Arrow, and Drill Roadmap

Posted by Jinfeng Ni <jn...@apache.org>.
>
> - What the two readers are (is one a special drill thing, is the other  a
> standard reader from the parquet project?)
> - What is the eventual goal here... to be able to use and switch between
> both? To provide the option? To have code parity with another project?

Both readers were for reading parquet data into Drill's value vector.
The default one (when store.parquet.use_new_reader is false) was
faster (based on measurements done by people worked on the two
readers), but it could not support complex type like map/array.  The
new reader would be used by Drill either if you change the option to
true, or when the parquet data you are querying contain complex type
(even with the default option being false). Therefore, both readers
might be used by Drill code.

There was a Parquet hackathon some time ago, which aimed to make
people in different projects using parquet work together to
standardize a vectorized reader. I did not keep track of that effort.
People with better knowledge of that may share their inputs.


> - Do either of the readers work with Arrow?

For now, neither works with Arrow, since Drill has not integrated with
Arrow yet. See DRILL-4455 for the latest discussion
(https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
Drill's parquet reader will work with Arrow, once the integration is
done.