You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Anoop Johnson <an...@gmail.com> on 2019/08/04 21:52:07 UTC

Re: Parquet to Arrow in Java

Thanks for the response Micah. I could implement this and contribute to
Arrow Java. To help me get started, are there any pointers on how the C++
or Rust implementations currently read Parquet into Arrow? Are they reading
Parquet row-by-row and building Arrow batches or are there better ways of
implementing this?

On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Anoop,
> There isn't currently anything in the Arrow Java library that does this.
> It is something that I think we want to add at some point.   Dremio [1] has
> some Parquet related code, but I haven't looked at it to understand how
> easy it is to use as a standalone library and whether is supports predicate
> push-down/column selection.
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>
> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <an...@gmail.com>
> wrote:
>
> > Arrow Newbie here.  What is the recommended way to convert Parquet data
> > into Arrow, preferably doing predicate/column pushdown?
> >
> > One can implement this as custom code using the Parquet API, and
> re-encode
> > it in Arrow using the Arrow APIs, but is this supported by Arrow out of
> the
> > box?
> >
> > Thanks,
> > Anoop
> >
>

Re: Parquet to Arrow in Java

Posted by Chao Sun <su...@uber.com.INVALID>.
Thanks Uwe for pointing out the Iceberg effort - will take a look. It is
good to have a "standard" Parquet-to-Arrow reader implementation live in
the Arrow project though, so that in future different projects can just
refer to this instead of implementing their own.

Chao

On Wed, Sep 4, 2019 at 10:46 AM Uwe L. Korn <ma...@uwekorn.com> wrote:

> Hello,
>
> You may want to interact with the Apache Iceberg community here. They are
> currently a similar things:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9-40-253Cdev.iceberg.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=ClLXJlApN8h8jVlxnDKaMbfwc8gSSjqa4te9fMyiY98&s=Mhdp7_Q9H01BDtBN1hR1etR-yTVBEpfO2Qr5KW-g1ZU&e=
> I'm not involved in this, just reading both mailing lists and thought I'd
> share this.
>
> Cheers
> Uwe
>
> On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> > Bumping this.
> >
> > We may have an upcoming use case for this as well. Want to know if anyone
> > is actively working on this? I also heard that Dremio has internally
> > implemented a performant Parquet to Arrow reader. Is there any plan to
> open
> > source it? that could save us a lot of work.
> >
> > Thanks,
> > Chao
> >
> > On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > > Hi:
> > >
> > > I'm working on the rust part and expecting to finish this recently. I'm
> > > also interested in the java version because we are trying to embed
> arrow in
> > > spark to implement vectorized processing. Maybe we can work together.
> > >
> > > Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:
> > >
> > > > Hi Anoop,
> > > > I think a contribution would be welcome.  There was a recent
> discussion
> > > > thread on what would be expected from new "readers" for Arrow data in
> > > Java
> > > > [1].  I think its worth reading through but my recollections of the
> > > > highlights are:
> > > > 1.  A short design sketch in the JIRA that will track the work.
> > > > 2.  Off-heap data-structures as much as possible
> > > > 3.  An interface that allows predicate push down, column projection
> and
> > > > specifying the batch sizes of reads.  I think there is probably some
> > > > interplay here between RowGroup size and size of batches.  It might
> worth
> > > > thinking about this up front and mentioning in the design.
> > > > 4.  Performant (since we care going from columnar->columar it should
> be
> > > > faster then Parquet-MR and on-par or better then Spark's
> implementation
> > > > which I believe also goes from columnar to columnar).
> > > >
> > > > Answers to specific questions below.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > To help me get started, are there any pointers on how the C++ or Rust
> > > > > implementations currently read Parquet into Arrow?
> > > >
> > > > I'm not sure about the Rust code, but the C++ code is located at
> [2], it
> > > is
> > > > has been going under some recent refactoring (and I think Wes might
> have
> > > 1
> > > > or 2 changes till to make).  It doesn't yet support nested data types
> > > fully
> > > > (e.g. structs).
> > > >
> > > > Are they reading Parquet row-by-row and building Arrow batches or are
> > > there
> > > > > better ways of implementing this?
> > > >
> > > > I believe the implementations should be reading a row-group at a time
> > > > column by column.  Spark potentially has an implementation that
> already
> > > > does this.
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > > > [2]
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> > > >
> > > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <
> anoop.k.johnson@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the response Micah. I could implement this and
> contribute to
> > > > > Arrow Java. To help me get started, are there any pointers on how
> the
> > > C++
> > > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > > reading
> > > > > Parquet row-by-row and building Arrow batches or are there better
> ways
> > > of
> > > > > implementing this?
> > > > >
> > > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <
> emkornfield@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Anoop,
> > > > >> There isn't currently anything in the Arrow Java library that does
> > > this.
> > > > >> It is something that I think we want to add at some point.
>  Dremio
> > > [1]
> > > > >> has
> > > > >> some Parquet related code, but I haven't looked at it to
> understand
> > > how
> > > > >> easy it is to use as a standalone library and whether is supports
> > > > >> predicate
> > > > >> push-down/column selection.
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >> [1]
> > > > >>
> > > > >>
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > > > >>
> > > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > > anoop.k.johnson@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Arrow Newbie here.  What is the recommended way to convert
> Parquet
> > > > data
> > > > >> > into Arrow, preferably doing predicate/column pushdown?
> > > > >> >
> > > > >> > One can implement this as custom code using the Parquet API, and
> > > > >> re-encode
> > > > >> > it in Arrow using the Arrow APIs, but is this supported by
> Arrow out
> > > > of
> > > > >> the
> > > > >> > box?
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Anoop
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Parquet to Arrow in Java

Posted by Chao Sun <su...@uber.com.INVALID>.
Thanks Uwe for pointing out the Iceberg effort - will take a look. It is
good to have a "standard" Parquet-to-Arrow reader implementation live in
the Arrow project though, so that in future different projects can just
refer to this instead of implementing their own.

Chao

On Wed, Sep 4, 2019 at 10:46 AM Uwe L. Korn <ma...@uwekorn.com> wrote:

> Hello,
>
> You may want to interact with the Apache Iceberg community here. They are
> currently a similar things:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9-40-253Cdev.iceberg.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=ClLXJlApN8h8jVlxnDKaMbfwc8gSSjqa4te9fMyiY98&s=Mhdp7_Q9H01BDtBN1hR1etR-yTVBEpfO2Qr5KW-g1ZU&e=
> I'm not involved in this, just reading both mailing lists and thought I'd
> share this.
>
> Cheers
> Uwe
>
> On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> > Bumping this.
> >
> > We may have an upcoming use case for this as well. Want to know if anyone
> > is actively working on this? I also heard that Dremio has internally
> > implemented a performant Parquet to Arrow reader. Is there any plan to
> open
> > source it? that could save us a lot of work.
> >
> > Thanks,
> > Chao
> >
> > On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > > Hi:
> > >
> > > I'm working on the rust part and expecting to finish this recently. I'm
> > > also interested in the java version because we are trying to embed
> arrow in
> > > spark to implement vectorized processing. Maybe we can work together.
> > >
> > > Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:
> > >
> > > > Hi Anoop,
> > > > I think a contribution would be welcome.  There was a recent
> discussion
> > > > thread on what would be expected from new "readers" for Arrow data in
> > > Java
> > > > [1].  I think its worth reading through but my recollections of the
> > > > highlights are:
> > > > 1.  A short design sketch in the JIRA that will track the work.
> > > > 2.  Off-heap data-structures as much as possible
> > > > 3.  An interface that allows predicate push down, column projection
> and
> > > > specifying the batch sizes of reads.  I think there is probably some
> > > > interplay here between RowGroup size and size of batches.  It might
> worth
> > > > thinking about this up front and mentioning in the design.
> > > > 4.  Performant (since we care going from columnar->columar it should
> be
> > > > faster then Parquet-MR and on-par or better then Spark's
> implementation
> > > > which I believe also goes from columnar to columnar).
> > > >
> > > > Answers to specific questions below.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > To help me get started, are there any pointers on how the C++ or Rust
> > > > > implementations currently read Parquet into Arrow?
> > > >
> > > > I'm not sure about the Rust code, but the C++ code is located at
> [2], it
> > > is
> > > > has been going under some recent refactoring (and I think Wes might
> have
> > > 1
> > > > or 2 changes till to make).  It doesn't yet support nested data types
> > > fully
> > > > (e.g. structs).
> > > >
> > > > Are they reading Parquet row-by-row and building Arrow batches or are
> > > there
> > > > > better ways of implementing this?
> > > >
> > > > I believe the implementations should be reading a row-group at a time
> > > > column by column.  Spark potentially has an implementation that
> already
> > > > does this.
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > > > [2]
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> > > >
> > > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <
> anoop.k.johnson@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the response Micah. I could implement this and
> contribute to
> > > > > Arrow Java. To help me get started, are there any pointers on how
> the
> > > C++
> > > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > > reading
> > > > > Parquet row-by-row and building Arrow batches or are there better
> ways
> > > of
> > > > > implementing this?
> > > > >
> > > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <
> emkornfield@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Anoop,
> > > > >> There isn't currently anything in the Arrow Java library that does
> > > this.
> > > > >> It is something that I think we want to add at some point.
>  Dremio
> > > [1]
> > > > >> has
> > > > >> some Parquet related code, but I haven't looked at it to
> understand
> > > how
> > > > >> easy it is to use as a standalone library and whether is supports
> > > > >> predicate
> > > > >> push-down/column selection.
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >> [1]
> > > > >>
> > > > >>
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > > > >>
> > > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > > anoop.k.johnson@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Arrow Newbie here.  What is the recommended way to convert
> Parquet
> > > > data
> > > > >> > into Arrow, preferably doing predicate/column pushdown?
> > > > >> >
> > > > >> > One can implement this as custom code using the Parquet API, and
> > > > >> re-encode
> > > > >> > it in Arrow using the Arrow APIs, but is this supported by
> Arrow out
> > > > of
> > > > >> the
> > > > >> > box?
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Anoop
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Parquet to Arrow in Java

Posted by "Uwe L. Korn" <ma...@uwekorn.com>.
Hello,

You may want to interact with the Apache Iceberg community here. They are currently a similar things: https://lists.apache.org/thread.html/3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9@%3Cdev.iceberg.apache.org%3E I'm not involved in this, just reading both mailing lists and thought I'd share this.

Cheers
Uwe

On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> Bumping this.
> 
> We may have an upcoming use case for this as well. Want to know if anyone
> is actively working on this? I also heard that Dremio has internally
> implemented a performant Parquet to Arrow reader. Is there any plan to open
> source it? that could save us a lot of work.
> 
> Thanks,
> Chao
> 
> On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <li...@gmail.com> wrote:
> 
> > Hi:
> >
> > I'm working on the rust part and expecting to finish this recently. I'm
> > also interested in the java version because we are trying to embed arrow in
> > spark to implement vectorized processing. Maybe we can work together.
> >
> > Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:
> >
> > > Hi Anoop,
> > > I think a contribution would be welcome.  There was a recent discussion
> > > thread on what would be expected from new "readers" for Arrow data in
> > Java
> > > [1].  I think its worth reading through but my recollections of the
> > > highlights are:
> > > 1.  A short design sketch in the JIRA that will track the work.
> > > 2.  Off-heap data-structures as much as possible
> > > 3.  An interface that allows predicate push down, column projection and
> > > specifying the batch sizes of reads.  I think there is probably some
> > > interplay here between RowGroup size and size of batches.  It might worth
> > > thinking about this up front and mentioning in the design.
> > > 4.  Performant (since we care going from columnar->columar it should be
> > > faster then Parquet-MR and on-par or better then Spark's implementation
> > > which I believe also goes from columnar to columnar).
> > >
> > > Answers to specific questions below.
> > >
> > > Thanks,
> > > Micah
> > >
> > > To help me get started, are there any pointers on how the C++ or Rust
> > > > implementations currently read Parquet into Arrow?
> > >
> > > I'm not sure about the Rust code, but the C++ code is located at [2], it
> > is
> > > has been going under some recent refactoring (and I think Wes might have
> > 1
> > > or 2 changes till to make).  It doesn't yet support nested data types
> > fully
> > > (e.g. structs).
> > >
> > > Are they reading Parquet row-by-row and building Arrow batches or are
> > there
> > > > better ways of implementing this?
> > >
> > > I believe the implementations should be reading a row-group at a time
> > > column by column.  Spark potentially has an implementation that already
> > > does this.
> > >
> > >
> > > [1]
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > > [2]
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> > >
> > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <an...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the response Micah. I could implement this and contribute to
> > > > Arrow Java. To help me get started, are there any pointers on how the
> > C++
> > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > reading
> > > > Parquet row-by-row and building Arrow batches or are there better ways
> > of
> > > > implementing this?
> > > >
> > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfield@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Anoop,
> > > >> There isn't currently anything in the Arrow Java library that does
> > this.
> > > >> It is something that I think we want to add at some point.   Dremio
> > [1]
> > > >> has
> > > >> some Parquet related code, but I haven't looked at it to understand
> > how
> > > >> easy it is to use as a standalone library and whether is supports
> > > >> predicate
> > > >> push-down/column selection.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > > >>
> > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > anoop.k.johnson@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > > data
> > > >> > into Arrow, preferably doing predicate/column pushdown?
> > > >> >
> > > >> > One can implement this as custom code using the Parquet API, and
> > > >> re-encode
> > > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> > > of
> > > >> the
> > > >> > box?
> > > >> >
> > > >> > Thanks,
> > > >> > Anoop
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Parquet to Arrow in Java

Posted by "Uwe L. Korn" <ma...@uwekorn.com>.
Hello,

You may want to interact with the Apache Iceberg community here. They are currently a similar things: https://lists.apache.org/thread.html/3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9@%3Cdev.iceberg.apache.org%3E I'm not involved in this, just reading both mailing lists and thought I'd share this.

Cheers
Uwe

On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> Bumping this.
> 
> We may have an upcoming use case for this as well. Want to know if anyone
> is actively working on this? I also heard that Dremio has internally
> implemented a performant Parquet to Arrow reader. Is there any plan to open
> source it? that could save us a lot of work.
> 
> Thanks,
> Chao
> 
> On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <li...@gmail.com> wrote:
> 
> > Hi:
> >
> > I'm working on the rust part and expecting to finish this recently. I'm
> > also interested in the java version because we are trying to embed arrow in
> > spark to implement vectorized processing. Maybe we can work together.
> >
> > Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:
> >
> > > Hi Anoop,
> > > I think a contribution would be welcome.  There was a recent discussion
> > > thread on what would be expected from new "readers" for Arrow data in
> > Java
> > > [1].  I think its worth reading through but my recollections of the
> > > highlights are:
> > > 1.  A short design sketch in the JIRA that will track the work.
> > > 2.  Off-heap data-structures as much as possible
> > > 3.  An interface that allows predicate push down, column projection and
> > > specifying the batch sizes of reads.  I think there is probably some
> > > interplay here between RowGroup size and size of batches.  It might worth
> > > thinking about this up front and mentioning in the design.
> > > 4.  Performant (since we care going from columnar->columar it should be
> > > faster then Parquet-MR and on-par or better then Spark's implementation
> > > which I believe also goes from columnar to columnar).
> > >
> > > Answers to specific questions below.
> > >
> > > Thanks,
> > > Micah
> > >
> > > To help me get started, are there any pointers on how the C++ or Rust
> > > > implementations currently read Parquet into Arrow?
> > >
> > > I'm not sure about the Rust code, but the C++ code is located at [2], it
> > is
> > > has been going under some recent refactoring (and I think Wes might have
> > 1
> > > or 2 changes till to make).  It doesn't yet support nested data types
> > fully
> > > (e.g. structs).
> > >
> > > Are they reading Parquet row-by-row and building Arrow batches or are
> > there
> > > > better ways of implementing this?
> > >
> > > I believe the implementations should be reading a row-group at a time
> > > column by column.  Spark potentially has an implementation that already
> > > does this.
> > >
> > >
> > > [1]
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > > [2]
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> > >
> > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <an...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the response Micah. I could implement this and contribute to
> > > > Arrow Java. To help me get started, are there any pointers on how the
> > C++
> > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > reading
> > > > Parquet row-by-row and building Arrow batches or are there better ways
> > of
> > > > implementing this?
> > > >
> > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfield@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Anoop,
> > > >> There isn't currently anything in the Arrow Java library that does
> > this.
> > > >> It is something that I think we want to add at some point.   Dremio
> > [1]
> > > >> has
> > > >> some Parquet related code, but I haven't looked at it to understand
> > how
> > > >> easy it is to use as a standalone library and whether is supports
> > > >> predicate
> > > >> push-down/column selection.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > > >>
> > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > anoop.k.johnson@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > > data
> > > >> > into Arrow, preferably doing predicate/column pushdown?
> > > >> >
> > > >> > One can implement this as custom code using the Parquet API, and
> > > >> re-encode
> > > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> > > of
> > > >> the
> > > >> > box?
> > > >> >
> > > >> > Thanks,
> > > >> > Anoop
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Parquet to Arrow in Java

Posted by Chao Sun <su...@uber.com.INVALID>.
Bumping this.

We may have an upcoming use case for this as well. Want to know if anyone
is actively working on this? I also heard that Dremio has internally
implemented a performant Parquet to Arrow reader. Is there any plan to open
source it? that could save us a lot of work.

Thanks,
Chao

On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <li...@gmail.com> wrote:

> Hi:
>
> I'm working on the rust part and expecting to finish this recently. I'm
> also interested in the java version because we are trying to embed arrow in
> spark to implement vectorized processing. Maybe we can work together.
>
> Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:
>
> > Hi Anoop,
> > I think a contribution would be welcome.  There was a recent discussion
> > thread on what would be expected from new "readers" for Arrow data in
> Java
> > [1].  I think its worth reading through but my recollections of the
> > highlights are:
> > 1.  A short design sketch in the JIRA that will track the work.
> > 2.  Off-heap data-structures as much as possible
> > 3.  An interface that allows predicate push down, column projection and
> > specifying the batch sizes of reads.  I think there is probably some
> > interplay here between RowGroup size and size of batches.  It might worth
> > thinking about this up front and mentioning in the design.
> > 4.  Performant (since we care going from columnar->columar it should be
> > faster then Parquet-MR and on-par or better then Spark's implementation
> > which I believe also goes from columnar to columnar).
> >
> > Answers to specific questions below.
> >
> > Thanks,
> > Micah
> >
> > To help me get started, are there any pointers on how the C++ or Rust
> > > implementations currently read Parquet into Arrow?
> >
> > I'm not sure about the Rust code, but the C++ code is located at [2], it
> is
> > has been going under some recent refactoring (and I think Wes might have
> 1
> > or 2 changes till to make).  It doesn't yet support nested data types
> fully
> > (e.g. structs).
> >
> > Are they reading Parquet row-by-row and building Arrow batches or are
> there
> > > better ways of implementing this?
> >
> > I believe the implementations should be reading a row-group at a time
> > column by column.  Spark potentially has an implementation that already
> > does this.
> >
> >
> > [1]
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > [2]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> >
> > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <an...@gmail.com>
> > wrote:
> >
> > > Thanks for the response Micah. I could implement this and contribute to
> > > Arrow Java. To help me get started, are there any pointers on how the
> C++
> > > or Rust implementations currently read Parquet into Arrow? Are they
> > reading
> > > Parquet row-by-row and building Arrow batches or are there better ways
> of
> > > implementing this?
> > >
> > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Anoop,
> > >> There isn't currently anything in the Arrow Java library that does
> this.
> > >> It is something that I think we want to add at some point.   Dremio
> [1]
> > >> has
> > >> some Parquet related code, but I haven't looked at it to understand
> how
> > >> easy it is to use as a standalone library and whether is supports
> > >> predicate
> > >> push-down/column selection.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > >>
> > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > anoop.k.johnson@gmail.com>
> > >> wrote:
> > >>
> > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > data
> > >> > into Arrow, preferably doing predicate/column pushdown?
> > >> >
> > >> > One can implement this as custom code using the Parquet API, and
> > >> re-encode
> > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> > of
> > >> the
> > >> > box?
> > >> >
> > >> > Thanks,
> > >> > Anoop
> > >> >
> > >>
> > >
> >
>

Re: Parquet to Arrow in Java

Posted by Renjie Liu <li...@gmail.com>.
Hi:

I'm working on the rust part and expecting to finish this recently. I'm
also interested in the java version because we are trying to embed arrow in
spark to implement vectorized processing. Maybe we can work together.

Micah Kornfield <em...@gmail.com> 于 2019年8月5日周一 下午1:50写道:

> Hi Anoop,
> I think a contribution would be welcome.  There was a recent discussion
> thread on what would be expected from new "readers" for Arrow data in Java
> [1].  I think its worth reading through but my recollections of the
> highlights are:
> 1.  A short design sketch in the JIRA that will track the work.
> 2.  Off-heap data-structures as much as possible
> 3.  An interface that allows predicate push down, column projection and
> specifying the batch sizes of reads.  I think there is probably some
> interplay here between RowGroup size and size of batches.  It might worth
> thinking about this up front and mentioning in the design.
> 4.  Performant (since we care going from columnar->columar it should be
> faster then Parquet-MR and on-par or better then Spark's implementation
> which I believe also goes from columnar to columnar).
>
> Answers to specific questions below.
>
> Thanks,
> Micah
>
> To help me get started, are there any pointers on how the C++ or Rust
> > implementations currently read Parquet into Arrow?
>
> I'm not sure about the Rust code, but the C++ code is located at [2], it is
> has been going under some recent refactoring (and I think Wes might have 1
> or 2 changes till to make).  It doesn't yet support nested data types fully
> (e.g. structs).
>
> Are they reading Parquet row-by-row and building Arrow batches or are there
> > better ways of implementing this?
>
> I believe the implementations should be reading a row-group at a time
> column by column.  Spark potentially has an implementation that already
> does this.
>
>
> [1]
>
> https://lists.apache.org/thread.html/b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70@%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow
>
> On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <an...@gmail.com>
> wrote:
>
> > Thanks for the response Micah. I could implement this and contribute to
> > Arrow Java. To help me get started, are there any pointers on how the C++
> > or Rust implementations currently read Parquet into Arrow? Are they
> reading
> > Parquet row-by-row and building Arrow batches or are there better ways of
> > implementing this?
> >
> > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Anoop,
> >> There isn't currently anything in the Arrow Java library that does this.
> >> It is something that I think we want to add at some point.   Dremio [1]
> >> has
> >> some Parquet related code, but I haven't looked at it to understand how
> >> easy it is to use as a standalone library and whether is supports
> >> predicate
> >> push-down/column selection.
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1]
> >>
> >>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> >>
> >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> anoop.k.johnson@gmail.com>
> >> wrote:
> >>
> >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> data
> >> > into Arrow, preferably doing predicate/column pushdown?
> >> >
> >> > One can implement this as custom code using the Parquet API, and
> >> re-encode
> >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> of
> >> the
> >> > box?
> >> >
> >> > Thanks,
> >> > Anoop
> >> >
> >>
> >
>

Re: Parquet to Arrow in Java

Posted by Micah Kornfield <em...@gmail.com>.
Hi Anoop,
I think a contribution would be welcome.  There was a recent discussion
thread on what would be expected from new "readers" for Arrow data in Java
[1].  I think its worth reading through but my recollections of the
highlights are:
1.  A short design sketch in the JIRA that will track the work.
2.  Off-heap data-structures as much as possible
3.  An interface that allows predicate push down, column projection and
specifying the batch sizes of reads.  I think there is probably some
interplay here between RowGroup size and size of batches.  It might worth
thinking about this up front and mentioning in the design.
4.  Performant (since we care going from columnar->columar it should be
faster then Parquet-MR and on-par or better then Spark's implementation
which I believe also goes from columnar to columnar).

Answers to specific questions below.

Thanks,
Micah

To help me get started, are there any pointers on how the C++ or Rust
> implementations currently read Parquet into Arrow?

I'm not sure about the Rust code, but the C++ code is located at [2], it is
has been going under some recent refactoring (and I think Wes might have 1
or 2 changes till to make).  It doesn't yet support nested data types fully
(e.g. structs).

Are they reading Parquet row-by-row and building Arrow batches or are there
> better ways of implementing this?

I believe the implementations should be reading a row-group at a time
column by column.  Spark potentially has an implementation that already
does this.


[1]
https://lists.apache.org/thread.html/b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow

On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <an...@gmail.com>
wrote:

> Thanks for the response Micah. I could implement this and contribute to
> Arrow Java. To help me get started, are there any pointers on how the C++
> or Rust implementations currently read Parquet into Arrow? Are they reading
> Parquet row-by-row and building Arrow batches or are there better ways of
> implementing this?
>
> On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Anoop,
>> There isn't currently anything in the Arrow Java library that does this.
>> It is something that I think we want to add at some point.   Dremio [1]
>> has
>> some Parquet related code, but I haven't looked at it to understand how
>> easy it is to use as a standalone library and whether is supports
>> predicate
>> push-down/column selection.
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>>
>> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <an...@gmail.com>
>> wrote:
>>
>> > Arrow Newbie here.  What is the recommended way to convert Parquet data
>> > into Arrow, preferably doing predicate/column pushdown?
>> >
>> > One can implement this as custom code using the Parquet API, and
>> re-encode
>> > it in Arrow using the Arrow APIs, but is this supported by Arrow out of
>> the
>> > box?
>> >
>> > Thanks,
>> > Anoop
>> >
>>
>