You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Yurui Zhou <yu...@alibaba-inc.com> on 2018/12/12 03:33:43 UTC

Arrow read write support on Java

Hello

I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.

I am wondering is there any plan on making the same support on the Java side? 

I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet <https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet>

Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?

Thanks
Yurui

Re: Arrow read write support on Java

Posted by Masayuki Takahashi <ma...@gmail.com>.
I have created the JIRA.

https://issues.apache.org/jira/browse/PARQUET-1479

2018年12月14日(金) 0:44 Masayuki Takahashi <ma...@gmail.com>:
>
> Hi Ryan,
>
> Which part do you want to discuss? May I create JIRA for?
>
> thanks.
> 2018年12月13日(木) 3:28 Ryan Blue <rb...@netflix.com.invalid>:
> >
> > We've had a lot of discussion about this in the Iceberg community as well,
> > since Parquet to Arrow is going to be the easiest path to vectorized reads
> > for Spark. It would be great to have people working on it!
> >
> > On Wed, Dec 12, 2018 at 7:38 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Masayuki -- this is great to hear. Since this software was not
> > > developed in the Apache Parquet community we may need to careful about
> > > IP lineage / transfer issues if you do open a pull request.
> > >
> > > - Wes
> > > On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> > > <ma...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am developing the simple converter from Parquet to Arrow.
> > > >
> > > > https://github.com/masayuki038/parquet-to-arrow
> > > >
> > > > If anyone have not started yet, may I create the JIRA and pull request
> > > > about the converter from parquet to arrow?
> > > >
> > > > I would like to develop the converter from Arrow to Parquet and some
> > > > features(like Dremio implementation).
> > > >
> > > > thanks.
> > > >
> > > >
> > > > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > > > >
> > > > > hi Yurui,
> > > > >
> > > > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > > > write in a reusable Java library would be very useful (it has proven
> > > > > popular in C++). We welcome your contributions.
> > > > >
> > > > > - Wes
> > > > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com>
> > > wrote:
> > > > > >
> > > > > > Hello
> > > > > >
> > > > > > I just learned arrow now provided a native reader/writer
> > > implementation on C++ to allow user directly read parquet file into Arrow
> > > Buffer and Write to parquet file from arrow buffer.
> > > > > >
> > > > > > I am wondering is there any plan on making the same support on the
> > > Java side?
> > > > > >
> > > > > > I found an implementation on dremio codebase that provide the arrow
> > > support mentioned above.
> > > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > > > >
> > > > > > Does the parquet community or arrow community have any plan to
> > > integrate this into the parquet codebase or implement a new version from
> > > scratch?
> > > > > >
> > > > > > Thanks
> > > > > > Yurui
> > > >
> > > >
> > > >
> > > > --
> > > > 高橋 真之
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
> --
> 高橋 真之



-- 
高橋 真之

Re: Arrow read write support on Java

Posted by Masayuki Takahashi <ma...@gmail.com>.
Hi Ryan,

Which part do you want to discuss? May I create JIRA for?

thanks.
2018年12月13日(木) 3:28 Ryan Blue <rb...@netflix.com.invalid>:
>
> We've had a lot of discussion about this in the Iceberg community as well,
> since Parquet to Arrow is going to be the easiest path to vectorized reads
> for Spark. It would be great to have people working on it!
>
> On Wed, Dec 12, 2018 at 7:38 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Masayuki -- this is great to hear. Since this software was not
> > developed in the Apache Parquet community we may need to careful about
> > IP lineage / transfer issues if you do open a pull request.
> >
> > - Wes
> > On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> > <ma...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I am developing the simple converter from Parquet to Arrow.
> > >
> > > https://github.com/masayuki038/parquet-to-arrow
> > >
> > > If anyone have not started yet, may I create the JIRA and pull request
> > > about the converter from parquet to arrow?
> > >
> > > I would like to develop the converter from Arrow to Parquet and some
> > > features(like Dremio implementation).
> > >
> > > thanks.
> > >
> > >
> > > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > > >
> > > > hi Yurui,
> > > >
> > > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > > write in a reusable Java library would be very useful (it has proven
> > > > popular in C++). We welcome your contributions.
> > > >
> > > > - Wes
> > > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com>
> > wrote:
> > > > >
> > > > > Hello
> > > > >
> > > > > I just learned arrow now provided a native reader/writer
> > implementation on C++ to allow user directly read parquet file into Arrow
> > Buffer and Write to parquet file from arrow buffer.
> > > > >
> > > > > I am wondering is there any plan on making the same support on the
> > Java side?
> > > > >
> > > > > I found an implementation on dremio codebase that provide the arrow
> > support mentioned above.
> > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > > >
> > > > > Does the parquet community or arrow community have any plan to
> > integrate this into the parquet codebase or implement a new version from
> > scratch?
> > > > >
> > > > > Thanks
> > > > > Yurui
> > >
> > >
> > >
> > > --
> > > 高橋 真之
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



-- 
高橋 真之

Re: Arrow read write support on Java

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
We've had a lot of discussion about this in the Iceberg community as well,
since Parquet to Arrow is going to be the easiest path to vectorized reads
for Spark. It would be great to have people working on it!

On Wed, Dec 12, 2018 at 7:38 AM Wes McKinney <we...@gmail.com> wrote:

> hi Masayuki -- this is great to hear. Since this software was not
> developed in the Apache Parquet community we may need to careful about
> IP lineage / transfer issues if you do open a pull request.
>
> - Wes
> On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> <ma...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am developing the simple converter from Parquet to Arrow.
> >
> > https://github.com/masayuki038/parquet-to-arrow
> >
> > If anyone have not started yet, may I create the JIRA and pull request
> > about the converter from parquet to arrow?
> >
> > I would like to develop the converter from Arrow to Parquet and some
> > features(like Dremio implementation).
> >
> > thanks.
> >
> >
> > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > >
> > > hi Yurui,
> > >
> > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > write in a reusable Java library would be very useful (it has proven
> > > popular in C++). We welcome your contributions.
> > >
> > > - Wes
> > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com>
> wrote:
> > > >
> > > > Hello
> > > >
> > > > I just learned arrow now provided a native reader/writer
> implementation on C++ to allow user directly read parquet file into Arrow
> Buffer and Write to parquet file from arrow buffer.
> > > >
> > > > I am wondering is there any plan on making the same support on the
> Java side?
> > > >
> > > > I found an implementation on dremio codebase that provide the arrow
> support mentioned above.
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > >
> > > > Does the parquet community or arrow community have any plan to
> integrate this into the parquet codebase or implement a new version from
> scratch?
> > > >
> > > > Thanks
> > > > Yurui
> >
> >
> >
> > --
> > 高橋 真之
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Arrow read write support on Java

Posted by Masayuki Takahashi <ma...@gmail.com>.
Hi Wes,

Thanks for telling me details!
I am going to check the documents of projects that have already been donated.

thanks.
2018年12月14日(金) 0:48 Wes McKinney <we...@gmail.com>:
>
> hi,
>
> This software was developed outside of Apache Parquet:
> https://github.com/masayuki038/parquet-to-arrow. It would be different
> if this had been developed as pull requests into apache/parquet-mr,
> for example.
>
> We have a procedure for accepting foreign IP into Apache projects:
> http://incubator.apache.org/ip-clearance/
>
> - Wes
> On Thu, Dec 13, 2018 at 9:39 AM Masayuki Takahashi
> <ma...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I could not understand about "IP lineage / transfer issues".
> > Could you tell me the details?
> >
> > I will try to conform to the rules of Parquet Community as much as possible.
> >
> > thank.
> > 2018年12月13日(木) 0:38 Wes McKinney <we...@gmail.com>:
> > >
> > > hi Masayuki -- this is great to hear. Since this software was not
> > > developed in the Apache Parquet community we may need to careful about
> > > IP lineage / transfer issues if you do open a pull request.
> > >
> > > - Wes
> > > On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> > > <ma...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am developing the simple converter from Parquet to Arrow.
> > > >
> > > > https://github.com/masayuki038/parquet-to-arrow
> > > >
> > > > If anyone have not started yet, may I create the JIRA and pull request
> > > > about the converter from parquet to arrow?
> > > >
> > > > I would like to develop the converter from Arrow to Parquet and some
> > > > features(like Dremio implementation).
> > > >
> > > > thanks.
> > > >
> > > >
> > > > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > > > >
> > > > > hi Yurui,
> > > > >
> > > > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > > > write in a reusable Java library would be very useful (it has proven
> > > > > popular in C++). We welcome your contributions.
> > > > >
> > > > > - Wes
> > > > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
> > > > > >
> > > > > > Hello
> > > > > >
> > > > > > I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
> > > > > >
> > > > > > I am wondering is there any plan on making the same support on the Java side?
> > > > > >
> > > > > > I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > > > >
> > > > > > Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
> > > > > >
> > > > > > Thanks
> > > > > > Yurui
> > > >
> > > >
> > > >
> > > > --
> > > > 高橋 真之
> >
> >
> >
> > --
> > 高橋 真之



-- 
高橋 真之

Re: Arrow read write support on Java

Posted by Wes McKinney <we...@gmail.com>.
hi,

This software was developed outside of Apache Parquet:
https://github.com/masayuki038/parquet-to-arrow. It would be different
if this had been developed as pull requests into apache/parquet-mr,
for example.

We have a procedure for accepting foreign IP into Apache projects:
http://incubator.apache.org/ip-clearance/

- Wes
On Thu, Dec 13, 2018 at 9:39 AM Masayuki Takahashi
<ma...@gmail.com> wrote:
>
> Hi Wes,
>
> I could not understand about "IP lineage / transfer issues".
> Could you tell me the details?
>
> I will try to conform to the rules of Parquet Community as much as possible.
>
> thank.
> 2018年12月13日(木) 0:38 Wes McKinney <we...@gmail.com>:
> >
> > hi Masayuki -- this is great to hear. Since this software was not
> > developed in the Apache Parquet community we may need to careful about
> > IP lineage / transfer issues if you do open a pull request.
> >
> > - Wes
> > On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> > <ma...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I am developing the simple converter from Parquet to Arrow.
> > >
> > > https://github.com/masayuki038/parquet-to-arrow
> > >
> > > If anyone have not started yet, may I create the JIRA and pull request
> > > about the converter from parquet to arrow?
> > >
> > > I would like to develop the converter from Arrow to Parquet and some
> > > features(like Dremio implementation).
> > >
> > > thanks.
> > >
> > >
> > > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > > >
> > > > hi Yurui,
> > > >
> > > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > > write in a reusable Java library would be very useful (it has proven
> > > > popular in C++). We welcome your contributions.
> > > >
> > > > - Wes
> > > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
> > > > >
> > > > > Hello
> > > > >
> > > > > I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
> > > > >
> > > > > I am wondering is there any plan on making the same support on the Java side?
> > > > >
> > > > > I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > > >
> > > > > Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
> > > > >
> > > > > Thanks
> > > > > Yurui
> > >
> > >
> > >
> > > --
> > > 高橋 真之
>
>
>
> --
> 高橋 真之

Re: Arrow read write support on Java

Posted by Masayuki Takahashi <ma...@gmail.com>.
Hi Wes,

I could not understand about "IP lineage / transfer issues".
Could you tell me the details?

I will try to conform to the rules of Parquet Community as much as possible.

thank.
2018年12月13日(木) 0:38 Wes McKinney <we...@gmail.com>:
>
> hi Masayuki -- this is great to hear. Since this software was not
> developed in the Apache Parquet community we may need to careful about
> IP lineage / transfer issues if you do open a pull request.
>
> - Wes
> On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
> <ma...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am developing the simple converter from Parquet to Arrow.
> >
> > https://github.com/masayuki038/parquet-to-arrow
> >
> > If anyone have not started yet, may I create the JIRA and pull request
> > about the converter from parquet to arrow?
> >
> > I would like to develop the converter from Arrow to Parquet and some
> > features(like Dremio implementation).
> >
> > thanks.
> >
> >
> > 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> > >
> > > hi Yurui,
> > >
> > > It has been discussed in the last 3 years, but I haven't seen anyone
> > > step up to begin to work on this yet. Having vectorized Arrow read and
> > > write in a reusable Java library would be very useful (it has proven
> > > popular in C++). We welcome your contributions.
> > >
> > > - Wes
> > > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
> > > >
> > > > Hello
> > > >
> > > > I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
> > > >
> > > > I am wondering is there any plan on making the same support on the Java side?
> > > >
> > > > I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > > >
> > > > Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
> > > >
> > > > Thanks
> > > > Yurui
> >
> >
> >
> > --
> > 高橋 真之



-- 
高橋 真之

Re: Arrow read write support on Java

Posted by Wes McKinney <we...@gmail.com>.
hi Masayuki -- this is great to hear. Since this software was not
developed in the Apache Parquet community we may need to careful about
IP lineage / transfer issues if you do open a pull request.

- Wes
On Wed, Dec 12, 2018 at 9:23 AM Masayuki Takahashi
<ma...@gmail.com> wrote:
>
> Hi,
>
> I am developing the simple converter from Parquet to Arrow.
>
> https://github.com/masayuki038/parquet-to-arrow
>
> If anyone have not started yet, may I create the JIRA and pull request
> about the converter from parquet to arrow?
>
> I would like to develop the converter from Arrow to Parquet and some
> features(like Dremio implementation).
>
> thanks.
>
>
> 2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
> >
> > hi Yurui,
> >
> > It has been discussed in the last 3 years, but I haven't seen anyone
> > step up to begin to work on this yet. Having vectorized Arrow read and
> > write in a reusable Java library would be very useful (it has proven
> > popular in C++). We welcome your contributions.
> >
> > - Wes
> > On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
> > >
> > > Hello
> > >
> > > I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
> > >
> > > I am wondering is there any plan on making the same support on the Java side?
> > >
> > > I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> > >
> > > Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
> > >
> > > Thanks
> > > Yurui
>
>
>
> --
> 高橋 真之

Re: Arrow read write support on Java

Posted by Masayuki Takahashi <ma...@gmail.com>.
Hi,

I am developing the simple converter from Parquet to Arrow.

https://github.com/masayuki038/parquet-to-arrow

If anyone have not started yet, may I create the JIRA and pull request
about the converter from parquet to arrow?

I would like to develop the converter from Arrow to Parquet and some
features(like Dremio implementation).

thanks.


2018年12月12日(水) 23:49 Wes McKinney <we...@gmail.com>:
>
> hi Yurui,
>
> It has been discussed in the last 3 years, but I haven't seen anyone
> step up to begin to work on this yet. Having vectorized Arrow read and
> write in a reusable Java library would be very useful (it has proven
> popular in C++). We welcome your contributions.
>
> - Wes
> On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
> >
> > Hello
> >
> > I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
> >
> > I am wondering is there any plan on making the same support on the Java side?
> >
> > I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> >
> > Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
> >
> > Thanks
> > Yurui



-- 
高橋 真之

Re: Arrow read write support on Java

Posted by Wes McKinney <we...@gmail.com>.
hi Yurui,

It has been discussed in the last 3 years, but I haven't seen anyone
step up to begin to work on this yet. Having vectorized Arrow read and
write in a reusable Java library would be very useful (it has proven
popular in C++). We welcome your contributions.

- Wes
On Tue, Dec 11, 2018 at 9:34 PM Yurui Zhou <yu...@alibaba-inc.com> wrote:
>
> Hello
>
> I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer.
>
> I am wondering is there any plan on making the same support on the Java side?
>
> I found an implementation on dremio codebase that provide the arrow support mentioned above. https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>
> Does the parquet community or arrow community have any plan to integrate this into the parquet codebase or implement a new version from scratch?
>
> Thanks
> Yurui