You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Jean-Baptiste Onofré <jb...@nanthrax.net> on 2018/02/13 20:31:40 UTC

Plan for a Parquet new release and writing Parquet file with outputstream

Hi guys,

I'm working on the Apache Beam ParquetIO:

https://github.com/apache/beam/pull/1851

In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).

If I was able to implement the Read part using AvroParquetReader leveraging Beam
 FileIO, I'm struggling on the writing part.

I have to create ParquetSink implementing FileIO.Sink. Especially, I have to
implement the open(WritableByteChannel channel) method.

It's not possible to use AvroParquetWriter here as it takes a Path as argument
(and from the channel, I can only have an OutputStream).

As a workaround, I wanted to use org.apache.parquet.hadoop.ParquetFileWriter,
providing my own implementation of org.apache.parquet.io.OutputFile.

Unfortunately OutputFile (and the updated method in ParquetFileWriter) exists on
Parquet master branch, but it was different on Parquet 1.9.0.

So, I have two questions:
- do you plan a Parquet 1.9.1 release including org.apache.parquet.io.OutputFile
and updated org.apache.parquet.hadoop.ParquetFileWriter ?
- using Parquet 1.9.0, do you have any advice how to use
AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object that I
can get from WritableByteChannel) ?

Thanks !

Regards
JB
-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Eugene Kirpichov <ki...@google.com>.
Thanks for raising this, JB!
To clarify for people on Parquet mailing list who are not familiar with
Beam:

Beam supports multiple filesystems (currently: local, HDFS, Google Cloud,
S3) via a pluggable interface (that among other things can give you a
Channel for reading/writing the given path), and we'd like to be able to
read and write Parquet files to any of the supported filesystems.

The current AvroParquetReader/Writer API, that takes a Path to the file,
supports only local and HDFS files. We would like to be able to read
Parquet files via a ReadableByteChannel or InputStream, and write via
WritableByteChannel or OutputStream. (JB raised the issue for writing, but
I just realized that it affects reading to the same extent)

ParquetFileWriter constructor with OutputFile seems to help with that,
likewise, ParquetFileReader with InputFile.
Generally this seems to be part of
https://issues.apache.org/jira/browse/PARQUET-1142 (marked "fixed" for 1.10
but I wasn't ) and https://issues.apache.org/jira/browse/PARQUET-1126 .


On Tue, Feb 13, 2018 at 12:31 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi guys,
>
> I'm working on the Apache Beam ParquetIO:
>
> https://github.com/apache/beam/pull/1851
>
> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
>
> If I was able to implement the Read part using AvroParquetReader
> leveraging Beam
>  FileIO, I'm struggling on the writing part.
>
> I have to create ParquetSink implementing FileIO.Sink. Especially, I have
> to
> implement the open(WritableByteChannel channel) method.
>
> It's not possible to use AvroParquetWriter here as it takes a Path as
> argument
> (and from the channel, I can only have an OutputStream).
>
> As a workaround, I wanted to use
> org.apache.parquet.hadoop.ParquetFileWriter,
> providing my own implementation of org.apache.parquet.io.OutputFile.
>
> Unfortunately OutputFile (and the updated method in ParquetFileWriter)
> exists on
> Parquet master branch, but it was different on Parquet 1.9.0.
>
> So, I have two questions:
> - do you plan a Parquet 1.9.1 release including org.apache.parquet.io
> .OutputFile
> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
> - using Parquet 1.9.0, do you have any advice how to use
> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object
> that I
> can get from WritableByteChannel) ?
>
> Thanks !
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Yup, that's great. I will update the PR when back from vacation.

Regards
JB

Le 20 avr. 2018 à 02:26, à 02:26, Eugene Kirpichov <ki...@google.com> a écrit:
>Very cool! JB, time to update your PR?
>
>On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko
><ar...@gmail.com>
>wrote:
>
>> FYI: Apache Parquet 1.10.0 was release recently.
>> It contains *org.apache.parquet.io.OutputFile *and updated
>> *org.apache.parquet.hadoop.ParquetFileWriter*
>>
>> WBR,
>> Alexey
>>
>>
>> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net>
>wrote:
>>
>> Great !!
>>
>> In the mean time, I started to PoC around directly parquet-common to
>see
>> if I
>> can implement a BeamParquetReader and a BeamParquetWriter.
>>
>> I might also propose some PRs.
>>
>> I will continue tomorrow around that.
>>
>> Thanks again !
>> Regards
>> JB
>>
>> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>>
>> Additions to the builders are easy enough that we can get that in.
>There's
>> a PR out there that needs to be fixed:
>> https://github.com/apache/parquet-mr/pull/446
>>
>> I've asked the author for just the builder changes. If we don't hear
>back,
>> we can add another PR but I'd like to give the author some time to
>update.
>>
>> rb
>>
>> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré
><jb...@nanthrax.net>
>> wrote:
>>
>> Hi  Ryan,
>>
>> Thanks for the update.
>>
>> Ideally for Beam, it would be great to have the AvroParquetReader and
>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>> allow me
>> to directly leverage Beam FileIO.
>>
>> Do you have a rough date for the Parquet release with that ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>
>> Jean-Baptiste,
>>
>> We're planning a release that will include the new OutputFile class,
>>
>> which I
>>
>> think you should be able to use. Is there anything you'd change to
>make
>>
>> this
>>
>> work more easily with Beam?
>>
>> rb
>>
>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré
><jb@nanthrax.net
>> <ma...@nanthrax.net>> wrote:
>>
>>    Hi guys,
>>
>>    I'm working on the Apache Beam ParquetIO:
>>
>>    https://github.com/apache/beam/pull/1851
>>    <https://github.com/apache/beam/pull/1851>
>>
>>    In Beam, thanks to FileIO, we support several filesystems (HDFS,
>S3,
>>
>> ...).
>>
>>
>>    If I was able to implement the Read part using AvroParquetReader
>>
>> leveraging Beam
>>
>>     FileIO, I'm struggling on the writing part.
>>
>>    I have to create ParquetSink implementing FileIO.Sink. Especially,
>I
>>
>> have to
>>
>>    implement the open(WritableByteChannel channel) method.
>>
>>    It's not possible to use AvroParquetWriter here as it takes a Path
>>
>> as argument
>>
>>    (and from the channel, I can only have an OutputStream).
>>
>>    As a workaround, I wanted to use org.apache.parquet.hadoop.
>>
>> ParquetFileWriter,
>>
>>    providing my own implementation of org.apache.parquet.io
>>    <http://org.apache.parquet.io>.OutputFile.
>>
>>    Unfortunately OutputFile (and the updated method in
>>
>> ParquetFileWriter) exists on
>>
>>    Parquet master branch, but it was different on Parquet 1.9.0.
>>
>>    So, I have two questions:
>>    - do you plan a Parquet 1.9.1 release including
>>
>> org.apache.parquet.io
>>
>>    <http://org.apache.parquet.io>.OutputFile
>>    and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>    - using Parquet 1.9.0, do you have any advice how to use
>>    AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>>
>> object that I
>>
>>    can get from WritableByteChannel) ?
>>
>>    Thanks !
>>
>>    Regards
>>    JB
>>    --
>>    Jean-Baptiste Onofré
>>    jbonofre@apache.org <ma...@apache.org>
>>    http://blog.nanthrax.net
>>    Talend - http://www.talend.com
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Eugene Kirpichov <ki...@google.com>.
Very cool! JB, time to update your PR?

On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> FYI: Apache Parquet 1.10.0 was release recently.
> It contains *org.apache.parquet.io.OutputFile *and updated
> *org.apache.parquet.hadoop.ParquetFileWriter*
>
> WBR,
> Alexey
>
>
> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
> Great !!
>
> In the mean time, I started to PoC around directly parquet-common to see
> if I
> can implement a BeamParquetReader and a BeamParquetWriter.
>
> I might also propose some PRs.
>
> I will continue tomorrow around that.
>
> Thanks again !
> Regards
> JB
>
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
>
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
>
> rb
>
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> Hi  Ryan,
>
> Thanks for the update.
>
> Ideally for Beam, it would be great to have the AvroParquetReader and
> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
> allow me
> to directly leverage Beam FileIO.
>
> Do you have a rough date for the Parquet release with that ?
>
> Thanks
> Regards
> JB
>
> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>
> Jean-Baptiste,
>
> We're planning a release that will include the new OutputFile class,
>
> which I
>
> think you should be able to use. Is there anything you'd change to make
>
> this
>
> work more easily with Beam?
>
> rb
>
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
>
>    Hi guys,
>
>    I'm working on the Apache Beam ParquetIO:
>
>    https://github.com/apache/beam/pull/1851
>    <https://github.com/apache/beam/pull/1851>
>
>    In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>
> ...).
>
>
>    If I was able to implement the Read part using AvroParquetReader
>
> leveraging Beam
>
>     FileIO, I'm struggling on the writing part.
>
>    I have to create ParquetSink implementing FileIO.Sink. Especially, I
>
> have to
>
>    implement the open(WritableByteChannel channel) method.
>
>    It's not possible to use AvroParquetWriter here as it takes a Path
>
> as argument
>
>    (and from the channel, I can only have an OutputStream).
>
>    As a workaround, I wanted to use org.apache.parquet.hadoop.
>
> ParquetFileWriter,
>
>    providing my own implementation of org.apache.parquet.io
>    <http://org.apache.parquet.io>.OutputFile.
>
>    Unfortunately OutputFile (and the updated method in
>
> ParquetFileWriter) exists on
>
>    Parquet master branch, but it was different on Parquet 1.9.0.
>
>    So, I have two questions:
>    - do you plan a Parquet 1.9.1 release including
>
> org.apache.parquet.io
>
>    <http://org.apache.parquet.io>.OutputFile
>    and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>    - using Parquet 1.9.0, do you have any advice how to use
>    AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>
> object that I
>
>    can get from WritableByteChannel) ?
>
>    Thanks !
>
>    Regards
>    JB
>    --
>    Jean-Baptiste Onofré
>    jbonofre@apache.org <ma...@apache.org>
>    http://blog.nanthrax.net
>    Talend - http://www.talend.com
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Alexey Romanenko <ar...@gmail.com>.
FYI: Apache Parquet 1.10.0 was release recently. 
It contains org.apache.parquet.io.OutputFile and updated org.apache.parquet.hadoop.ParquetFileWriter

WBR,
Alexey

> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Great !!
> 
> In the mean time, I started to PoC around directly parquet-common to see if I
> can implement a BeamParquetReader and a BeamParquetWriter.
> 
> I might also propose some PRs.
> 
> I will continue tomorrow around that.
> 
> Thanks again !
> Regards
> JB
> 
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>> Additions to the builders are easy enough that we can get that in. There's
>> a PR out there that needs to be fixed:
>> https://github.com/apache/parquet-mr/pull/446
>> 
>> I've asked the author for just the builder changes. If we don't hear back,
>> we can add another PR but I'd like to give the author some time to update.
>> 
>> rb
>> 
>> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> 
>>> Hi  Ryan,
>>> 
>>> Thanks for the update.
>>> 
>>> Ideally for Beam, it would be great to have the AvroParquetReader and
>>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>>> allow me
>>> to directly leverage Beam FileIO.
>>> 
>>> Do you have a rough date for the Parquet release with that ?
>>> 
>>> Thanks
>>> Regards
>>> JB
>>> 
>>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>>> Jean-Baptiste,
>>>> 
>>>> We're planning a release that will include the new OutputFile class,
>>> which I
>>>> think you should be able to use. Is there anything you'd change to make
>>> this
>>>> work more easily with Beam?
>>>> 
>>>> rb
>>>> 
>>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>> <ma...@nanthrax.net>> wrote:
>>>> 
>>>>    Hi guys,
>>>> 
>>>>    I'm working on the Apache Beam ParquetIO:
>>>> 
>>>>    https://github.com/apache/beam/pull/1851
>>>>    <https://github.com/apache/beam/pull/1851>
>>>> 
>>>>    In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>>> ...).
>>>> 
>>>>    If I was able to implement the Read part using AvroParquetReader
>>> leveraging Beam
>>>>     FileIO, I'm struggling on the writing part.
>>>> 
>>>>    I have to create ParquetSink implementing FileIO.Sink. Especially, I
>>> have to
>>>>    implement the open(WritableByteChannel channel) method.
>>>> 
>>>>    It's not possible to use AvroParquetWriter here as it takes a Path
>>> as argument
>>>>    (and from the channel, I can only have an OutputStream).
>>>> 
>>>>    As a workaround, I wanted to use org.apache.parquet.hadoop.
>>> ParquetFileWriter,
>>>>    providing my own implementation of org.apache.parquet.io
>>>>    <http://org.apache.parquet.io>.OutputFile.
>>>> 
>>>>    Unfortunately OutputFile (and the updated method in
>>> ParquetFileWriter) exists on
>>>>    Parquet master branch, but it was different on Parquet 1.9.0.
>>>> 
>>>>    So, I have two questions:
>>>>    - do you plan a Parquet 1.9.1 release including
>>> org.apache.parquet.io
>>>>    <http://org.apache.parquet.io>.OutputFile
>>>>    and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>>    - using Parquet 1.9.0, do you have any advice how to use
>>>>    AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>>> object that I
>>>>    can get from WritableByteChannel) ?
>>>> 
>>>>    Thanks !
>>>> 
>>>>    Regards
>>>>    JB
>>>>    --
>>>>    Jean-Baptiste Onofré
>>>>    jbonofre@apache.org <ma...@apache.org>
>>>>    http://blog.nanthrax.net
>>>>    Talend - http://www.talend.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>> 
>> 
>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Great !!

In the mean time, I started to PoC around directly parquet-common to see if I
can implement a BeamParquetReader and a BeamParquetWriter.

I might also propose some PRs.

I will continue tomorrow around that.

Thanks again !
Regards
JB

On 02/14/2018 08:04 PM, Ryan Blue wrote:
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
> 
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
> 
> rb
> 
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> 
>> Hi  Ryan,
>>
>> Thanks for the update.
>>
>> Ideally for Beam, it would be great to have the AvroParquetReader and
>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>> allow me
>> to directly leverage Beam FileIO.
>>
>> Do you have a rough date for the Parquet release with that ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>> Jean-Baptiste,
>>>
>>> We're planning a release that will include the new OutputFile class,
>> which I
>>> think you should be able to use. Is there anything you'd change to make
>> this
>>> work more easily with Beam?
>>>
>>> rb
>>>
>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
>>> <ma...@nanthrax.net>> wrote:
>>>
>>>     Hi guys,
>>>
>>>     I'm working on the Apache Beam ParquetIO:
>>>
>>>     https://github.com/apache/beam/pull/1851
>>>     <https://github.com/apache/beam/pull/1851>
>>>
>>>     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>> ...).
>>>
>>>     If I was able to implement the Read part using AvroParquetReader
>> leveraging Beam
>>>      FileIO, I'm struggling on the writing part.
>>>
>>>     I have to create ParquetSink implementing FileIO.Sink. Especially, I
>> have to
>>>     implement the open(WritableByteChannel channel) method.
>>>
>>>     It's not possible to use AvroParquetWriter here as it takes a Path
>> as argument
>>>     (and from the channel, I can only have an OutputStream).
>>>
>>>     As a workaround, I wanted to use org.apache.parquet.hadoop.
>> ParquetFileWriter,
>>>     providing my own implementation of org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile.
>>>
>>>     Unfortunately OutputFile (and the updated method in
>> ParquetFileWriter) exists on
>>>     Parquet master branch, but it was different on Parquet 1.9.0.
>>>
>>>     So, I have two questions:
>>>     - do you plan a Parquet 1.9.1 release including
>> org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile
>>>     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>     - using Parquet 1.9.0, do you have any advice how to use
>>>     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>> object that I
>>>     can get from WritableByteChannel) ?
>>>
>>>     Thanks !
>>>
>>>     Regards
>>>     JB
>>>     --
>>>     Jean-Baptiste Onofré
>>>     jbonofre@apache.org <ma...@apache.org>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Great !!

In the mean time, I started to PoC around directly parquet-common to see if I
can implement a BeamParquetReader and a BeamParquetWriter.

I might also propose some PRs.

I will continue tomorrow around that.

Thanks again !
Regards
JB

On 02/14/2018 08:04 PM, Ryan Blue wrote:
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
> 
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
> 
> rb
> 
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> 
>> Hi  Ryan,
>>
>> Thanks for the update.
>>
>> Ideally for Beam, it would be great to have the AvroParquetReader and
>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>> allow me
>> to directly leverage Beam FileIO.
>>
>> Do you have a rough date for the Parquet release with that ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>> Jean-Baptiste,
>>>
>>> We're planning a release that will include the new OutputFile class,
>> which I
>>> think you should be able to use. Is there anything you'd change to make
>> this
>>> work more easily with Beam?
>>>
>>> rb
>>>
>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
>>> <ma...@nanthrax.net>> wrote:
>>>
>>>     Hi guys,
>>>
>>>     I'm working on the Apache Beam ParquetIO:
>>>
>>>     https://github.com/apache/beam/pull/1851
>>>     <https://github.com/apache/beam/pull/1851>
>>>
>>>     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>> ...).
>>>
>>>     If I was able to implement the Read part using AvroParquetReader
>> leveraging Beam
>>>      FileIO, I'm struggling on the writing part.
>>>
>>>     I have to create ParquetSink implementing FileIO.Sink. Especially, I
>> have to
>>>     implement the open(WritableByteChannel channel) method.
>>>
>>>     It's not possible to use AvroParquetWriter here as it takes a Path
>> as argument
>>>     (and from the channel, I can only have an OutputStream).
>>>
>>>     As a workaround, I wanted to use org.apache.parquet.hadoop.
>> ParquetFileWriter,
>>>     providing my own implementation of org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile.
>>>
>>>     Unfortunately OutputFile (and the updated method in
>> ParquetFileWriter) exists on
>>>     Parquet master branch, but it was different on Parquet 1.9.0.
>>>
>>>     So, I have two questions:
>>>     - do you plan a Parquet 1.9.1 release including
>> org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile
>>>     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>     - using Parquet 1.9.0, do you have any advice how to use
>>>     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>> object that I
>>>     can get from WritableByteChannel) ?
>>>
>>>     Thanks !
>>>
>>>     Regards
>>>     JB
>>>     --
>>>     Jean-Baptiste Onofré
>>>     jbonofre@apache.org <ma...@apache.org>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Additions to the builders are easy enough that we can get that in. There's
a PR out there that needs to be fixed:
https://github.com/apache/parquet-mr/pull/446

I've asked the author for just the builder changes. If we don't hear back,
we can add another PR but I'd like to give the author some time to update.

rb

On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi  Ryan,
>
> Thanks for the update.
>
> Ideally for Beam, it would be great to have the AvroParquetReader and
> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
> allow me
> to directly leverage Beam FileIO.
>
> Do you have a rough date for the Parquet release with that ?
>
> Thanks
> Regards
> JB
>
> On 02/14/2018 02:01 AM, Ryan Blue wrote:
> > Jean-Baptiste,
> >
> > We're planning a release that will include the new OutputFile class,
> which I
> > think you should be able to use. Is there anything you'd change to make
> this
> > work more easily with Beam?
> >
> > rb
> >
> > On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> > <ma...@nanthrax.net>> wrote:
> >
> >     Hi guys,
> >
> >     I'm working on the Apache Beam ParquetIO:
> >
> >     https://github.com/apache/beam/pull/1851
> >     <https://github.com/apache/beam/pull/1851>
> >
> >     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
> ...).
> >
> >     If I was able to implement the Read part using AvroParquetReader
> leveraging Beam
> >      FileIO, I'm struggling on the writing part.
> >
> >     I have to create ParquetSink implementing FileIO.Sink. Especially, I
> have to
> >     implement the open(WritableByteChannel channel) method.
> >
> >     It's not possible to use AvroParquetWriter here as it takes a Path
> as argument
> >     (and from the channel, I can only have an OutputStream).
> >
> >     As a workaround, I wanted to use org.apache.parquet.hadoop.
> ParquetFileWriter,
> >     providing my own implementation of org.apache.parquet.io
> >     <http://org.apache.parquet.io>.OutputFile.
> >
> >     Unfortunately OutputFile (and the updated method in
> ParquetFileWriter) exists on
> >     Parquet master branch, but it was different on Parquet 1.9.0.
> >
> >     So, I have two questions:
> >     - do you plan a Parquet 1.9.1 release including
> org.apache.parquet.io
> >     <http://org.apache.parquet.io>.OutputFile
> >     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
> >     - using Parquet 1.9.0, do you have any advice how to use
> >     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
> object that I
> >     can get from WritableByteChannel) ?
> >
> >     Thanks !
> >
> >     Regards
> >     JB
> >     --
> >     Jean-Baptiste Onofré
> >     jbonofre@apache.org <ma...@apache.org>
> >     http://blog.nanthrax.net
> >     Talend - http://www.talend.com
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi  Ryan,

Thanks for the update.

Ideally for Beam, it would be great to have the AvroParquetReader and
AvroParquetWriter using the InputFile/OutputFile interfaces. It would allow me
to directly leverage Beam FileIO.

Do you have a rough date for the Parquet release with that ?

Thanks
Regards
JB

On 02/14/2018 02:01 AM, Ryan Blue wrote:
> Jean-Baptiste,
> 
> We're planning a release that will include the new OutputFile class, which I
> think you should be able to use. Is there anything you'd change to make this
> work more easily with Beam?
> 
> rb
> 
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     Hi guys,
> 
>     I'm working on the Apache Beam ParquetIO:
> 
>     https://github.com/apache/beam/pull/1851
>     <https://github.com/apache/beam/pull/1851>
> 
>     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
> 
>     If I was able to implement the Read part using AvroParquetReader leveraging Beam
>      FileIO, I'm struggling on the writing part.
> 
>     I have to create ParquetSink implementing FileIO.Sink. Especially, I have to
>     implement the open(WritableByteChannel channel) method.
> 
>     It's not possible to use AvroParquetWriter here as it takes a Path as argument
>     (and from the channel, I can only have an OutputStream).
> 
>     As a workaround, I wanted to use org.apache.parquet.hadoop.ParquetFileWriter,
>     providing my own implementation of org.apache.parquet.io
>     <http://org.apache.parquet.io>.OutputFile.
> 
>     Unfortunately OutputFile (and the updated method in ParquetFileWriter) exists on
>     Parquet master branch, but it was different on Parquet 1.9.0.
> 
>     So, I have two questions:
>     - do you plan a Parquet 1.9.1 release including org.apache.parquet.io
>     <http://org.apache.parquet.io>.OutputFile
>     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>     - using Parquet 1.9.0, do you have any advice how to use
>     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object that I
>     can get from WritableByteChannel) ?
> 
>     Thanks !
> 
>     Regards
>     JB
>     --
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Eugene Kirpichov <ki...@google.com>.
Hi Ryan,
It would help to have AvroParquetReader/Writer also provide the
InputFile/OutputFile interface.
Also: any suggestions as to when this might be officially released?
Thanks.

On Tue, Feb 13, 2018 at 5:02 PM Ryan Blue <rb...@netflix.com> wrote:

> Jean-Baptiste,
>
> We're planning a release that will include the new OutputFile class, which
> I think you should be able to use. Is there anything you'd change to make
> this work more easily with Beam?
>
> rb
>
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi guys,
>>
>> I'm working on the Apache Beam ParquetIO:
>>
>> https://github.com/apache/beam/pull/1851
>>
>> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
>>
>> If I was able to implement the Read part using AvroParquetReader
>> leveraging Beam
>>  FileIO, I'm struggling on the writing part.
>>
>> I have to create ParquetSink implementing FileIO.Sink. Especially, I have
>> to
>> implement the open(WritableByteChannel channel) method.
>>
>> It's not possible to use AvroParquetWriter here as it takes a Path as
>> argument
>> (and from the channel, I can only have an OutputStream).
>>
>> As a workaround, I wanted to use
>> org.apache.parquet.hadoop.ParquetFileWriter,
>> providing my own implementation of org.apache.parquet.io.OutputFile.
>>
>> Unfortunately OutputFile (and the updated method in ParquetFileWriter)
>> exists on
>> Parquet master branch, but it was different on Parquet 1.9.0.
>>
>> So, I have two questions:
>> - do you plan a Parquet 1.9.1 release including org.apache.parquet.io
>> .OutputFile
>> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>> - using Parquet 1.9.0, do you have any advice how to use
>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object
>> that I
>> can get from WritableByteChannel) ?
>>
>> Thanks !
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Ryan,

sorry to have been quite, but I was busy traveling recently :)

Just a quick update about this one:

- I asked a guy from my team to work with me on the Beam ParquetIO. We're also
seeing several users expected this new IO.
- I will update my current PR to use Parquet SNAPSHOT and verify that
OutputFile/InputFile are convenient for Beam use case. I should be able to do it
tomorrow.
- Then, if OutFile/InputFile are OK for ParquetIO, I will let you know and
kindly ask for a Parquet release.

Is it OK for you ?

Thanks !
Regards
JB

On 02/14/2018 02:01 AM, Ryan Blue wrote:
> Jean-Baptiste,
> 
> We're planning a release that will include the new OutputFile class, which
> I think you should be able to use. Is there anything you'd change to make
> this work more easily with Beam?
> 
> rb
> 
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> 
>> Hi guys,
>>
>> I'm working on the Apache Beam ParquetIO:
>>
>> https://github.com/apache/beam/pull/1851
>>
>> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
>>
>> If I was able to implement the Read part using AvroParquetReader
>> leveraging Beam
>>  FileIO, I'm struggling on the writing part.
>>
>> I have to create ParquetSink implementing FileIO.Sink. Especially, I have
>> to
>> implement the open(WritableByteChannel channel) method.
>>
>> It's not possible to use AvroParquetWriter here as it takes a Path as
>> argument
>> (and from the channel, I can only have an OutputStream).
>>
>> As a workaround, I wanted to use org.apache.parquet.hadoop.
>> ParquetFileWriter,
>> providing my own implementation of org.apache.parquet.io.OutputFile.
>>
>> Unfortunately OutputFile (and the updated method in ParquetFileWriter)
>> exists on
>> Parquet master branch, but it was different on Parquet 1.9.0.
>>
>> So, I have two questions:
>> - do you plan a Parquet 1.9.1 release including org.apache.parquet.io.
>> OutputFile
>> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>> - using Parquet 1.9.0, do you have any advice how to use
>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object
>> that I
>> can get from WritableByteChannel) ?
>>
>> Thanks !
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi  Ryan,

Thanks for the update.

Ideally for Beam, it would be great to have the AvroParquetReader and
AvroParquetWriter using the InputFile/OutputFile interfaces. It would allow me
to directly leverage Beam FileIO.

Do you have a rough date for the Parquet release with that ?

Thanks
Regards
JB

On 02/14/2018 02:01 AM, Ryan Blue wrote:
> Jean-Baptiste,
> 
> We're planning a release that will include the new OutputFile class, which I
> think you should be able to use. Is there anything you'd change to make this
> work more easily with Beam?
> 
> rb
> 
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     Hi guys,
> 
>     I'm working on the Apache Beam ParquetIO:
> 
>     https://github.com/apache/beam/pull/1851
>     <https://github.com/apache/beam/pull/1851>
> 
>     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
> 
>     If I was able to implement the Read part using AvroParquetReader leveraging Beam
>      FileIO, I'm struggling on the writing part.
> 
>     I have to create ParquetSink implementing FileIO.Sink. Especially, I have to
>     implement the open(WritableByteChannel channel) method.
> 
>     It's not possible to use AvroParquetWriter here as it takes a Path as argument
>     (and from the channel, I can only have an OutputStream).
> 
>     As a workaround, I wanted to use org.apache.parquet.hadoop.ParquetFileWriter,
>     providing my own implementation of org.apache.parquet.io
>     <http://org.apache.parquet.io>.OutputFile.
> 
>     Unfortunately OutputFile (and the updated method in ParquetFileWriter) exists on
>     Parquet master branch, but it was different on Parquet 1.9.0.
> 
>     So, I have two questions:
>     - do you plan a Parquet 1.9.1 release including org.apache.parquet.io
>     <http://org.apache.parquet.io>.OutputFile
>     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>     - using Parquet 1.9.0, do you have any advice how to use
>     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object that I
>     can get from WritableByteChannel) ?
> 
>     Thanks !
> 
>     Regards
>     JB
>     --
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Jean-Baptiste,

We're planning a release that will include the new OutputFile class, which
I think you should be able to use. Is there anything you'd change to make
this work more easily with Beam?

rb

On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi guys,
>
> I'm working on the Apache Beam ParquetIO:
>
> https://github.com/apache/beam/pull/1851
>
> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
>
> If I was able to implement the Read part using AvroParquetReader
> leveraging Beam
>  FileIO, I'm struggling on the writing part.
>
> I have to create ParquetSink implementing FileIO.Sink. Especially, I have
> to
> implement the open(WritableByteChannel channel) method.
>
> It's not possible to use AvroParquetWriter here as it takes a Path as
> argument
> (and from the channel, I can only have an OutputStream).
>
> As a workaround, I wanted to use org.apache.parquet.hadoop.
> ParquetFileWriter,
> providing my own implementation of org.apache.parquet.io.OutputFile.
>
> Unfortunately OutputFile (and the updated method in ParquetFileWriter)
> exists on
> Parquet master branch, but it was different on Parquet 1.9.0.
>
> So, I have two questions:
> - do you plan a Parquet 1.9.1 release including org.apache.parquet.io.
> OutputFile
> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
> - using Parquet 1.9.0, do you have any advice how to use
> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object
> that I
> can get from WritableByteChannel) ?
>
> Thanks !
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Plan for a Parquet new release and writing Parquet file with outputstream

Posted by Ryan Blue <rb...@netflix.com>.
Jean-Baptiste,

We're planning a release that will include the new OutputFile class, which
I think you should be able to use. Is there anything you'd change to make
this work more easily with Beam?

rb

On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi guys,
>
> I'm working on the Apache Beam ParquetIO:
>
> https://github.com/apache/beam/pull/1851
>
> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
>
> If I was able to implement the Read part using AvroParquetReader
> leveraging Beam
>  FileIO, I'm struggling on the writing part.
>
> I have to create ParquetSink implementing FileIO.Sink. Especially, I have
> to
> implement the open(WritableByteChannel channel) method.
>
> It's not possible to use AvroParquetWriter here as it takes a Path as
> argument
> (and from the channel, I can only have an OutputStream).
>
> As a workaround, I wanted to use org.apache.parquet.hadoop.
> ParquetFileWriter,
> providing my own implementation of org.apache.parquet.io.OutputFile.
>
> Unfortunately OutputFile (and the updated method in ParquetFileWriter)
> exists on
> Parquet master branch, but it was different on Parquet 1.9.0.
>
> So, I have two questions:
> - do you plan a Parquet 1.9.1 release including org.apache.parquet.io.
> OutputFile
> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
> - using Parquet 1.9.0, do you have any advice how to use
> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object
> that I
> can get from WritableByteChannel) ?
>
> Thanks !
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Ryan Blue
Software Engineer
Netflix