You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Alexey Romanenko <ar...@gmail.com> on 2018/04/19 16:17:47 UTC
Re: Plan for a Parquet new release and writing Parquet file with
outputstream
FYI: Apache Parquet 1.10.0 was release recently.
It contains org.apache.parquet.io.OutputFile and updated org.apache.parquet.hadoop.ParquetFileWriter
WBR,
Alexey
> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
> Great !!
>
> In the mean time, I started to PoC around directly parquet-common to see if I
> can implement a BeamParquetReader and a BeamParquetWriter.
>
> I might also propose some PRs.
>
> I will continue tomorrow around that.
>
> Thanks again !
> Regards
> JB
>
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>> Additions to the builders are easy enough that we can get that in. There's
>> a PR out there that needs to be fixed:
>> https://github.com/apache/parquet-mr/pull/446
>>
>> I've asked the author for just the builder changes. If we don't hear back,
>> we can add another PR but I'd like to give the author some time to update.
>>
>> rb
>>
>> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Thanks for the update.
>>>
>>> Ideally for Beam, it would be great to have the AvroParquetReader and
>>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>>> allow me
>>> to directly leverage Beam FileIO.
>>>
>>> Do you have a rough date for the Parquet release with that ?
>>>
>>> Thanks
>>> Regards
>>> JB
>>>
>>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>>> Jean-Baptiste,
>>>>
>>>> We're planning a release that will include the new OutputFile class,
>>> which I
>>>> think you should be able to use. Is there anything you'd change to make
>>> this
>>>> work more easily with Beam?
>>>>
>>>> rb
>>>>
>>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>> <ma...@nanthrax.net>> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I'm working on the Apache Beam ParquetIO:
>>>>
>>>> https://github.com/apache/beam/pull/1851
>>>> <https://github.com/apache/beam/pull/1851>
>>>>
>>>> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>>> ...).
>>>>
>>>> If I was able to implement the Read part using AvroParquetReader
>>> leveraging Beam
>>>> FileIO, I'm struggling on the writing part.
>>>>
>>>> I have to create ParquetSink implementing FileIO.Sink. Especially, I
>>> have to
>>>> implement the open(WritableByteChannel channel) method.
>>>>
>>>> It's not possible to use AvroParquetWriter here as it takes a Path
>>> as argument
>>>> (and from the channel, I can only have an OutputStream).
>>>>
>>>> As a workaround, I wanted to use org.apache.parquet.hadoop.
>>> ParquetFileWriter,
>>>> providing my own implementation of org.apache.parquet.io
>>>> <http://org.apache.parquet.io>.OutputFile.
>>>>
>>>> Unfortunately OutputFile (and the updated method in
>>> ParquetFileWriter) exists on
>>>> Parquet master branch, but it was different on Parquet 1.9.0.
>>>>
>>>> So, I have two questions:
>>>> - do you plan a Parquet 1.9.1 release including
>>> org.apache.parquet.io
>>>> <http://org.apache.parquet.io>.OutputFile
>>>> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>> - using Parquet 1.9.0, do you have any advice how to use
>>>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>>> object that I
>>>> can get from WritableByteChannel) ?
>>>>
>>>> Thanks !
>>>>
>>>> Regards
>>>> JB
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org <ma...@apache.org>
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
Re: Plan for a Parquet new release and writing Parquet file with outputstream
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Yup, that's great. I will update the PR when back from vacation.
Regards
JB
Le 20 avr. 2018 à 02:26, à 02:26, Eugene Kirpichov <ki...@google.com> a écrit:
>Very cool! JB, time to update your PR?
>
>On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko
><ar...@gmail.com>
>wrote:
>
>> FYI: Apache Parquet 1.10.0 was release recently.
>> It contains *org.apache.parquet.io.OutputFile *and updated
>> *org.apache.parquet.hadoop.ParquetFileWriter*
>>
>> WBR,
>> Alexey
>>
>>
>> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net>
>wrote:
>>
>> Great !!
>>
>> In the mean time, I started to PoC around directly parquet-common to
>see
>> if I
>> can implement a BeamParquetReader and a BeamParquetWriter.
>>
>> I might also propose some PRs.
>>
>> I will continue tomorrow around that.
>>
>> Thanks again !
>> Regards
>> JB
>>
>> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>>
>> Additions to the builders are easy enough that we can get that in.
>There's
>> a PR out there that needs to be fixed:
>> https://github.com/apache/parquet-mr/pull/446
>>
>> I've asked the author for just the builder changes. If we don't hear
>back,
>> we can add another PR but I'd like to give the author some time to
>update.
>>
>> rb
>>
>> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré
><jb...@nanthrax.net>
>> wrote:
>>
>> Hi Ryan,
>>
>> Thanks for the update.
>>
>> Ideally for Beam, it would be great to have the AvroParquetReader and
>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>> allow me
>> to directly leverage Beam FileIO.
>>
>> Do you have a rough date for the Parquet release with that ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>
>> Jean-Baptiste,
>>
>> We're planning a release that will include the new OutputFile class,
>>
>> which I
>>
>> think you should be able to use. Is there anything you'd change to
>make
>>
>> this
>>
>> work more easily with Beam?
>>
>> rb
>>
>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré
><jb@nanthrax.net
>> <ma...@nanthrax.net>> wrote:
>>
>> Hi guys,
>>
>> I'm working on the Apache Beam ParquetIO:
>>
>> https://github.com/apache/beam/pull/1851
>> <https://github.com/apache/beam/pull/1851>
>>
>> In Beam, thanks to FileIO, we support several filesystems (HDFS,
>S3,
>>
>> ...).
>>
>>
>> If I was able to implement the Read part using AvroParquetReader
>>
>> leveraging Beam
>>
>> FileIO, I'm struggling on the writing part.
>>
>> I have to create ParquetSink implementing FileIO.Sink. Especially,
>I
>>
>> have to
>>
>> implement the open(WritableByteChannel channel) method.
>>
>> It's not possible to use AvroParquetWriter here as it takes a Path
>>
>> as argument
>>
>> (and from the channel, I can only have an OutputStream).
>>
>> As a workaround, I wanted to use org.apache.parquet.hadoop.
>>
>> ParquetFileWriter,
>>
>> providing my own implementation of org.apache.parquet.io
>> <http://org.apache.parquet.io>.OutputFile.
>>
>> Unfortunately OutputFile (and the updated method in
>>
>> ParquetFileWriter) exists on
>>
>> Parquet master branch, but it was different on Parquet 1.9.0.
>>
>> So, I have two questions:
>> - do you plan a Parquet 1.9.1 release including
>>
>> org.apache.parquet.io
>>
>> <http://org.apache.parquet.io>.OutputFile
>> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>> - using Parquet 1.9.0, do you have any advice how to use
>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>>
>> object that I
>>
>> can get from WritableByteChannel) ?
>>
>> Thanks !
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org <ma...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
Re: Plan for a Parquet new release and writing Parquet file with outputstream
Posted by Eugene Kirpichov <ki...@google.com>.
Very cool! JB, time to update your PR?
On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko <ar...@gmail.com>
wrote:
> FYI: Apache Parquet 1.10.0 was release recently.
> It contains *org.apache.parquet.io.OutputFile *and updated
> *org.apache.parquet.hadoop.ParquetFileWriter*
>
> WBR,
> Alexey
>
>
> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
> Great !!
>
> In the mean time, I started to PoC around directly parquet-common to see
> if I
> can implement a BeamParquetReader and a BeamParquetWriter.
>
> I might also propose some PRs.
>
> I will continue tomorrow around that.
>
> Thanks again !
> Regards
> JB
>
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
>
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
>
> rb
>
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> Hi Ryan,
>
> Thanks for the update.
>
> Ideally for Beam, it would be great to have the AvroParquetReader and
> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
> allow me
> to directly leverage Beam FileIO.
>
> Do you have a rough date for the Parquet release with that ?
>
> Thanks
> Regards
> JB
>
> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>
> Jean-Baptiste,
>
> We're planning a release that will include the new OutputFile class,
>
> which I
>
> think you should be able to use. Is there anything you'd change to make
>
> this
>
> work more easily with Beam?
>
> rb
>
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
>
> Hi guys,
>
> I'm working on the Apache Beam ParquetIO:
>
> https://github.com/apache/beam/pull/1851
> <https://github.com/apache/beam/pull/1851>
>
> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>
> ...).
>
>
> If I was able to implement the Read part using AvroParquetReader
>
> leveraging Beam
>
> FileIO, I'm struggling on the writing part.
>
> I have to create ParquetSink implementing FileIO.Sink. Especially, I
>
> have to
>
> implement the open(WritableByteChannel channel) method.
>
> It's not possible to use AvroParquetWriter here as it takes a Path
>
> as argument
>
> (and from the channel, I can only have an OutputStream).
>
> As a workaround, I wanted to use org.apache.parquet.hadoop.
>
> ParquetFileWriter,
>
> providing my own implementation of org.apache.parquet.io
> <http://org.apache.parquet.io>.OutputFile.
>
> Unfortunately OutputFile (and the updated method in
>
> ParquetFileWriter) exists on
>
> Parquet master branch, but it was different on Parquet 1.9.0.
>
> So, I have two questions:
> - do you plan a Parquet 1.9.1 release including
>
> org.apache.parquet.io
>
> <http://org.apache.parquet.io>.OutputFile
> and updated org.apache.parquet.hadoop.ParquetFileWriter ?
> - using Parquet 1.9.0, do you have any advice how to use
> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>
> object that I
>
> can get from WritableByteChannel) ?
>
> Thanks !
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org <ma...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>