You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Aitozi <gj...@gmail.com> on 2023/04/02 14:21:12 UTC

Re: [DISCUSS] Add support for Apache Arrow format

Hi all,
    Thanks for your input.

@Ran > However, as mentioned in the issue you listed, it may take a lot of work
and the community's consideration for integrating Arrow.

To clarify, this proposal solely aims to introduce flink-arrow as a new format,
similar to flink-csv and flink-protobuf. It will not impact the internal data
structure representation in Flink. For proof of concept, please refer to:
https://github.com/Aitozi/flink/commits/arrow-format.

@Martijn > I'm wondering if there's really much benefit for the Flink project to
add another file format, over properly supporting the format that we already
have in the project.

Maintain the format we already have and introduce new formats should be
orthogonal. The requirement of supporting arrow format originally observed in
our internal usage to deserialize the data(VectorSchemaRoot) from other storage
systems to flink internal RowData and serialize the flink internal RowData to
VectorSchemaRoot out to the storage system.  And the requirement from the
slack[1] is to support the arrow file format. Although, Arrow is not usually
used as the final disk storage format.  But it has a tendency to be used as the
inter-exchange format between different systems or temporary storage for
analysis due to its columnar format and can be memory mapped to other analysis
programs.

So, I think it's meaningful to support arrow formats in Flink.

@Jim >  If the Flink format interface is used there, then it may be useful to
consider Arrow along with other columnar formats.

I am not well-versed with the formats utilized in Paimon. Upon checking [2], it
appears that Paimon does not directly employ flink formats. Instead, it utilizes
FormatWriterFactory and FormatReaderFactory to handle data serialization and
deserialization. Therefore, I believe that the current work may not be
applicable for reuse in Paimon at this time.

Best,
Aitozi.

[1]: https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
[2]: https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format

Jim Hughes <jh...@confluent.io.invalid> 于2023年3月31日周五 00:36写道：
>
> Hi all,
>
> How do Flink formats relate to or interact with Paimon (formerly
> Flink-Table-Store)?  If the Flink format interface is used there, then it
> may be useful to consider Arrow along with other columnar formats.
>
> Separately, from previous experience, I've seen the Arrow format be useful
> as an output format for clients to read efficiently.  Arrow does support
> returning batches of records, so there may be some options to use the
> format in a streaming situation where a sufficient collection of records
> can be gathered.
>
> Cheers,
>
> Jim
>
>
>
> On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <ma...@apache.org>
> wrote:
>
> > Hi,
> >
> > To be honest, I haven't seen that much demand for supporting the Arrow
> > format directly in Flink as a flink-format. I'm wondering if there's really
> > much benefit for the Flink project to add another file format, over
> > properly supporting the format that we already have in the project.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <ch...@gmail.com> wrote:
> >
> > > It is a good point that flink integrates apache arrow as a format.
> > > Arrow can take advantage of SIMD-specific or vectorized optimizations,
> > > which should be of great benefit to batch tasks.
> > > However, as mentioned in the issue you listed, it may take a lot of work
> > > and the community's consideration for integrating Arrow.
> > >
> > > I think you can try to make a simple poc for verification and some
> > specific
> > > plans.
> > >
> > >
> > > Best Regards,
> > > Ran Tao
> > >
> > >
> > > Aitozi <gj...@gmail.com> 于2023年3月29日周三 19:12写道：
> > >
> > > > Hi guys
> > > >      I'm opening this thread to discuss supporting the Apache Arrow
> > > format
> > > > in Flink.
> > > >      Arrow is a language-independent columnar memory format that has
> > > become
> > > > widely used in different systems, and It can also serve as an
> > > > inter-exchange format between other systems.
> > > > So, using it directly in the Flink system will be nice. We also
> > received
> > > > some requests from slack[1][2] and jira[3].
> > > >      In our company's internal usage, we have used flink-python
> > moudle's
> > > > ArrowReader and ArrowWriter to support this work. But it still can not
> > > > integrate with the current flink-formats framework closely.
> > > > So, I'd like to introduce the flink-arrow formats module to support the
> > > > arrow format naturally.
> > > >      Looking forward to some suggestions.
> > > >
> > > >
> > > > Best,
> > > > Aitozi
> > > >
> > > >
> > > > [1]:
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > >
> > > > [2]:
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > >
> > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > >
> > >
> >

Re: [DISCUSS] Add support for Apache Arrow format

Posted by Aitozi <gj...@gmail.com>.

> Which connectors would be commonly used when reading in Arrow format?
Filesystem?

Currently, yes. The better way is it can be combined used with
different connector,
but I have not figured out how to integrate the Arrow format
deserializer with the
`DecodingFormat` or `DeserializationSchema` interface. So, as a first
step, I want to introduce
it as the file bulk format.

Martijn Visser <ma...@apache.org> 于2023年4月12日周三 22:53写道：

>
> Which connectors would be commonly used when reading in Arrow format?
> Filesystem?
>
> On Wed, Apr 12, 2023 at 4:27 AM Jacky Lau <li...@gmail.com> wrote:
>
> > Hi
> >    I also think arrow format  will be useful when reading/writing with
> > message queue.
> >    Arrow defines a language-independent columnar memory format for flat and
> > hierarchical data, organized for efficient analytic operations on modern
> > hardware like CPUs and GPUs. The Arrow memory format also supports
> > zero-copy reads for lightning-fast data access without serialization
> > overhead. it will bring a lot.
> >    And we  may do some surveys, what other engines support like
> > spark/hive/presto and so on, how that supports and how it be used.
> >
> >    Best,
> >    Jacky.
> >
> > Aitozi <gj...@gmail.com> 于2023年4月2日周日 22:22写道：
> >
> > > Hi all,
> > >     Thanks for your input.
> > >
> > > @Ran > However, as mentioned in the issue you listed, it may take a lot
> > of
> > > work
> > > and the community's consideration for integrating Arrow.
> > >
> > > To clarify, this proposal solely aims to introduce flink-arrow as a new
> > > format,
> > > similar to flink-csv and flink-protobuf. It will not impact the internal
> > > data
> > > structure representation in Flink. For proof of concept, please refer to:
> > > https://github.com/Aitozi/flink/commits/arrow-format.
> > >
> > > @Martijn > I'm wondering if there's really much benefit for the Flink
> > > project to
> > > add another file format, over properly supporting the format that we
> > > already
> > > have in the project.
> > >
> > > Maintain the format we already have and introduce new formats should be
> > > orthogonal. The requirement of supporting arrow format originally
> > observed
> > > in
> > > our internal usage to deserialize the data(VectorSchemaRoot) from other
> > > storage
> > > systems to flink internal RowData and serialize the flink internal
> > RowData
> > > to
> > > VectorSchemaRoot out to the storage system.  And the requirement from the
> > > slack[1] is to support the arrow file format. Although, Arrow is not
> > > usually
> > > used as the final disk storage format.  But it has a tendency to be used
> > > as the
> > > inter-exchange format between different systems or temporary storage for
> > > analysis due to its columnar format and can be memory mapped to other
> > > analysis
> > > programs.
> > >
> > > So, I think it's meaningful to support arrow formats in Flink.
> > >
> > > @Jim >  If the Flink format interface is used there, then it may be
> > useful
> > > to
> > > consider Arrow along with other columnar formats.
> > >
> > > I am not well-versed with the formats utilized in Paimon. Upon checking
> > > [2], it
> > > appears that Paimon does not directly employ flink formats. Instead, it
> > > utilizes
> > > FormatWriterFactory and FormatReaderFactory to handle data serialization
> > > and
> > > deserialization. Therefore, I believe that the current work may not be
> > > applicable for reuse in Paimon at this time.
> > >
> > > Best,
> > > Aitozi.
> > >
> > > [1]:
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > [2]:
> > >
> > https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
> > >
> > > Jim Hughes <jh...@confluent.io.invalid> 于2023年3月31日周五 00:36写道：
> > > >
> > > > Hi all,
> > > >
> > > > How do Flink formats relate to or interact with Paimon (formerly
> > > > Flink-Table-Store)?  If the Flink format interface is used there, then
> > it
> > > > may be useful to consider Arrow along with other columnar formats.
> > > >
> > > > Separately, from previous experience, I've seen the Arrow format be
> > > useful
> > > > as an output format for clients to read efficiently.  Arrow does
> > support
> > > > returning batches of records, so there may be some options to use the
> > > > format in a streaming situation where a sufficient collection of
> > records
> > > > can be gathered.
> > > >
> > > > Cheers,
> > > >
> > > > Jim
> > > >
> > > >
> > > >
> > > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <
> > martijnvisser@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > To be honest, I haven't seen that much demand for supporting the
> > Arrow
> > > > > format directly in Flink as a flink-format. I'm wondering if there's
> > > really
> > > > > much benefit for the Flink project to add another file format, over
> > > > > properly supporting the format that we already have in the project.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <ch...@gmail.com>
> > wrote:
> > > > >
> > > > > > It is a good point that flink integrates apache arrow as a format.
> > > > > > Arrow can take advantage of SIMD-specific or vectorized
> > > optimizations,
> > > > > > which should be of great benefit to batch tasks.
> > > > > > However, as mentioned in the issue you listed, it may take a lot of
> > > work
> > > > > > and the community's consideration for integrating Arrow.
> > > > > >
> > > > > > I think you can try to make a simple poc for verification and some
> > > > > specific
> > > > > > plans.
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > Ran Tao
> > > > > >
> > > > > >
> > > > > > Aitozi <gj...@gmail.com> 于2023年3月29日周三 19:12写道：
> > > > > >
> > > > > > > Hi guys
> > > > > > >      I'm opening this thread to discuss supporting the Apache
> > Arrow
> > > > > > format
> > > > > > > in Flink.
> > > > > > >      Arrow is a language-independent columnar memory format that
> > > has
> > > > > > become
> > > > > > > widely used in different systems, and It can also serve as an
> > > > > > > inter-exchange format between other systems.
> > > > > > > So, using it directly in the Flink system will be nice. We also
> > > > > received
> > > > > > > some requests from slack[1][2] and jira[3].
> > > > > > >      In our company's internal usage, we have used flink-python
> > > > > moudle's
> > > > > > > ArrowReader and ArrowWriter to support this work. But it still
> > can
> > > not
> > > > > > > integrate with the current flink-formats framework closely.
> > > > > > > So, I'd like to introduce the flink-arrow formats module to
> > > support the
> > > > > > > arrow format naturally.
> > > > > > >      Looking forward to some suggestions.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Aitozi
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > >
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > > > >
> > > > > > > [2]:
> > > > > >
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > > > >
> > > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] Add support for Apache Arrow format

Posted by Martijn Visser <ma...@apache.org>.

Which connectors would be commonly used when reading in Arrow format?
Filesystem?

On Wed, Apr 12, 2023 at 4:27 AM Jacky Lau <li...@gmail.com> wrote:

> Hi
>    I also think arrow format  will be useful when reading/writing with
> message queue.
>    Arrow defines a language-independent columnar memory format for flat and
> hierarchical data, organized for efficient analytic operations on modern
> hardware like CPUs and GPUs. The Arrow memory format also supports
> zero-copy reads for lightning-fast data access without serialization
> overhead. it will bring a lot.
>    And we  may do some surveys, what other engines support like
> spark/hive/presto and so on, how that supports and how it be used.
>
>    Best,
>    Jacky.
>
> Aitozi <gj...@gmail.com> 于2023年4月2日周日 22:22写道：
>
> > Hi all,
> >     Thanks for your input.
> >
> > @Ran > However, as mentioned in the issue you listed, it may take a lot
> of
> > work
> > and the community's consideration for integrating Arrow.
> >
> > To clarify, this proposal solely aims to introduce flink-arrow as a new
> > format,
> > similar to flink-csv and flink-protobuf. It will not impact the internal
> > data
> > structure representation in Flink. For proof of concept, please refer to:
> > https://github.com/Aitozi/flink/commits/arrow-format.
> >
> > @Martijn > I'm wondering if there's really much benefit for the Flink
> > project to
> > add another file format, over properly supporting the format that we
> > already
> > have in the project.
> >
> > Maintain the format we already have and introduce new formats should be
> > orthogonal. The requirement of supporting arrow format originally
> observed
> > in
> > our internal usage to deserialize the data(VectorSchemaRoot) from other
> > storage
> > systems to flink internal RowData and serialize the flink internal
> RowData
> > to
> > VectorSchemaRoot out to the storage system.  And the requirement from the
> > slack[1] is to support the arrow file format. Although, Arrow is not
> > usually
> > used as the final disk storage format.  But it has a tendency to be used
> > as the
> > inter-exchange format between different systems or temporary storage for
> > analysis due to its columnar format and can be memory mapped to other
> > analysis
> > programs.
> >
> > So, I think it's meaningful to support arrow formats in Flink.
> >
> > @Jim >  If the Flink format interface is used there, then it may be
> useful
> > to
> > consider Arrow along with other columnar formats.
> >
> > I am not well-versed with the formats utilized in Paimon. Upon checking
> > [2], it
> > appears that Paimon does not directly employ flink formats. Instead, it
> > utilizes
> > FormatWriterFactory and FormatReaderFactory to handle data serialization
> > and
> > deserialization. Therefore, I believe that the current work may not be
> > applicable for reuse in Paimon at this time.
> >
> > Best,
> > Aitozi.
> >
> > [1]:
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > [2]:
> >
> https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
> >
> > Jim Hughes <jh...@confluent.io.invalid> 于2023年3月31日周五 00:36写道：
> > >
> > > Hi all,
> > >
> > > How do Flink formats relate to or interact with Paimon (formerly
> > > Flink-Table-Store)?  If the Flink format interface is used there, then
> it
> > > may be useful to consider Arrow along with other columnar formats.
> > >
> > > Separately, from previous experience, I've seen the Arrow format be
> > useful
> > > as an output format for clients to read efficiently.  Arrow does
> support
> > > returning batches of records, so there may be some options to use the
> > > format in a streaming situation where a sufficient collection of
> records
> > > can be gathered.
> > >
> > > Cheers,
> > >
> > > Jim
> > >
> > >
> > >
> > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <
> martijnvisser@apache.org
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > To be honest, I haven't seen that much demand for supporting the
> Arrow
> > > > format directly in Flink as a flink-format. I'm wondering if there's
> > really
> > > > much benefit for the Flink project to add another file format, over
> > > > properly supporting the format that we already have in the project.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <ch...@gmail.com>
> wrote:
> > > >
> > > > > It is a good point that flink integrates apache arrow as a format.
> > > > > Arrow can take advantage of SIMD-specific or vectorized
> > optimizations,
> > > > > which should be of great benefit to batch tasks.
> > > > > However, as mentioned in the issue you listed, it may take a lot of
> > work
> > > > > and the community's consideration for integrating Arrow.
> > > > >
> > > > > I think you can try to make a simple poc for verification and some
> > > > specific
> > > > > plans.
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Ran Tao
> > > > >
> > > > >
> > > > > Aitozi <gj...@gmail.com> 于2023年3月29日周三 19:12写道：
> > > > >
> > > > > > Hi guys
> > > > > >      I'm opening this thread to discuss supporting the Apache
> Arrow
> > > > > format
> > > > > > in Flink.
> > > > > >      Arrow is a language-independent columnar memory format that
> > has
> > > > > become
> > > > > > widely used in different systems, and It can also serve as an
> > > > > > inter-exchange format between other systems.
> > > > > > So, using it directly in the Flink system will be nice. We also
> > > > received
> > > > > > some requests from slack[1][2] and jira[3].
> > > > > >      In our company's internal usage, we have used flink-python
> > > > moudle's
> > > > > > ArrowReader and ArrowWriter to support this work. But it still
> can
> > not
> > > > > > integrate with the current flink-formats framework closely.
> > > > > > So, I'd like to introduce the flink-arrow formats module to
> > support the
> > > > > > arrow format naturally.
> > > > > >      Looking forward to some suggestions.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Aitozi
> > > > > >
> > > > > >
> > > > > > [1]:
> > > > >
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > > >
> > > > > > [2]:
> > > > >
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > > >
> > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] Add support for Apache Arrow format

Posted by Jacky Lau <li...@gmail.com>.

Hi
   I also think arrow format  will be useful when reading/writing with
message queue.
   Arrow defines a language-independent columnar memory format for flat and
hierarchical data, organized for efficient analytic operations on modern
hardware like CPUs and GPUs. The Arrow memory format also supports
zero-copy reads for lightning-fast data access without serialization
overhead. it will bring a lot.
   And we  may do some surveys, what other engines support like
spark/hive/presto and so on, how that supports and how it be used.

   Best,
   Jacky.

Aitozi <gj...@gmail.com> 于2023年4月2日周日 22:22写道：

> Hi all,
>     Thanks for your input.
>
> @Ran > However, as mentioned in the issue you listed, it may take a lot of
> work
> and the community's consideration for integrating Arrow.
>
> To clarify, this proposal solely aims to introduce flink-arrow as a new
> format,
> similar to flink-csv and flink-protobuf. It will not impact the internal
> data
> structure representation in Flink. For proof of concept, please refer to:
> https://github.com/Aitozi/flink/commits/arrow-format.
>
> @Martijn > I'm wondering if there's really much benefit for the Flink
> project to
> add another file format, over properly supporting the format that we
> already
> have in the project.
>
> Maintain the format we already have and introduce new formats should be
> orthogonal. The requirement of supporting arrow format originally observed
> in
> our internal usage to deserialize the data(VectorSchemaRoot) from other
> storage
> systems to flink internal RowData and serialize the flink internal RowData
> to
> VectorSchemaRoot out to the storage system.  And the requirement from the
> slack[1] is to support the arrow file format. Although, Arrow is not
> usually
> used as the final disk storage format.  But it has a tendency to be used
> as the
> inter-exchange format between different systems or temporary storage for
> analysis due to its columnar format and can be memory mapped to other
> analysis
> programs.
>
> So, I think it's meaningful to support arrow formats in Flink.
>
> @Jim >  If the Flink format interface is used there, then it may be useful
> to
> consider Arrow along with other columnar formats.
>
> I am not well-versed with the formats utilized in Paimon. Upon checking
> [2], it
> appears that Paimon does not directly employ flink formats. Instead, it
> utilizes
> FormatWriterFactory and FormatReaderFactory to handle data serialization
> and
> deserialization. Therefore, I believe that the current work may not be
> applicable for reuse in Paimon at this time.
>
> Best,
> Aitozi.
>
> [1]: https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> [2]:
> https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
>
> Jim Hughes <jh...@confluent.io.invalid> 于2023年3月31日周五 00:36写道：
> >
> > Hi all,
> >
> > How do Flink formats relate to or interact with Paimon (formerly
> > Flink-Table-Store)?  If the Flink format interface is used there, then it
> > may be useful to consider Arrow along with other columnar formats.
> >
> > Separately, from previous experience, I've seen the Arrow format be
> useful
> > as an output format for clients to read efficiently.  Arrow does support
> > returning batches of records, so there may be some options to use the
> > format in a streaming situation where a sufficient collection of records
> > can be gathered.
> >
> > Cheers,
> >
> > Jim
> >
> >
> >
> > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <martijnvisser@apache.org
> >
> > wrote:
> >
> > > Hi,
> > >
> > > To be honest, I haven't seen that much demand for supporting the Arrow
> > > format directly in Flink as a flink-format. I'm wondering if there's
> really
> > > much benefit for the Flink project to add another file format, over
> > > properly supporting the format that we already have in the project.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <ch...@gmail.com> wrote:
> > >
> > > > It is a good point that flink integrates apache arrow as a format.
> > > > Arrow can take advantage of SIMD-specific or vectorized
> optimizations,
> > > > which should be of great benefit to batch tasks.
> > > > However, as mentioned in the issue you listed, it may take a lot of
> work
> > > > and the community's consideration for integrating Arrow.
> > > >
> > > > I think you can try to make a simple poc for verification and some
> > > specific
> > > > plans.
> > > >
> > > >
> > > > Best Regards,
> > > > Ran Tao
> > > >
> > > >
> > > > Aitozi <gj...@gmail.com> 于2023年3月29日周三 19:12写道：
> > > >
> > > > > Hi guys
> > > > >      I'm opening this thread to discuss supporting the Apache Arrow
> > > > format
> > > > > in Flink.
> > > > >      Arrow is a language-independent columnar memory format that
> has
> > > > become
> > > > > widely used in different systems, and It can also serve as an
> > > > > inter-exchange format between other systems.
> > > > > So, using it directly in the Flink system will be nice. We also
> > > received
> > > > > some requests from slack[1][2] and jira[3].
> > > > >      In our company's internal usage, we have used flink-python
> > > moudle's
> > > > > ArrowReader and ArrowWriter to support this work. But it still can
> not
> > > > > integrate with the current flink-formats framework closely.
> > > > > So, I'd like to introduce the flink-arrow formats module to
> support the
> > > > > arrow format naturally.
> > > > >      Looking forward to some suggestions.
> > > > >
> > > > >
> > > > > Best,
> > > > > Aitozi
> > > > >
> > > > >
> > > > > [1]:
> > > >
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > >
> > > > > [2]:
> > > >
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > >
> > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > >
> > > >
> > >
>