You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Davor Bonaci <da...@google.com.INVALID> on 2016/04/28 19:03:09 UTC

IO timelines (Was: How to read/write avro data using FlinkKafka Consumer/Producer)

[ Moving over to the dev@ list ]

I think we should be aiming a little higher than "trying out Beam" ;)

Beam SDK currently has built-in IOs for Kafka, as well as for all important
Google Cloud Platform services. Additionally, there are pull requests for
Firebase and Cassandra. This is not bad, particularly talking into account
that we have APIs for user to develop their own IO connectors. Of course,
there's a long way to go, but there should *not* be any users that are
blocked or scenarios that are impossible.

In terms of the runner support, Cloud Dataflow runner supports all IOs,
including any user-written ones. Other runners don't as extensively, but
this is a high priority item to address.

In my mind, we should strive to address the following:

   - Complete conversion of existing IOs to the Source / Sink API. ETA: a
   week or two for full completion.
   - Make sure Spark & Flink runners fully support Source / Sink API, and
   that ties into the new Runner / Fn API discussion.
   - Increase the set of built-in IOs. No ETA; iterative process over time.
   There are 2 pending pull requests, others in development.

I'm hopeful we can address all of these items in a relatively short period
of time -- in a few months or so -- and likely before we can call any
release "stable". (This is why the new Runner / Fn API discussions are so
important.)

In summary, in my mind, "long run" here means "< few months".

---------- Forwarded message ----------
From: Maximilian Michels <mx...@apache.org>
Date: Thu, Apr 28, 2016 at 3:20 AM
Subject: Re: How to read/write avro data using FlinkKafka Consumer/Producer
(Beam Pipeline) ?
To: user@beam.incubator.apache.org

On Wed, Apr 27, 2016 at 11:12 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
> generally speaking, we have to check that all runners work fine with the
provided IO. I don't think it's a good idea that the runners themselves
implement any IO: they should use "out of the box" IO.

In the long run, big yes and I liked to help to make it possible!
However, there is still a gap between what Beam and its Runners
provide and what users want to do. For the time being, I think the
solution we have is fine. It gives users the option to try out Beam
with sources and sinks that they expect to be available in streaming
systems.

Re: IO timelines (Was: How to read/write avro data using FlinkKafka Consumer/Producer)

Posted by Maximilian Michels <mx...@apache.org>.
Hi Davor, hi Dan,

The discussion was partly held in JB's thread about "useNative()" for
Beam transforms. I think we reached consensus that we prefer portable
out of the box Beam IO over Runner specific IO.

On Thu, Apr 28, 2016 at 10:03 AM, Davor Bonaci <da...@google.com.invalid>
wrote:
>
> Of course, there's a long way to go, but there should *not* be any users that are
> blocked or scenarios that are impossible.


Exactly, while we are transitioning, let's not make any scenarios
impossible. For example, if users want to use Kafka 8, they should be
able to use the Flink Consumer/Producer as long as there is no support
yet. Exchanging sources/sinks in Beam programs is relatively easy to
do; users still get to keep all the nice semantics. Once we have a
decent support for IO, these wrappers should go away.

>  - Complete conversion of existing IOs to the Source / Sink API. ETA: a
>   week or two for full completion.


Which ones are there which have not been converted? From a first
glance, I see AvroIO, BigQueryIO, and PubsubIO. Only sources should be
affected because they have a dedicated interface; sinks are ParDos.

>   - Make sure Spark & Flink runners fully support Source / Sink API, and
>   that ties into the new Runner / Fn API discussion.


Yes, it's not hard to fix those. We will fix those as soon as possible.

Cheers,
Max


On Tue, May 3, 2016 at 1:06 AM, Dan Halperin
<dh...@google.com.invalid> wrote:
>
> Coming back from vacation, sorry for delay.
>
> I agree with Davor. While it's nice to have a `UnboundedFlinkSource`
> <https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedFlinkSource.java>
> wrapper that can be used to convert a Flink-specific source into one that
> can be used in Beam, I'd hope this is a temporary stop-gap until we have
> Beam-API versions of these for all important connectors. If a
> Flink-specific source has major advantages over one that can be implemented
> in Beam, we should explore why and see whether we have gaps in our APIs.
>
> (And similar analogs for Spark and for Sinks, Transforms libraries, and
> everything else).
>
> Thanks!
> Dan
>
> On Thu, Apr 28, 2016 at 10:03 AM, Davor Bonaci <da...@google.com.invalid>
> wrote:
>
> > [ Moving over to the dev@ list ]
> >
> > I think we should be aiming a little higher than "trying out Beam" ;)
> >
> > Beam SDK currently has built-in IOs for Kafka, as well as for all important
> > Google Cloud Platform services. Additionally, there are pull requests for
> > Firebase and Cassandra. This is not bad, particularly talking into account
> > that we have APIs for user to develop their own IO connectors. Of course,
> > there's a long way to go, but there should *not* be any users that are
> > blocked or scenarios that are impossible.
> >
> > In terms of the runner support, Cloud Dataflow runner supports all IOs,
> > including any user-written ones. Other runners don't as extensively, but
> > this is a high priority item to address.
> >
> > In my mind, we should strive to address the following:
> >
> >    - Complete conversion of existing IOs to the Source / Sink API. ETA: a
> >    week or two for full completion.
> >    - Make sure Spark & Flink runners fully support Source / Sink API, and
> >    that ties into the new Runner / Fn API discussion.
> >    - Increase the set of built-in IOs. No ETA; iterative process over time.
> >    There are 2 pending pull requests, others in development.
> >
> > I'm hopeful we can address all of these items in a relatively short period
> > of time -- in a few months or so -- and likely before we can call any
> > release "stable". (This is why the new Runner / Fn API discussions are so
> > important.)
> >
> > In summary, in my mind, "long run" here means "< few months".
> >
> > ---------- Forwarded message ----------
> > From: Maximilian Michels <mx...@apache.org>
> > Date: Thu, Apr 28, 2016 at 3:20 AM
> > Subject: Re: How to read/write avro data using FlinkKafka Consumer/Producer
> > (Beam Pipeline) ?
> > To: user@beam.incubator.apache.org
> >
> > On Wed, Apr 27, 2016 at 11:12 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> > > generally speaking, we have to check that all runners work fine with the
> > provided IO. I don't think it's a good idea that the runners themselves
> > implement any IO: they should use "out of the box" IO.
> >
> > In the long run, big yes and I liked to help to make it possible!
> > However, there is still a gap between what Beam and its Runners
> > provide and what users want to do. For the time being, I think the
> > solution we have is fine. It gives users the option to try out Beam
> > with sources and sinks that they expect to be available in streaming
> > systems.
> >

Re: IO timelines (Was: How to read/write avro data using FlinkKafka Consumer/Producer)

Posted by Dan Halperin <dh...@google.com.INVALID>.
Coming back from vacation, sorry for delay.

I agree with Davor. While it's nice to have a `UnboundedFlinkSource`
<https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedFlinkSource.java>
wrapper that can be used to convert a Flink-specific source into one that
can be used in Beam, I'd hope this is a temporary stop-gap until we have
Beam-API versions of these for all important connectors. If a
Flink-specific source has major advantages over one that can be implemented
in Beam, we should explore why and see whether we have gaps in our APIs.

(And similar analogs for Spark and for Sinks, Transforms libraries, and
everything else).

Thanks!
Dan

On Thu, Apr 28, 2016 at 10:03 AM, Davor Bonaci <da...@google.com.invalid>
wrote:

> [ Moving over to the dev@ list ]
>
> I think we should be aiming a little higher than "trying out Beam" ;)
>
> Beam SDK currently has built-in IOs for Kafka, as well as for all important
> Google Cloud Platform services. Additionally, there are pull requests for
> Firebase and Cassandra. This is not bad, particularly talking into account
> that we have APIs for user to develop their own IO connectors. Of course,
> there's a long way to go, but there should *not* be any users that are
> blocked or scenarios that are impossible.
>
> In terms of the runner support, Cloud Dataflow runner supports all IOs,
> including any user-written ones. Other runners don't as extensively, but
> this is a high priority item to address.
>
> In my mind, we should strive to address the following:
>
>    - Complete conversion of existing IOs to the Source / Sink API. ETA: a
>    week or two for full completion.
>    - Make sure Spark & Flink runners fully support Source / Sink API, and
>    that ties into the new Runner / Fn API discussion.
>    - Increase the set of built-in IOs. No ETA; iterative process over time.
>    There are 2 pending pull requests, others in development.
>
> I'm hopeful we can address all of these items in a relatively short period
> of time -- in a few months or so -- and likely before we can call any
> release "stable". (This is why the new Runner / Fn API discussions are so
> important.)
>
> In summary, in my mind, "long run" here means "< few months".
>
> ---------- Forwarded message ----------
> From: Maximilian Michels <mx...@apache.org>
> Date: Thu, Apr 28, 2016 at 3:20 AM
> Subject: Re: How to read/write avro data using FlinkKafka Consumer/Producer
> (Beam Pipeline) ?
> To: user@beam.incubator.apache.org
>
> On Wed, Apr 27, 2016 at 11:12 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> > generally speaking, we have to check that all runners work fine with the
> provided IO. I don't think it's a good idea that the runners themselves
> implement any IO: they should use "out of the box" IO.
>
> In the long run, big yes and I liked to help to make it possible!
> However, there is still a gap between what Beam and its Runners
> provide and what users want to do. For the time being, I think the
> solution we have is fine. It gives users the option to try out Beam
> with sources and sinks that they expect to be available in streaming
> systems.
>