You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@distributedlog.apache.org by Khurrum Nasim <kh...@gmail.com> on 2016/08/31 09:02:19 UTC

Add DistributedLog IO

Hello beam folks,

We are evaluating a new solution to unify our streaming and batching data
pipeline, from storage, computing engine to programming model. The idea is
basically to implement the Kappa architecture, using DistributedLog as a
unified stream store for both streaming and batching, using Flink or Spark
(still debating) as the process engine, and using Beam as the programming
model.

We'd like to contribute an IO connector to DistributedLog (both bounded
source/sink and unbounded source/sink).

Is there any special instructions or best practise to add a new IO
connector? Any suggestion is very appreciated.

The jira is here: https://issues.apache.org/jira/browse/BEAM-607

Also, /cc the distributed log team for any helps.

KN

Re: Add DistributedLog IO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Thanks !

I will do the review today.

Regards
JB

On 11/30/2016 01:04 PM, Khurrum Nasim wrote:
>  I just sent a pull request for adding a bounded source to Beam for reading
> distributedlog streams - https://github.com/apache/incubator-beam/pull/1464
>
> Appreciate any review comments.
>
> - KN
>
> On Wed, Aug 31, 2016 at 2:10 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi Khurrum,
>>
>> I already replied in the Jira this morning.
>>
>> To write the IO, the first question is bounded or unbounded and which
>> features you want to provide.
>>
>> An IO could be a wrapper to a simple DoFn.
>>
>> If you want provide advanced features like:
>> - watermark/skew management for unbounded source
>> - estimated size and split for bounded source
>> then you can use the Source API.
>>
>> You can take a look on the existing IO:
>> - JMS, Kafka, PubSub for unbounded
>> - Bigtable, MongoDB for bounded
>>
>> We are preparing some documentation on the Beam website about that.
>>
>> In the mean time, you can take a look on the Dataflow Custom IO
>> documentation:
>>
>> https://cloud.google.com/dataflow/model/custom-io-java
>>
>> It's basically the same as in Beam.
>>
>> Anyway, please, let me know, I would be more than happy to help you on
>> this !
>>
>> I'm looking forward working with you on this !
>>
>> Regards
>> JB
>>
>>
>> On 08/31/2016 11:02 AM, Khurrum Nasim wrote:
>>
>>> Hello beam folks,
>>>
>>> We are evaluating a new solution to unify our streaming and batching data
>>> pipeline, from storage, computing engine to programming model. The idea is
>>> basically to implement the Kappa architecture, using DistributedLog as a
>>> unified stream store for both streaming and batching, using Flink or Spark
>>> (still debating) as the process engine, and using Beam as the programming
>>> model.
>>>
>>> We'd like to contribute an IO connector to DistributedLog (both bounded
>>> source/sink and unbounded source/sink).
>>>
>>> Is there any special instructions or best practise to add a new IO
>>> connector? Any suggestion is very appreciated.
>>>
>>> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>>>
>>> Also, /cc the distributed log team for any helps.
>>>
>>> KN
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Add DistributedLog IO

Posted by Khurrum Nasim <kh...@gmail.com>.

 I just sent a pull request for adding a bounded source to Beam for reading
distributedlog streams - https://github.com/apache/incubator-beam/pull/1464

Appreciate any review comments.

- KN

On Wed, Aug 31, 2016 at 2:10 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Khurrum,
>
> I already replied in the Jira this morning.
>
> To write the IO, the first question is bounded or unbounded and which
> features you want to provide.
>
> An IO could be a wrapper to a simple DoFn.
>
> If you want provide advanced features like:
> - watermark/skew management for unbounded source
> - estimated size and split for bounded source
> then you can use the Source API.
>
> You can take a look on the existing IO:
> - JMS, Kafka, PubSub for unbounded
> - Bigtable, MongoDB for bounded
>
> We are preparing some documentation on the Beam website about that.
>
> In the mean time, you can take a look on the Dataflow Custom IO
> documentation:
>
> https://cloud.google.com/dataflow/model/custom-io-java
>
> It's basically the same as in Beam.
>
> Anyway, please, let me know, I would be more than happy to help you on
> this !
>
> I'm looking forward working with you on this !
>
> Regards
> JB
>
>
> On 08/31/2016 11:02 AM, Khurrum Nasim wrote:
>
>> Hello beam folks,
>>
>> We are evaluating a new solution to unify our streaming and batching data
>> pipeline, from storage, computing engine to programming model. The idea is
>> basically to implement the Kappa architecture, using DistributedLog as a
>> unified stream store for both streaming and batching, using Flink or Spark
>> (still debating) as the process engine, and using Beam as the programming
>> model.
>>
>> We'd like to contribute an IO connector to DistributedLog (both bounded
>> source/sink and unbounded source/sink).
>>
>> Is there any special instructions or best practise to add a new IO
>> connector? Any suggestion is very appreciated.
>>
>> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>>
>> Also, /cc the distributed log team for any helps.
>>
>> KN
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Add DistributedLog IO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Khurrum,

I already replied in the Jira this morning.

To write the IO, the first question is bounded or unbounded and which 
features you want to provide.

An IO could be a wrapper to a simple DoFn.

If you want provide advanced features like:
- watermark/skew management for unbounded source
- estimated size and split for bounded source
then you can use the Source API.

You can take a look on the existing IO:
- JMS, Kafka, PubSub for unbounded
- Bigtable, MongoDB for bounded

We are preparing some documentation on the Beam website about that.

In the mean time, you can take a look on the Dataflow Custom IO 
documentation:

https://cloud.google.com/dataflow/model/custom-io-java

It's basically the same as in Beam.

Anyway, please, let me know, I would be more than happy to help you on 
this !

I'm looking forward working with you on this !

Regards
JB

On 08/31/2016 11:02 AM, Khurrum Nasim wrote:
> Hello beam folks,
>
> We are evaluating a new solution to unify our streaming and batching data
> pipeline, from storage, computing engine to programming model. The idea is
> basically to implement the Kappa architecture, using DistributedLog as a
> unified stream store for both streaming and batching, using Flink or Spark
> (still debating) as the process engine, and using Beam as the programming
> model.
>
> We'd like to contribute an IO connector to DistributedLog (both bounded
> source/sink and unbounded source/sink).
>
> Is there any special instructions or best practise to add a new IO
> connector? Any suggestion is very appreciated.
>
> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>
> Also, /cc the distributed log team for any helps.
>
> KN
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Add DistributedLog IO

Posted by Sijie Guo <si...@apache.org>.

Thanks Khurrum. I will try to take a look. (Although I am not an expert of
Beam.)

It would also be awesome if you can share your experience on building this
via soms blog posts, once you finished this.

Sijie

On Nov 30, 2016 4:02 AM, "Khurrum Nasim" <kh...@gmail.com> wrote:

> FYI
>
> I just sent a pull request for adding a bounded source to Beam for reading
> distributedlog streams. I am going to send out the pull request for adding
> an unbounded source and a sink after that.
>
> If you are interested in this and willing to help review it, this is the
> pull request - https://github.com/apache/incubator-beam/pull/1464
>
> - KN
>
> On Thu, Sep 1, 2016 at 9:41 PM, Sijie Guo <si...@apache.org> wrote:
>
>> Wow. This sounds interesting. Look forward to this. Let us know if you
>> need
>> any helps.
>>
>> - Sijie
>>
>> On Wed, Aug 31, 2016 at 2:02 AM, Khurrum Nasim <kh...@gmail.com>
>> wrote:
>>
>> > Hello beam folks,
>> >
>> > We are evaluating a new solution to unify our streaming and batching
>> data
>> > pipeline, from storage, computing engine to programming model. The idea
>> is
>> > basically to implement the Kappa architecture, using DistributedLog as a
>> > unified stream store for both streaming and batching, using Flink or
>> Spark
>> > (still debating) as the process engine, and using Beam as the
>> programming
>> > model.
>> >
>> > We'd like to contribute an IO connector to DistributedLog (both bounded
>> > source/sink and unbounded source/sink).
>> >
>> > Is there any special instructions or best practise to add a new IO
>> > connector? Any suggestion is very appreciated.
>> >
>> > The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>> >
>> > Also, /cc the distributed log team for any helps.
>> >
>> > KN
>> >
>>
>
>

Re: Add DistributedLog IO

Posted by Sijie Guo <si...@apache.org>.

Thanks Khurrum. I will try to take a look. (Although I am not an expert of
Beam.)

It would also be awesome if you can share your experience on building this
via soms blog posts, once you finished this.

Sijie

On Nov 30, 2016 4:02 AM, "Khurrum Nasim" <kh...@gmail.com> wrote:

> FYI
>
> I just sent a pull request for adding a bounded source to Beam for reading
> distributedlog streams. I am going to send out the pull request for adding
> an unbounded source and a sink after that.
>
> If you are interested in this and willing to help review it, this is the
> pull request - https://github.com/apache/incubator-beam/pull/1464
>
> - KN
>
> On Thu, Sep 1, 2016 at 9:41 PM, Sijie Guo <si...@apache.org> wrote:
>
>> Wow. This sounds interesting. Look forward to this. Let us know if you
>> need
>> any helps.
>>
>> - Sijie
>>
>> On Wed, Aug 31, 2016 at 2:02 AM, Khurrum Nasim <kh...@gmail.com>
>> wrote:
>>
>> > Hello beam folks,
>> >
>> > We are evaluating a new solution to unify our streaming and batching
>> data
>> > pipeline, from storage, computing engine to programming model. The idea
>> is
>> > basically to implement the Kappa architecture, using DistributedLog as a
>> > unified stream store for both streaming and batching, using Flink or
>> Spark
>> > (still debating) as the process engine, and using Beam as the
>> programming
>> > model.
>> >
>> > We'd like to contribute an IO connector to DistributedLog (both bounded
>> > source/sink and unbounded source/sink).
>> >
>> > Is there any special instructions or best practise to add a new IO
>> > connector? Any suggestion is very appreciated.
>> >
>> > The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>> >
>> > Also, /cc the distributed log team for any helps.
>> >
>> > KN
>> >
>>
>
>

Re: Add DistributedLog IO

Posted by Khurrum Nasim <kh...@gmail.com>.

FYI

I just sent a pull request for adding a bounded source to Beam for reading
distributedlog streams. I am going to send out the pull request for adding
an unbounded source and a sink after that.

If you are interested in this and willing to help review it, this is the
pull request - https://github.com/apache/incubator-beam/pull/1464

- KN

On Thu, Sep 1, 2016 at 9:41 PM, Sijie Guo <si...@apache.org> wrote:

> Wow. This sounds interesting. Look forward to this. Let us know if you need
> any helps.
>
> - Sijie
>
> On Wed, Aug 31, 2016 at 2:02 AM, Khurrum Nasim <kh...@gmail.com>
> wrote:
>
> > Hello beam folks,
> >
> > We are evaluating a new solution to unify our streaming and batching data
> > pipeline, from storage, computing engine to programming model. The idea
> is
> > basically to implement the Kappa architecture, using DistributedLog as a
> > unified stream store for both streaming and batching, using Flink or
> Spark
> > (still debating) as the process engine, and using Beam as the programming
> > model.
> >
> > We'd like to contribute an IO connector to DistributedLog (both bounded
> > source/sink and unbounded source/sink).
> >
> > Is there any special instructions or best practise to add a new IO
> > connector? Any suggestion is very appreciated.
> >
> > The jira is here: https://issues.apache.org/jira/browse/BEAM-607
> >
> > Also, /cc the distributed log team for any helps.
> >
> > KN
> >
>

Re: Add DistributedLog IO

Posted by Khurrum Nasim <kh...@gmail.com>.

FYI

I just sent a pull request for adding a bounded source to Beam for reading
distributedlog streams. I am going to send out the pull request for adding
an unbounded source and a sink after that.

If you are interested in this and willing to help review it, this is the
pull request - https://github.com/apache/incubator-beam/pull/1464

- KN

On Thu, Sep 1, 2016 at 9:41 PM, Sijie Guo <si...@apache.org> wrote:

> Wow. This sounds interesting. Look forward to this. Let us know if you need
> any helps.
>
> - Sijie
>
> On Wed, Aug 31, 2016 at 2:02 AM, Khurrum Nasim <kh...@gmail.com>
> wrote:
>
> > Hello beam folks,
> >
> > We are evaluating a new solution to unify our streaming and batching data
> > pipeline, from storage, computing engine to programming model. The idea
> is
> > basically to implement the Kappa architecture, using DistributedLog as a
> > unified stream store for both streaming and batching, using Flink or
> Spark
> > (still debating) as the process engine, and using Beam as the programming
> > model.
> >
> > We'd like to contribute an IO connector to DistributedLog (both bounded
> > source/sink and unbounded source/sink).
> >
> > Is there any special instructions or best practise to add a new IO
> > connector? Any suggestion is very appreciated.
> >
> > The jira is here: https://issues.apache.org/jira/browse/BEAM-607
> >
> > Also, /cc the distributed log team for any helps.
> >
> > KN
> >
>

Re: Add DistributedLog IO

Posted by Sijie Guo <si...@apache.org>.

Wow. This sounds interesting. Look forward to this. Let us know if you need
any helps.

- Sijie

On Wed, Aug 31, 2016 at 2:02 AM, Khurrum Nasim <kh...@gmail.com>
wrote:

> Hello beam folks,
>
> We are evaluating a new solution to unify our streaming and batching data
> pipeline, from storage, computing engine to programming model. The idea is
> basically to implement the Kappa architecture, using DistributedLog as a
> unified stream store for both streaming and batching, using Flink or Spark
> (still debating) as the process engine, and using Beam as the programming
> model.
>
> We'd like to contribute an IO connector to DistributedLog (both bounded
> source/sink and unbounded source/sink).
>
> Is there any special instructions or best practise to add a new IO
> connector? Any suggestion is very appreciated.
>
> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>
> Also, /cc the distributed log team for any helps.
>
> KN
>