You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Khurrum Nasim <kh...@gmail.com> on 2016/11/30 12:04:25 UTC

Re: Add DistributedLog IO

 I just sent a pull request for adding a bounded source to Beam for reading
distributedlog streams - https://github.com/apache/incubator-beam/pull/1464

Appreciate any review comments.

- KN

On Wed, Aug 31, 2016 at 2:10 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Khurrum,
>
> I already replied in the Jira this morning.
>
> To write the IO, the first question is bounded or unbounded and which
> features you want to provide.
>
> An IO could be a wrapper to a simple DoFn.
>
> If you want provide advanced features like:
> - watermark/skew management for unbounded source
> - estimated size and split for bounded source
> then you can use the Source API.
>
> You can take a look on the existing IO:
> - JMS, Kafka, PubSub for unbounded
> - Bigtable, MongoDB for bounded
>
> We are preparing some documentation on the Beam website about that.
>
> In the mean time, you can take a look on the Dataflow Custom IO
> documentation:
>
> https://cloud.google.com/dataflow/model/custom-io-java
>
> It's basically the same as in Beam.
>
> Anyway, please, let me know, I would be more than happy to help you on
> this !
>
> I'm looking forward working with you on this !
>
> Regards
> JB
>
>
> On 08/31/2016 11:02 AM, Khurrum Nasim wrote:
>
>> Hello beam folks,
>>
>> We are evaluating a new solution to unify our streaming and batching data
>> pipeline, from storage, computing engine to programming model. The idea is
>> basically to implement the Kappa architecture, using DistributedLog as a
>> unified stream store for both streaming and batching, using Flink or Spark
>> (still debating) as the process engine, and using Beam as the programming
>> model.
>>
>> We'd like to contribute an IO connector to DistributedLog (both bounded
>> source/sink and unbounded source/sink).
>>
>> Is there any special instructions or best practise to add a new IO
>> connector? Any suggestion is very appreciated.
>>
>> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>>
>> Also, /cc the distributed log team for any helps.
>>
>> KN
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Add DistributedLog IO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Thanks !

I will do the review today.

Regards
JB

On 11/30/2016 01:04 PM, Khurrum Nasim wrote:
>  I just sent a pull request for adding a bounded source to Beam for reading
> distributedlog streams - https://github.com/apache/incubator-beam/pull/1464
>
> Appreciate any review comments.
>
> - KN
>
> On Wed, Aug 31, 2016 at 2:10 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi Khurrum,
>>
>> I already replied in the Jira this morning.
>>
>> To write the IO, the first question is bounded or unbounded and which
>> features you want to provide.
>>
>> An IO could be a wrapper to a simple DoFn.
>>
>> If you want provide advanced features like:
>> - watermark/skew management for unbounded source
>> - estimated size and split for bounded source
>> then you can use the Source API.
>>
>> You can take a look on the existing IO:
>> - JMS, Kafka, PubSub for unbounded
>> - Bigtable, MongoDB for bounded
>>
>> We are preparing some documentation on the Beam website about that.
>>
>> In the mean time, you can take a look on the Dataflow Custom IO
>> documentation:
>>
>> https://cloud.google.com/dataflow/model/custom-io-java
>>
>> It's basically the same as in Beam.
>>
>> Anyway, please, let me know, I would be more than happy to help you on
>> this !
>>
>> I'm looking forward working with you on this !
>>
>> Regards
>> JB
>>
>>
>> On 08/31/2016 11:02 AM, Khurrum Nasim wrote:
>>
>>> Hello beam folks,
>>>
>>> We are evaluating a new solution to unify our streaming and batching data
>>> pipeline, from storage, computing engine to programming model. The idea is
>>> basically to implement the Kappa architecture, using DistributedLog as a
>>> unified stream store for both streaming and batching, using Flink or Spark
>>> (still debating) as the process engine, and using Beam as the programming
>>> model.
>>>
>>> We'd like to contribute an IO connector to DistributedLog (both bounded
>>> source/sink and unbounded source/sink).
>>>
>>> Is there any special instructions or best practise to add a new IO
>>> connector? Any suggestion is very appreciated.
>>>
>>> The jira is here: https://issues.apache.org/jira/browse/BEAM-607
>>>
>>> Also, /cc the distributed log team for any helps.
>>>
>>> KN
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com