You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sergio Fernández <wi...@apache.org> on 2016/06/14 10:38:55 UTC

newbie question about beam

Hi guys,

I'm newbie in the Beam community, but as someone who has used DataFlow in
the past I've been following the podling since you came to ASK. I'm very
happy to see that 0.1.0-incubating is finally going out, congratulations
for such great milestone.

I discussed with some of you guys in the last ApacheCon, and for me was
good to know the Python SDK was just a matter of time and should come to
Beam at some point. So coming back to the original plans <
http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html>,
do you manage any timeline to bring the Python SDK to Beam?

So I'd like to bring a question how Beam plans to deal with the
distribution of resources across all nodes, something I know it not really
clean with some runners (e.g., Spark). More concretely, we're using Keras <
http://keras.io/>, a deep learning Python library that is capable of
running on top of either TensorFlow or Theano. Historically I know DataFlow
and TensorFlow are not very compatible. But I wonder if the project has
already discussed how to support running Keras (TensorFlow) tasks on Beam.
For us is more for querying than for training, so I'd like to know if the
Beam Model could natively support the distribution of the models (sometimes
several GB).

Thanks in advance.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: newbie question about beam

Posted by Davor Bonaci <da...@google.com.INVALID>.

We are in process of porting Cloud Dataflow documentation to Beam, so I'll
give you a mix of Dataflow and Beam links.

FilesToStage is a pipeline option [1], [2]. Super-easy to use.
Side inputs are a ParDo concept [3].

If you hit any rough edges, please let us know -- I'd be glad to help!

[1]
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
[2]
https://beam.incubator.apache.org/javadoc/0.1.0-incubating/org/apache/beam/runners/dataflow/options/DataflowPipelineWorkerPoolOptions.html#getFilesToStage--
[3] https://cloud.google.com/dataflow/model/par-do#side-inputs

On Thu, Jun 16, 2016 at 1:40 AM, Sergio Fernández <wi...@apache.org> wrote:

> Hi Davor,
>
> On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci <da...@google.com.invalid>
> wrote:
>
> > This is a really good question, Sergio. You got right away to the crux of
> > the problem -- how to express such pattern in the Beam model.
> >
> > The answer depends whether the data is static, e.g., whether it is known
> at
> > pipeline construction time / computed in the earlier stages of the
> > pipeline, or perhaps evolving during pipeline execution. I'll give a
> > high-level answer -- feel free to share more information about your use
> > case and we can drill into specific details.
> >
>
> Well, as a said, for us is more interesting to use Beam in processing time
> that for training purposes. In the past we have experimented a bit with
> approaches like TensorSpark <https://github.com/adatao/tensorspark>, but
> the critical aspect is exploitation of the models. Therefore we could
> assume the models are static data.
>
>
>
> > In the simplest case, Beam supports "files to stage" concept if the data
> is
> > known apriori. In this case, runners will distribute the data to all
> > workers before computation starts, and your logic can depend on the data
> > being available locally on each worker.
> >
>
> Oh, cool. Something like that would be more than enough for now. Can you
> please point me to any documentation or code I could use to play with it?
>
>
> If this is not sufficient, Beam's side inputs are the right primitive. We
> > support several access patterns for side inputs, including distributed
> > lookup and various types of caching. This can work really well,
> > particularly with a well-optimized runner.
> >
>
> Interesting... any (early) documentation (or code) about such feature?
>
>
>
> > Other alternatives typically include access to a shared storage, which
> is a
> > lower-level approach and often requires more work.
>
>
> Sure, share-storage is always an option, but for many reasons I'd rather
> not resort to such approach.
>
> Thanks so much for all the ideas and valuable discussions!
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernandez@redlink.co
> w: http://redlink.co
>

Re: newbie question about beam

Posted by Sergio Fernández <wi...@apache.org>.

Hi Davor,

On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci <da...@google.com.invalid>
wrote:

> This is a really good question, Sergio. You got right away to the crux of
> the problem -- how to express such pattern in the Beam model.
>
> The answer depends whether the data is static, e.g., whether it is known at
> pipeline construction time / computed in the earlier stages of the
> pipeline, or perhaps evolving during pipeline execution. I'll give a
> high-level answer -- feel free to share more information about your use
> case and we can drill into specific details.
>

Well, as a said, for us is more interesting to use Beam in processing time
that for training purposes. In the past we have experimented a bit with
approaches like TensorSpark <https://github.com/adatao/tensorspark>, but
the critical aspect is exploitation of the models. Therefore we could
assume the models are static data.

> In the simplest case, Beam supports "files to stage" concept if the data is
> known apriori. In this case, runners will distribute the data to all
> workers before computation starts, and your logic can depend on the data
> being available locally on each worker.
>

Oh, cool. Something like that would be more than enough for now. Can you
please point me to any documentation or code I could use to play with it?

If this is not sufficient, Beam's side inputs are the right primitive. We
> support several access patterns for side inputs, including distributed
> lookup and various types of caching. This can work really well,
> particularly with a well-optimized runner.
>

Interesting... any (early) documentation (or code) about such feature?

> Other alternatives typically include access to a shared storage, which is a
> lower-level approach and often requires more work.

Sure, share-storage is always an option, but for many reasons I'd rather
not resort to such approach.

Thanks so much for all the ideas and valuable discussions!

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: newbie question about beam

Posted by Davor Bonaci <da...@google.com.INVALID>.

This is a really good question, Sergio. You got right away to the crux of
the problem -- how to express such pattern in the Beam model.

The answer depends whether the data is static, e.g., whether it is known at
pipeline construction time / computed in the earlier stages of the
pipeline, or perhaps evolving during pipeline execution. I'll give a
high-level answer -- feel free to share more information about your use
case and we can drill into specific details.

In the simplest case, Beam supports "files to stage" concept if the data is
known apriori. In this case, runners will distribute the data to all
workers before computation starts, and your logic can depend on the data
being available locally on each worker.

If this is not sufficient, Beam's side inputs are the right primitive. We
support several access patterns for side inputs, including distributed
lookup and various types of caching. This can work really well,
particularly with a well-optimized runner.

Other alternatives typically include access to a shared storage, which is a
lower-level approach and often requires more work.

--

Back to Ismael's question -- Beam is great at orchestrating such pipelines.
You can build the pipeline that prepares data for a custom system, manages
its invocation, and processes its output. PTransforms can encapsulate
arbitrary computation, including invocation of an outside logic / system.
It would be great to have a set of PTransform libraries that wrap such
computations.

On Wed, Jun 15, 2016 at 2:45 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> I would say DSL + PTransform should work.
>
> But certainly some PoC to do ;)
>
> Regards
> JB
>
>
> On 06/15/2016 11:39 AM, Ismaël Mejía wrote:
>
>> One interesting point that Sergio mentions and that it is getting lost in
>> the discussion is how to integrate other dataflow style frameworks into
>> Beam, e.g. Tensorflow. I am really curious about what the others have to
>> say about this since this is probably one question that will come once
>> more
>> users write Pipelines on Beam. Any ideas on this ? or the solution is just
>> to write some 'integration PTransforms' and that's it ?
>>
>> Regards,
>> Ismaël
>>
>> ps. I forgot to say Hi and welcome Sergio :).
>>
>>
>> On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Not the Beam Model for sure (the Beam Model is about the pipeline design).
>>>
>>> The Beam Runner API can help there, but the final implement is on the
>>> runner itself.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/15/2016 10:18 AM, Sergio Fernández wrote:
>>>
>>> Hi Jean-Baptiste,
>>>>
>>>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>> >
>>>> wrote:
>>>>
>>>>
>>>>> Welcome aboard, and good to discuss with you during ApacheCon.
>>>>>
>>>>>
>>>>> Was nice to put you all faces ;-)
>>>>
>>>>
>>>> Distribution of the resources is a point related to runner, and more
>>>>
>>>>> specifically to the execution environment of the runner. Each
>>>>> runner/backend will implement their own logic.
>>>>>
>>>>>
>>>>> Yes, I can understand. But I wonder if the Beam Model provides any
>>>> primitive to deal with such aspects in an abstract way. I guess I'd need
>>>> to
>>>> go deeper into Beam to approach you with more concrete questions; so for
>>>> now it's fine.
>>>>
>>>> Regarding the Python SDK, we discussed about that last week: it's on the
>>>>
>>>> way. We should have the Python SDK very soon (we were busy with the
>>>>> first
>>>>> release).
>>>>>
>>>>>
>>>>
>>>> Yep, I knew that was the plan. It's really cool to have it already is
>>>> master to the next release :-)
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>>>
>>>>>> I'm newbie in the Beam community, but as someone who has used DataFlow
>>>>>> in
>>>>>> the past I've been following the podling since you came to ASK. I'm
>>>>>> very
>>>>>> happy to see that 0.1.0-incubating is finally going out,
>>>>>> congratulations
>>>>>> for such great milestone.
>>>>>>
>>>>>> I discussed with some of you guys in the last ApacheCon, and for me
>>>>>> was
>>>>>> good to know the Python SDK was just a matter of time and should come
>>>>>> to
>>>>>> Beam at some point. So coming back to the original plans <
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>>>>>
>>>>>> ,
>>>>>>>
>>>>>>> do you manage any timeline to bring the Python SDK to Beam?
>>>>>>
>>>>>> So I'd like to bring a question how Beam plans to deal with the
>>>>>> distribution of resources across all nodes, something I know it not
>>>>>> really
>>>>>> clean with some runners (e.g., Spark). More concretely, we're using
>>>>>> Keras
>>>>>> <
>>>>>> http://keras.io/>, a deep learning Python library that is capable of
>>>>>> running on top of either TensorFlow or Theano. Historically I know
>>>>>> DataFlow
>>>>>> and TensorFlow are not very compatible. But I wonder if the project
>>>>>> has
>>>>>> already discussed how to support running Keras (TensorFlow) tasks on
>>>>>> Beam.
>>>>>> For us is more for querying than for training, so I'd like to know if
>>>>>> the
>>>>>> Beam Model could natively support the distribution of the models
>>>>>> (sometimes
>>>>>> several GB).
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: newbie question about beam

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

I would say DSL + PTransform should work.

But certainly some PoC to do ;)

Regards
JB

On 06/15/2016 11:39 AM, Isma�l Mej�a wrote:
> One interesting point that Sergio mentions and that it is getting lost in
> the discussion is how to integrate other dataflow style frameworks into
> Beam, e.g. Tensorflow. I am really curious about what the others have to
> say about this since this is probably one question that will come once more
> users write Pipelines on Beam. Any ideas on this ? or the solution is just
> to write some 'integration PTransforms' and that's it ?
>
> Regards,
> Isma�l
>
> ps. I forgot to say Hi and welcome Sergio :).
>
>
> On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Not the Beam Model for sure (the Beam Model is about the pipeline design).
>>
>> The Beam Runner API can help there, but the final implement is on the
>> runner itself.
>>
>> Regards
>> JB
>>
>>
>> On 06/15/2016 10:18 AM, Sergio Fern�ndez wrote:
>>
>>> Hi Jean-Baptiste,
>>>
>>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
>>> wrote:
>>>
>>>>
>>>> Welcome aboard, and good to discuss with you during ApacheCon.
>>>>
>>>>
>>> Was nice to put you all faces ;-)
>>>
>>>
>>> Distribution of the resources is a point related to runner, and more
>>>> specifically to the execution environment of the runner. Each
>>>> runner/backend will implement their own logic.
>>>>
>>>>
>>> Yes, I can understand. But I wonder if the Beam Model provides any
>>> primitive to deal with such aspects in an abstract way. I guess I'd need
>>> to
>>> go deeper into Beam to approach you with more concrete questions; so for
>>> now it's fine.
>>>
>>> Regarding the Python SDK, we discussed about that last week: it's on the
>>>
>>>> way. We should have the Python SDK very soon (we were busy with the first
>>>> release).
>>>>
>>>
>>>
>>> Yep, I knew that was the plan. It's really cool to have it already is
>>> master to the next release :-)
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>> On 06/14/2016 12:38 PM, Sergio Fern�ndez wrote:
>>>>
>>>> Hi guys,
>>>>>
>>>>> I'm newbie in the Beam community, but as someone who has used DataFlow
>>>>> in
>>>>> the past I've been following the podling since you came to ASK. I'm very
>>>>> happy to see that 0.1.0-incubating is finally going out, congratulations
>>>>> for such great milestone.
>>>>>
>>>>> I discussed with some of you guys in the last ApacheCon, and for me was
>>>>> good to know the Python SDK was just a matter of time and should come to
>>>>> Beam at some point. So coming back to the original plans <
>>>>>
>>>>>
>>>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>>>>
>>>>>> ,
>>>>>>
>>>>> do you manage any timeline to bring the Python SDK to Beam?
>>>>>
>>>>> So I'd like to bring a question how Beam plans to deal with the
>>>>> distribution of resources across all nodes, something I know it not
>>>>> really
>>>>> clean with some runners (e.g., Spark). More concretely, we're using
>>>>> Keras
>>>>> <
>>>>> http://keras.io/>, a deep learning Python library that is capable of
>>>>> running on top of either TensorFlow or Theano. Historically I know
>>>>> DataFlow
>>>>> and TensorFlow are not very compatible. But I wonder if the project has
>>>>> already discussed how to support running Keras (TensorFlow) tasks on
>>>>> Beam.
>>>>> For us is more for querying than for training, so I'd like to know if
>>>>> the
>>>>> Beam Model could natively support the distribution of the models
>>>>> (sometimes
>>>>> several GB).
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> --
>>>> Jean-Baptiste Onofr�
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: newbie question about beam

Posted by Ismaël Mejía <ie...@gmail.com>.

One interesting point that Sergio mentions and that it is getting lost in
the discussion is how to integrate other dataflow style frameworks into
Beam, e.g. Tensorflow. I am really curious about what the others have to
say about this since this is probably one question that will come once more
users write Pipelines on Beam. Any ideas on this ? or the solution is just
to write some 'integration PTransforms' and that's it ?

Regards,
Ismaël

ps. I forgot to say Hi and welcome Sergio :).


On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Not the Beam Model for sure (the Beam Model is about the pipeline design).
>
> The Beam Runner API can help there, but the final implement is on the
> runner itself.
>
> Regards
> JB
>
>
> On 06/15/2016 10:18 AM, Sergio Fernández wrote:
>
>> Hi Jean-Baptiste,
>>
>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>>
>>> Welcome aboard, and good to discuss with you during ApacheCon.
>>>
>>>
>> Was nice to put you all faces ;-)
>>
>>
>> Distribution of the resources is a point related to runner, and more
>>> specifically to the execution environment of the runner. Each
>>> runner/backend will implement their own logic.
>>>
>>>
>> Yes, I can understand. But I wonder if the Beam Model provides any
>> primitive to deal with such aspects in an abstract way. I guess I'd need
>> to
>> go deeper into Beam to approach you with more concrete questions; so for
>> now it's fine.
>>
>> Regarding the Python SDK, we discussed about that last week: it's on the
>>
>>> way. We should have the Python SDK very soon (we were busy with the first
>>> release).
>>>
>>
>>
>> Yep, I knew that was the plan. It's really cool to have it already is
>> master to the next release :-)
>>
>> Thanks.
>>
>>
>>
>>
>>
>>> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>>>
>>> Hi guys,
>>>>
>>>> I'm newbie in the Beam community, but as someone who has used DataFlow
>>>> in
>>>> the past I've been following the podling since you came to ASK. I'm very
>>>> happy to see that 0.1.0-incubating is finally going out, congratulations
>>>> for such great milestone.
>>>>
>>>> I discussed with some of you guys in the last ApacheCon, and for me was
>>>> good to know the Python SDK was just a matter of time and should come to
>>>> Beam at some point. So coming back to the original plans <
>>>>
>>>>
>>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>>>
>>>>> ,
>>>>>
>>>> do you manage any timeline to bring the Python SDK to Beam?
>>>>
>>>> So I'd like to bring a question how Beam plans to deal with the
>>>> distribution of resources across all nodes, something I know it not
>>>> really
>>>> clean with some runners (e.g., Spark). More concretely, we're using
>>>> Keras
>>>> <
>>>> http://keras.io/>, a deep learning Python library that is capable of
>>>> running on top of either TensorFlow or Theano. Historically I know
>>>> DataFlow
>>>> and TensorFlow are not very compatible. But I wonder if the project has
>>>> already discussed how to support running Keras (TensorFlow) tasks on
>>>> Beam.
>>>> For us is more for querying than for training, so I'd like to know if
>>>> the
>>>> Beam Model could natively support the distribution of the models
>>>> (sometimes
>>>> several GB).
>>>>
>>>> Thanks in advance.
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: newbie question about beam

Posted by Sergio Fernández <wi...@apache.org>.

On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Not the Beam Model for sure (the Beam Model is about the pipeline design).
>
> The Beam Runner API can help there, but the final implement is on the
> runner itself.
>

Right. I'll take a look to the Beam Runner API documentation and experiment
a bit with it. Thanks!





On 06/15/2016 10:18 AM, Sergio Fernández wrote:
>
>> Hi Jean-Baptiste,
>>
>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>>
>>> Welcome aboard, and good to discuss with you during ApacheCon.
>>>
>>>
>> Was nice to put you all faces ;-)
>>
>>
>> Distribution of the resources is a point related to runner, and more
>>> specifically to the execution environment of the runner. Each
>>> runner/backend will implement their own logic.
>>>
>>>
>> Yes, I can understand. But I wonder if the Beam Model provides any
>> primitive to deal with such aspects in an abstract way. I guess I'd need
>> to
>> go deeper into Beam to approach you with more concrete questions; so for
>> now it's fine.
>>
>> Regarding the Python SDK, we discussed about that last week: it's on the
>>
>>> way. We should have the Python SDK very soon (we were busy with the first
>>> release).
>>>
>>
>>
>> Yep, I knew that was the plan. It's really cool to have it already is
>> master to the next release :-)
>>
>> Thanks.
>>
>>
>>
>>
>>
>>> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>>>
>>> Hi guys,
>>>>
>>>> I'm newbie in the Beam community, but as someone who has used DataFlow
>>>> in
>>>> the past I've been following the podling since you came to ASK. I'm very
>>>> happy to see that 0.1.0-incubating is finally going out, congratulations
>>>> for such great milestone.
>>>>
>>>> I discussed with some of you guys in the last ApacheCon, and for me was
>>>> good to know the Python SDK was just a matter of time and should come to
>>>> Beam at some point. So coming back to the original plans <
>>>>
>>>>
>>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>>>
>>>>> ,
>>>>>
>>>> do you manage any timeline to bring the Python SDK to Beam?
>>>>
>>>> So I'd like to bring a question how Beam plans to deal with the
>>>> distribution of resources across all nodes, something I know it not
>>>> really
>>>> clean with some runners (e.g., Spark). More concretely, we're using
>>>> Keras
>>>> <
>>>> http://keras.io/>, a deep learning Python library that is capable of
>>>> running on top of either TensorFlow or Theano. Historically I know
>>>> DataFlow
>>>> and TensorFlow are not very compatible. But I wonder if the project has
>>>> already discussed how to support running Keras (TensorFlow) tasks on
>>>> Beam.
>>>> For us is more for querying than for training, so I'd like to know if
>>>> the
>>>> Beam Model could natively support the distribution of the models
>>>> (sometimes
>>>> several GB).
>>>>
>>>> Thanks in advance.
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: newbie question about beam

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Not the Beam Model for sure (the Beam Model is about the pipeline design).

The Beam Runner API can help there, but the final implement is on the 
runner itself.

Regards
JB

On 06/15/2016 10:18 AM, Sergio Fern�ndez wrote:
> Hi Jean-Baptiste,
>
> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>>
>> Welcome aboard, and good to discuss with you during ApacheCon.
>>
>
> Was nice to put you all faces ;-)
>
>
>> Distribution of the resources is a point related to runner, and more
>> specifically to the execution environment of the runner. Each
>> runner/backend will implement their own logic.
>>
>
> Yes, I can understand. But I wonder if the Beam Model provides any
> primitive to deal with such aspects in an abstract way. I guess I'd need to
> go deeper into Beam to approach you with more concrete questions; so for
> now it's fine.
>
> Regarding the Python SDK, we discussed about that last week: it's on the
>> way. We should have the Python SDK very soon (we were busy with the first
>> release).
>
>
> Yep, I knew that was the plan. It's really cool to have it already is
> master to the next release :-)
>
> Thanks.
>
>
>
>
>>
>> On 06/14/2016 12:38 PM, Sergio Fern�ndez wrote:
>>
>>> Hi guys,
>>>
>>> I'm newbie in the Beam community, but as someone who has used DataFlow in
>>> the past I've been following the podling since you came to ASK. I'm very
>>> happy to see that 0.1.0-incubating is finally going out, congratulations
>>> for such great milestone.
>>>
>>> I discussed with some of you guys in the last ApacheCon, and for me was
>>> good to know the Python SDK was just a matter of time and should come to
>>> Beam at some point. So coming back to the original plans <
>>>
>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>>> ,
>>> do you manage any timeline to bring the Python SDK to Beam?
>>>
>>> So I'd like to bring a question how Beam plans to deal with the
>>> distribution of resources across all nodes, something I know it not really
>>> clean with some runners (e.g., Spark). More concretely, we're using Keras
>>> <
>>> http://keras.io/>, a deep learning Python library that is capable of
>>> running on top of either TensorFlow or Theano. Historically I know
>>> DataFlow
>>> and TensorFlow are not very compatible. But I wonder if the project has
>>> already discussed how to support running Keras (TensorFlow) tasks on Beam.
>>> For us is more for querying than for training, so I'd like to know if the
>>> Beam Model could natively support the distribution of the models
>>> (sometimes
>>> several GB).
>>>
>>> Thanks in advance.
>>>
>>> Cheers,
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: newbie question about beam

Posted by Sergio Fernández <wi...@apache.org>.

Hi Jean-Baptiste,

On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
>
> Welcome aboard, and good to discuss with you during ApacheCon.
>

Was nice to put you all faces ;-)


> Distribution of the resources is a point related to runner, and more
> specifically to the execution environment of the runner. Each
> runner/backend will implement their own logic.
>

Yes, I can understand. But I wonder if the Beam Model provides any
primitive to deal with such aspects in an abstract way. I guess I'd need to
go deeper into Beam to approach you with more concrete questions; so for
now it's fine.

Regarding the Python SDK, we discussed about that last week: it's on the
> way. We should have the Python SDK very soon (we were busy with the first
> release).


Yep, I knew that was the plan. It's really cool to have it already is
master to the next release :-)

Thanks.




>
> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>
>> Hi guys,
>>
>> I'm newbie in the Beam community, but as someone who has used DataFlow in
>> the past I've been following the podling since you came to ASK. I'm very
>> happy to see that 0.1.0-incubating is finally going out, congratulations
>> for such great milestone.
>>
>> I discussed with some of you guys in the last ApacheCon, and for me was
>> good to know the Python SDK was just a matter of time and should come to
>> Beam at some point. So coming back to the original plans <
>>
>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>> >,
>> do you manage any timeline to bring the Python SDK to Beam?
>>
>> So I'd like to bring a question how Beam plans to deal with the
>> distribution of resources across all nodes, something I know it not really
>> clean with some runners (e.g., Spark). More concretely, we're using Keras
>> <
>> http://keras.io/>, a deep learning Python library that is capable of
>> running on top of either TensorFlow or Theano. Historically I know
>> DataFlow
>> and TensorFlow are not very compatible. But I wonder if the project has
>> already discussed how to support running Keras (TensorFlow) tasks on Beam.
>> For us is more for querying than for training, so I'd like to know if the
>> Beam Model could natively support the distribution of the models
>> (sometimes
>> several GB).
>>
>> Thanks in advance.
>>
>> Cheers,
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: newbie question about beam

Posted by Sergio Fernández <wi...@apache.org>.

Awesome, Davor!

BTW, the ApacheCon CFP opened yesterday:
http://events.linuxfoundation.org/events/apache-big-data-europe/program/cfp
so it'd be great to have you presenting there this time ;-)


On Wed, Jun 15, 2016 at 2:51 AM, Davor Bonaci <da...@google.com.invalid>
wrote:

> Hi Sergio,
> It was great talking with you in Vancouver.
>
> As of today, the Python SDK is here, [1], [2]. Wasn't that fast enough ;)
>
> Davor
>
> [1] https://github.com/apache/incubator-beam/pull/461
> [2] https://github.com/apache/incubator-beam/tree/python-sdk/sdks/python
>
> On Tue, Jun 14, 2016 at 3:45 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Sergio,
> >
> > Welcome aboard, and good to discuss with you during ApacheCon.
> >
> > Distribution of the resources is a point related to runner, and more
> > specifically to the execution environment of the runner. Each
> > runner/backend will implement their own logic.
> >
> > I don't know Keras enough to provide a strong advice.
> >
> > Regarding the Python SDK, we discussed about that last week: it's on the
> > way. We should have the Python SDK very soon (we were busy with the first
> > release).
> >
> > Regards
> > JB
> >
> >
> > On 06/14/2016 12:38 PM, Sergio Fernández wrote:
> >
> >> Hi guys,
> >>
> >> I'm newbie in the Beam community, but as someone who has used DataFlow
> in
> >> the past I've been following the podling since you came to ASK. I'm very
> >> happy to see that 0.1.0-incubating is finally going out, congratulations
> >> for such great milestone.
> >>
> >> I discussed with some of you guys in the last ApacheCon, and for me was
> >> good to know the Python SDK was just a matter of time and should come to
> >> Beam at some point. So coming back to the original plans <
> >>
> >>
> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
> >> >,
> >> do you manage any timeline to bring the Python SDK to Beam?
> >>
> >> So I'd like to bring a question how Beam plans to deal with the
> >> distribution of resources across all nodes, something I know it not
> really
> >> clean with some runners (e.g., Spark). More concretely, we're using
> Keras
> >> <
> >> http://keras.io/>, a deep learning Python library that is capable of
> >> running on top of either TensorFlow or Theano. Historically I know
> >> DataFlow
> >> and TensorFlow are not very compatible. But I wonder if the project has
> >> already discussed how to support running Keras (TensorFlow) tasks on
> Beam.
> >> For us is more for querying than for training, so I'd like to know if
> the
> >> Beam Model could natively support the distribution of the models
> >> (sometimes
> >> several GB).
> >>
> >> Thanks in advance.
> >>
> >> Cheers,
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernandez@redlink.co
> w: http://redlink.co
>

Re: newbie question about beam

Posted by Davor Bonaci <da...@google.com.INVALID>.

Hi Sergio,
It was great talking with you in Vancouver.

As of today, the Python SDK is here, [1], [2]. Wasn't that fast enough ;)

Davor

[1] https://github.com/apache/incubator-beam/pull/461
[2] https://github.com/apache/incubator-beam/tree/python-sdk/sdks/python

On Tue, Jun 14, 2016 at 3:45 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Sergio,
>
> Welcome aboard, and good to discuss with you during ApacheCon.
>
> Distribution of the resources is a point related to runner, and more
> specifically to the execution environment of the runner. Each
> runner/backend will implement their own logic.
>
> I don't know Keras enough to provide a strong advice.
>
> Regarding the Python SDK, we discussed about that last week: it's on the
> way. We should have the Python SDK very soon (we were busy with the first
> release).
>
> Regards
> JB
>
>
> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>
>> Hi guys,
>>
>> I'm newbie in the Beam community, but as someone who has used DataFlow in
>> the past I've been following the podling since you came to ASK. I'm very
>> happy to see that 0.1.0-incubating is finally going out, congratulations
>> for such great milestone.
>>
>> I discussed with some of you guys in the last ApacheCon, and for me was
>> good to know the Python SDK was just a matter of time and should come to
>> Beam at some point. So coming back to the original plans <
>>
>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>> >,
>> do you manage any timeline to bring the Python SDK to Beam?
>>
>> So I'd like to bring a question how Beam plans to deal with the
>> distribution of resources across all nodes, something I know it not really
>> clean with some runners (e.g., Spark). More concretely, we're using Keras
>> <
>> http://keras.io/>, a deep learning Python library that is capable of
>> running on top of either TensorFlow or Theano. Historically I know
>> DataFlow
>> and TensorFlow are not very compatible. But I wonder if the project has
>> already discussed how to support running Keras (TensorFlow) tasks on Beam.
>> For us is more for querying than for training, so I'd like to know if the
>> Beam Model could natively support the distribution of the models
>> (sometimes
>> several GB).
>>
>> Thanks in advance.
>>
>> Cheers,
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: newbie question about beam

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Sergio,

Welcome aboard, and good to discuss with you during ApacheCon.

Distribution of the resources is a point related to runner, and more 
specifically to the execution environment of the runner. Each 
runner/backend will implement their own logic.

I don't know Keras enough to provide a strong advice.

Regarding the Python SDK, we discussed about that last week: it's on the 
way. We should have the Python SDK very soon (we were busy with the 
first release).

Regards
JB

On 06/14/2016 12:38 PM, Sergio Fern�ndez wrote:
> Hi guys,
>
> I'm newbie in the Beam community, but as someone who has used DataFlow in
> the past I've been following the podling since you came to ASK. I'm very
> happy to see that 0.1.0-incubating is finally going out, congratulations
> for such great milestone.
>
> I discussed with some of you guys in the last ApacheCon, and for me was
> good to know the Python SDK was just a matter of time and should come to
> Beam at some point. So coming back to the original plans <
> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html>,
> do you manage any timeline to bring the Python SDK to Beam?
>
> So I'd like to bring a question how Beam plans to deal with the
> distribution of resources across all nodes, something I know it not really
> clean with some runners (e.g., Spark). More concretely, we're using Keras <
> http://keras.io/>, a deep learning Python library that is capable of
> running on top of either TensorFlow or Theano. Historically I know DataFlow
> and TensorFlow are not very compatible. But I wonder if the project has
> already discussed how to support running Keras (TensorFlow) tasks on Beam.
> For us is more for querying than for training, so I'd like to know if the
> Beam Model could natively support the distribution of the models (sometimes
> several GB).
>
> Thanks in advance.
>
> Cheers,
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com