You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sergio Fernández <wi...@apache.org> on 2016/11/22 09:39:20 UTC

DoFn relying on Microservices

Hi,

I'd like resume the idea to have TensorFlow-based tasks running in a Beam
Pipeline. So far the cleaner approach I can imagine would be to have it
running outside (Functions in GCP, Lambdas in AWS, Microservices generally
speaking).

Therefore, does the current Beam model provide the sense of a DoFn which
actually runs externally?

Thanks in advance for the feedback.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: DoFn relying on Microservices

Posted by Sergio Fernández <wi...@apache.org>.

Hi JB,

On Fri, Nov 25, 2016 at 2:36 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
>
> By the way, you can also use TensorFrame allowing you to use TensorFlow
> directly with Spark dataframe, and more direct access. I discussed with Tim
> Hunter from Databricks about that who's working on TensorFrame.
>

Yes, we have been discussing and experimenting a bit with TensorFrame. The
work is very interesting, although it has some limitations. So actually
that would mean take a step back in our plan of getting away from the
specifics of the concrete processing engine.


Back on Beam, what you could do:
>
> 1. you expose the service on a microservice container (for instance Apache
> Karaf ;))
> In your pipeline, you have two options:
> 2.a. in your Beam pipeline, in a DoFn, in the @Setup you can create the
> REST client (using CXF, or whatever), and in the @ProcessElement you can
> use the service (hosted by Karaf)
>

Besides a different microservice infrastructure, I already started to play
with DoFn and the concepts around.


2.b. I also have a RestIO (source and sink) that can request a REST
> endpoint. However, for now, this IO acts as a pipeline endpoint
> (PTransform<PBegin, PCollection> or PTransform<PCollection, PDone>). In
> your case, if the service called is a step of your pipeline, ParDo(your
> DoFn) would be easier.
>

Yes, that's was what I understood of the Beam design. IO is expected for
the head or the tail of the pipeline.


> Is it what you mean by microservice ?
>

Yep, exactly that.

Thanks so much!






On 11/25/2016 01:18 PM, Sergio Fernández wrote:

> Hi JB,
>
> On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>>
>> DoFn will execute per element (with eventually a hook on StartBundle,
>> FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we
>> create the connection in StartBundle and send each element (with a batch)
>> to external resource.
>>
>> PTransform is maybe more flexible in case of interact with "outside"
>> resources.
>>
>>
> Probably PTransform would be a better place. I'm still pretty new to some
> of the Beam terms and apis.
>
> Do you have use case to be sure I understand ?
>
>
> Yes, Well, it's far more complex, but this question I can simplify it:
>
> We have a TensorFlow-based classifier. In our pipeline one step performs
> that classification of the data. Currently it's implemented as a Spark
> Function, because TensorFlow models can directly be embedded within
> pipelines using PySpark.
>
> Therefore I'm looking for the best option to move such classification
> process one level up in the abstraction with Beam, so I could make it
> portable. The first idea I'm exploring is relying on a external function
> (i.e., microservice) that I'd need to scale up and down independently of
> the pipeline. So I'm more than welcome to discuss ideas ;-)
>
> Thanks.
>
> Cheers,
>
>
>
> On 11/22/2016 10:39 AM, Sergio Fernández wrote:
>>
>> Hi,
>>>
>>> I'd like resume the idea to have TensorFlow-based tasks running in a Beam
>>> Pipeline. So far the cleaner approach I can imagine would be to have it
>>> running outside (Functions in GCP, Lambdas in AWS, Microservices
>>> generally
>>> speaking).
>>>
>>> Therefore, does the current Beam model provide the sense of a DoFn which
>>> actually runs externally?
>>>
>>> Thanks in advance for the feedback.
>>>
>>> Cheers,
>>>
>>>
>>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> --
>> <http://www.talend.com>
>> <http://www.talend.com>
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e:  <http://www.talend.com>sergio.fernandez@redlink.co
>> w: http://redlink.co
>>
>>
>
-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: DoFn relying on Microservices

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Sergio,

By the way, you can also use TensorFrame allowing you to use TensorFlow 
directly with Spark dataframe, and more direct access. I discussed with 
Tim Hunter from Databricks about that who's working on TensorFrame.

Back on Beam, what you could do:

1. you expose the service on a microservice container (for instance 
Apache Karaf ;))
In your pipeline, you have two options:
2.a. in your Beam pipeline, in a DoFn, in the @Setup you can create the 
REST client (using CXF, or whatever), and in the @ProcessElement you can 
use the service (hosted by Karaf)
2.b. I also have a RestIO (source and sink) that can request a REST 
endpoint. However, for now, this IO acts as a pipeline endpoint 
(PTransform<PBegin, PCollection> or PTransform<PCollection, PDone>). In 
your case, if the service called is a step of your pipeline, ParDo(your 
DoFn) would be easier.

Is it what you mean by microservice ?

Regards
JB

On 11/25/2016 01:18 PM, Sergio Fern�ndez wrote:
> Hi JB,
>
> On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>>
>> DoFn will execute per element (with eventually a hook on StartBundle,
>> FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we
>> create the connection in StartBundle and send each element (with a batch)
>> to external resource.
>>
>> PTransform is maybe more flexible in case of interact with "outside"
>> resources.
>>
>
> Probably PTransform would be a better place. I'm still pretty new to some
> of the Beam terms and apis.
>
> Do you have use case to be sure I understand ?
>
>
> Yes, Well, it's far more complex, but this question I can simplify it:
>
> We have a TensorFlow-based classifier. In our pipeline one step performs
> that classification of the data. Currently it's implemented as a Spark
> Function, because TensorFlow models can directly be embedded within
> pipelines using PySpark.
>
> Therefore I'm looking for the best option to move such classification
> process one level up in the abstraction with Beam, so I could make it
> portable. The first idea I'm exploring is relying on a external function
> (i.e., microservice) that I'd need to scale up and down independently of
> the pipeline. So I'm more than welcome to discuss ideas ;-)
>
> Thanks.
>
> Cheers,
>
>
>
>> On 11/22/2016 10:39 AM, Sergio Fern�ndez wrote:
>>
>>> Hi,
>>>
>>> I'd like resume the idea to have TensorFlow-based tasks running in a Beam
>>> Pipeline. So far the cleaner approach I can imagine would be to have it
>>> running outside (Functions in GCP, Lambdas in AWS, Microservices generally
>>> speaking).
>>>
>>> Therefore, does the current Beam model provide the sense of a DoFn which
>>> actually runs externally?
>>>
>>> Thanks in advance for the feedback.
>>>
>>> Cheers,
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> --
>> <http://www.talend.com>
>> <http://www.talend.com>
>> Sergio Fern�ndez
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e:  <http://www.talend.com>sergio.fernandez@redlink.co
>> w: http://redlink.co
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: DoFn relying on Microservices

Posted by Sergio Fernández <wi...@apache.org>.

Hi JB,

On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
>
> DoFn will execute per element (with eventually a hook on StartBundle,
> FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we
> create the connection in StartBundle and send each element (with a batch)
> to external resource.
>
> PTransform is maybe more flexible in case of interact with "outside"
> resources.
>

Probably PTransform would be a better place. I'm still pretty new to some
of the Beam terms and apis.

Do you have use case to be sure I understand ?

Yes, Well, it's far more complex, but this question I can simplify it:

We have a TensorFlow-based classifier. In our pipeline one step performs
that classification of the data. Currently it's implemented as a Spark
Function, because TensorFlow models can directly be embedded within
pipelines using PySpark.

Therefore I'm looking for the best option to move such classification
process one level up in the abstraction with Beam, so I could make it
portable. The first idea I'm exploring is relying on a external function
(i.e., microservice) that I'd need to scale up and down independently of
the pipeline. So I'm more than welcome to discuss ideas ;-)

Thanks.

Cheers,

> On 11/22/2016 10:39 AM, Sergio Fernández wrote:
>
>> Hi,
>>
>> I'd like resume the idea to have TensorFlow-based tasks running in a Beam
>> Pipeline. So far the cleaner approach I can imagine would be to have it
>> running outside (Functions in GCP, Lambdas in AWS, Microservices generally
>> speaking).
>>
>> Therefore, does the current Beam model provide the sense of a DoFn which
>> actually runs externally?
>>
>> Thanks in advance for the feedback.
>>
>> Cheers,
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> --
> <http://www.talend.com>
> <http://www.talend.com>
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e:  <http://www.talend.com>sergio.fernandez@redlink.co
> w: http://redlink.co
>

Re: DoFn relying on Microservices

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Sergio,

DoFn will execute per element (with eventually a hook on StartBundle, 
FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: 
we create the connection in StartBundle and send each element (with a 
batch) to external resource.

PTransform is maybe more flexible in case of interact with "outside" 
resources.

Do you have use case to be sure I understand ?

Thanks !
Regards
JB

On 11/22/2016 10:39 AM, Sergio Fern�ndez wrote:
> Hi,
>
> I'd like resume the idea to have TensorFlow-based tasks running in a Beam
> Pipeline. So far the cleaner approach I can imagine would be to have it
> running outside (Functions in GCP, Lambdas in AWS, Microservices generally
> speaking).
>
> Therefore, does the current Beam model provide the sense of a DoFn which
> actually runs externally?
>
> Thanks in advance for the feedback.
>
> Cheers,
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com