You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Jianfeng Qian <qi...@outlook.com> on 2016/04/20 04:23:48 UTC

Will Beam provide a machine learning API in the future?

Hi ,

Machine learning become more and more popular today, mllib of Spark, FlinkML and Google Cloud Machine Learning.

Will Beam provide a machine learning API in the future?

Will anyone have some interest in doing it?



Best Regards,

Jianfeng


Re: Will Beam provide a machine learning API in the future?

Posted by Simone Robutti <si...@radicalbit.io>.
I'm new to the Beam project so if I say things that were already discussed,
ignore me, but I would like to give you my two cents from the last months I
spent working on this, scouting different solutions for ML on a distributed
processing engine, so that you will have another perspective on the subject.

The feel, shared with many other ML engineers, is that the open source big
data environment is reinventing the wheel over and over. The same
algorithms are rewritten for every platform and they fall short because
their quality is not comparable with proper ML-oriented solutions. SparkML
was born as a placeholder and evolved in a decent library just because
interest for Spark skyrocketed. But it wasn't their main focus, it wasn't
the best to do proper ML with and it was there aiming for a "battery
included" approach that obviously is not enough for most applications. The
same is true for FlinkML, that is in a very early stage right now.

On the other side, many many libraries and platforms were born to achieve
the same results and appeal to the same audience (big enterprises with
existing infrastructure that want an *easy* way to do ML at scale): Mahout
was one of the first in the Apache foundation but many others followed.
Most of these libraries were bound to a processing engine (MapReduce or
later Spark) and were really hard to port. A good approach, that will be
interesting for Beam, is the one of Samsara, a distributed matrix
operations library: they have a set of algorithms expressed in terms of
"simple" primitives that are directly implemented on Spark, Flink,
MapReduce and so on. The others suffer from portability issues and require
a long effort of integration.

Another approach is the one of H2O: they build their own cloud, with their
own KV storage and communication protocol. They probably didn't meant to be
integrated with other platform in the first place, but they released
Sparkling Water that basically build an H2O cluster inside Spark,
instantiating their clients inside Spark's Executors. This is not a simple
piece of software and most of the complexity comes from the translation
back and forth from Spark to H2O data structures.

So, this is all to say that I see a lot of partial solutions to the
problem, where platforms try to do the work of ML libraries and ML engines,
and I see ML libraries and engines trying to integrate with existing
distributed processing softwares. Both try to fill the gap doing the work
of other softwares instead of doing what they can do best.

I have huge expectations from an ML library/DSL built over Beam because it
has the potential to achieve this separation that is required for a clean
and rational big data ecosystem. It should offer enough primitives (linalg
stuff, optimization algorithms, data structures) and tooling to let people
contribute their own algorithms in Beam before native ML libraries like
SparkML. As i said before, I believe integrating with native libraries
would be a big big error because it would be really hard to sell in an
enterprise environment, because it doesn't really give an added value to
the user and it would probably be a pain to find an unifying model across
different libraries.




2016-04-21 1:30 GMT+02:00 Davor Bonaci <da...@google.com.invalid>:

> It seems like there's a lot of community interest in ML running on Beam --
> definitely something that we should eventually have in Beam.
>
> Hopefully, we'll be able to coordinate individual efforts to come up with a
> unified API. It fits right in with Beam goals to have a library of ML
> PTransforms that isn't tied to any particular ML backend. Then, users will
> have portability benefits and will be able to make the right choice for
> them for each execution.
>
> Overall, I think this is a complex feature with a really big impact and
> benefit to Beam. As such, it would be great to write up and discuss
> architecture and design in detail first.
>
> --
>
> In terms of specific questions, a library of PTransforms would probably be
> a better start than a DSL (but that doesn't exclude the possibility of a
> DSL some day). There would be a default implementation, and then each
> runner could override it, as appropriate.
>
> I think Simone's warning should be taken into account, however. Definitely
> something to have in mind as the design progresses.
>

Re: Will Beam provide a machine learning API in the future?

Posted by Davor Bonaci <da...@google.com.INVALID>.
It seems like there's a lot of community interest in ML running on Beam --
definitely something that we should eventually have in Beam.

Hopefully, we'll be able to coordinate individual efforts to come up with a
unified API. It fits right in with Beam goals to have a library of ML
PTransforms that isn't tied to any particular ML backend. Then, users will
have portability benefits and will be able to make the right choice for
them for each execution.

Overall, I think this is a complex feature with a really big impact and
benefit to Beam. As such, it would be great to write up and discuss
architecture and design in detail first.

--

In terms of specific questions, a library of PTransforms would probably be
a better start than a DSL (but that doesn't exclude the possibility of a
DSL some day). There would be a default implementation, and then each
runner could override it, as appropriate.

I think Simone's warning should be taken into account, however. Definitely
something to have in mind as the design progresses.

Re: Will Beam provide a machine learning API in the future?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,

at first glance, I would say a first step would be to create ML 
functions "translated" for the runtime (Spark, Flink, ...).

In the same time, we can create a Beam ML API (generic) that we 
translate to the corresponding ML backend (Mahout, Spark ML, ...).

Regards
JB

On 04/20/2016 11:27 AM, Jianfeng Qian wrote:
> Hi JB,
> I am quite interest about that and I like to contribute to this part.
> Does the DSL ML mean we should provide a framwork or API to support mllib of Spark, FlinkML and Google Cloud Machine Learning.
> Or implement algorithms to run at Spark,Flink and DataFlow?
>
> Best Regards,
> Jianfeng
> ________________________________________
> From: Jean-Baptiste Onofré <jb...@nanthrax.net>
> Sent: Wednesday, April 20, 2016 1:04 AM
> To: dev@beam.incubator.apache.org
> Subject: Re: Will Beam provide a machine learning API in the future?
>
> Hi Simone,
>
> it sounds good. It's basically what I meant by DSL ML oriented.
>
> We love contribution if you want to work on that with me ;)
>
> Regards
> JB
>
> On 04/20/2016 08:27 AM, Simone Robutti wrote:
>> This would be an interesting feature. We are looking forward to develop ML
>> integrations on Beam and we are watching what's going on. The idea of a ML
>> may be interesting as an higher level API or as a proper ML library written
>> in Beam (pretty much what SAMOA does) but beware to offer a common layer
>> between different algorithmic implementation because the assumption that
>> they are consistent in nature and implementation is a big assumption and it
>> could lead to a lot of design problems for you and usability problems for
>> the end user.
>> Il 20/apr/2016 06:16 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> ha
>> scritto:
>>
>>> Hi Jianfreng
>>>
>>> As you can see in the "Technical Vision" document:
>>>
>>>
>>> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing
>>>
>>> I proposed "Machine Learning functions support".
>>>
>>> It's not the highest priority right now, but it's something that we plan.
>>>
>>> Regards
>>> JB
>>>
>>> On 04/20/2016 04:23 AM, Jianfeng Qian wrote:
>>>
>>>> Hi ,
>>>>
>>>> Machine learning become more and more popular today, mllib of Spark,
>>>> FlinkML and Google Cloud Machine Learning.
>>>>
>>>> Will Beam provide a machine learning API in the future?
>>>>
>>>> Will anyone have some interest in doing it?
>>>>
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Jianfeng
>>>>
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Will Beam provide a machine learning API in the future?

Posted by Jianfeng Qian <qi...@outlook.com>.
Hi JB,
I am quite interest about that and I like to contribute to this part.
Does the DSL ML mean we should provide a framwork or API to support mllib of Spark, FlinkML and Google Cloud Machine Learning.
Or implement algorithms to run at Spark,Flink and DataFlow?

Best Regards,
Jianfeng
________________________________________
From: Jean-Baptiste Onofré <jb...@nanthrax.net>
Sent: Wednesday, April 20, 2016 1:04 AM
To: dev@beam.incubator.apache.org
Subject: Re: Will Beam provide a machine learning API in the future?

Hi Simone,

it sounds good. It's basically what I meant by DSL ML oriented.

We love contribution if you want to work on that with me ;)

Regards
JB

On 04/20/2016 08:27 AM, Simone Robutti wrote:
> This would be an interesting feature. We are looking forward to develop ML
> integrations on Beam and we are watching what's going on. The idea of a ML
> may be interesting as an higher level API or as a proper ML library written
> in Beam (pretty much what SAMOA does) but beware to offer a common layer
> between different algorithmic implementation because the assumption that
> they are consistent in nature and implementation is a big assumption and it
> could lead to a lot of design problems for you and usability problems for
> the end user.
> Il 20/apr/2016 06:16 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> ha
> scritto:
>
>> Hi Jianfreng
>>
>> As you can see in the "Technical Vision" document:
>>
>>
>> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing
>>
>> I proposed "Machine Learning functions support".
>>
>> It's not the highest priority right now, but it's something that we plan.
>>
>> Regards
>> JB
>>
>> On 04/20/2016 04:23 AM, Jianfeng Qian wrote:
>>
>>> Hi ,
>>>
>>> Machine learning become more and more popular today, mllib of Spark,
>>> FlinkML and Google Cloud Machine Learning.
>>>
>>> Will Beam provide a machine learning API in the future?
>>>
>>> Will anyone have some interest in doing it?
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Jianfeng
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Will Beam provide a machine learning API in the future?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Simone,

it sounds good. It's basically what I meant by DSL ML oriented.

We love contribution if you want to work on that with me ;)

Regards
JB

On 04/20/2016 08:27 AM, Simone Robutti wrote:
> This would be an interesting feature. We are looking forward to develop ML
> integrations on Beam and we are watching what's going on. The idea of a ML
> may be interesting as an higher level API or as a proper ML library written
> in Beam (pretty much what SAMOA does) but beware to offer a common layer
> between different algorithmic implementation because the assumption that
> they are consistent in nature and implementation is a big assumption and it
> could lead to a lot of design problems for you and usability problems for
> the end user.
> Il 20/apr/2016 06:16 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> ha
> scritto:
>
>> Hi Jianfreng
>>
>> As you can see in the "Technical Vision" document:
>>
>>
>> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing
>>
>> I proposed "Machine Learning functions support".
>>
>> It's not the highest priority right now, but it's something that we plan.
>>
>> Regards
>> JB
>>
>> On 04/20/2016 04:23 AM, Jianfeng Qian wrote:
>>
>>> Hi ,
>>>
>>> Machine learning become more and more popular today, mllib of Spark,
>>> FlinkML and Google Cloud Machine Learning.
>>>
>>> Will Beam provide a machine learning API in the future?
>>>
>>> Will anyone have some interest in doing it?
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Jianfeng
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Will Beam provide a machine learning API in the future?

Posted by Simone Robutti <si...@radicalbit.io>.
This would be an interesting feature. We are looking forward to develop ML
integrations on Beam and we are watching what's going on. The idea of a ML
may be interesting as an higher level API or as a proper ML library written
in Beam (pretty much what SAMOA does) but beware to offer a common layer
between different algorithmic implementation because the assumption that
they are consistent in nature and implementation is a big assumption and it
could lead to a lot of design problems for you and usability problems for
the end user.
Il 20/apr/2016 06:16 AM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> ha
scritto:

> Hi Jianfreng
>
> As you can see in the "Technical Vision" document:
>
>
> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing
>
> I proposed "Machine Learning functions support".
>
> It's not the highest priority right now, but it's something that we plan.
>
> Regards
> JB
>
> On 04/20/2016 04:23 AM, Jianfeng Qian wrote:
>
>> Hi ,
>>
>> Machine learning become more and more popular today, mllib of Spark,
>> FlinkML and Google Cloud Machine Learning.
>>
>> Will Beam provide a machine learning API in the future?
>>
>> Will anyone have some interest in doing it?
>>
>>
>>
>> Best Regards,
>>
>> Jianfeng
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Will Beam provide a machine learning API in the future?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Jianfreng

As you can see in the "Technical Vision" document:

https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing

I proposed "Machine Learning functions support".

It's not the highest priority right now, but it's something that we plan.

Regards
JB

On 04/20/2016 04:23 AM, Jianfeng Qian wrote:
> Hi ,
>
> Machine learning become more and more popular today, mllib of Spark, FlinkML and Google Cloud Machine Learning.
>
> Will Beam provide a machine learning API in the future?
>
> Will anyone have some interest in doing it?
>
>
>
> Best Regards,
>
> Jianfeng
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com