You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asher Krim (JIRA)" <ji...@apache.org> on 2017/03/14 02:09:41 UTC
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

    [ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923406#comment-15923406 ] 

Asher Krim commented on SPARK-16365:
------------------------------------

Thanks for pointing me to this Jira [~josephkb], I somehow missed it!

I recently posted about this exact issue on the dev@ list. I wrote up a [document|https://docs.google.com/document/d/1Ha4DRMio5A7LjPqiHUnwVzbaxbev6ys04myyz6nDgI4/edit?usp=sharing] with some details about my views. The tl;dr is that this is, in my mind, the single most important feature currently missing from the Pipeline API. Training using Spark is all nice and good, but if I can't deploy the models without relying on Spark, then using Spark for production systems becomes much less attractive. The work currently required to make Pipeline models work in production without Spark is just not worth it, since it requires both re-implementing the algorithms as well as rigorous testing (which is required to avoid small skews due to possible differences in implementations). The maintenance of this can easily become nightmarish.

PMML and other export schemes are not a solution to this problem - they are nearly orthogonal, since they only describe WHAT to do, not exactly HOW to do it. 

I'm mostly just rephrasing [~josephkb], who captured this perfectly under "Local model implementations" above.

I am very eager to see this implemented in Spark, and am happy to start contributing code. [~hollinwilkins] has already done a lot of work on this for MLeap, so I'm hoping he would be on board as well. Other than a few small upfront design decisions, I think the implementation is mostly "Embarrassingly Parallel™"

> Ideas for moving "mllib-local" forward
> --------------------------------------
>
>                 Key: SPARK-16365
>                 URL: https://issues.apache.org/jira/browse/SPARK-16365
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's linear algebra", or "investigate how we will implement local models/pipelines in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation of linalg into a standalone project turned out to be significantly more complex than originally expected. So I vote we devote sufficient discussion and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org