You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2016/04/08 13:10:25 UTC

[jira] [Comment Edited] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

    [ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231948#comment-15231948 ] 

Nick Pentreath edited comment on SPARK-13944 at 4/8/16 11:09 AM:
-----------------------------------------------------------------

What's the reasoning behind breaking changes in {{ml}} API and not in {{mllib}}? It seems to me that if we're breaking one API, we may as well break both, and make a clean break rather than keep a bunch of essentially deprecated cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2, 2.3). If we broke explicitly without trying to "half-maintain" back compat, it's also very clear to everyone what's broken. While converting back and forth may be more error prone in the long run.

Also, in practice the actual breaking change is mostly for (a) 3rd party developers developing their own models and Pipeline components; (b) users creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}} API, or creating raw {{Vector}} DataFrame columns or working with udfs over {{Vector}} in {{ml}} API). Across the board the only change required is simply replacing {{mllib}} with {{ml}} in the imports.

Now, if we use the type alias, Scala users don't even need to make that change! As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset<Row>}} in ALL their code. Changing imports for linalg components from {{mllib}} -> {{ml}} seems as onerous (or as "not very onerous").


was (Author: mlnick):
What's the reasoning behind breaking changes in {{ml}} API and not in {{mllib}}? It seems to me that if we're breaking one API, we may as well break both, and make a clean break rather than keep a bunch of essentially deprecated cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2, 2.3). If we broke explicitly without trying to "half-maintain" back compat, it's also very clear to everyone what's broken. While converting back and forth may be more error prone in the long run.

Also, in practice the actual breaking change is mostly for (a) 3rd party developers developing their own models and Pipeline components; (b) users creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}} API, or creating raw {{Vector}} DataFrame columns or working with udfs over {{Vector}}s in {{ml}} API). Across the board the only change required is simply replacing {{mllib}} with {{ml}} in the imports.

Now, if we use the type alias, Scala users don't even need to make that change! As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset<Row>}} in ALL their code. Changing imports for linalg components from {{mllib}} -> {{ml}} seems as onerous (or as "not very onerous").

> Separate out local linear algebra as a standalone module without Spark dependency
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13944
>                 URL: https://issues.apache.org/jira/browse/SPARK-13944
>             Project: Spark
>          Issue Type: New Feature
>          Components: Build, ML
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: DB Tsai
>            Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency to simplify production deployment. We can call the new module spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will be changed from `org.apache.spark.mllib.linalg.Vector` to `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML pipeline will be the one in ML package; however, the existing mllib code will not be touched. As a result, this will potentially break the API. Also, when the vector is loaded from mllib vector by Spark SQL, the vector will automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org