You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2016/05/28 13:40:43 UTC

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/13378

    [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib migration guide from 1.6 to 2.0

    ## What changes were proposed in this pull request?
    Update ```spark.ml``` and ```spark.mllib``` migration guide from 1.6 to 2.0.
    
    ## How was this patch tested?
    Docs update, no tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-13448

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13378
    
----
commit 182414b89dca0056e5732cb3ddae654ae1379436
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-05-28T12:43:58Z

    Document MLlib deprecations and behavior changes in Spark 2.0

commit fb610d25e58cff765d481f1f15728d42806aa8de
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-05-28T13:36:00Z

    fix typos

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    @MLnick Would you like to update corresponding migration docs for changes in [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) in a follow up PR? I saw you left comments to do that. If not, please let me know.
    For linear algebra, we can document them after we have final decision. It's also better we can have a converter that scans a DataFrame and update its schema to use new vectors. Otherwise, the previously stored DataFrame or MLlib models will be loaded incorrectly in Spark 2.0.
    Let's focus on deprecations and changes of behavior and get this in firstly. We can left the JIRA open for follow up work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65004105
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,54 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
     There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
     deprecations and changes of behavior.
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    --- End diff --
    
    @yanboliang  there are breaking changes for removing some deprecated methods in https://issues.apache.org/jira/browse/SPARK-14089 and https://issues.apache.org/jira/browse/SPARK-14952 that we should highlight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    @MLnick What about merging this PR firstly and then sending your PR for breaking changes separately? If this is OK, please go ahead to get it in. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #59734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59734/consoleFull)** for PR 13378 at commit [`235930d`](https://github.com/apache/spark/commit/235930d229021ef21921477478b39fa955aa5294).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222309519
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59561/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13378#discussion_r65041486

--- Diff: docs/mllib-guide.md ---
@@ -102,32 +102,54 @@ MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

-## From 1.5 to 1.6
+## From 1.6 to 2.0

There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
deprecations and changes of behavior.

Deprecations:

Good points. I forgot to record all removed deprecated methods. It's great that you can do that in a follow up PR. Thanks!

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    There are also some changes from these 2 JIRAs/PRs which should be noted here:
    * [https://issues.apache.org/jira/browse/SPARK-14810]
    * [https://issues.apache.org/jira/browse/SPARK-14814]
    
    For linear algebra, we should definitely discuss the change in the migration guide.  @mengxr is also thinking about whether we can add a little functionality to make that transition easier.  Documenting/improving this could happen in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65004127
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,54 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
     There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
     deprecations and changes of behavior.
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
    +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
    + In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
    + the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
    +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
    + In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
    + We move all functionality in overridden methods to the corresponding `transformSchema`.
    +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
    + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
    + We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
    +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
    + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
     
     Changes of behavior:
     
    -* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
    - `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
    - Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
    - `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
    - previous error); for small errors (`< 0.01`), it uses absolute error.
    -* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
    - `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
    - tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
    - behavior of the simpler `Tokenizer` transformer.
    +* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
    + `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
    + This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
    +    * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
    +    * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
    +* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
    + In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
    + the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
    +* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
    + Fix a bug of `PowerIterationClustering` which will likely change its result.
    +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
    + `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
    +* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
    + `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
    +* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
    + `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
    +* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
    + We remove `expectedType` argument for PySpark `Param`.
    --- End diff --
    
    The expectedType argument ... was removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13378


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    @yanboliang how is this coming along? I have a PR ready for the breaking changes. I can either do that separately or push a PR to your branch.
    
    We need to update this PR with a few items mentioned in the JIRA by @jkbradley & @mengxr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    How do we want to handle the new vectors (i.e. `ml` APIs / VectorUDT only works for `ml.linalg` Vectors and not for old `mllib.linalg` Vectors)? A note about that should probably go into the migration guide (or elsewhere).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #61364 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61364/consoleFull)** for PR 13378 at commit [`5472fb9`](https://github.com/apache/spark/commit/5472fb9e4d1158644c0c4fc22cc02083acc4576f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #59809 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59809/consoleFull)** for PR 13378 at commit [`2339200`](https://github.com/apache/spark/commit/23392006c3023e3c95f8bd434e5fe3090b5724b0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13378#discussion_r65410927

--- Diff: docs/mllib-guide.md ---
@@ -102,32 +102,53 @@ MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

-## From 1.5 to 1.6
+## From 1.6 to 2.0

-There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
-deprecations and changes of behavior.
+The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include:

Deprecations:

-* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
- In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
-* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
- In `spark.ml.classification.LogisticRegressionModel` and
- `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
- the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to
- algorithms.
+* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
+ In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
+* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
+ In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
+ the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
+* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
+ In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
+ We move all functionality in overridden methods to the corresponding `transformSchema`.
+* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
+ In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
+ We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
+* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
+ In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.

Changes of behavior:

-* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
- `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
- Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
- `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
- previous error); for small errors (`< 0.01`), it uses absolute error.
-* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
- `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
- tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
- behavior of the simpler `Tokenizer` transformer.
+* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
+ `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
+ This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
+ * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
+ * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
+* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
+ In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
+ the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
+* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
+ Fix a bug of `PowerIterationClustering` which will likely change its result.
+* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
+ `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
+* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
+ `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
+* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
+ `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
+* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
+ The `expectedType` argument for PySpark `Param` was removed.
+* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
+ Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
+* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
+ `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
+ The output buckets will differ for same input data and params.
+* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814):
--- End diff --

I've added it to the list in [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). We can either remove it here from this PR and I will include it when I do the one for breaking changes, or add it to a breaking changes section in this PR, which I will update with the others later.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222445538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61305/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Other than that, this looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59734/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #59734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59734/consoleFull)** for PR 13378 at commit [`235930d`](https://github.com/apache/spark/commit/235930d229021ef21921477478b39fa955aa5294).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222309518
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13378#discussion_r65004142

-## From 1.5 to 1.6
+## From 1.6 to 2.0

There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
deprecations and changes of behavior.

Deprecations:

Changes of behavior:

-* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
- `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
- Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
- `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
- previous error); for small errors (`< 0.01`), it uses absolute error.
-* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
- `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
- tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
- behavior of the simpler `Tokenizer` transformer.
+* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
+ `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
+ This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
+ * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
+ * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
+* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
+ In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
+ the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
+* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
+ Fix a bug of `PowerIterationClustering` which will likely change its result.
+* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
+ `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
+* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
+ `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
+* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
+ `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
+* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
+ We remove `expectedType` argument for PySpark `Param`.
+* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
+ We change some default `Param` values which were mismatched between pipelines in Scala and Python.
--- End diff --

Some default Param values, which were ... Scala and Python, have been changed.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    @yanboliang I'm happy with that - we need to merge this one first so I can slot my changes in format-wise.
    
    Could you update for the new deprecations in the JIRA (https://issues.apache.org/jira/browse/SPARK-15643?focusedCommentId=15343059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15343059)? Also the vector conversion (https://issues.apache.org/jira/browse/SPARK-15643?focusedCommentId=15334729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15334729)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r68640123
  
    --- Diff: docs/mllib-guide.md ---
    @@ -121,6 +121,9 @@ Deprecations:
      We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
     * [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
      In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
    +* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
    + In `spark.ml.util.BaseReadWrite`, the `context` method has been deprecated in favor of `session`.
    --- End diff --
    
    Could you please list this as MLReader and MLWriter instead of BaseReadWrite?  Those are the public APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Separating the work SGTM too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #61305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61305/consoleFull)** for PR 13378 at commit [`d2666ac`](https://github.com/apache/spark/commit/d2666acc4fcd3813400589fc10f74743c8e0b38f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222444185
  
    **[Test build #59611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59611/consoleFull)** for PR 13378 at commit [`260f3a3`](https://github.com/apache/spark/commit/260f3a35063e4dbf5775aef1cd2878731c0e1147).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65004116
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,54 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
     There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
     deprecations and changes of behavior.
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    --- End diff --
    
    Though I'm happy to just do that in a follow up PR once I've made a final pass through for MiMa changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    I'm happy to do the breaking changes in a separate PR (I still need to do a final pass through of those to confirm I've caught them all).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222445539
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59611/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    LGTM
    Merging with master
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #61305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61305/consoleFull)** for PR 13378 at commit [`d2666ac`](https://github.com/apache/spark/commit/d2666acc4fcd3813400589fc10f74743c8e0b38f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61364/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59809/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222309200
  
    **[Test build #59561 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59561/consoleFull)** for PR 13378 at commit [`fb610d2`](https://github.com/apache/spark/commit/fb610d25e58cff765d481f1f15728d42806aa8de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65288651
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,54 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
     There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
    --- End diff --
    
    Not the case for this release


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222309499
  
    **[Test build #59561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59561/consoleFull)** for PR 13378 at commit [`fb610d2`](https://github.com/apache/spark/commit/fb610d25e58cff765d481f1f15728d42806aa8de).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65480031
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,53 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
    -There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
    -deprecations and changes of behavior.
    +The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include:
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
    +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
    + In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
    + the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
    +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
    + In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
    + We move all functionality in overridden methods to the corresponding `transformSchema`.
    +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
    + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
    + We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
    +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
    + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
     
     Changes of behavior:
     
    -* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
    - `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
    - Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
    - `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
    - previous error); for small errors (`< 0.01`), it uses absolute error.
    -* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
    - `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
    - tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
    - behavior of the simpler `Tokenizer` transformer.
    +* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
    + `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
    + This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
    +    * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
    +    * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
    +* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
    + In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
    + the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
    +* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
    + Fix a bug of `PowerIterationClustering` which will likely change its result.
    +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
    + `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
    +* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
    + `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
    +* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
    + `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
    +* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
    + The `expectedType` argument for PySpark `Param` was removed.
    +* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
    + Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
    +* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
    + `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
    + The output buckets will differ for same input data and params.
    +* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814):
    --- End diff --
    
    I removed it in this PR. @MLnick Please add it in your follow up PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    @yanboliang I opened #13924 with my changes. If you prefer, I can incorporate the part about vector conversions into my section on the new linalg classes (since it perhaps fits best there?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #59809 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59809/consoleFull)** for PR 13378 at commit [`2339200`](https://github.com/apache/spark/commit/23392006c3023e3c95f8bd434e5fe3090b5724b0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65409516
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,53 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
    -There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
    -deprecations and changes of behavior.
    +The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include:
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
    +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
    + In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
    + the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
    +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
    + In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
    + We move all functionality in overridden methods to the corresponding `transformSchema`.
    +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
    + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
    + We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
    +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
    + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
     
     Changes of behavior:
     
    -* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
    - `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
    - Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
    - `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
    - previous error); for small errors (`< 0.01`), it uses absolute error.
    -* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
    - `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
    - tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
    - behavior of the simpler `Tokenizer` transformer.
    +* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
    + `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
    + This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
    +    * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
    +    * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
    +* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
    + In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
    + the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
    +* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
    + Fix a bug of `PowerIterationClustering` which will likely change its result.
    +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
    + `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
    +* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
    + `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
    +* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
    + `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
    +* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
    + The `expectedType` argument for PySpark `Param` was removed.
    +* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
    + Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
    +* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
    + `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
    + The output buckets will differ for same input data and params.
    +* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814):
    --- End diff --
    
    Just noticed that this is a breaking API change, not a change of behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    **[Test build #61364 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61364/consoleFull)** for PR 13378 at commit [`5472fb9`](https://github.com/apache/spark/commit/5472fb9e4d1158644c0c4fc22cc02083acc4576f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/13378
  
    @MLnick I have updated the new deprecations in the [JIRA] (https://issues.apache.org/jira/browse/SPARK-15643?focusedCommentId=15343059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15343059) in this PR. To the vector conversions issue, I think it fits more to add them in your section. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222309652
  
    cc @jkbradley @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15643] [Doc] [ML] Update spark.ml and s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13378#issuecomment-222445436
  
    **[Test build #59611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59611/consoleFull)** for PR 13378 at commit [`260f3a3`](https://github.com/apache/spark/commit/260f3a35063e4dbf5775aef1cd2878731c0e1147).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org