You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2016/07/27 09:52:50 UTC

[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/14378

    [SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake

    ## What changes were proposed in this pull request?
    ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
    See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
    Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during fit or transform. I will also add this function to all estimators/transformers who missed in this PR.
    
    
    ## How was this patch tested?
    No new tests, should pass existing ones.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-16750

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14378
    
----
commit a0a32efa47be7dc0a51b71790bbee07620bb7d28
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-07-27T09:49:52Z

    Fix GaussianMixture training failed due to feature column type mistake

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

Posted by lins05 <gi...@git.apache.org>.
Github user lins05 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14378#discussion_r72568090
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
    @@ -111,7 +111,7 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String)
     
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
    -    transformSchema(dataset.schema, logging = true)
    +    transformSchema(dataset.schema)
    --- End diff --
    
    Seems the `transformSchema(schema: StructType, logging: Boolean)` method of the base class `PipelineStage` would call the the overloaded `transformSchema` method without the `logging` param:
    
    https://github.com/apache/spark/blob/v2.0.0/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L70


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    I just had a minor question, but LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    **[Test build #62917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62917/consoleFull)** for PR 14378 at commit [`a0a32ef`](https://github.com/apache/spark/commit/a0a32efa47be7dc0a51b71790bbee07620bb7d28).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14378#discussion_r72623737
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
    @@ -111,7 +111,7 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String)
     
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
    -    transformSchema(dataset.schema, logging = true)
    +    transformSchema(dataset.schema)
    --- End diff --
    
    Thanks for your remind, updated the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    **[Test build #62970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62970/consoleFull)** for PR 14378 at commit [`0663ad9`](https://github.com/apache/spark/commit/0663ad9042fc8f348f174dd0f9a02c6e721e8b16).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    **[Test build #62917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62917/consoleFull)** for PR 14378 at commit [`a0a32ef`](https://github.com/apache/spark/commit/a0a32efa47be7dc0a51b71790bbee07620bb7d28).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62970/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    **[Test build #62970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62970/consoleFull)** for PR 14378 at commit [`0663ad9`](https://github.com/apache/spark/commit/0663ad9042fc8f348f174dd0f9a02c6e721e8b16).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Merged to master/2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14378#discussion_r72485919
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
    @@ -111,7 +111,7 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String)
     
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
    -    transformSchema(dataset.schema, logging = true)
    +    transformSchema(dataset.schema)
    --- End diff --
    
    Just wondering why you remove the `logging` flag here?  I know it just adds some debug logging, but there are other similar calls that still have it set to true, should those be removed also?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Seems reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14378#discussion_r72560646
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
    @@ -111,7 +111,7 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String)
     
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
    -    transformSchema(dataset.schema, logging = true)
    +    transformSchema(dataset.schema)
    --- End diff --
    
    It's a good question. Since ```MinMaxScaler``` override ```transformSchema``` with no argument ```logging```, we should use that one rather than the function in the base class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14378: [SPARK-16750] [ML] Fix GaussianMixture training failed d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62917/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14378: [SPARK-16750] [ML] Fix GaussianMixture training f...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14378


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org