You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/08/31 10:56:17 UTC

[GitHub] spark pull request #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelecto...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22303

    [SPARK-25289][ML] Avoid exception in ChiSqSelector with FDR when no feature is selected

    ## What changes were proposed in this pull request?
    
    Currently, when FDR is used for `ChiSqSelector` and no feature is selected an exception is thrown because the max operation fails.
    
    The PR fixes the problem by handling this case and returning an empty array in that case, as sklearn (which was the reference for the initial implementation of FDR) does.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-25289

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22303
    
----
commit f5b0c4e05bc495e2cc0fc22a944d431e262342ca
Author: Marco Gaido <ma...@...>
Date:   2018-08-31T10:42:34Z

    [SPARK-25289][ML] Avoid exception in ChiSqSelector with FDR when no feature is selected

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95549/testReport)** for PR 22303 at commit [`d8a94f2`](https://github.com/apache/spark/commit/d8a94f26ba5cf6e65ab96301f853d223c878d1e3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by mabeaulieu1991 <gi...@git.apache.org>.
Github user mabeaulieu1991 commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Why is the unit test added for the ml package and not the mllib package?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelecto...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22303#discussion_r214364472
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -272,12 +272,15 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
             // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
             val tempRes = chiSqTestResult
               .sortBy { case (res, _) => res.pValue }
    -        val maxIndex = tempRes
    +        val selected = tempRes
               .zipWithIndex
               .filter { case ((res, _), index) =>
                 res.pValue <= fdr * (index + 1) / chiSqTestResult.length }
    -          .map { case (_, index) => index }
    -          .max
    +        val maxIndex = if (selected.isEmpty) {
    --- End diff --
    
    How about replacing this new code with
    
    ```
    if (selected.isEmpty) {
      Seq.empty
    } else {
      val maxIndex = select.map { case (_, index) => index }.max
      tempRes.take(maxIndex + 1)
    }
    ```
    
    Might just be more straightforward and less change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95537/testReport)** for PR 22303 at commit [`f5b0c4e`](https://github.com/apache/spark/commit/f5b0c4e05bc495e2cc0fc22a944d431e262342ca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    The .ml implementation wraps the .mllib implementation. The .mllib package isn't really being developed anymore, even though parts of it are still 'active' insofar as they underpin the .ml implementation, as here. The tests however make sense against the active .ml user-facing API.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2743/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95537/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95546 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95546/testReport)** for PR 22303 at commit [`03f68aa`](https://github.com/apache/spark/commit/03f68aa8cabe1af00f532cfdc1e10647b333246d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelecto...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22303#discussion_r214365450
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -272,12 +272,15 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
             // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
             val tempRes = chiSqTestResult
               .sortBy { case (res, _) => res.pValue }
    -        val maxIndex = tempRes
    +        val selected = tempRes
               .zipWithIndex
               .filter { case ((res, _), index) =>
                 res.pValue <= fdr * (index + 1) / chiSqTestResult.length }
    -          .map { case (_, index) => index }
    -          .max
    +        val maxIndex = if (selected.isEmpty) {
    --- End diff --
    
    sounds reasonable, I was already doubting whether doing what you proposed and what I did, so I am fine going that way. I am updating, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95546/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2735/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95537/testReport)** for PR 22303 at commit [`f5b0c4e`](https://github.com/apache/spark/commit/f5b0c4e05bc495e2cc0fc22a944d431e262342ca).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95549/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95546 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95546/testReport)** for PR 22303 at commit [`03f68aa`](https://github.com/apache/spark/commit/03f68aa8cabe1af00f532cfdc1e10647b333246d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    cc @jkbradley @srowen @yanboliang 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2739/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelecto...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22303


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22303: [SPARK-25289][ML] Avoid exception in ChiSqSelector with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22303
  
    **[Test build #95549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95549/testReport)** for PR 22303 at commit [`d8a94f2`](https://github.com/apache/spark/commit/d8a94f26ba5cf6e65ab96301f853d223c878d1e3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org