You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mpjlu <gi...@git.apache.org> on 2017/10/17 13:15:07 UTC

[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

GitHub user mpjlu opened a pull request:

    https://github.com/apache/spark/pull/19516

    [SPARK-22277][ML]fix the bug of ChiSqSelector on preparing the output column

    ## What changes were proposed in this pull request?
    
    To prepare the output columns when use ChiSqSelector,  the master method adds some additional feature attribute, this is not necessary, and sometimes cause error. 
    
    `    val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
          origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
        } else {
          Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
        }
        val newAttributeGroup = new AttributeGroup($(outputCol), featureAttributes)` 
    
    ## How was this patch tested?
    The existing UT.  


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mpjlu/spark testDFdirect

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19516
    
----
commit 3128133d76348666df82bf43aa42cd9ebae70faf
Author: Peng Meng <pe...@intel.com>
Date:   2017-10-17T13:04:08Z

    fix the bug of ChiSqSelector on preparing the output column

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146528222
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    For this problem, either here or the following code should be changed, right? 
     " case nomAttr: NominalAttribute =>
                  nomAttr.getNumValues match {
                    case Some(numValues: Int) => Iterator(idx -> numValues)
                    case None => throw new IllegalArgumentException(s"Feature $idx is marked as" +
                      " Nominal (categorical), but it does not have the number of values specified.")
                  }" 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    **[Test build #82840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82840/testReport)** for PR 19516 at commit [`3128133`](https://github.com/apache/spark/commit/3128133d76348666df82bf43aa42cd9ebae70faf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    **[Test build #82840 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82840/testReport)** for PR 19516 at commit [`3128133`](https://github.com/apache/spark/commit/3128133d76348666df82bf43aa42cd9ebae70faf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    Because I don't have the environment to continue this work, I will close it. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146741139
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    Change the error description is ok, but ChiSqSelector maybe not the only case causes this error. StringIndexer, QuantileDiscretizer may also causes this error. 
    Maybe we need a solution for this problem? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    I thought about this, because `ChiSqSelector` only work for categorical features, after processing it marked features without attributes as `NominalAttribute` is reasonable, the problem is it do not fill the `values` or `numValues` members (which are both optional, not required). So I think it isn't a bug. But maybe you can improve this, to fill `values` or `numValues` members, by using stats info from `ChiSqSelector`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu closed the pull request at:

    https://github.com/apache/spark/pull/19516


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146537275
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    The problem is should we allow add Nominal and not set value or numvalues. If we allow this,  should we change the code of getCategoricalFeatures?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146531755
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    Yes I admit it is hard to get the `values` and/or `numValues` here.
    Current spark code will throw exception when we pipeline ChiSqSelector + DecisionTreeClassifier (on features without attributes),
    But if we remove the code adding `Nominal` here (in ChiSqSelector), although pipeline ChiSqSelector + DecisionTreeClassifier(on features without attributes) will run successfully, it will get wrong result, because DecisionTreeClassifier treat them as continuous features.
    Comparing explicitly throwing exception and running successfully on the surface(but internal is wrong), I tend to keep current choice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146527616
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    This function is in ChiSqSelectorModel, in the model you only have "selectedFeatures: Array[Int]", seems you cannot fill values and or numValues only with the model information. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19516
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82840/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146546628
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    So I suggest keep current code, but you can improve the error description here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/MetadataUtils.scala#L66:
    ```
     throw new IllegalArgumentException(s"Feature $idx is marked as" +
    " Nominal (categorical), but it does not have the number of values specified.")
    ```
    Add tips here, tell user it can use `VectorIndexer` to process categorical features first, to solve this issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146513000
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    Instead of removing this, I think the better way is fill the `values` or `numValues` member here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org