You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by AiHe <gi...@git.apache.org> on 2015/04/30 19:35:34 UTC

[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

GitHub user AiHe opened a pull request:

    https://github.com/apache/spark/pull/5810

    [MLLIB][tree] Verify size of input rdd > 0 when building meta data

    Require non empty input rdd such that we can take the first labeledpoint and get the feature size

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AiHe/spark decisiontree-issue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5810
    
----
commit b448f47b02b2eba71f90c83f6654cb998d65f141
Author: Alain <ai...@usc.edu>
Date:   2015-04-30T17:31:23Z

    [MLLIB][tree] Verify size of input rdd > 0 when building meta data
    
    Require non empty input rdd such that we can take the first
    labeledpoint and get the feature size

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by AiHe <gi...@git.apache.org>.
Github user AiHe commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5810#discussion_r29469236
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
           numTrees: Int,
           featureSubsetStrategy: String): DecisionTreeMetadata = {
     
    -    val numFeatures = input.take(1)(0).features.size
         val numExamples = input.count()
    +    require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " +
    --- End diff --
    
    Yes. After some filter work on dataset, the result rdd turns out to be empty which is unexpected but happens. This exception actually comes from the nature of the dataset but has not been captured by the Decision Tree algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5810#discussion_r29480524
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
           numTrees: Int,
           featureSubsetStrategy: String): DecisionTreeMetadata = {
     
    -    val numFeatures = input.take(1)(0).features.size
    +    require(!input.isEmpty, s"DecisionTree requires size of input RDD > 0, " +
    --- End diff --
    
    This will trigger `take(1)` twice. Should be
    
    ~~~
    val numFeatures = input.map(_.features.size).take(1).headOption.getOrElse {
      throw new IllegalArgumentException("...")
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5810#discussion_r29467818
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
           numTrees: Int,
           featureSubsetStrategy: String): DecisionTreeMetadata = {
     
    -    val numFeatures = input.take(1)(0).features.size
         val numExamples = input.count()
    +    require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " +
    --- End diff --
    
    You should use `isEmpty` rather than count the whole data set. Does this help much? You get an exception either way. Although this makes the message nicer. At the cost of non-trivial extra work.
    
    At this stage wouldn't the size have already had to be positive? have you encountered this in real life?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by AiHe <gi...@git.apache.org>.
Github user AiHe commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5810#discussion_r29475404
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
           numTrees: Int,
           featureSubsetStrategy: String): DecisionTreeMetadata = {
     
    -    val numFeatures = input.take(1)(0).features.size
         val numExamples = input.count()
    +    require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " +
    --- End diff --
    
    Changed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5810#issuecomment-97892164
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB][tree] Verify size of input rdd > 0 whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5810#discussion_r29471393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
           numTrees: Int,
           featureSubsetStrategy: String): DecisionTreeMetadata = {
     
    -    val numFeatures = input.take(1)(0).features.size
         val numExamples = input.count()
    +    require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " +
    --- End diff --
    
    OK. I'm fine with this as long as it only uses `isEmpty`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org