You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2014/08/17 07:37:30 UTC

[GitHub] spark pull request: [SPARK-3807][MLLIB] fix col indexing bug and a...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/1997

    [SPARK-3807][MLLIB] fix col indexing bug and add a check for number of distinct values

    There is a bug determining the column index.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark chisq-index

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1997.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1997
    
----
commit 8fc2ab2a5ed6d7320542b8810c706c588f77bf5c
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-17T05:34:22Z

    fix col indexing bug and add a check for number of distinct values

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3807][MLLIB] fix col indexing bug and a...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1997#discussion_r16328293
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala ---
    @@ -75,21 +77,42 @@ private[stat] object ChiSqTest extends Logging {
        */
       def chiSquaredFeatures(data: RDD[LabeledPoint],
           methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val maxCategories = 10000
         val numCols = data.first().features.size
         val results = new Array[ChiSqTestResult](numCols)
         var labels: Map[Double, Int] = null
    -    // At most 100 columns at a time
    -    val batchSize = 100
    +    // at most 1000 columns at a time
    +    val batchSize = 1000
         var batch = 0
         while (batch * batchSize < numCols) {
           // The following block of code can be cleaned up and made public as
           // chiSquared(data: RDD[(V1, V2)])
           val startCol = batch * batchSize
           val endCol = startCol + math.min(batchSize, numCols - startCol)
    -      val pairCounts = data.flatMap { p =>
    -        // assume dense vectors
    -        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    --- End diff --
    
    Should use `zipWithIndex` before slicing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3087][MLLIB] fix col indexing bug in ch...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1997#issuecomment-52448544
  
    @mengxr  LGTM!  I didn't find any bugs or style issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3087][MLLIB] fix col indexing bug in ch...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1997


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3087][MLLIB] fix col indexing bug in ch...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1997#issuecomment-52414858
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18695/consoleFull) for   PR 1997 at commit [`8fc2ab2`](https://github.com/apache/spark/commit/8fc2ab2a5ed6d7320542b8810c706c588f77bf5c).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3807][MLLIB] fix col indexing bug and a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1997#issuecomment-52414240
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18695/consoleFull) for   PR 1997 at commit [`8fc2ab2`](https://github.com/apache/spark/commit/8fc2ab2a5ed6d7320542b8810c706c588f77bf5c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3087][MLLIB] fix col indexing bug in ch...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1997#issuecomment-52448894
  
    Thanks! I've merged this into master and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org