You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/10/29 20:11:21 UTC

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/3000

    [SPARK-4081] [mllib]  DatasetIndexer

    This introduces a DatasetIndexer class which does the following:
    * fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take
    * getCategoricalFeatureIndexes(): use the statistics to choose (a) which features should be treated as categorical vs. continuous and (b) 0-based indices for categorical feature values
    * transform(): use the result from getCategoricalFeatureIndexes() to re-index categorical feature values
    
    Currently, this kind of functionality is done on an ad-hoc basis (e.g., for labels in DecisionTreeRunner).  This attempts to standardize it.
    
    The basic usage pattern is:
    ```
    val myData1: RDD[Vector] = ...
    val myData2: RDD[Vector] = ...
    val datasetIndexer = new DatasetIndexer(maxCategories)
    datasetIndexer.fit(myData1)
    val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    datasetIndexer.fit(myData2)
    val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    val categoricalFeaturesInfo: Map[Double, Int] = datasetIndexer.getCategoricalFeatureIndexes()
    ```
    
    Design notes:
    * This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
    * This does not yet support transforming data with new (unknown) categorical feature values.  That can be added later.
    * This does not take advantage of sparsity in the input during fit(); it could be more efficient when given SparseVectors.
    
    CC: @mengxr  @manishamde  @codedeft  This should be helpful for DecisionTree and RandomForest.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark indexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3000
    
----
commit 827518d072dc03d621c4915873468248d2925cc2
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-23T17:35:42Z

    working on DatasetIndexer

commit faa0ea71f5a44b9dc8fd4a6c7dc1f7674ca32772
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-27T18:08:16Z

    partly done with DatasetIndexerSuite

commit 15cc344bc6b7bef36fb81fb542ffb15d914cf7fe
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-27T23:07:55Z

    Merge remote-tracking branch 'upstream/master' into indexer

commit a2957b536ea25150a74507ebc6fda69230762a35
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-27T23:08:14Z

    DatasetIndexer now passes tests

commit 228fac6aec115beda8af15526b79f77f2a74023a
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-28T17:27:49Z

    Added another test for DatasetIndexer

commit a27e3b55629f0c8cee50cc6ddb2fde609fc0330c
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-29T02:47:33Z

    Merge remote-tracking branch 'upstream/master' into indexer

commit b9c43feb374584ebeee37f678b895844dc388e0d
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-29T18:44:19Z

    DatasetIndexer now maintains sparsity in SparseVector

commit fc781bdd5325e2a746b99d50d669de07351954fe
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2014-10-29T19:02:52Z

    Merge remote-tracking branch 'upstream/master' into indexer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91442017
  
    @jkbradley I have a question about the expected behavior. Say I have a vector column containing 2 features. One is categorical: 0, 1, 2, 3, 4, 5 and the other is continuous but only takes values from 1.0, 2.0, 4.0. Then if I set `maxCategories` to 10, both will be recognized as categorical and the mapping for the second feature may become something like 1.0 -> 0, 2.0 -> 1, 4.0 -> 2. Is it what we expect? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19645862
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    + *
    + * This can also map categorical feature values to 0-based indices.
    + *
    + * Usage:
    + *   val myData1: RDD[Vector] = ...
    + *   val myData2: RDD[Vector] = ...
    + *   val datasetIndexer = new DatasetIndexer(maxCategories)
    + *   datasetIndexer.fit(myData1)
    + *   val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    + *   datasetIndexer.fit(myData2)
    + *   val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    + *   val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
    + *
    + * TODO: Add option for transform: defaultForUnknownValue (default index for unknown category).
    + *
    + * TODO: Add warning if a categorical feature has only 1 category.
    + */
    +@Experimental
    +class DatasetIndexer(
    +    val maxCategories: Int,
    +    val ignoreUnrecognizedCategories: Boolean = true)
    +  extends Logging with Serializable {
    +
    +  require(maxCategories > 1,
    +    s"DatasetIndexer given maxCategories = $maxCategories, but requires maxCategories > 1.")
    +
    +  private class FeatureValueStats(val numFeatures: Int, val maxCategories: Int)
    +    extends Serializable {
    +
    +    val featureValueSets = Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge other [[FeatureValueStats]] into this instance, modifying this instance.
    +     * @param other  Other instance.  Not modified.
    +     * @return This instance
    +     */
    +    def merge(other: FeatureValueStats): FeatureValueStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (fvs1, fvs2) =>
    +        fvs2.iterator.foreach { val2 =>
    +          if (fvs1.size <= maxCategories) fvs1.add(val2)
    +        }
    +      }
    +      this
    +    }
    +
    +    def addDenseVector(dv: DenseVector): Unit = {
    +      var i = 0
    +      while (i < dv.size) {
    +        if (featureValueSets(i).size <= maxCategories) {
    +          featureValueSets(i).add(dv(i))
    +        }
    +        i += 1
    +      }
    +    }
    +
    +    def addSparseVector(sv: SparseVector): Unit = {
    +      // TODO: This could be made more efficient.
    +      var vecIndex = 0 // index into vector
    +      var nzIndex = 0 // index into non-zero elements
    +      while (vecIndex < sv.size) {
    +        val featureValue = if (nzIndex < sv.indices.size && vecIndex == sv.indices(nzIndex)) {
    +          nzIndex += 1
    +          sv.values(nzIndex - 1)
    +        } else {
    +          0.0
    +        }
    +        if (featureValueSets(vecIndex).size <= maxCategories) {
    +          featureValueSets(vecIndex).add(featureValue)
    +        }
    +        vecIndex += 1
    +      }
    +    }
    +
    +  }
    +
    +  /**
    +   * Array (over features) of sets of distinct feature values (up to maxCategories values).
    +   * Null values in array indicate feature has been determined to be continuous.
    +   *
    +   * Once the number of elements in a feature's set reaches maxCategories + 1,
    +   * then it is declared continuous, and we stop adding elements.
    +   */
    +  private var featureValueStats: Option[FeatureValueStats] = None
    +
    +  /**
    +   * Scans a dataset once and updates statistics about each column.
    +   * The statistics are used to choose categorical features and re-index them.
    +   *
    +   * Warning: Calling this on a new dataset changes the feature statistics and thus
    +   *          can change the behavior of [[transform]] and [[getCategoricalFeatureIndexes]].
    +   *          It is best to [[fit]] on all datasets before calling [[transform]] on any.
    +   *
    +   * @param data  Dataset with equal-length vectors.
    +   *              NOTE: A single instance of [[DatasetIndexer]] must always be given vectors of
    +   *              the same length.  If given non-matching vectors, this method will throw an error.
    +   */
    +  def fit(data: RDD[Vector]): Unit = {
    +    // For each partition, get (featureValueStats, newNumFeatures).
    +    //  If all vectors have the same length, then newNumFeatures = -1.
    +    //  If a vector with a new length is found, then newNumFeatures is set to that length.
    +    val partitionFeatureValueSets: RDD[(Option[FeatureValueStats], Int)] =
    +      data.mapPartitions { iter =>
    +        // Make local copy of featureValueStats.
    +        //  This will be None initially if this is the first dataset to be fitted.
    +        var localFeatureValueStats: Option[FeatureValueStats] = featureValueStats
    +        var localNumFeatures: Int = -1
    --- End diff --
    
    -1 is used in several places to signify uninitialized value. A private val might be better even it's value is -1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61725203
  
      [Test build #22901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22901/consoleFull) for   PR 3000 at commit [`aed6bb3`](https://github.com/apache/spark/commit/aed6bb34979fa72911eae49e82210afd36ebb199).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19634820
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    --- End diff --
    
    I agree about adding more criteria and options later on.
    For a default value, does 16 seem reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19633564
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    --- End diff --
    
    maxCategories as a threshold is a good default. In the future, we may want to add different criteria for some features. Thoughts? Moreover, should we have a reasonable default value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61712479
  
      [Test build #22890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22890/consoleFull) for   PR 3000 at commit [`0d947cb`](https://github.com/apache/spark/commit/0d947cbaac82bf58c8609f8a3454916ca558024a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable `
      * `class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91884972
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30082/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-74609759
  
    @sryza  Thanks for offering!  That would be great if you have the bandwidth to work on this.  I'd be happy to help review.
    
    One comment: It would be nice to be able to take advantage of FeatureAttributes in the spark.ml package, but that's a WIP right now: [https://github.com/apache/spark/pull/4460]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91424352
  
      [Test build #30003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30003/consoleFull) for   PR 3000 at commit [`643b444`](https://github.com/apache/spark/commit/643b4449e1b1af025cf2dcd00c2fd90f9cbe4c29).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91871404
  
      [Test build #30082 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30082/consoleFull) for   PR 3000 at commit [`5956d91`](https://github.com/apache/spark/commit/5956d9197de833bfee870dadd51bbed7ec136ea1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61002439
  
      [Test build #22463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22463/consoleFull) for   PR 3000 at commit [`fc781bd`](https://github.com/apache/spark/commit/fc781bdd5325e2a746b99d50d669de07351954fe).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DatasetIndexer(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19635024
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    --- End diff --
    
    Sure. I was going to suggest 32 but 16 is reasonable as well. I think R only supports up to 32 for categorical features so that was my motivation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61334876
  
    @manishamde  Thanks for the feedback!  I realized I can't really include a fit(RDD[Double]) method since it conflicts with fit(RDD[Vector]).  This is because erasure strips away the Double/Vector to get type fit(RDD[_]).  I instead included a note about mapping to Vector.
    
    I believe the PR is ready.  I removed the non-implemented parameter for unrecognized categories, to be added later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-60986763
  
      [Test build #22463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22463/consoleFull) for   PR 3000 at commit [`fc781bd`](https://github.com/apache/spark/commit/fc781bdd5325e2a746b99d50d669de07351954fe).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61002448
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22463/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-92219015
  
    LGTM. Merged into master. Thanks, and sorry for the long delay!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91871058
  
    I did some minor cleanups, but I don't see any great places to remove code.  I added a Java test suite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91878787
  
      [Test build #30078 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30078/consoleFull) for   PR 3000 at commit [`f5c57a8`](https://github.com/apache/spark/commit/f5c57a80bb4a81466220974e2b2a6676d5e85459).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91425878
  
    I feel like the code could be a bit shorter; I'll think about that more tomorrow and whether we can make working with DataFrames and metadata easier in general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-89421959
  
      [Test build #29694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29694/consoleFull) for   PR 3000 at commit [`286d221`](https://github.com/apache/spark/commit/286d22104e19585368dd83749a15a1409a9a53cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-74034874
  
    @jkbradley sorry for the delay in responding here.  Your breakdown of operations makes sense to me.
    
    A stats collector seems like a good idea.  I also wonder if there's some way to hook it in with Hive table statistics so we can avoid a pass over the data, but maybe that should be saved for future.  If you aren't planning to get to this in the near future, but think you'll have bandwidth to review, I'd be happy to work on it.  Otherwise, I'm happy to look over whatever you put up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19635743
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    --- End diff --
    
    OK, 32 sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91435489
  
      [Test build #30002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30002/consoleFull) for   PR 3000 at commit [`02236c3`](https://github.com/apache/spark/commit/02236c35623a2a8b95497757feb638e18671d961).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61435216
  
      [Test build #22787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22787/consoleFull) for   PR 3000 at commit [`ee495e4`](https://github.com/apache/spark/commit/ee495e4022263904fa234f6eeb4cd53459ebc5e2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61435221
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22787/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-62588894
  
    Just noticed this. I'd been working on something similar a little while ago on SPARK-1216 / #304. One difference is that I had aimed to accept categorical features that are strings, as input data commonly comes this way.  Do you think that functionality should come here or in a separate PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61168626
  
    Good point; I intended this to be used for labels too, so I'll add fit() and transform() methods which take RDD[Double].  Perhaps I should relabel "features" to "columns."  I'd imagine someone either using 2 indexers (1 for labels and 1 for features), or zipping the labels and features into 1 vector and then using 1 indexer.  We could also add other fit() and transform() methods later on to prevent users from having to do the zipping manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91423455
  
    I think it's ready now.  I'll add a quick Java unit test soon to make sure getters/setters work correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91692851
  
    @mengxr  Yes, that's what we'd expect.  Eventually, we'd want to be able to specify which features to index, either (a) via another parameter specifying specific features to index or (b) via metadata, where we do not index features which already have metadata.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r28195820
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
    @@ -0,0 +1,394 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.attribute.{BinaryAttribute, NumericAttribute, NominalAttribute,
    +  Attribute, AttributeGroup}
    +import org.apache.spark.ml.param.{HasInputCol, HasOutputCol, IntParam, ParamMap, Params}
    +import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, VectorUDT}
    +import org.apache.spark.sql.{Row, DataFrame}
    +import org.apache.spark.sql.functions.callUDF
    +import org.apache.spark.sql.types.{StructField, StructType}
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +
    +/** Private trait for params for VectorIndexer and VectorIndexerModel */
    +private[ml] trait VectorIndexerParams extends Params with HasInputCol with HasOutputCol {
    +
    +  /**
    +   * Threshold for the number of values a categorical feature can take.
    +   * If a feature is found to have > maxCategories values, then it is declared continuous.
    +   *
    +   * (default = 20)
    +   */
    +  val maxCategories = new IntParam(this, "maxCategories",
    +    "Threshold for the number of values a categorical feature can take." +
    +      " If a feature is found to have > maxCategories values, then it is declared continuous.",
    +    Some(20))
    +
    +  /** @group getParam */
    +  def getMaxCategories: Int = get(maxCategories)
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + *
    + * Class for indexing categorical feature columns in a dataset of [[Vector]].
    + *
    + * This has 2 usage modes:
    + *  - Automatically identify categorical features (default behavior)
    + *     - This helps process a dataset of unknown vectors into a dataset with some continuous
    + *       features and some categorical features. The choice between continuous and categorical
    + *       is based upon a maxCategories parameter.
    + *     - Set maxCategories to the maximum number of categorical any categorical feature should have.
    + *     - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}.
    + *       If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1},
    + *       and feature 1 will be declared continuous.
    + *  - Index all features, if all features are categorical
    + *     - If maxCategories is set to be very large, then this will build an index of unique
    + *       values for all features.
    + *     - Warning: This can cause problems if features are continuous since this will collect ALL
    + *       unique values to the driver.
    + *     - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}.
    + *       If maxCategories >= 3, then both features will be declared categorical.
    + *
    + * This returns a model which can transform categorical features to use 0-based indices.
    + *
    + * Index stability:
    + *  - This is not guaranteed to choose the same category index across multiple runs.
    + *  - If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0.
    + *    This maintains vector sparsity.
    + *  - More stability may be added in the future.
    + *
    + * TODO: Future extensions: The following functionality is planned for the future:
    + *  - Preserve metadata in transform; if a feature's metadata is already present, do not recompute.
    + *  - Specify certain features to not index, either via a parameter or via existing metadata.
    + *  - Add warning if a categorical feature has only 1 category.
    + *  - Add option for allowing unknown categories.
    + */
    +@AlphaComponent
    +class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams {
    +
    +  /** @group setParam */
    +  def setMaxCategories(value: Int): this.type = {
    +    require(value > 1,
    +      s"DatasetIndexer given maxCategories = value, but requires maxCategories > 1.")
    +    set(maxCategories, value)
    +  }
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): VectorIndexerModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +    val firstRow = dataset.select(map(inputCol)).take(1)
    +    require(firstRow.length == 1, s"VectorIndexer cannot be fit on an empty dataset.")
    +    val numFeatures = firstRow(0).getAs[Vector](0).size
    +    val vectorDataset = dataset.select(map(inputCol)).map { case Row(v: Vector) => v }
    +    val maxCats = map(maxCategories)
    +    val categoryStats: VectorIndexer.CategoryStats = vectorDataset.mapPartitions { iter =>
    +      val localCatStats = new VectorIndexer.CategoryStats(numFeatures, maxCats)
    +      iter.foreach(localCatStats.addVector)
    +      Iterator(localCatStats)
    +    }.reduce((stats1, stats2) => stats1.merge(stats2))
    +    val model = new VectorIndexerModel(this, map, numFeatures, categoryStats.getCategoryMaps)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    // We do not transfer feature metadata since we do not know what types of features we will
    +    // produce in transform().
    +    val map = this.paramMap ++ paramMap
    +    val dataType = new VectorUDT
    +    require(map.contains(inputCol), s"VectorIndexer requires input column parameter: $inputCol")
    +    require(map.contains(outputCol), s"VectorIndexer requires output column parameter: $outputCol")
    +    checkInputColumn(schema, map(inputCol), dataType)
    +    addOutputColumn(schema, map(outputCol), dataType)
    +  }
    +}
    +
    +private object VectorIndexer {
    +
    +  /**
    +   * Helper class for tracking unique values for each feature.
    +   *
    +   * TODO: Track which features are known to be continuous already; do not update counts for them.
    +   *
    +   * @param numFeatures  This class fails if it encounters a Vector whose length is not numFeatures.
    +   * @param maxCategories  This class caps the number of unique values collected at maxCategories.
    +   */
    +  class CategoryStats(private val numFeatures: Int, private val maxCategories: Int)
    +    extends Serializable {
    +
    +    /** featureValueSets[feature index] = set of unique values */
    +    private val featureValueSets =
    +      Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge with another instance, modifying this instance.
    +     * @param other  Other instance, not modified
    +     * @return This instance, modified
    +     */
    +    def merge(other: CategoryStats): CategoryStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (thisValSet, otherValSet) =>
    +        otherValSet.iterator.foreach { x =>
    +          // Once we have found > maxCategories values, we know the feature is continuous
    +          // and do not need to collect more values for it.
    +          if (thisValSet.size <= maxCategories) thisValSet.add(x)
    +        }
    +      }
    +      this
    +    }
    +
    +    /** Add a new vector to this index, updating sets of unique feature values */
    +    def addVector(v: Vector): Unit = {
    +      require(v.size == numFeatures, s"VectorIndexer expected $numFeatures features but" +
    +        s" found vector of size ${v.size}.")
    +      v match {
    +        case dv: DenseVector => addDenseVector(dv)
    +        case sv: SparseVector => addSparseVector(sv)
    +      }
    +    }
    +
    +    /**
    +     * Based on stats collected, decide which features are categorical,
    +     * and choose indices for categories.
    +     *
    +     * Sparsity: This tries to maintain sparsity by treating value 0.0 specially.
    +     *           If a categorical feature takes value 0.0, then value 0.0 is given index 0.
    +     *
    +     * @return  Feature value index.  Keys are categorical feature indices (column indices).
    +     *          Values are mappings from original features values to 0-based category indices.
    +     */
    +    def getCategoryMaps: Map[Int, Map[Double, Int]] = {
    +      // Filter out features which are declared continuous.
    +      featureValueSets.zipWithIndex.filter(_._1.size <= maxCategories).map {
    +        case (featureValues: OpenHashSet[Double], featureIndex: Int) =>
    +          // Get feature values, but remove 0 to treat separately.
    +          // If value 0 exists, give it index 0 to maintain sparsity if possible.
    +          var sortedFeatureValues = featureValues.iterator.filter(_ != 0.0).toArray.sorted
    +          val zeroExists = sortedFeatureValues.length + 1 == featureValues.size
    +          if (zeroExists) {
    +            sortedFeatureValues = 0.0 +: sortedFeatureValues
    +          }
    +          val categoryMap: Map[Double, Int] = sortedFeatureValues.zipWithIndex.toMap
    +          (featureIndex, categoryMap)
    +      }.toMap
    +    }
    +
    +    private def addDenseVector(dv: DenseVector): Unit = {
    +      var i = 0
    +      while (i < dv.size) {
    +        if (featureValueSets(i).size <= maxCategories) {
    +          featureValueSets(i).add(dv(i))
    +        }
    +        i += 1
    +      }
    +    }
    +
    +    private def addSparseVector(sv: SparseVector): Unit = {
    +      // TODO: This might be able to handle 0's more efficiently.
    +      var vecIndex = 0 // index into vector
    +      var k = 0 // index into non-zero elements
    +      while (vecIndex < sv.size) {
    +        val featureValue = if (k < sv.indices.length && vecIndex == sv.indices(k)) {
    +          k += 1
    +          sv.values(k - 1)
    +        } else {
    +          0.0
    +        }
    +        if (featureValueSets(vecIndex).size <= maxCategories) {
    +          featureValueSets(vecIndex).add(featureValue)
    +        }
    +        vecIndex += 1
    +      }
    +    }
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + *
    + * Transform categorical features to use 0-based indices instead of their original values.
    + *  - Categorical features are mapped to their feature value indices.
    + *  - Continuous features (columns) are left unchanged.
    + *
    + * This maintains vector sparsity.
    + *
    + * Note: If this model was created for vectors of length numFeatures,
    + *       this model's transform method must be given vectors of length numFeatures.
    + *
    + * @param numFeatures  Number of features, i.e., length of Vectors which this transforms
    + * @param categoryMaps  Feature value index.  Keys are categorical feature indices (column indices).
    + *                      Values are maps from original features values to 0-based category indices.
    + */
    +@AlphaComponent
    +class VectorIndexerModel private[ml] (
    +    override val parent: VectorIndexer,
    +    override val fittingParamMap: ParamMap,
    +    val numFeatures: Int,
    +    val categoryMaps: Map[Int, Map[Double, Int]])
    +  extends Model[VectorIndexerModel] with VectorIndexerParams {
    +
    +  /**
    +   * Pre-computed feature attributes, with some missing info.
    +   * In transform(), set attribute name and other info, if available.
    +   */
    +  private val partialFeatureAttributes: Array[Attribute] = {
    +    val attrs = new Array[Attribute](numFeatures)
    +    var categoricalFeatureCount = 0 // validity check for numFeatures, categoryMaps
    +    var featureIndex = 0
    +    while (featureIndex < numFeatures) {
    +      if (categoryMaps.contains(featureIndex)) {
    +        // categorical feature
    +        val featureValues = categoryMaps(featureIndex).toArray.sortBy(_._1).map(_._1)
    +        if (featureValues.length == 2) {
    +          attrs(featureIndex) = new BinaryAttribute(index = Some(featureIndex),
    +            values = Some(featureValues.map(_.toString)))
    +        } else {
    +          attrs(featureIndex) = new NominalAttribute(index = Some(featureIndex),
    +            isOrdinal = Some(false), values = Some(featureValues.map(_.toString)))
    +        }
    +        categoricalFeatureCount += 1
    +      } else {
    +        // continuous feature
    +        attrs(featureIndex) = new NumericAttribute(index = Some(featureIndex))
    +      }
    +      featureIndex += 1
    +    }
    +    require(categoricalFeatureCount == categoryMaps.size, "VectorIndexerModel given categoryMaps" +
    +      s" with keys outside expected range [0,...,numFeatures), where numFeatures=$numFeatures")
    +    attrs
    +  }
    +
    +  // TODO: Check more carefully about whether this whole class will be included in a closure.
    +
    +  private val transformFunc: Vector => Vector = {
    +    val sortedCategoricalFeatureIndices = categoryMaps.keys.toArray.sorted
    +    val localVectorMap = categoryMaps
    +    val f: Vector => Vector = {
    +      case dv: DenseVector =>
    +        val tmpv = dv.copy
    +        localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) =>
    +          tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
    +        }
    +        tmpv
    +      case sv: SparseVector =>
    +        // We use the fact that categorical value 0 is always mapped to index 0.
    +        val tmpv = sv.copy
    +        var catFeatureIdx = 0 // index into sortedCategoricalFeatureIndices
    +        var k = 0 // index into non-zero elements of sparse vector
    +        while (catFeatureIdx < sortedCategoricalFeatureIndices.length && k < tmpv.indices.length) {
    +          val featureIndex = sortedCategoricalFeatureIndices(catFeatureIdx)
    +          if (featureIndex < tmpv.indices(k)) {
    +            catFeatureIdx += 1
    +          } else if (featureIndex > tmpv.indices(k)) {
    +            k += 1
    +          } else {
    +            tmpv.values(k) = localVectorMap(featureIndex)(tmpv.values(k))
    +            catFeatureIdx += 1
    +            k += 1
    +          }
    +        }
    +        tmpv
    +    }
    +    f
    +  }
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +    val newField = prepOutputField(dataset.schema, map)
    +    val newCol = callUDF(transformFunc, new VectorUDT, dataset(map(inputCol)))
    +    // For now, just check the first row of inputCol for vector length.
    +    val firstRow = dataset.select(map(inputCol)).take(1)
    +    if (firstRow.length != 0) {
    +      val actualNumFeatures = firstRow(0).getAs[Vector](0).size
    +      require(numFeatures == actualNumFeatures, "VectorIndexerModel expected vector of length" +
    +        s" $numFeatures but found length $actualNumFeatures")
    +    }
    +    dataset.withColumn(map(outputCol), newCol.as(map(outputCol), newField.metadata))
    --- End diff --
    
    It'd be nice for withColumn to take metadata.  I'll make a JIRA for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91884957
  
      [Test build #30082 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30082/consoleFull) for   PR 3000 at commit [`5956d91`](https://github.com/apache/spark/commit/5956d9197de833bfee870dadd51bbed7ec136ea1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61432188
  
      [Test build #22787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22787/consoleFull) for   PR 3000 at commit [`ee495e4`](https://github.com/apache/spark/commit/ee495e4022263904fa234f6eeb4cd53459ebc5e2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61335029
  
    Cool. I will make another pass shortly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61351601
  
    LGTM. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61335324
  
      [Test build #22642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22642/consoleFull) for   PR 3000 at commit [`831aa92`](https://github.com/apache/spark/commit/831aa926ca3ce480c9b73c6ec08e2b138c2586b2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3000#issuecomment-62630207

@sryza Hi, yes, I didn't realize that they shared some functionality. It would be great to coordinate. I think these 2 types of feature transformations are pretty different, but there is some shared underlying functionality.
Feature operations:
* Decide which features should be categorical (this PR)
* Relabel categorical feature values based on an index (this PR)
* Create new features by expanding a categorical feature (your PR)
* Count statistics about dataset columns (both PRs)
The first 3 operations seem fairly distinct to me. But the last one (which does not really need to be exposed to users) could definitely be shared.

We both need to know how many distinct values there are in a column, with some extra options. (You need to specify a subset of columns, and I need to limit the number of distinct values at some point.) Perhaps we could combine these into some sort of stats collector (maybe private[mllib] for now?) which we can both use. I'd be happy to do that, or let me know if you'd like to.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61431701
  
      [Test build #507 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/507/consoleFull) for   PR 3000 at commit [`831aa92`](https://github.com/apache/spark/commit/831aa926ca3ce480c9b73c6ec08e2b138c2586b2).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61736010
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22901/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19636313
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    + *
    + * This can also map categorical feature values to 0-based indices.
    + *
    + * Usage:
    + *   val myData1: RDD[Vector] = ...
    + *   val myData2: RDD[Vector] = ...
    + *   val datasetIndexer = new DatasetIndexer(maxCategories)
    + *   datasetIndexer.fit(myData1)
    + *   val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    + *   datasetIndexer.fit(myData2)
    + *   val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    + *   val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
    + *
    + * TODO: Add option for transform: defaultForUnknownValue (default index for unknown category).
    + *
    + * TODO: Add warning if a categorical feature has only 1 category.
    + */
    +@Experimental
    +class DatasetIndexer(
    +    val maxCategories: Int,
    +    val ignoreUnrecognizedCategories: Boolean = true)
    +  extends Logging with Serializable {
    +
    +  require(maxCategories > 1,
    +    s"DatasetIndexer given maxCategories = $maxCategories, but requires maxCategories > 1.")
    +
    +  private class FeatureValueStats(val numFeatures: Int, val maxCategories: Int)
    +    extends Serializable {
    +
    +    val featureValueSets = Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge other [[FeatureValueStats]] into this instance, modifying this instance.
    +     * @param other  Other instance.  Not modified.
    +     * @return This instance
    +     */
    +    def merge(other: FeatureValueStats): FeatureValueStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (fvs1, fvs2) =>
    +        fvs2.iterator.foreach { val2 =>
    +          if (fvs1.size <= maxCategories) fvs1.add(val2)
    +        }
    +      }
    +      this
    +    }
    +
    +    def addDenseVector(dv: DenseVector): Unit = {
    +      var i = 0
    +      while (i < dv.size) {
    +        if (featureValueSets(i).size <= maxCategories) {
    +          featureValueSets(i).add(dv(i))
    +        }
    +        i += 1
    +      }
    +    }
    +
    +    def addSparseVector(sv: SparseVector): Unit = {
    +      // TODO: This could be made more efficient.
    +      var vecIndex = 0 // index into vector
    +      var nzIndex = 0 // index into non-zero elements
    +      while (vecIndex < sv.size) {
    +        val featureValue = if (nzIndex < sv.indices.size && vecIndex == sv.indices(nzIndex)) {
    +          nzIndex += 1
    +          sv.values(nzIndex - 1)
    +        } else {
    +          0.0
    +        }
    +        if (featureValueSets(vecIndex).size <= maxCategories) {
    +          featureValueSets(vecIndex).add(featureValue)
    +        }
    +        vecIndex += 1
    +      }
    +    }
    +
    +  }
    +
    +  /**
    +   * Array (over features) of sets of distinct feature values (up to maxCategories values).
    +   * Null values in array indicate feature has been determined to be continuous.
    +   *
    +   * Once the number of elements in a feature's set reaches maxCategories + 1,
    +   * then it is declared continuous, and we stop adding elements.
    +   */
    +  private var featureValueStats: Option[FeatureValueStats] = None
    +
    +  /**
    +   * Scans a dataset once and updates statistics about each column.
    +   * The statistics are used to choose categorical features and re-index them.
    +   *
    +   * Warning: Calling this on a new dataset changes the feature statistics and thus
    +   *          can change the behavior of [[transform]] and [[getCategoricalFeatureIndexes]].
    +   *          It is best to [[fit]] on all datasets before calling [[transform]] on any.
    +   *
    +   * @param data  Dataset with equal-length vectors.
    +   *              NOTE: A single instance of [[DatasetIndexer]] must always be given vectors of
    +   *              the same length.  If given non-matching vectors, this method will throw an error.
    --- End diff --
    
    Minor: extra space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61168991
  
    Agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19634868
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    + *
    + * This can also map categorical feature values to 0-based indices.
    + *
    + * Usage:
    + *   val myData1: RDD[Vector] = ...
    + *   val myData2: RDD[Vector] = ...
    + *   val datasetIndexer = new DatasetIndexer(maxCategories)
    + *   datasetIndexer.fit(myData1)
    + *   val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    + *   datasetIndexer.fit(myData2)
    + *   val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    + *   val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
    + *
    + * TODO: Add option for transform: defaultForUnknownValue (default index for unknown category).
    + *
    + * TODO: Add warning if a categorical feature has only 1 category.
    + */
    +@Experimental
    +class DatasetIndexer(
    +    val maxCategories: Int,
    +    val ignoreUnrecognizedCategories: Boolean = true)
    +  extends Logging with Serializable {
    +
    +  require(maxCategories > 1,
    +    s"DatasetIndexer given maxCategories = $maxCategories, but requires maxCategories > 1.")
    +
    +  private class FeatureValueStats(val numFeatures: Int, val maxCategories: Int)
    +    extends Serializable {
    +
    +    val featureValueSets = Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge other [[FeatureValueStats]] into this instance, modifying this instance.
    +     * @param other  Other instance.  Not modified.
    +     * @return This instance
    +     */
    +    def merge(other: FeatureValueStats): FeatureValueStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (fvs1, fvs2) =>
    +        fvs2.iterator.foreach { val2 =>
    +          if (fvs1.size <= maxCategories) fvs1.add(val2)
    --- End diff --
    
    We will, but all we need to know is that the sum of the sizes is > maxCategories.  Once we know that, then the feature will definitely be considered continuous, so we don't need to collect more statistics on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91878807
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30078/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61712491
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22890/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61340126
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22642/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91437227
  
      [Test build #30003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30003/consoleFull) for   PR 3000 at commit [`643b444`](https://github.com/apache/spark/commit/643b4449e1b1af025cf2dcd00c2fd90f9cbe4c29).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91437229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30003/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-89439346
  
      [Test build #29694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29694/consoleFull) for   PR 3000 at commit [`286d221`](https://github.com/apache/spark/commit/286d22104e19585368dd83749a15a1409a9a53cf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91865824
  
      [Test build #30078 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30078/consoleFull) for   PR 3000 at commit [`f5c57a8`](https://github.com/apache/spark/commit/f5c57a80bb4a81466220974e2b2a6676d5e85459).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61736005
  
      [Test build #22901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22901/consoleFull) for   PR 3000 at commit [`aed6bb3`](https://github.com/apache/spark/commit/aed6bb34979fa72911eae49e82210afd36ebb199).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4081] [mllib] VectorIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-89439352
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29694/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19634947
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    + *
    + * This can also map categorical feature values to 0-based indices.
    + *
    + * Usage:
    + *   val myData1: RDD[Vector] = ...
    + *   val myData2: RDD[Vector] = ...
    + *   val datasetIndexer = new DatasetIndexer(maxCategories)
    + *   datasetIndexer.fit(myData1)
    + *   val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    + *   datasetIndexer.fit(myData2)
    + *   val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    + *   val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
    + *
    + * TODO: Add option for transform: defaultForUnknownValue (default index for unknown category).
    + *
    + * TODO: Add warning if a categorical feature has only 1 category.
    + */
    +@Experimental
    +class DatasetIndexer(
    +    val maxCategories: Int,
    +    val ignoreUnrecognizedCategories: Boolean = true)
    +  extends Logging with Serializable {
    +
    +  require(maxCategories > 1,
    +    s"DatasetIndexer given maxCategories = $maxCategories, but requires maxCategories > 1.")
    +
    +  private class FeatureValueStats(val numFeatures: Int, val maxCategories: Int)
    +    extends Serializable {
    +
    +    val featureValueSets = Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge other [[FeatureValueStats]] into this instance, modifying this instance.
    +     * @param other  Other instance.  Not modified.
    +     * @return This instance
    +     */
    +    def merge(other: FeatureValueStats): FeatureValueStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (fvs1, fvs2) =>
    +        fvs2.iterator.foreach { val2 =>
    +          if (fvs1.size <= maxCategories) fvs1.add(val2)
    --- End diff --
    
    Ignore my statement above. I noticed below that you are using size == (maxCategories + 1) as a signal for continuous feature. May be a small comment might be helpful here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91423699
  
      [Test build #30002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30002/consoleFull) for   PR 3000 at commit [`02236c3`](https://github.com/apache/spark/commit/02236c35623a2a8b95497757feb638e18671d961).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3000#discussion_r19634083
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala ---
    @@ -0,0 +1,280 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vectors, DenseVector, SparseVector, Vector}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.collection.OpenHashSet
    +
    +/**
    + * :: Experimental ::
    + * Class for indexing columns in a dataset.
    + *
    + * This helps process a dataset of unknown vectors into a dataset with some continuous features
    + * and some categorical features. The choice between continuous and categorical is based upon
    + * a maxCategories parameter.
    + *
    + * This can also map categorical feature values to 0-based indices.
    + *
    + * Usage:
    + *   val myData1: RDD[Vector] = ...
    + *   val myData2: RDD[Vector] = ...
    + *   val datasetIndexer = new DatasetIndexer(maxCategories)
    + *   datasetIndexer.fit(myData1)
    + *   val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    + *   datasetIndexer.fit(myData2)
    + *   val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    + *   val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
    + *
    + * TODO: Add option for transform: defaultForUnknownValue (default index for unknown category).
    + *
    + * TODO: Add warning if a categorical feature has only 1 category.
    + */
    +@Experimental
    +class DatasetIndexer(
    +    val maxCategories: Int,
    +    val ignoreUnrecognizedCategories: Boolean = true)
    +  extends Logging with Serializable {
    +
    +  require(maxCategories > 1,
    +    s"DatasetIndexer given maxCategories = $maxCategories, but requires maxCategories > 1.")
    +
    +  private class FeatureValueStats(val numFeatures: Int, val maxCategories: Int)
    +    extends Serializable {
    +
    +    val featureValueSets = Array.fill[OpenHashSet[Double]](numFeatures)(new OpenHashSet[Double]())
    +
    +    /**
    +     * Merge other [[FeatureValueStats]] into this instance, modifying this instance.
    +     * @param other  Other instance.  Not modified.
    +     * @return This instance
    +     */
    +    def merge(other: FeatureValueStats): FeatureValueStats = {
    +      featureValueSets.zip(other.featureValueSets).foreach { case (fvs1, fvs2) =>
    +        fvs2.iterator.foreach { val2 =>
    +          if (fvs1.size <= maxCategories) fvs1.add(val2)
    --- End diff --
    
    Are we ignoring some hash key-value pairs of fs2 when fs1.size + fs2.size > maxCategories?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61434869
  
      [Test build #507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/507/consoleFull) for   PR 3000 at commit [`831aa92`](https://github.com/apache/spark/commit/831aa926ca3ce480c9b73c6ec08e2b138c2586b2).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61340118
  
      [Test build #22642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22642/consoleFull) for   PR 3000 at commit [`831aa92`](https://github.com/apache/spark/commit/831aa926ca3ce480c9b73c6ec08e2b138c2586b2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61699375
  
      [Test build #22890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22890/consoleFull) for   PR 3000 at commit [`0d947cb`](https://github.com/apache/spark/commit/0d947cbaac82bf58c8609f8a3454916ca558024a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] VectorIndexer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-91435508
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30002/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-61166762
  
    How about the transformation for labels? This will help with transformations for classification especially from +1/-1 to 0/1 labeling for binary classification. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org