You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2015/06/01 23:41:27 UTC

[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/6561

    [SPARK-7582] [MLLIB] user guide for StringIndexer

    This PR adds a Java unit test and user guide for `StringIndexer`. I put it before `OneHotEncoder` because they are closely related. @jkbradley 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-7582

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6561.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6561
    
----
commit 136cb93dbf7f89cdc270646243dfee37b42792de
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-06-01T21:10:12Z

    add a Java unit test for StringIndexer

commit 7fa18d18494f9d0d9fa991d02ca9441c51a5a20e
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-06-01T21:39:23Z

    add user guide for StringIndexer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6561#discussion_r31477790
  
    --- Diff: docs/ml-features.md ---
    @@ -456,6 +456,122 @@ for expanded in polyDF.select("polyFeatures").take(3):
     </div>
     </div>
     
    +## StringIndexer
    +
    +`StringIndexer` encodes a string column of labels to a column of label indices.
    +The indices are in `[0, numLabels)`, ordered by label frequencies.
    +So the most frequent label gets index `0`.
    +If the input column is numeric, we cast it to string and index the string values.
    +
    +**Examples**
    +
    +Assume that we have the following DataFrame with columns `id` and `category`:
    +
    +~~~~
    + id | category
    +----|----------
    + 0  | a
    + 1  | b
    + 2  | c
    + 3  | a
    + 4  | a
    + 5  | c
    +~~~~
    +
    +`category` is a string column with three labels: "a", "b", and "c".
    +Applying `StringIndexer` with `category` as the input column and `categoryIndex` as the output
    +column, we should get the following:
    +
    +~~~~
    + id | category | categoryIndex
    +----|----------|---------------
    + 0  | a        | 0.0
    + 1  | b        | 2.0
    + 2  | c        | 1.0
    + 3  | a        | 0.0
    + 4  | a        | 0.0
    + 5  | c        | 1.0
    +~~~~
    +
    +"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
    +index `2`.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`StringIndexer`](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) takes an input
    +column name and an output column name.
    +
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.StringIndexer
    +
    +val df = sqlContext.createDataFrame(
    +  Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
    +).toDF("id", "category")
    +val indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex")
    +val indexed = indexer.fit(df).transform(df)
    +indexed.show()
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`StringIndexer`](api/java/org/apache/spark/ml/feature/StringIndexer.html) takes an input column
    +name and an output column name.
    +
    +{% highlight java %}
    +import java.util.Arrays;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.StringIndexer;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +import static org.apache.spark.sql.types.DataTypes.*;
    +
    +JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
    +  RowFactory.create(0, "a"),
    +  RowFactory.create(1, "b"),
    +  RowFactory.create(2, "c"),
    +  RowFactory.create(3, "a"),
    +  RowFactory.create(4, "a"),
    +  RowFactory.create(5, "c")
    +));
    +StructType schema = new StructType(new StructField[] {
    +  createStructField("id", DoubleType, false),
    +  createStructField("category", StringType, false)
    +});
    +DataFrame df = sqlContext.createDataFrame(jrdd, schema);
    +StringIndexer indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex");
    +DataFrame indexed = indexer.fit(df).transform(df);
    +indexed.show();
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +[`StringIndexer`](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) takes an input
    +column name and an output column name.
    +
    +{% highlight python %}
    +from pyspark.ml.feature import StringIndexer
    +
    +df = sqlContext.createDataFrame(
    +    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")]
    --- End diff --
    
    Thanks for catching it! I tested `VectorAssembler` example code but not this one ... this is why everything needs a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107770768
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107721751
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107738026
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107748824
  
      [Test build #33928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33928/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaStringIndexerSuite `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107723094
  
      [Test build #33923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33923/consoleFull) for   PR 6561 at commit [`ba1cd1b`](https://github.com/apache/spark/commit/ba1cd1b837f5a70340ab75a6a5acc9c3e18ec39e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107721777
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107751021
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107738048
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107771063
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107771160
  
      [Test build #33944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33944/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107765446
  
      [Test build #33933 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33933/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaStringIndexerSuite `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107748831
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107735478
  
    Nice description!  LGTM, except for the missing comma


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107810052
  
    Merged into master and branch-1.4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107751120
  
      [Test build #33933 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33933/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107771082
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107722948
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107765452
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107738477
  
      [Test build #33928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33928/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107789843
  
      [Test build #33944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33944/consoleFull) for   PR 6561 at commit [`4bba4f1`](https://github.com/apache/spark/commit/4bba4f15336e5f05fa42d9c8cbcea5550ac9e4e1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107751034
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107750770
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107725178
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107747333
  
      [Test build #33923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33923/consoleFull) for   PR 6561 at commit [`ba1cd1b`](https://github.com/apache/spark/commit/ba1cd1b837f5a70340ab75a6a5acc9c3e18ec39e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaStringIndexerSuite `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107789867
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107722922
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/6561


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6561#issuecomment-107747342
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7582] [MLLIB] user guide for StringInde...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6561#discussion_r31477341
  
    --- Diff: docs/ml-features.md ---
    @@ -456,6 +456,122 @@ for expanded in polyDF.select("polyFeatures").take(3):
     </div>
     </div>
     
    +## StringIndexer
    +
    +`StringIndexer` encodes a string column of labels to a column of label indices.
    +The indices are in `[0, numLabels)`, ordered by label frequencies.
    +So the most frequent label gets index `0`.
    +If the input column is numeric, we cast it to string and index the string values.
    +
    +**Examples**
    +
    +Assume that we have the following DataFrame with columns `id` and `category`:
    +
    +~~~~
    + id | category
    +----|----------
    + 0  | a
    + 1  | b
    + 2  | c
    + 3  | a
    + 4  | a
    + 5  | c
    +~~~~
    +
    +`category` is a string column with three labels: "a", "b", and "c".
    +Applying `StringIndexer` with `category` as the input column and `categoryIndex` as the output
    +column, we should get the following:
    +
    +~~~~
    + id | category | categoryIndex
    +----|----------|---------------
    + 0  | a        | 0.0
    + 1  | b        | 2.0
    + 2  | c        | 1.0
    + 3  | a        | 0.0
    + 4  | a        | 0.0
    + 5  | c        | 1.0
    +~~~~
    +
    +"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
    +index `2`.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`StringIndexer`](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) takes an input
    +column name and an output column name.
    +
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.StringIndexer
    +
    +val df = sqlContext.createDataFrame(
    +  Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
    +).toDF("id", "category")
    +val indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex")
    +val indexed = indexer.fit(df).transform(df)
    +indexed.show()
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`StringIndexer`](api/java/org/apache/spark/ml/feature/StringIndexer.html) takes an input column
    +name and an output column name.
    +
    +{% highlight java %}
    +import java.util.Arrays;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.StringIndexer;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +import static org.apache.spark.sql.types.DataTypes.*;
    +
    +JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
    +  RowFactory.create(0, "a"),
    +  RowFactory.create(1, "b"),
    +  RowFactory.create(2, "c"),
    +  RowFactory.create(3, "a"),
    +  RowFactory.create(4, "a"),
    +  RowFactory.create(5, "c")
    +));
    +StructType schema = new StructType(new StructField[] {
    +  createStructField("id", DoubleType, false),
    +  createStructField("category", StringType, false)
    +});
    +DataFrame df = sqlContext.createDataFrame(jrdd, schema);
    +StringIndexer indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex");
    +DataFrame indexed = indexer.fit(df).transform(df);
    +indexed.show();
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +[`StringIndexer`](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) takes an input
    +column name and an output column name.
    +
    +{% highlight python %}
    +from pyspark.ml.feature import StringIndexer
    +
    +df = sqlContext.createDataFrame(
    +    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")]
    --- End diff --
    
    missing comma at end of line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org