You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2015/09/14 08:07:06 UTC

[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/8740

    [SPARK-10077][DOCS] Add package info for java of ml/feature

    Should be the same as SPARK-7808 but use Java for the code example.
    It would be great to add package doc for `spark.ml.feature`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8740.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8740
    
----
commit ff9485664ab6c10f915d7a94392e6d2dc71593ed
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-09-14T05:50:31Z

    Add package info for java of ml/feature

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140858533
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139972269
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140853490
  
      [Test build #42540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42540/consoleFull) for   PR 8740 at commit [`15cdcd2`](https://github.com/apache/spark/commit/15cdcd2409496c78e230cbf6961bdaeda9e9f06c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39618779
  
    --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPackage.java ---
    @@ -0,0 +1,120 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature;
    --- End diff --
    
    86


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140655219
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42526/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140595613
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140176618
  
    @holdenk Please generate the JavaDoc and check the output. This is what I got:
    
    ![screen shot 2015-09-14 at 11 59 59 am](https://cloud.githubusercontent.com/assets/829644/9858898/3522c7b2-5ad8-11e5-83ef-34842c1520f4.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433039
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    + *   import org.apache.spark.ml.feature.*
    + *   import org.apache.spark.ml.Pipeline
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactor.create(0, "Hi I heard about Spark", 3.0),
    + *     RowFactor.create(1, "I wish Java could use case classes", 4.0),
    + *     RowFactor.create(2, "Logistic regression models are neat", 4.0)
    + *   )).toDF("id", "text", "rating")
    + *
    + *   // define feature transformers
    + *   RegexTokenizer tok = new RegexTokenizer()
    + *     .setInputCol("text")
    + *     .setOutputCol("words")
    --- End diff --
    
    `;`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140296010
  
      [Test build #42472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42472/consoleFull) for   PR 8740 at commit [`0926c28`](https://github.com/apache/spark/commit/0926c28a38df70e89ddcfd9a9bf61c08eebb0c1c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39530506
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,92 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactory.create(0, "Hi I heard about Spark", 3.0),
    + *     RowFactory.create(1, "I wish Java could use case classes", 4.0),
    + *     RowFactory.create(2, "Logistic regression models are neat", 4.0)
    + *   )).toDF("id", "text", "rating");
    + *
    + *   // define feature transformers
    + *   RegexTokenizer tok = new RegexTokenizer()
    + *     .setInputCol("text")
    + *     .setOutputCol("words");
    + *   StopWordsRemover sw = new StopWordsRemover()
    + *     .setInputCol("words")
    + *     .setOutputCol("filtered_words");
    + *   HashingTF tf = new HashingTF()
    + *     .setInputCol("filtered_words")
    + *     .setOutputCol("tf")
    + *     .setNumFeatures(10000);
    + *   IDF idf = new IDF()
    + *     .setInputCol("tf")
    + *     .setOutputCol("tf_idf");
    + *   VectorAssembler assembler = new VectorAssembler()
    + *     .setInputCols(new String[] {"tf_idf", "rating"})
    + *     .setOutputCol("features");
    + *
    + *   // assemble and fit the feature transformation pipeline
    + *   Pipeline pipeline = new Pipeline()
    + *     .setStages(new PipelineStage[] {tok, sw, tf, idf, assembler});
    + *   val model = pipeline.fit(df);
    + *
    + *   // save transformed features with raw data
    + *   model.transform(df)
    + *     .select("id", "text", "rating", "features")
    + *     .write.format("parquet").save("/output/path");
    + * </code>
    + * </pre>
    + *
    + * Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn.
    + * The major difference is that most scikit-learn feature transformers operate eagerly on the entire
    + * input dataset, while MLlib's feature transformers operate lazily on individual columns,
    + * which is more efficient and flexible to handle large and complex datasets.
    + *
    + * @see <a href="http://scikit-learn.org/stable/modules/preprocessing.html">scikit-learn.preprocessing</a>
    --- End diff --
    
    `target="_blank"` and wrap this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140606717
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42520/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139972862
  
      [Test build #42407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42407/consoleFull) for   PR 8740 at commit [`ff94856`](https://github.com/apache/spark/commit/ff9485664ab6c10f915d7a94392e6d2dc71593ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39596991
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *   import java.util.List;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   // Import factory methods provided by DataTypes.
    + *   import org.apache.spark.sql.types.DataTypes;
    --- End diff --
    
    `import static ...DataTypes.*;`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140560938
  
    Sure, I'll try and compile the java test code tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39597002
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *   import java.util.List;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   // Import factory methods provided by DataTypes.
    + *   import org.apache.spark.sql.types.DataTypes;
    + *   // Import StructType and StructField
    + *   import org.apache.spark.sql.types.StructType;
    + *   import org.apache.spark.sql.types.StructField;
    + *   import org.apache.spark.sql.DataFrame;
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.sql.Row;
    + *
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *   import org.apache.spark.ml.PipelineStage;
    + *   import org.apache.spark.ml.PipelineModel;
    + *
    + *  // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *  List<StructField> fields = Arrays.asList(
    + *      DataTypes.createStructField("id", DataTypes.IntegerType, false),
    + *      DataTypes.createStructField("text", DataTypes.StringType, false),
    + *      DataTypes.createStructField("rating", DataTypes.DoubleType, false));
    + *  StructType schema = DataTypes.createStructType(fields);
    + *  JavaRDD<Row> rowRDD = jsc.parallelize(
    --- End diff --
    
    Not in this PR, we could add `createDataFrame(List, StructType)` to `SQLContext`: https://issues.apache.org/jira/browse/SPARK-10630


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140588289
  
      [Test build #42516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42516/consoleFull) for   PR 8740 at commit [`84f0ad6`](https://github.com/apache/spark/commit/84f0ad633868869aeaaaee58cf3a63aad221e2aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140448103
  
    LGTM except minor inline comments. It would be nice if you can compile and test the Java code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139979808
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140858536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42539/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39596988
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *   import java.util.List;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   // Import factory methods provided by DataTypes.
    --- End diff --
    
    the comment is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140646082
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140595598
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140587947
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140840264
  
      [Test build #42539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42539/consoleFull) for   PR 8740 at commit [`512ca8a`](https://github.com/apache/spark/commit/512ca8af85687cc1cd3aad5990ace8e7c169094f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140606065
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140596833
  
      [Test build #42520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42520/consoleFull) for   PR 8740 at commit [`f844d55`](https://github.com/apache/spark/commit/f844d55dfc307d9b7ec9a6a7a064928f252827c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433034
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    --- End diff --
    
    I think you need `<pre>` with `@code`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140592701
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39597009
  
    --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPackage.java ---
    @@ -0,0 +1,120 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature;
    --- End diff --
    
    Thanks for testing the code! I think it might be an overkill to include a unit test for package doc. The problem would be keeping the content in-sync in the future. https://issues.apache.org/jira/browse/SPARK-10383 is for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140594442
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139979745
  
      [Test build #42407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42407/console) for   PR 8740 at commit [`ff94856`](https://github.com/apache/spark/commit/ff9485664ab6c10f915d7a94392e6d2dc71593ed).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39596994
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *   import java.util.List;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   // Import factory methods provided by DataTypes.
    + *   import org.apache.spark.sql.types.DataTypes;
    + *   // Import StructType and StructField
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39618817
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   import static org.apache.spark.sql.types.DataTypes.*;
    + *   import org.apache.spark.sql.types.StructType;
    + *   import org.apache.spark.sql.types.StructField;
    + *   import org.apache.spark.sql.DataFrame;
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.sql.Row;
    + *
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *   import org.apache.spark.ml.PipelineStage;
    + *   import org.apache.spark.ml.PipelineModel;
    + *
    + *  // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *  StructType schema = createStructType(
    + *      Arrays.asList(
    --- End diff --
    
    fix indentation (2-space)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140587966
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39596999
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *   import java.util.List;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   // Import factory methods provided by DataTypes.
    + *   import org.apache.spark.sql.types.DataTypes;
    + *   // Import StructType and StructField
    + *   import org.apache.spark.sql.types.StructType;
    + *   import org.apache.spark.sql.types.StructField;
    + *   import org.apache.spark.sql.DataFrame;
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.sql.Row;
    + *
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *   import org.apache.spark.ml.PipelineStage;
    + *   import org.apache.spark.ml.PipelineModel;
    + *
    + *  // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *  List<StructField> fields = Arrays.asList(
    --- End diff --
    
    We can avoid importing `List` to construct `schema` directly. With `import static ...DataTypes.*;`, the code could be simpler:
    
    ~~~java
    StructType schema = createStructType(Arrays.asList(
      createStructField("id", IntegerType, false),
      ...
    ));
    ~~~



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140850268
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140592702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42516/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39618846
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   import static org.apache.spark.sql.types.DataTypes.*;
    + *   import org.apache.spark.sql.types.StructType;
    + *   import org.apache.spark.sql.types.StructField;
    + *   import org.apache.spark.sql.DataFrame;
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.sql.Row;
    + *
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *   import org.apache.spark.ml.PipelineStage;
    + *   import org.apache.spark.ml.PipelineModel;
    + *
    + *  // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *  StructType schema = createStructType(
    + *      Arrays.asList(
    + *        createStructField("id", IntegerType, false),
    + *        createStructField("text", StringType, false),
    + *        createStructField("rating", DoubleType, false)));
    + *  JavaRDD<Row> rowRDD = jsc.parallelize(
    + *      Arrays.asList(
    --- End diff --
    
    fix indentation (2-space)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140606637
  
      [Test build #42520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42520/console) for   PR 8740 at commit [`f844d55`](https://github.com/apache/spark/commit/f844d55dfc307d9b7ec9a6a7a064928f252827c8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TaskCommitDenied(`
      * `  final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")`
      * `abstract class LocalNode(conf: SQLConf) extends QueryPlan[LocalNode] with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140838393
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140868253
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42540/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140838321
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39530502
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,92 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactory.create(0, "Hi I heard about Spark", 3.0),
    + *     RowFactory.create(1, "I wish Java could use case classes", 4.0),
    + *     RowFactory.create(2, "Logistic regression models are neat", 4.0)
    + *   )).toDF("id", "text", "rating");
    + *
    + *   // define feature transformers
    + *   RegexTokenizer tok = new RegexTokenizer()
    + *     .setInputCol("text")
    + *     .setOutputCol("words");
    + *   StopWordsRemover sw = new StopWordsRemover()
    + *     .setInputCol("words")
    + *     .setOutputCol("filtered_words");
    + *   HashingTF tf = new HashingTF()
    + *     .setInputCol("filtered_words")
    + *     .setOutputCol("tf")
    + *     .setNumFeatures(10000);
    + *   IDF idf = new IDF()
    + *     .setInputCol("tf")
    + *     .setOutputCol("tf_idf");
    + *   VectorAssembler assembler = new VectorAssembler()
    + *     .setInputCols(new String[] {"tf_idf", "rating"})
    + *     .setOutputCol("features");
    + *
    + *   // assemble and fit the feature transformation pipeline
    + *   Pipeline pipeline = new Pipeline()
    + *     .setStages(new PipelineStage[] {tok, sw, tf, idf, assembler});
    + *   val model = pipeline.fit(df);
    --- End diff --
    
    `val` -> `PipelineModel` and remember to import it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139979812
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42407/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140295823
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140858371
  
      [Test build #42539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42539/console) for   PR 8740 at commit [`512ca8a`](https://github.com/apache/spark/commit/512ca8af85687cc1cd3aad5990ace8e7c169094f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140303373
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42472/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140295813
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140646066
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433030
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    + *   import org.apache.spark.ml.feature.*
    + *   import org.apache.spark.ml.Pipeline
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactor.create(0, "Hi I heard about Spark", 3.0),
    --- End diff --
    
    * `RowFactor` -> `RowFactory` (this is why we need SPARK-10382)
    * import `RowFactory`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140868115
  
      [Test build #42540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42540/console) for   PR 8740 at commit [`15cdcd2`](https://github.com/apache/spark/commit/15cdcd2409496c78e230cbf6961bdaeda9e9f06c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140868252
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140594433
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433031
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    + *   import org.apache.spark.ml.feature.*
    --- End diff --
    
    missing `;`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-141137424
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140605943
  
      [Test build #42519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42519/console) for   PR 8740 at commit [`0e1a49e`](https://github.com/apache/spark/commit/0e1a49ec80ea3c4e75dbd5bf17eba996fa4ffadd).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TaskCommitDenied(`
      * `  final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")`
      * `abstract class LocalNode(conf: SQLConf) extends QueryPlan[LocalNode] with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140303259
  
      [Test build #42472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42472/console) for   PR 8740 at commit [`0926c28`](https://github.com/apache/spark/commit/0926c28a38df70e89ddcfd9a9bf61c08eebb0c1c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39618772
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import java.util.Arrays;
    + *
    + *   import org.apache.spark.api.java.JavaRDD;
    + *   import static org.apache.spark.sql.types.DataTypes.*;
    + *   import org.apache.spark.sql.types.StructType;
    + *   import org.apache.spark.sql.types.StructField;
    --- End diff --
    
    no longer needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39597206
  
    --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPackage.java ---
    @@ -0,0 +1,120 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature;
    --- End diff --
    
    Makes sense, for now do you think we should keep the test or 86 it until we get the syncing solution in place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140655218
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS] Add package info for java ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-139972262
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433068
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    + *   import org.apache.spark.ml.feature.*
    + *   import org.apache.spark.ml.Pipeline
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactor.create(0, "Hi I heard about Spark", 3.0),
    + *     RowFactor.create(1, "I wish Java could use case classes", 4.0),
    + *     RowFactor.create(2, "Logistic regression models are neat", 4.0)
    + *   )).toDF("id", "text", "rating")
    + *
    + *   // define feature transformers
    + *   RegexTokenizer tok = new RegexTokenizer()
    + *     .setInputCol("text")
    + *     .setOutputCol("words")
    + *   StopWordsRemover sw = new StopWordsRemover()
    + *     .setInputCol("words")
    + *     .setOutputCol("filtered_words")
    + *   HashingTF tf = new HashingTF()
    + *     .setInputCol("filtered_words")
    + *     .setOutputCol("tf")
    + *     .setNumFeatures(10000)
    + *   IDF idf = new IDF()
    + *     .setInputCol("tf")
    + *     .setOutputCol("tf_idf")
    + *   VectorAssembler assembler = new VectorAssembler()
    + *     .setInputCols(Array("tf_idf", "rating"))
    --- End diff --
    
    `Array` is not Java code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][STREAMING] Add package inf...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39433091
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <code>
    + *   import org.apache.spark.ml.feature.*
    + *   import org.apache.spark.ml.Pipeline
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    + *     RowFactor.create(0, "Hi I heard about Spark", 3.0),
    + *     RowFactor.create(1, "I wish Java could use case classes", 4.0),
    + *     RowFactor.create(2, "Logistic regression models are neat", 4.0)
    + *   )).toDF("id", "text", "rating")
    + *
    + *   // define feature transformers
    + *   RegexTokenizer tok = new RegexTokenizer()
    + *     .setInputCol("text")
    + *     .setOutputCol("words")
    + *   StopWordsRemover sw = new StopWordsRemover()
    + *     .setInputCol("words")
    + *     .setOutputCol("filtered_words")
    + *   HashingTF tf = new HashingTF()
    + *     .setInputCol("filtered_words")
    + *     .setOutputCol("tf")
    + *     .setNumFeatures(10000)
    + *   IDF idf = new IDF()
    + *     .setInputCol("tf")
    + *     .setOutputCol("tf_idf")
    + *   VectorAssembler assembler = new VectorAssembler()
    + *     .setInputCols(Array("tf_idf", "rating"))
    + *     .setOutputCol("features")
    + *
    + *   // assemble and fit the feature transformation pipeline
    + *   Pipeline pipeline = new Pipeline()
    + *     .setStages(new PipelineStage[] {tok, sw, tf, idf, assembler})
    + *   val model = pipeline.fit(df)
    + *
    + *   // save transformed features with raw data
    + *   model.transform(df)
    + *     .select("id", "text", "rating", "features")
    + *     .write.format("parquet").save("/output/path")
    + * </code>
    + *
    + * Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn.
    + * The major difference is that most scikit-learn feature transformers operate eagerly on the entire
    + * input dataset, while MLlib's feature transformers operate lazily on individual columns,
    + * which is more efficient and flexible to handle large and complex datasets.
    + *
    + * @see [[http://scikit-learn.org/stable/modules/preprocessing.html scikit-learn.preprocessing]]
    --- End diff --
    
    Use `<a>` in JavaDoc's `@see`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140303368
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8740


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140592621
  
      [Test build #42516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42516/console) for   PR 8740 at commit [`84f0ad6`](https://github.com/apache/spark/commit/84f0ad633868869aeaaaee58cf3a63aad221e2aa).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140647785
  
      [Test build #42526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42526/consoleFull) for   PR 8740 at commit [`392baf0`](https://github.com/apache/spark/commit/392baf044d4f29bbdf2e40d76fd5e53baa8a862c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140606716
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8740#discussion_r39530504
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java ---
    @@ -0,0 +1,92 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Feature transformers
    + *
    + * The `ml.feature` package provides common feature transformers that help convert raw data or
    + * features into more suitable forms for model fitting.
    + * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which
    + * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g.,
    + * {@link org.apache.spark.feature.HashingTF}.
    + * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the
    + * transformation requires some aggregated information of the dataset, e.g., document
    + * frequencies in {@link org.apache.spark.ml.feature.IDF}.
    + * For those feature transformers, calling {@link org.apache.spark.ml.Estimator#fit} is required to
    + * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply
    + * transformation.
    + * The transformation is usually done by appending new columns to the input
    + * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over.
    + *
    + * We try to make each transformer minimal, so it becomes flexible to assemble feature
    + * transformation pipelines.
    + * {@link org.apache.spark.ml.Pipeline} can be used to chain feature transformers, and
    + * {@link org.apache.spark.ml.feature.VectorAssembler} can be used to combine multiple feature
    + * transformations, for example:
    + *
    + * <pre>
    + * <code>
    + *   import org.apache.spark.sql.RowFactory;
    + *   import org.apache.spark.ml.feature.*;
    + *   import org.apache.spark.ml.Pipeline;
    + *
    + *   // a DataFrame with three columns: id (integer), text (string), and rating (double).
    + *   DataFrame df = sqlContext.createDataFrame(Arrays.asList(
    --- End diff --
    
    import `Arrays` and remember to organize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140606066
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42519/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140655104
  
      [Test build #42526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42526/console) for   PR 8740 at commit [`392baf0`](https://github.com/apache/spark/commit/392baf044d4f29bbdf2e40d76fd5e53baa8a862c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140595362
  
      [Test build #42519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42519/consoleFull) for   PR 8740 at commit [`0e1a49e`](https://github.com/apache/spark/commit/0e1a49ec80ea3c4e75dbd5bf17eba996fa4ffadd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10077][DOCS][ML] Add package info for j...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8740#issuecomment-140850115
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org