You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2018/01/31 01:45:28 UTC

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/20446

    [SPARK-23254][ML] Add user guide entry for DataFrame multivariate summary

    ## What changes were proposed in this pull request?
    
    Add user guide and scala/java examples for `ml.stat.Summarizer`
    
    ## How was this patch tested?
    
    Doc generated snapshot:
    
    ![image](https://user-images.githubusercontent.com/19235986/35600897-2bb9c102-05e5-11e8-849f-e327f5555125.png)
    ![image](https://user-images.githubusercontent.com/19235986/35600910-3b022f28-05e5-11e8-823e-ae61009317a0.png)
    ![image](https://user-images.githubusercontent.com/19235986/35600918-43c24f3a-05e5-11e8-847d-446452838e05.png)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark summ_guide

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20446
    
----
commit 307f75f4990049f78978364af4541cd20e4d5bd7
Author: WeichenXu <we...@...>
Date:   2018-01-31T01:41:48Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #89570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89570/testReport)** for PR 20446 at commit [`ee9d368`](https://github.com/apache/spark/commit/ee9d3686f3e48650668bf26a7003b0bde912b6a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2487/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165568368
  
    --- Diff: docs/ml-statistics.md ---
    @@ -89,4 +89,26 @@ Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat
     {% include_example python/ml/chi_square_test_example.py %}
     </div>
     
    +</div>
    +
    +## Summarizer
    +
    +We provide vector column summary statistics for `Dataframe` through `Summarizer`.
    +Available metrics are the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +The following example demonstrates using [`Summarizer`](api/scala/index.html#org.apache.spark.ml.stat.Summarizer$)
    +to compute the mean and variance for the input dataframe, with and without a weight column.
    --- End diff --
    
    sorry, one more comment here
    
    I think perhaps "... to compute the mean and variance for a vector column of the input dataframe ..." 
    
    (and same below)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2486/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165568014
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java ---
    @@ -0,0 +1,71 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.*;
    +
    +// $example on$
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import org.apache.spark.ml.linalg.Vector;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.stat.Summarizer;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +// $example off$
    +
    +public class JavaSummarizerExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaSummarizerExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    List<Row> data = Arrays.asList(
    +      RowFactory.create(Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      RowFactory.create(Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    );
    +
    +    StructType schema = new StructType(new StructField[]{
    +      new StructField("features", new VectorUDT(), false, Metadata.empty()),
    +      new StructField("weight", DataTypes.DoubleType, false, Metadata.empty())
    +    });
    +
    +    Dataset<Row> df = spark.createDataFrame(data, schema);
    +
    +    Row result1 = df.select(Summarizer.metrics("mean", "variance")
    +        .summary(new Column("features"), new Column("weight")))
    +        .first().getStruct(0);
    +    System.out.println("with weight: mean = " + result1.<Vector>getAs(0).toString() +
    --- End diff --
    
    ok fair enough


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/507/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86975/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #92539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92539/testReport)** for PR 20446 at commit [`ee9d368`](https://github.com/apache/spark/commit/ee9d3686f3e48650668bf26a7003b0bde912b6a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165360692
  
    --- Diff: docs/ml-statistics.md ---
    @@ -89,4 +89,26 @@ Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat
     {% include_example python/ml/chi_square_test_example.py %}
     </div>
     
    +</div>
    +
    +## Summarizer
    +
    +We provide vector column summary statistics for `Dataframe` through `Summarizer`.
    +Available metrics contain the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
    --- End diff --
    
    Perhaps "contain" -> "are" or "include"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    @srowen The reason I do not use `.show` I have already reply here https://github.com/apache/spark/pull/20446#discussion_r165565121
    thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86856/testReport)** for PR 20446 at commit [`307f75f`](https://github.com/apache/spark/commit/307f75f4990049f78978364af4541cd20e4d5bd7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaSummarizerExample `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86980/testReport)** for PR 20446 at commit [`f02172f`](https://github.com/apache/spark/commit/f02172f4637a8332c93fec1c0b44e2bb3a65f5a5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165362148
  
    --- Diff: docs/ml-statistics.md ---
    @@ -89,4 +89,26 @@ Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat
     {% include_example python/ml/chi_square_test_example.py %}
     </div>
     
    +</div>
    +
    +## Summarizer
    +
    +We provide vector column summary statistics for `Dataframe` through `Summarizer`.
    +Available metrics contain the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Summarizer`](api/scala/index.html#org.apache.spark.ml.stat.Summarizer$)
    --- End diff --
    
    Perhaps "The following example demonstrates using `Summarizer`(...) to compute the mean and variance for the input dataframe, with and without a weight column"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/520/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165362533
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    +
    +    println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")
    --- End diff --
    
    Same applies here, why not just `df.select(...).show()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86975/testReport)** for PR 20446 at commit [`fc9622b`](https://github.com/apache/spark/commit/fc9622bab185167da30ec47ff09e5a7641419865).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/412/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165362568
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    +
    +    println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")
    +
    +    val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features"))
    +      .as[(Vector, Vector)].first()
    +
    +    println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")
    --- End diff --
    
    Same applies here, why not just `df.select(...).show()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165573866
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    --- End diff --
    
    Do you mean us `.as[((Vector, Vector))]` ? It compile fails..
    or Do you mean change to
    ```
    val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
          .summary($"features", $"weight"))
          .as[(Vector, Vector)].first()
    ```
    ? Seems also do not work because it is a "struct type" value in the returned row. So the first row format should match Row(Row(mean, variance))



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86968/testReport)** for PR 20446 at commit [`2592bb9`](https://github.com/apache/spark/commit/2592bb9a2c38e544d87bafa9561560d6c32778f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/515/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #89570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89570/testReport)** for PR 20446 at commit [`ee9d368`](https://github.com/apache/spark/commit/ee9d3686f3e48650668bf26a7003b0bde912b6a0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86856/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165578020
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    --- End diff --
    
    Good idea. This way make code easier to read.
    Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165362440
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java ---
    @@ -0,0 +1,71 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.*;
    +
    +// $example on$
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import org.apache.spark.ml.linalg.Vector;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.stat.Summarizer;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +// $example off$
    +
    +public class JavaSummarizerExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaSummarizerExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    List<Row> data = Arrays.asList(
    +      RowFactory.create(Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      RowFactory.create(Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    );
    +
    +    StructType schema = new StructType(new StructField[]{
    +      new StructField("features", new VectorUDT(), false, Metadata.empty()),
    +      new StructField("weight", DataTypes.DoubleType, false, Metadata.empty())
    +    });
    +
    +    Dataset<Row> df = spark.createDataFrame(data, schema);
    +
    +    Row result1 = df.select(Summarizer.metrics("mean", "variance")
    +        .summary(new Column("features"), new Column("weight")))
    +        .first().getStruct(0);
    +    System.out.println("with weight: mean = " + result1.<Vector>getAs(0).toString() +
    +      ", variance = " + result1.<Vector>getAs(1).toString());
    +
    +    Row result2 = df.select(
    +      Summarizer.mean(new Column("features")),
    +      Summarizer.variance(new Column("features"))
    +    ).first();
    +    System.out.println("without weight: mean = " + result2.<Vector>getAs(0).toString() +
    --- End diff --
    
    Why not just `df.select(...).show()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86856/testReport)** for PR 20446 at commit [`307f75f`](https://github.com/apache/spark/commit/307f75f4990049f78978364af4541cd20e4d5bd7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #92539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92539/testReport)** for PR 20446 at commit [`ee9d368`](https://github.com/apache/spark/commit/ee9d3686f3e48650668bf26a7003b0bde912b6a0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86980/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    @MLnick @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86968/testReport)** for PR 20446 at commit [`2592bb9`](https://github.com/apache/spark/commit/2592bb9a2c38e544d87bafa9561560d6c32778f1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    @WeichenXu123 looks like there was one more outstanding comment, about using `.show()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/508/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165575680
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    --- End diff --
    
    oh ok - perhaps `select("summary.mean", "summary.variance")` would work to extract into two columns?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92539/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89570/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry and exampl...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20446


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165567614
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.stat.Summarizer
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object SummarizerExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("SummarizerExample")
    +      .getOrCreate()
    +
    +    import spark.implicits._
    +    import Summarizer._
    +
    +    // $example on$
    +    val data = Seq(
    +      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    )
    +
    +    val df = data.toDF("features", "weight")
    +
    +    val Tuple1((meanVal, varianceVal)) = df.select(metrics("mean", "variance")
    +      .summary($"features", $"weight"))
    +      .as[Tuple1[(Vector, Vector)]].first()
    --- End diff --
    
    nit, but `Tuple1` not required here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86968/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165362364
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java ---
    @@ -0,0 +1,71 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.*;
    +
    +// $example on$
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import org.apache.spark.ml.linalg.Vector;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.stat.Summarizer;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +// $example off$
    +
    +public class JavaSummarizerExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaSummarizerExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    List<Row> data = Arrays.asList(
    +      RowFactory.create(Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      RowFactory.create(Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    );
    +
    +    StructType schema = new StructType(new StructField[]{
    +      new StructField("features", new VectorUDT(), false, Metadata.empty()),
    +      new StructField("weight", DataTypes.DoubleType, false, Metadata.empty())
    +    });
    +
    +    Dataset<Row> df = spark.createDataFrame(data, schema);
    +
    +    Row result1 = df.select(Summarizer.metrics("mean", "variance")
    +        .summary(new Column("features"), new Column("weight")))
    +        .first().getStruct(0);
    +    System.out.println("with weight: mean = " + result1.<Vector>getAs(0).toString() +
    --- End diff --
    
    Why not just `df.select(...).show()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20446: [SPARK-23254][ML] Add user guide entry for DataFr...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20446#discussion_r165565121
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java ---
    @@ -0,0 +1,71 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.*;
    +
    +// $example on$
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import org.apache.spark.ml.linalg.Vector;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.stat.Summarizer;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +// $example off$
    +
    +public class JavaSummarizerExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaSummarizerExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    List<Row> data = Arrays.asList(
    +      RowFactory.create(Vectors.dense(2.0, 3.0, 5.0), 1.0),
    +      RowFactory.create(Vectors.dense(4.0, 6.0, 7.0), 2.0)
    +    );
    +
    +    StructType schema = new StructType(new StructField[]{
    +      new StructField("features", new VectorUDT(), false, Metadata.empty()),
    +      new StructField("weight", DataTypes.DoubleType, false, Metadata.empty())
    +    });
    +
    +    Dataset<Row> df = spark.createDataFrame(data, schema);
    +
    +    Row result1 = df.select(Summarizer.metrics("mean", "variance")
    +        .summary(new Column("features"), new Column("weight")))
    +        .first().getStruct(0);
    +    System.out.println("with weight: mean = " + result1.<Vector>getAs(0).toString() +
    --- End diff --
    
    Because spark user will usually want to get the summary result (multiple vectors), I want to show the simple way to extract these results from the returned dataframe which contains only one row. I think some user is possible to get stuck here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89569/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #89569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89569/testReport)** for PR 20446 at commit [`f9eb02a`](https://github.com/apache/spark/commit/f9eb02a1a82d411cdc5ddba562ab982db4b583df).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #89569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89569/testReport)** for PR 20446 at commit [`f9eb02a`](https://github.com/apache/spark/commit/f9eb02a1a82d411cdc5ddba562ab982db4b583df).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    @MLnick @MrBago Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86975/testReport)** for PR 20446 at commit [`fc9622b`](https://github.com/apache/spark/commit/fc9622bab185167da30ec47ff09e5a7641419865).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry for DataFrame mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20446
  
    **[Test build #86980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86980/testReport)** for PR 20446 at commit [`f02172f`](https://github.com/apache/spark/commit/f02172f4637a8332c93fec1c0b44e2bb3a65f5a5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org