You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2018/01/17 01:58:04 UTC

[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

GitHub user MrBago opened a pull request:

    https://github.com/apache/spark/pull/20285

    [SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples.

    ## What changes were proposed in this pull request?
    
    Added documentation for new transformer.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MrBago/spark sizeHintDocs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20285
    
----
commit 85d0db07641d8d39a87129995367efad44dba56f
Author: Bago Amirbekian <ba...@...>
Date:   2018-01-17T00:24:01Z

    "Added VectorSizeHint docs and examples."

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86302 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86302/testReport)** for PR 20285 at commit [`0cdfc1b`](https://github.com/apache/spark/commit/0cdfc1b0c49b5a8278723cd7b48eca1d2c45f944).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    LGTM. Merged into master and branch-2.3. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86541/testReport)** for PR 20285 at commit [`4de3f81`](https://github.com/apache/spark/commit/4de3f8153747b3c7dc1fe36abf0d1391ffd2b80c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163336725
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
    @@ -0,0 +1,78 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.SparkSession;
    +
    +// $example on$
    +import java.util.Arrays;
    +
    +import org.apache.spark.ml.feature.VectorAssembler;
    +import org.apache.spark.ml.feature.VectorSizeHint;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.sql.Dataset;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +import static org.apache.spark.sql.types.DataTypes.*;
    +// $example off$
    +
    +public class JavaVectorSizeHintExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaVectorSizeHintExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    StructType schema = createStructType(new StructField[]{
    +      createStructField("id", IntegerType, false),
    +      createStructField("hour", IntegerType, false),
    +      createStructField("mobile", DoubleType, false),
    +      createStructField("userFeatures", new VectorUDT(), false),
    +      createStructField("clicked", DoubleType, false)
    +    });
    +    Row row = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
    --- End diff --
    
    Hi, @MrBago . It seems that we need to add one more row here.
    ```java
    RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0);
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86221/testReport)** for PR 20285 at commit [`85d0db0`](https://github.com/apache/spark/commit/85d0db07641d8d39a87129995367efad44dba56f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162144090
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,48 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors an a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls for vectors of the wrong size. By default
    --- End diff --
    
    typo: 'nulls for vectors..' -> 'nulls or vectors'


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162519014
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    --- End diff --
    
    I don't know if the spark style guide covers this, but I believe "a user" is generally the prefered form, https://english.stackexchange.com/a/105117.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86366/testReport)** for PR 20285 at commit [`6228902`](https://github.com/apache/spark/commit/622890287aecf515da147b039c253657161e89f3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163341372
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.SparkSession;
    +
    +// $example on$
    +import java.util.Arrays;
    +
    +import org.apache.spark.ml.feature.VectorAssembler;
    +import org.apache.spark.ml.feature.VectorSizeHint;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.sql.Dataset;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +import static org.apache.spark.sql.types.DataTypes.*;
    +// $example off$
    +
    +public class JavaVectorSizeHintExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaVectorSizeHintExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    StructType schema = createStructType(new StructField[]{
    +      createStructField("id", IntegerType, false),
    +      createStructField("hour", IntegerType, false),
    +      createStructField("mobile", DoubleType, false),
    +      createStructField("userFeatures", new VectorUDT(), false),
    +      createStructField("clicked", DoubleType, false)
    +    });
    +    Row row0 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
    +    Row row1 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
    --- End diff --
    
    Sorry, yes should be fixed now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86543/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162144930
  
    --- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
    @@ -0,0 +1,57 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from __future__ import print_function
    +
    +# $example on$
    +from pyspark.ml.linalg import Vectors
    +from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
    +# $example off$
    +from pyspark.sql import SparkSession
    +
    +if __name__ == "__main__":
    +    spark = SparkSession\
    +        .builder\
    +        .appName("VectorAssemblerExample")\
    --- End diff --
    
    should be "VectorSizeHintExample" - same with other apis


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162224903
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    --- End diff --
    
    nit: a user -> an user


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86302/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86541/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163389908
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors for a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls or vectors of the wrong size. By default
    +`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
    +also be set to "skip", indicating that rows containing invalid values should be filtered out from
    +the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
    +`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
    --- End diff --
    
    I've updated it, let me know if you think we can still make it more clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Thanks for the review @BryanCutler, I've added a java example & uploaded 2 screenshots.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162223639
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.ml.feature.VectorAssembler;
    +import org.apache.spark.ml.feature.VectorSizeHint;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.sql.Dataset;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.SparkSession;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +
    +import java.util.Arrays;
    +
    +import static org.apache.spark.sql.types.DataTypes.*;
    +
    +// $example on$
    +// $example off$
    --- End diff --
    
    Do we need the above two lines?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86221/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    <img width="1144" alt="screen shot 2018-01-17 at 4 19 40 pm" src="https://user-images.githubusercontent.com/223219/35074406-90b074ea-fba2-11e7-853b-45e3c447fe68.png">



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    I'd like to try and get this patched into 2.3 to make sure our documentation is complete for the 2.3 release. @viirya and @WeichenXu123 would you mind having another look at it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86302/testReport)** for PR 20285 at commit [`0cdfc1b`](https://github.com/apache/spark/commit/0cdfc1b0c49b5a8278723cd7b48eca1d2c45f944).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    <img width="1073" alt="screen shot 2018-01-17 at 4 20 02 pm" src="https://user-images.githubusercontent.com/223219/35074422-9c192bf6-fba2-11e7-8be1-35279db1df49.png">



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162223973
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala ---
    @@ -0,0 +1,63 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +// $example on$
    +import org.apache.spark.ml.feature.{VectorAssembler, VectorSizeHint}
    +import org.apache.spark.ml.linalg.Vectors
    +// $example off$
    +import org.apache.spark.sql.SparkSession
    +
    +object VectorSizeHintExample {
    +  def main(args: Array[String]): Unit = {
    +    val spark = SparkSession
    +      .builder
    +      .appName("VectorSizeHintExample")
    +      .getOrCreate()
    +
    +    // $example on$
    +    val dataset = spark.createDataFrame(
    +      Seq(
    +        (0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
    +        (0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0))
    +    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")
    +
    +    val sizeHint = new VectorSizeHint()
    +      .setInputCol("userFeatures")
    +      .setHandleInvalid("skip")
    +      .setSize(3)
    +
    +    val datasetWithSize = sizeHint.transform(dataset)
    +    println("Rows where 'userFeatures' is not the right size are filtered out")
    +    datasetWithSize.show(false)
    +
    +    val assembler = new VectorAssembler()
    +      .setInputCols(Array("hour", "mobile", "userFeatures"))
    +      .setOutputCol("features")
    +
    +    // This dataframe can be used by used by downstream transformers as before
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)** for PR 20285 at commit [`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/20/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86541/testReport)** for PR 20285 at commit [`4de3f81`](https://github.com/apache/spark/commit/4de3f8153747b3c7dc1fe36abf0d1391ffd2b80c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86223/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86543/testReport)** for PR 20285 at commit [`6055a8c`](https://github.com/apache/spark/commit/6055a8c058cad380233e8bab73e6d9abb2674bd1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163340508
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.sql.SparkSession;
    +
    +// $example on$
    +import java.util.Arrays;
    +
    +import org.apache.spark.ml.feature.VectorAssembler;
    +import org.apache.spark.ml.feature.VectorSizeHint;
    +import org.apache.spark.ml.linalg.VectorUDT;
    +import org.apache.spark.ml.linalg.Vectors;
    +import org.apache.spark.sql.Dataset;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +import static org.apache.spark.sql.types.DataTypes.*;
    +// $example off$
    +
    +public class JavaVectorSizeHintExample {
    +  public static void main(String[] args) {
    +    SparkSession spark = SparkSession
    +      .builder()
    +      .appName("JavaVectorSizeHintExample")
    +      .getOrCreate();
    +
    +    // $example on$
    +    StructType schema = createStructType(new StructField[]{
    +      createStructField("id", IntegerType, false),
    +      createStructField("hour", IntegerType, false),
    +      createStructField("mobile", DoubleType, false),
    +      createStructField("userFeatures", new VectorUDT(), false),
    +      createStructField("clicked", DoubleType, false)
    +    });
    +    Row row0 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
    +    Row row1 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
    --- End diff --
    
    Can we use the same data set with the other code?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86221/testReport)** for PR 20285 at commit [`85d0db0`](https://github.com/apache/spark/commit/85d0db07641d8d39a87129995367efad44dba56f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86223/testReport)** for PR 20285 at commit [`b4f2c71`](https://github.com/apache/spark/commit/b4f2c71950f4a5d9127942c5b3de266b92dfaac6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86287/testReport)** for PR 20285 at commit [`c0a53de`](https://github.com/apache/spark/commit/c0a53de5ae114f0a00d4cd1c341da35b8fe00263).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162224172
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors a column of
    --- End diff --
    
    a column of `Vector` -> for a column of `Vector`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163338180
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors for a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls or vectors of the wrong size. By default
    +`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
    +also be set to "skip", indicating that rows containing invalid values should be filtered out from
    +the resulting dataframe, or `optimistic` indicating that all rows should be kept. When
    +`handleInvalid` is set to `optimistic` the user takes responsibility for ensuring that the column
    --- End diff --
    
    `optimistic` --> "optimistic"
    the backquote only used on code vars.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162145565
  
    --- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
    @@ -0,0 +1,57 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from __future__ import print_function
    +
    +# $example on$
    +from pyspark.ml.linalg import Vectors
    +from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
    +# $example off$
    +from pyspark.sql import SparkSession
    +
    +if __name__ == "__main__":
    +    spark = SparkSession\
    +        .builder\
    +        .appName("VectorAssemblerExample")\
    +        .getOrCreate()
    +
    +    # $example on$
    +    dataset = spark.createDataFrame(
    +        [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0),
    +         (0, 18, 1.0, Vectors.dense([0.0, 10.0]), 0.0)],
    +        ["id", "hour", "mobile", "userFeatures", "clicked"])
    +
    +    sizeHint = VectorSizeHint(
    +        inputCol="userFeatures",
    +        handleInvalid="sip",
    --- End diff --
    
    typo "sip" -> "skip"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86548/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20285


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163377934
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors for a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls or vectors of the wrong size. By default
    +`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
    +also be set to "skip", indicating that rows containing invalid values should be filtered out from
    +the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
    +`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
    --- End diff --
    
    Not clear to me what is the expected behaivor of `optimistic`. How is it different from `error`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86287/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86223/testReport)** for PR 20285 at commit [`b4f2c71`](https://github.com/apache/spark/commit/b4f2c71950f4a5d9127942c5b3de266b92dfaac6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86287/testReport)** for PR 20285 at commit [`c0a53de`](https://github.com/apache/spark/commit/c0a53de5ae114f0a00d4cd1c341da35b8fe00263).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162223829
  
    --- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
    @@ -0,0 +1,57 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from __future__ import print_function
    +
    +# $example on$
    +from pyspark.ml.linalg import Vectors
    +from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
    +# $example off$
    +from pyspark.sql import SparkSession
    +
    +if __name__ == "__main__":
    +    spark = SparkSession\
    +        .builder\
    +        .appName("VectorSizeHintExample")\
    +        .getOrCreate()
    +
    +    # $example on$
    +    dataset = spark.createDataFrame(
    +        [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0),
    +         (0, 18, 1.0, Vectors.dense([0.0, 10.0]), 0.0)],
    +        ["id", "hour", "mobile", "userFeatures", "clicked"])
    +
    +    sizeHint = VectorSizeHint(
    +        inputCol="userFeatures",
    +        handleInvalid="skip",
    +        size=3)
    +
    +    datasetWithSize = sizeHint.transform(dataset)
    +    print("Rows where 'userFeatures' is not the right size are filtered out")
    +    datasetWithSize.show(truncate=False)
    +
    +    assembler = VectorAssembler(
    +        inputCols=["hour", "mobile", "userFeatures"],
    +        outputCol="features")
    +
    +    # This dataframe can be used by used by downstream transformers as before
    --- End diff --
    
    I think there is some typos here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Hi, @mengxr .
    Could you resolve the JIRA, too?
    - https://issues.apache.org/jira/browse/SPARK-22735
    
    Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86366/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/151/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163390328
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors for a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls or vectors of the wrong size. By default
    +`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
    +also be set to "skip", indicating that rows containing invalid values should be filtered out from
    +the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
    +`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
    +does not have invalid values, values that don't match the column's metadata, or dealing with those
    +invalid values downstream.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
    --- End diff --
    
    <img width="520" alt="screen shot 2018-01-23 at 1 54 28 pm" src="https://user-images.githubusercontent.com/223219/35302985-523fcccc-0045-11e8-9a21-c4ed795b6e6a.png">
    
    I don't think so :), but I think we should leave it to be consistent with other examples.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)** for PR 20285 at commit [`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/150/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163378065
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors for a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    +transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
    +the vector size. Downstream operations on the resulting dataframe can get this size using the
    +meatadata.
    +
    +`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
    +behaviour when the vector column contains nulls or vectors of the wrong size. By default
    +`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
    +also be set to "skip", indicating that rows containing invalid values should be filtered out from
    +the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
    +`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
    +does not have invalid values, values that don't match the column's metadata, or dealing with those
    +invalid values downstream.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
    --- End diff --
    
    Do we need to mention `Scala` explicitly here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86366/testReport)** for PR 20285 at commit [`6228902`](https://github.com/apache/spark/commit/622890287aecf515da147b039c253657161e89f3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/157/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r163377373
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    --- End diff --
    
    `a user` is correct because `users`'s pronunciation starts with `y`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162142830
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,48 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors an a column of
    --- End diff --
    
    typo 'an a column' -> 'in a column'


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    **[Test build #86543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86543/testReport)** for PR 20285 at commit [`6055a8c`](https://github.com/apache/spark/commit/6055a8c058cad380233e8bab73e6d9abb2674bd1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20285#discussion_r162224951
  
    --- Diff: docs/ml-features.md ---
    @@ -1283,6 +1283,56 @@ for more details on the API.
     </div>
     </div>
     
    +## VectorSizeHint
    +
    +It can sometimes be useful to explicitly specify the size of the vectors a column of
    +`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
    +produce size information and metadata for its output column. While in some cases this information
    +can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
    +not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
    +vector size for a column so that `VectorAssembler`, or other transformers that might
    +need to know vector size, can use that column as an input.
    +
    +To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
    --- End diff --
    
    a user -> an user


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20285
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org