You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2018/01/17 01:58:04 UTC
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
GitHub user MrBago opened a pull request:
https://github.com/apache/spark/pull/20285
[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples.
## What changes were proposed in this pull request?
Added documentation for new transformer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MrBago/spark sizeHintDocs
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20285.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20285
----
commit 85d0db07641d8d39a87129995367efad44dba56f
Author: Bago Amirbekian <ba...@...>
Date: 2018-01-17T00:24:01Z
"Added VectorSizeHint docs and examples."
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86302 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86302/testReport)** for PR 20285 at commit [`0cdfc1b`](https://github.com/apache/spark/commit/0cdfc1b0c49b5a8278723cd7b48eca1d2c45f944).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:
https://github.com/apache/spark/pull/20285
LGTM. Merged into master and branch-2.3. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86541/testReport)** for PR 20285 at commit [`4de3f81`](https://github.com/apache/spark/commit/4de3f8153747b3c7dc1fe36abf0d1391ffd2b80c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163336725
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+
+import org.apache.spark.ml.feature.VectorAssembler;
+import org.apache.spark.ml.feature.VectorSizeHint;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import static org.apache.spark.sql.types.DataTypes.*;
+// $example off$
+
+public class JavaVectorSizeHintExample {
+ public static void main(String[] args) {
+ SparkSession spark = SparkSession
+ .builder()
+ .appName("JavaVectorSizeHintExample")
+ .getOrCreate();
+
+ // $example on$
+ StructType schema = createStructType(new StructField[]{
+ createStructField("id", IntegerType, false),
+ createStructField("hour", IntegerType, false),
+ createStructField("mobile", DoubleType, false),
+ createStructField("userFeatures", new VectorUDT(), false),
+ createStructField("clicked", DoubleType, false)
+ });
+ Row row = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
--- End diff --
Hi, @MrBago . It seems that we need to add one more row here.
```java
RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0);
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86221/testReport)** for PR 20285 at commit [`85d0db0`](https://github.com/apache/spark/commit/85d0db07641d8d39a87129995367efad44dba56f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162144090
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,48 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors an a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls for vectors of the wrong size. By default
--- End diff --
typo: 'nulls for vectors..' -> 'nulls or vectors'
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162519014
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
--- End diff --
I don't know if the spark style guide covers this, but I believe "a user" is generally the prefered form, https://english.stackexchange.com/a/105117.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86366/testReport)** for PR 20285 at commit [`6228902`](https://github.com/apache/spark/commit/622890287aecf515da147b039c253657161e89f3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163341372
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+
+import org.apache.spark.ml.feature.VectorAssembler;
+import org.apache.spark.ml.feature.VectorSizeHint;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import static org.apache.spark.sql.types.DataTypes.*;
+// $example off$
+
+public class JavaVectorSizeHintExample {
+ public static void main(String[] args) {
+ SparkSession spark = SparkSession
+ .builder()
+ .appName("JavaVectorSizeHintExample")
+ .getOrCreate();
+
+ // $example on$
+ StructType schema = createStructType(new StructField[]{
+ createStructField("id", IntegerType, false),
+ createStructField("hour", IntegerType, false),
+ createStructField("mobile", DoubleType, false),
+ createStructField("userFeatures", new VectorUDT(), false),
+ createStructField("clicked", DoubleType, false)
+ });
+ Row row0 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
+ Row row1 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
--- End diff --
Sorry, yes should be fixed now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86543/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162144930
--- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.linalg import Vectors
+from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
+# $example off$
+from pyspark.sql import SparkSession
+
+if __name__ == "__main__":
+ spark = SparkSession\
+ .builder\
+ .appName("VectorAssemblerExample")\
--- End diff --
should be "VectorSizeHintExample" - same with other apis
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162224903
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
--- End diff --
nit: a user -> an user
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86302/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86541/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163389908
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls or vectors of the wrong size. By default
+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
--- End diff --
I've updated it, let me know if you think we can still make it more clear.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/20285
Thanks for the review @BryanCutler, I've added a java example & uploaded 2 screenshots.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162223639
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.ml.feature.VectorAssembler;
+import org.apache.spark.ml.feature.VectorSizeHint;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Arrays;
+
+import static org.apache.spark.sql.types.DataTypes.*;
+
+// $example on$
+// $example off$
--- End diff --
Do we need the above two lines?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86221/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/20285
<img width="1144" alt="screen shot 2018-01-17 at 4 19 40 pm" src="https://user-images.githubusercontent.com/223219/35074406-90b074ea-fba2-11e7-853b-45e3c447fe68.png">
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/20285
I'd like to try and get this patched into 2.3 to make sure our documentation is complete for the 2.3 release. @viirya and @WeichenXu123 would you mind having another look at it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86302/testReport)** for PR 20285 at commit [`0cdfc1b`](https://github.com/apache/spark/commit/0cdfc1b0c49b5a8278723cd7b48eca1d2c45f944).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/20285
<img width="1073" alt="screen shot 2018-01-17 at 4 20 02 pm" src="https://user-images.githubusercontent.com/223219/35074422-9c192bf6-fba2-11e7-8be1-35279db1df49.png">
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162223973
--- Diff: examples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.{VectorAssembler, VectorSizeHint}
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object VectorSizeHintExample {
+ def main(args: Array[String]): Unit = {
+ val spark = SparkSession
+ .builder
+ .appName("VectorSizeHintExample")
+ .getOrCreate()
+
+ // $example on$
+ val dataset = spark.createDataFrame(
+ Seq(
+ (0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
+ (0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0))
+ ).toDF("id", "hour", "mobile", "userFeatures", "clicked")
+
+ val sizeHint = new VectorSizeHint()
+ .setInputCol("userFeatures")
+ .setHandleInvalid("skip")
+ .setSize(3)
+
+ val datasetWithSize = sizeHint.transform(dataset)
+ println("Rows where 'userFeatures' is not the right size are filtered out")
+ datasetWithSize.show(false)
+
+ val assembler = new VectorAssembler()
+ .setInputCols(Array("hour", "mobile", "userFeatures"))
+ .setOutputCol("features")
+
+ // This dataframe can be used by used by downstream transformers as before
--- End diff --
ditto.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)** for PR 20285 at commit [`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/20/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86541/testReport)** for PR 20285 at commit [`4de3f81`](https://github.com/apache/spark/commit/4de3f8153747b3c7dc1fe36abf0d1391ffd2b80c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86223/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86543/testReport)** for PR 20285 at commit [`6055a8c`](https://github.com/apache/spark/commit/6055a8c058cad380233e8bab73e6d9abb2674bd1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163340508
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+
+import org.apache.spark.ml.feature.VectorAssembler;
+import org.apache.spark.ml.feature.VectorSizeHint;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import static org.apache.spark.sql.types.DataTypes.*;
+// $example off$
+
+public class JavaVectorSizeHintExample {
+ public static void main(String[] args) {
+ SparkSession spark = SparkSession
+ .builder()
+ .appName("JavaVectorSizeHintExample")
+ .getOrCreate();
+
+ // $example on$
+ StructType schema = createStructType(new StructField[]{
+ createStructField("id", IntegerType, false),
+ createStructField("hour", IntegerType, false),
+ createStructField("mobile", DoubleType, false),
+ createStructField("userFeatures", new VectorUDT(), false),
+ createStructField("clicked", DoubleType, false)
+ });
+ Row row0 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
+ Row row1 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
--- End diff --
Can we use the same data set with the other code?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86221/testReport)** for PR 20285 at commit [`85d0db0`](https://github.com/apache/spark/commit/85d0db07641d8d39a87129995367efad44dba56f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86223/testReport)** for PR 20285 at commit [`b4f2c71`](https://github.com/apache/spark/commit/b4f2c71950f4a5d9127942c5b3de266b92dfaac6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86287/testReport)** for PR 20285 at commit [`c0a53de`](https://github.com/apache/spark/commit/c0a53de5ae114f0a00d4cd1c341da35b8fe00263).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162224172
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of
--- End diff --
a column of `Vector` -> for a column of `Vector`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163338180
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls or vectors of the wrong size. By default
+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or `optimistic` indicating that all rows should be kept. When
+`handleInvalid` is set to `optimistic` the user takes responsibility for ensuring that the column
--- End diff --
`optimistic` --> "optimistic"
the backquote only used on code vars.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162145565
--- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.linalg import Vectors
+from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
+# $example off$
+from pyspark.sql import SparkSession
+
+if __name__ == "__main__":
+ spark = SparkSession\
+ .builder\
+ .appName("VectorAssemblerExample")\
+ .getOrCreate()
+
+ # $example on$
+ dataset = spark.createDataFrame(
+ [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0),
+ (0, 18, 1.0, Vectors.dense([0.0, 10.0]), 0.0)],
+ ["id", "hour", "mobile", "userFeatures", "clicked"])
+
+ sizeHint = VectorSizeHint(
+ inputCol="userFeatures",
+ handleInvalid="sip",
--- End diff --
typo "sip" -> "skip"
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86548/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/20285
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163377934
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls or vectors of the wrong size. By default
+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
--- End diff --
Not clear to me what is the expected behaivor of `optimistic`. How is it different from `error`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86287/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86223/testReport)** for PR 20285 at commit [`b4f2c71`](https://github.com/apache/spark/commit/b4f2c71950f4a5d9127942c5b3de266b92dfaac6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86287/testReport)** for PR 20285 at commit [`c0a53de`](https://github.com/apache/spark/commit/c0a53de5ae114f0a00d4cd1c341da35b8fe00263).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162223829
--- Diff: examples/src/main/python/ml/vector_size_hint_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.linalg import Vectors
+from pyspark.ml.feature import (VectorSizeHint, VectorAssembler)
+# $example off$
+from pyspark.sql import SparkSession
+
+if __name__ == "__main__":
+ spark = SparkSession\
+ .builder\
+ .appName("VectorSizeHintExample")\
+ .getOrCreate()
+
+ # $example on$
+ dataset = spark.createDataFrame(
+ [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0),
+ (0, 18, 1.0, Vectors.dense([0.0, 10.0]), 0.0)],
+ ["id", "hour", "mobile", "userFeatures", "clicked"])
+
+ sizeHint = VectorSizeHint(
+ inputCol="userFeatures",
+ handleInvalid="skip",
+ size=3)
+
+ datasetWithSize = sizeHint.transform(dataset)
+ print("Rows where 'userFeatures' is not the right size are filtered out")
+ datasetWithSize.show(truncate=False)
+
+ assembler = VectorAssembler(
+ inputCols=["hour", "mobile", "userFeatures"],
+ outputCol="features")
+
+ # This dataframe can be used by used by downstream transformers as before
--- End diff --
I think there is some typos here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/20285
Hi, @mengxr .
Could you resolve the JIRA, too?
- https://issues.apache.org/jira/browse/SPARK-22735
Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86366/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/151/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163390328
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls or vectors of the wrong size. By default
+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
+does not have invalid values, values that don't match the column's metadata, or dealing with those
+invalid values downstream.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
--- End diff --
<img width="520" alt="screen shot 2018-01-23 at 1 54 28 pm" src="https://user-images.githubusercontent.com/223219/35302985-523fcccc-0045-11e8-9a21-c4ed795b6e6a.png">
I don't think so :), but I think we should leave it to be consistent with other examples.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)** for PR 20285 at commit [`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/150/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163378065
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls or vectors of the wrong size. By default
+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column
+does not have invalid values, values that don't match the column's metadata, or dealing with those
+invalid values downstream.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
--- End diff --
Do we need to mention `Scala` explicitly here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86366/testReport)** for PR 20285 at commit [`6228902`](https://github.com/apache/spark/commit/622890287aecf515da147b039c253657161e89f3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/157/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r163377373
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
--- End diff --
`a user` is correct because `users`'s pronunciation starts with `y`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162142830
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,48 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors an a column of
--- End diff --
typo 'an a column' -> 'in a column'
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20285
**[Test build #86543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86543/testReport)** for PR 20285 at commit [`6055a8c`](https://github.com/apache/spark/commit/6055a8c058cad380233e8bab73e6d9abb2674bd1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20285#discussion_r162224951
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
</div>
</div>
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of
+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
--- End diff --
a user -> an user
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20285
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org