You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2016/07/21 21:41:40 UTC
[GitHub] spark pull request #14308: [SPARK-16260][EXAMPLES][ML] Improve ML Example Ou...
GitHub user BryanCutler opened a pull request:
https://github.com/apache/spark/pull/14308
[SPARK-16260][EXAMPLES][ML] Improve ML Example Outputs
## What changes were proposed in this pull request?
Improve example outputs to better reflect the functionality that is being presented. This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output. Explicitly set parameters when they are used as part of the example. Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema. Synced examples between different APIs.
## How was this patch tested?
Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/BryanCutler/spark ml-examples-improve-output-SPARK-16260
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14308.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14308
----
commit 7b4496b16517b01c01abd6aebe84b53876265b82
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-19T00:22:12Z
finished going through about a third of examples
commit 6e4ed29e704e4805fff312b561f8e41919e014eb
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-20T00:23:25Z
Fixed more examples, about half done now
commit 26718e9da96de142e4bb3078ffdacbf94e4c3d47
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-20T18:19:45Z
more progress up to NaiveBayes example
commit ff066ce1ad3391c707cd21b4802c5843a70a2da9
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T00:26:59Z
further progress up to PCA example
commit 53a29411c5969d1bc25ace3817cc927213fcb0b7
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T04:28:12Z
continued throught examples up to Tf Idf
commit 38c319945e854939f86b8e3f67ebcb04d0be532f
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T20:22:31Z
finished remaining ml examples
commit a8093bec8fc4090711e6d7b56001a288db03235d
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T20:37:22Z
fixed style checks
commit afe2b2ad3069363de62a6f25cd1e4ac706b9e6b8
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T20:57:36Z
fixed Java import ordering
commit b7384cef97f89730f4f400873c8369775bbe994e
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T21:09:41Z
minor cleanup
commit ae2249a3396f6585c504986234d664dd23f9c401
Author: Bryan Cutler <cu...@gmail.com>
Date: 2016-07-21T21:33:35Z
made accurracy reporting consistent
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72836296
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java ---
@@ -48,23 +48,19 @@ public static void main(String[] args) {
.setDegree(3);
List<Row> data = Arrays.asList(
- RowFactory.create(Vectors.dense(-2.0, 2.3)),
+ RowFactory.create(Vectors.dense(2.0, 1.0)),
--- End diff --
The fractional part makes the output a little ugly, where as using whole numbers is more readable and still shows the transform
before
```
[[-2.0,4.0,-8.0,2.3,-4.6,9.2,5.289999999999999,-10.579999999999998,12.166999999999996]]
[[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]
[[0.6,0.36,0.216,-1.1,-0.66,-0.396,1.2100000000000002,0.7260000000000001,-1.3310000000000004]]
```
after
```
+----------+------------------------------------------+
|features |polyFeatures |
+----------+------------------------------------------+
|[2.0,1.0] |[2.0,4.0,8.0,1.0,2.0,4.0,1.0,2.0,1.0] |
|[0.0,0.0] |[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0] |
|[3.0,-1.0]|[3.0,9.0,27.0,-1.0,-3.0,-9.0,1.0,3.0,-1.0]|
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778529
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java ---
@@ -48,23 +48,19 @@ public static void main(String[] args) {
.setDegree(3);
List<Row> data = Arrays.asList(
- RowFactory.create(Vectors.dense(-2.0, 2.3)),
+ RowFactory.create(Vectors.dense(2.0, 1.0)),
--- End diff --
Why this change?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14308
@BryanCutler yeah if there are some changes that are more bug-fixes to make the examples work, let's separate those out into a new JIRA & PR. That should be a little higher priority for `2.0.1`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63089 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63089/consoleFull)** for PR 14308 at commit [`a556742`](https://github.com/apache/spark/commit/a556742dd38b2722ee7d497e355bc1b9ed974cf4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72830947
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java ---
@@ -50,8 +50,8 @@ public static void main(String[] args) {
IsotonicRegression ir = new IsotonicRegression();
IsotonicRegressionModel model = ir.fit(dataset);
- System.out.println("Boundaries in increasing order: " + model.boundaries());
- System.out.println("Predictions associated with the boundaries: " + model.predictions());
+ System.out.println("Boundaries in increasing order: " + model.boundaries() + "\n");
--- End diff --
The 2 arrays that are printed are large and all the output get clumped together, looking like a huge block of text, so adding some separation makes it a bit more readable.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14308
I think it's fine to remove files that aren't referenced here too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63087 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63087/consoleFull)** for PR 14308 at commit [`479819d`](https://github.com/apache/spark/commit/479819dbddbe02d099f3b6359b99718e7a71a2df).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
Ok, I removed these data files
```
sample_tree_data.csv
lr_data.txt
random.data
```
and added example usage to reference `pagerank_data.txt`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778384
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaMaxAbsScalerExample.java ---
@@ -34,10 +44,17 @@ public static void main(String[] args) {
.getOrCreate();
// $example on$
- Dataset<Row> dataFrame = spark
- .read()
- .format("libsvm")
- .load("data/mllib/sample_libsvm_data.txt");
+ List<Row> data = Arrays.asList(
--- End diff --
Does the data change here? why change from reading the file?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72838595
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaWord2VecExample.java ---
@@ -55,10 +56,14 @@ public static void main(String[] args) {
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0);
+
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
- for (Row r : result.select("result").takeAsList(3)) {
- System.out.println(r);
+
+ for (Row row : result.collectAsList()) {
+ java.util.List text = row.getList(0);
--- End diff --
List was already imported, but this should be `List text = ...`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63089/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
Thanks for the review @srowen! I added some before/after outputs, so hopefully some of the changes make more sense. I'll fix up the rest after I make another JIRA for the Java errors.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #62974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62974/consoleFull)** for PR 14308 at commit [`bb2fcee`](https://github.com/apache/spark/commit/bb2fceea1c696b04f2113be8c9c5a9ce638493b9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72857036
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaWord2VecExample.java ---
@@ -55,10 +56,14 @@ public static void main(String[] args) {
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0);
+
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
- for (Row r : result.select("result").takeAsList(3)) {
- System.out.println(r);
+
+ for (Row row : result.collectAsList()) {
+ java.util.List text = row.getList(0);
--- End diff --
Yeah, just saying it's also fully qualified here. It could have a generic bound too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63087/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63274/consoleFull)** for PR 14308 at commit [`b634f9b`](https://github.com/apache/spark/commit/b634f9b8a7fd7f118605800f19266611d8951b33).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63274/consoleFull)** for PR 14308 at commit [`b634f9b`](https://github.com/apache/spark/commit/b634f9b8a7fd7f118605800f19266611d8951b33).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14308
It's probably OK on the whole, improving or standardizing examples slightly. I left a number of small questions. Some of the changes didn't feel quite worth making but maybe I miss the logic.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
ping @mengxr @jkbradley @MLnick , any of you mind taking a look at this? There were a few Java examples I fixed up that wouldn't run because of using mllib.linalg.Vectors. If it would be easier, I could separate those in another PR to get that in asap. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62974/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14308
There's a lot of change here; I skimmed it and it all looks generally positive, adding some consistency or clarification, or a fix in some cases. Is sample_libsvm_data.txt used anymore then? it's low risk to merge because they're example changes. I'm OK with it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16260][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #62693 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62693/consoleFull)** for PR 14308 at commit [`ae2249a`](https://github.com/apache/spark/commit/ae2249a3396f6585c504986234d664dd23f9c401).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63087/consoleFull)** for PR 14308 at commit [`479819d`](https://github.com/apache/spark/commit/479819dbddbe02d099f3b6359b99718e7a71a2df).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63274/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
Thanks for taking another look @srowen. `sample_libsvm_data.txt` is still used but it looks these are
never referenced
```
sample_tree_data.csv
pagerank_data.txt
lr_data.txt
random.data
```
I can't place where `sample_tree_data.csv` might have belonged, `pagerank_data.txt` is obvious (just missing reference in usage), and `lr_data.txt`/`random.data` look like labeled points probably from some older MLlib examples.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
> @BryanCutler yeah if there are some changes that are more bug-fixes to make the examples work, let's separate those out into a new JIRA & PR. That should be a little higher priority for 2.0.1
Sure @MLnick , I realized I should probably do that about half-way into this. I'll make another JIRA and fix the Java errors there.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
attaching a quick audit of example data files and what examples reference them, take from this branch
[spark_example_data_audit.txt](https://github.com/apache/spark/files/402881/spark_example_data_audit.txt)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72832453
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java ---
@@ -54,6 +54,7 @@ public static void main(String[] args) {
// Output the parameters of the mixture model
for (int i = 0; i < model.getK(); i++) {
+ System.out.println("Gaussian " + i);
--- End diff --
Yeah the 2 print statements could be combined. I was probably just trying not to cram too much together, but I think it would be fine.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72835137
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaOneVsRestExample.java ---
@@ -75,7 +75,7 @@ public static void main(String[] args) {
// compute the classification error on test data.
double accuracy = evaluator.evaluate(predictions);
- System.out.println("Test Error : " + (1 - accuracy));
+ System.out.println("Test Error = " + (1 - accuracy));
--- End diff --
Yeah, I was just trying to make things like this consistent with other similar examples. I think I just saw "=" used more often, but it really doesn't make a difference to me.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14308
Merged to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #63089 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63089/consoleFull)** for PR 14308 at commit [`a556742`](https://github.com/apache/spark/commit/a556742dd38b2722ee7d497e355bc1b9ed974cf4).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16260][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16260][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #62693 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62693/consoleFull)** for PR 14308 at commit [`ae2249a`](https://github.com/apache/spark/commit/ae2249a3396f6585c504986234d664dd23f9c401).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16260][EXAMPLES][ML] Improve ML Example Outputs
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14308
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62693/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72834486
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaNGramExample.java ---
@@ -55,16 +55,12 @@ public static void main(String[] args) {
Dataset<Row> wordDataFrame = spark.createDataFrame(data, schema);
- NGram ngramTransformer = new NGram().setInputCol("words").setOutputCol("ngrams");
+ NGram ngramTransformer = new NGram().setN(2).setInputCol("words").setOutputCol("bigrams");
--- End diff --
I really only think that the param `N` should be set explicitly. Looking back, changing the column name was not necessary, let me change that back.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/14308
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778518
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaOneVsRestExample.java ---
@@ -75,7 +75,7 @@ public static void main(String[] args) {
// compute the classification error on test data.
double accuracy = evaluator.evaluate(predictions);
- System.out.println("Test Error : " + (1 - accuracy));
+ System.out.println("Test Error = " + (1 - accuracy));
--- End diff --
Some of these changes feel kind of trivial, but I guess this is for consistency. But other new System.out.println statements use : not =
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778582
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java ---
@@ -23,17 +23,23 @@
import java.util.Arrays;
import java.util.List;
+import scala.collection.mutable.WrappedArray;
+
import org.apache.spark.ml.feature.RegexTokenizer;
import org.apache.spark.ml.feature.Tokenizer;
+import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.DataTypes;
-import org.apache.spark.sql.types.Metadata;
-import org.apache.spark.sql.types.StructField;
-import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.*;
--- End diff --
Here imports are collapsed to *; elsewhere a * import is expanded. I might generally not touch these, but, the standard is usually to avoid wildcard imports by default
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72837199
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java ---
@@ -23,17 +23,23 @@
import java.util.Arrays;
import java.util.List;
+import scala.collection.mutable.WrappedArray;
+
import org.apache.spark.ml.feature.RegexTokenizer;
import org.apache.spark.ml.feature.Tokenizer;
+import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.DataTypes;
-import org.apache.spark.sql.types.Metadata;
-import org.apache.spark.sql.types.StructField;
-import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.*;
--- End diff --
I agree that wildcards should be avoided, not sure what happened here. It might have been an automatic thing from the IDE, I'll revert this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
This has been updated since fixing the errors in Java @srowen @MLnick . I know most of these changes are trivial, but will hopefully make some of the examples easier to follow. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778614
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaWord2VecExample.java ---
@@ -55,10 +56,14 @@ public static void main(String[] args) {
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0);
+
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
- for (Row r : result.select("result").takeAsList(3)) {
- System.out.println(r);
+
+ for (Row row : result.collectAsList()) {
+ java.util.List text = row.getList(0);
--- End diff --
Import List
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72833265
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaMaxAbsScalerExample.java ---
@@ -34,10 +44,17 @@ public static void main(String[] args) {
.getOrCreate();
// $example on$
- Dataset<Row> dataFrame = spark
- .read()
- .format("libsvm")
- .load("data/mllib/sample_libsvm_data.txt");
+ List<Row> data = Arrays.asList(
--- End diff --
The data in the file is fine, but uses sparse vectors so that when the result is output, it doesn't really show anything. Using just a small sample dataset, you can see what it is doing from the output
before
```
+-----+--------------------+--------------------+
|label| features| scaledFeatures|
+-----+--------------------+--------------------+
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
```
after
```
+--------------+----------------+
| features| scaledFeatures|
+--------------+----------------+
|[1.0,0.1,-8.0]|[0.25,0.01,-1.0]|
|[2.0,1.0,-4.0]| [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]| [1.0,1.0,1.0]|
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/14308
Thanks @srowen!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14308
**[Test build #62974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62974/consoleFull)** for PR 14308 at commit [`bb2fcee`](https://github.com/apache/spark/commit/bb2fceea1c696b04f2113be8c9c5a9ce638493b9).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778479
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaNGramExample.java ---
@@ -55,16 +55,12 @@ public static void main(String[] args) {
Dataset<Row> wordDataFrame = spark.createDataFrame(data, schema);
- NGram ngramTransformer = new NGram().setInputCol("words").setOutputCol("ngrams");
+ NGram ngramTransformer = new NGram().setN(2).setInputCol("words").setOutputCol("bigrams");
--- End diff --
I suppose this doesn't hurt, but ngrams was still fairly OK
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778075
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java ---
@@ -50,8 +50,8 @@ public static void main(String[] args) {
IsotonicRegression ir = new IsotonicRegression();
IsotonicRegressionModel model = ir.fit(dataset);
- System.out.println("Boundaries in increasing order: " + model.boundaries());
- System.out.println("Predictions associated with the boundaries: " + model.predictions());
+ System.out.println("Boundaries in increasing order: " + model.boundaries() + "\n");
--- End diff --
No big deal, but why the extra line break?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14308: [SPARK-16421][EXAMPLES][ML] Improve ML Example Ou...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/14308#discussion_r72778289
--- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java ---
@@ -54,6 +54,7 @@ public static void main(String[] args) {
// Output the parameters of the mixture model
for (int i = 0; i < model.getK(); i++) {
+ System.out.println("Gaussian " + i);
--- End diff --
Why split over two statements?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org