You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by oliverpierson <gi...@git.apache.org> on 2016/05/31 12:44:48 UTC

[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Github user oliverpierson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r65173690

--- Diff: docs/ml-features.md ---
@@ -145,9 +148,11 @@ for more details on the API.
passed to other algorithms like LDA.

During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
- term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
+ term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
- included in the vocabulary.
+ included in the vocabulary. Another optional binary toggle parameter controls the output vector.
--- End diff --

The difference in results is a bit puzzling. I'm getting the same thing as @MLnick. Could you both look at the output of `df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)`. I get the following on the DataFrame above:

```
scala> df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)
res8: Array[Double] = Array(2.2, 5.0)
```

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org