You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GayathriMurali <gi...@git.apache.org> on 2016/05/18 19:00:37 UTC

[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

GitHub user GayathriMurali opened a pull request:

    https://github.com/apache/spark/pull/13176

    [SPARK-15100][DOC] Modified user guide and examples for CountVectoriz…

    ## What changes were proposed in this pull request?
    
    This is partial document changes to ml.feature. Made changes to CountVectorizer, HashingTF and QuantileDiscretizer
    
    
    ## How was this patch tested?
    
    Unit test and manual testing

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/GayathriMurali/spark SPARK-15100

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13176
    
----
commit 46408bbdb13da94ecd40ba380ee8fc219232d481
Author: GayathriMurali <ga...@intel.com>
Date:   2016-05-18T18:58:27Z

    [SPARK-15100][DOC] Modified user guide and examples for CountVectorizer, HashingTF and QuantileDiscretizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64079147
  
    --- Diff: docs/ml-features.md ---
    @@ -114,7 +116,10 @@ for more details on the API.
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
      term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary.Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    I said "Another", because the previous line starts with 'An optional parameter". It just sounded right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @MLnick Please let me know if there is anything else I can do to help get this merged.Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221410132
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59222/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220698824
  
    Something messed up the `git push`. I will send another commit 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220723548
  
    @MLnick The latest commit includes just the ml-feature.md changes. I removed all the other example files and feature.py. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @GayathriMurali Thanks for the update.  For future cases like this, I support creating a new JIRA specific to the task at hand, but agree with @MLnick about keeping the same PR and just changing the title to link it to the new JIRA.  No big deal though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59915/consoleFull)** for PR 13176 at commit [`5a53051`](https://github.com/apache/spark/commit/5a530519d77f036a90b71a654d377e333e99a7f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221145486
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59171/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64080140
  
    --- Diff: docs/ml-features.md ---
    @@ -114,7 +116,10 @@ for more details on the API.
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
      term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary.Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    It's ok I guess, but you are missing a space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64073253
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java ---
    @@ -54,6 +54,7 @@ public static void main(String[] args) {
           .setOutputCol("feature")
           .setVocabSize(3)
           .setMinDF(2)
    +      .setBinary(true)
    --- End diff --
    
    @MLnick Since we introduce Binary toggle in the doc, I thought it would make sense to show how to set it. Do you want me to remove it from all examples? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220513197
  
    @hhbyyh Can you please help review this? I will resolve the branch conflict along with review comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59964/consoleFull)** for PR 13176 at commit [`881a2fc`](https://github.com/apache/spark/commit/881a2fc98217eefc454e2ad310ef28f0c677d14f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220703852
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59014/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221145485
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59964/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64683245
  
    --- Diff: docs/ml-features.md ---
    @@ -53,7 +53,10 @@ collisions, where different raw features may become the same term after hashing.
     chance of collision, we can increase the target feature dimension, i.e. the number of buckets 
     of the hash table. Since a simple modulo is used to transform the hash function to a column index, 
    --- End diff --
    
    @yanboliang I am neutral about adding this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60322/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15997][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @GayathriMurali why did you close this? I'm sorry it is taking a while, but everyone has been pretty swamped over the past few weeks in the lead up to Spark Summit and trying to get 2.0 ready for an RC.
    
    This would have been simpler if not for the partition issue. I think we can figure out an approach easily to ensure 1 partition in the examples (e.g. use `repartition` or similar).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64079535
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    +a relativeError parameter.Default value is 0.001. This parameter is not available in Python yet.
    --- End diff --
    
    Nit: space


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64100861
  
    --- Diff: docs/ml-features.md ---
    @@ -1093,13 +1111,10 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +The bin ranges are chosen using the `approxQuantile` method based on the Greenwald-Khanna algorithm.
    +The number of bins found is equal to `numBuckets` parameter value. `relativeError` sets the target relative precision
    --- End diff --
    
    Feel free to borrow from my proposed doc here too. Note the upper bound is `1`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001752
  
    --- Diff: docs/ml-features.md ---
    @@ -114,7 +116,10 @@ for more details on the API.
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
      term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
    --- End diff --
    
    while we're here, make it `minDF` (with backticks rather than quotation marks)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65173690
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    The difference in results is a bit puzzling.  I'm getting the same thing as @MLnick.  Could you both look at the output of `df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)`.  I get the following on the DataFrame above:
    
    ```
    scala> df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)
    res8: Array[Double] = Array(2.2, 5.0)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64077367
  
    --- Diff: docs/ml-features.md ---
    @@ -26,7 +26,9 @@ This section covers algorithms for working with features, roughly divided into t
     
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
    -The algorithm combines Term Frequency (TF) counts with the 
    +A binary toggle parameter controls term frequency. When set to true all nonzero frequencies are
    +set to 1. This is especially useful for discrete probabilistic models that model binary counts
    +rather than integer. The algorithm combines Term Frequency (TF) counts with the
     [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
    --- End diff --
    
    It might better to switch the order of these sentences, so you describe the algorithm first, then the optional binary parameter.
    ```
    The algorithm combines...
    A binary toggle parameter...
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r67237879
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    --- End diff --
    
    Bad link: remove ".scala"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64523585
  
    --- Diff: docs/ml-features.md ---
    @@ -151,7 +151,7 @@ for more details on the API.
      term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
      included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    - If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete
    + If set to true all nonzero counts are set to 1. This is especially useful for discrete
    --- End diff --
    
    @GayathriMurali you haven't addressed this comment as far as I can see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @MLnick  +1 for making the change in the example as well. Calling out difference in result due to parallelism might be little confusing in this document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220566098
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58967/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64535542
  
    --- Diff: docs/ml-features.md ---
    @@ -53,7 +53,10 @@ collisions, where different raw features may become the same term after hashing.
     chance of collision, we can increase the target feature dimension, i.e. the number of buckets 
     of the hash table. Since a simple modulo is used to transform the hash function to a column index, 
    --- End diff --
    
    Should we also mention that we use ``` Austin Appleby's MurmurHash 3 algorithm``` to calculate the hash code value? Because how to set the feature dimension is related with the hash algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221962590
  
    **[Test build #59399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59399/consoleFull)** for PR 13176 at commit [`39d3dfb`](https://github.com/apache/spark/commit/39d3dfb97f0334d0921942356ca03d71bfd636b5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64355101
  
    --- Diff: docs/ml-features.md ---
    @@ -1098,9 +1098,9 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins is set by the `numBuckets` parameter.
    -The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
     for a detailed description). The precision of the approximation can be controlled with the
    -`relativeError` parameter. When set to zero, exact quantiles are calculated.
    +`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation.
    --- End diff --
    
    I'd prefer `When set to zero, exact quantiles are calculated (**Note** that computing exact ...).`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    Can you check with `sysctl -n hw.ncpu`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220703848
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #60322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60322/consoleFull)** for PR 13176 at commit [`87844ef`](https://github.com/apache/spark/commit/87844efaeee06a9d4cd7c30cbd6d4b947978112c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221145417
  
    **[Test build #59171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59171/consoleFull)** for PR 13176 at commit [`27acda3`](https://github.com/apache/spark/commit/27acda3745d84f5abdd93d1e2621aa1d211c4a95).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220727900
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220566097
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64079509
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    The approximate quantile algorithm is deterministic by the way


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @jkbradley @MLnick I have created SPARK-15997 to track the changes addressed in this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64077742
  
    --- Diff: docs/ml-features.md ---
    @@ -114,7 +116,10 @@ for more details on the API.
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
      term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary.Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    Don't say another, just say "An optional binary..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @MLnick I am using local. I havent explicitly setup thread count. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59694/consoleFull)** for PR 13176 at commit [`4b1a1fa`](https://github.com/apache/spark/commit/4b1a1fa2d7c9b19b18bf223439abe1318b9795da).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65774912
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java ---
    @@ -58,7 +58,11 @@ public static void main(String[] args) {
         QuantileDiscretizer discretizer = new QuantileDiscretizer()
           .setInputCol("hour")
           .setOutputCol("result")
    -      .setNumBuckets(3);
    +      .setNumBuckets(3)
    +      .setRelativeError(0);
    +      // Note that we compute exact quantiles here by setting `relativeError` to 0 for
    --- End diff --
    
    I actually think it will be better to put the comment above the code block since it spans two lines (for each example)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-222409058
  
    @MLnick Please let me know if there is anything else that I can help with this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @MLnick I opened PR #13745 to track this as @jkbradley suggested. This JIRA is only doing partial list of Audit ml.feature. Please help review SPARK-15597.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64110989
  
    --- Diff: docs/ml-features.md ---
    @@ -53,7 +53,10 @@ collisions, where different raw features may become the same term after hashing.
     chance of collision, we can increase the target feature dimension, i.e. the number of buckets 
     of the hash table. Since a simple modulo is used to transform the hash function to a column index, 
     it is advisable to use a power of two as the feature dimension, otherwise the features will 
    -not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. 
    +not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
    +An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are
    +set to 1. This is especially useful for discrete probabilistic models that model binary counts
    +rather than integer.
    --- End diff --
    
    This sentence is not right.  The api doc reads "This is useful for discrete probabilistic models that model binary events rather than integer counts."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65790125
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01.
    --- End diff --
    
    @MLnick I specified the default value coz in the example, we say "however in most cases the default parameter value should suffice " and not mentioning the default value wouldnt make much sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64079652
  
    --- Diff: docs/ml-features.md ---
    @@ -26,7 +26,9 @@ This section covers algorithms for working with features, roughly divided into t
     
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
    -The algorithm combines Term Frequency (TF) counts with the 
    +A binary toggle parameter controls term frequency. When set to true all nonzero frequencies are
    --- End diff --
    
    I was looking at the wrong one, but you need to say "term frequency counts", the counts are the actual output


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001064
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java ---
    @@ -58,7 +58,8 @@ public static void main(String[] args) {
         QuantileDiscretizer discretizer = new QuantileDiscretizer()
           .setInputCol("hour")
           .setOutputCol("result")
    -      .setNumBuckets(3);
    +      .setNumBuckets(3)
    --- End diff --
    
    this is ok to keep since it matches the Scala example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221761350
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59328/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r66011880
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01.
    --- End diff --
    
    @MLnick What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220727811
  
    **[Test build #59036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59036/consoleFull)** for PR 13176 at commit [`901fb6d`](https://github.com/apache/spark/commit/901fb6df17667440339120fde3e36ae6be1ae2df).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221409973
  
    **[Test build #59222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59222/consoleFull)** for PR 13176 at commit [`1028995`](https://github.com/apache/spark/commit/1028995e6456448be8fcd2c8478a5f8218e4892c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220202679
  
    (@GayathriMurali It seems the title is incomplete ending with ... Maybe it would be nicer if the title is complete and rebased for the conflict)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001345
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    I believe this doc is not accurate any longer with the move to `approxQuantile`. cc @oliverpierson for any suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001546
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    We should clarify that the quantile computation is approximate, perhaps borrowing from the `approxQuantile` docs, and provide a link to the API doc for the function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220725078
  
    **[Test build #59032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59032/consoleFull)** for PR 13176 at commit [`490a8e8`](https://github.com/apache/spark/commit/490a8e8b038868c56f8393fee180255041f19b7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65036342
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    Hmm, it's strange as I still get different results (clean build off latest master @ `1360a6d636dd812a27955fc85df8e0255db60dfa`):
    
    <img width="871" alt="screen shot 2016-05-30 at 8 45 01 am" src="https://cloud.githubusercontent.com/assets/1036807/15643010/d318b6bc-2642-11e6-9ca8-d2825bd1dcce.png">
    
    For now we can leave the example as is, but it's a little worrying to be getting this difference. Will need to dig further - can you confirm which commit hash you're building off?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221020100
  
    @MLnick I fixed all review comments. Can you please let me know if there is anything else to be done to help get this merged? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    Yeah I get the following
    ```
    scala> df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)
    res1: Array[Double] = Array(2.2, 5.0)
    ```
    env:
    on Mac, `Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-222454756
  
    LGTM - my internet connection is a bit patchy as I'm traveling. Will merge later today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    That would require setting `relativeError` to `0` in the examples however. Open to other suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220538455
  
    **[Test build #58967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58967/consoleFull)** for PR 13176 at commit [`46408bb`](https://github.com/apache/spark/commit/46408bbdb13da94ecd40ba380ee8fc219232d481).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64051081
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    @oliverpierson actually in looking at adding the PySpark I also have just come across this doc issue on both the Scala and Python sides. I will submit a PR soon and ping you on it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r67237898
  
    --- Diff: docs/ml-features.md ---
    @@ -46,14 +46,16 @@ In MLlib, we separate TF and IDF to make them flexible.
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
     `HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
    -A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies 
    +A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash).Then term frequencies
    --- End diff --
    
    Put space between sentences


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220125864
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220692810
  
    **[Test build #59014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59014/consoleFull)** for PR 13176 at commit [`e0b1c38`](https://github.com/apache/spark/commit/e0b1c3835a7432bd410cf2887c010ab1998a0ec4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Can the issue with different results be fixed by making sure the DataFrame has a single partition?  It'd be great not to have error = 0 in the examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65788351
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01.
    --- End diff --
    
    sorry I missed this - the default for `relativeError` is actually `0.001`. But in any case, I don't think it's necessary to specify it here in the guide, so you can remove that sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64077625
  
    --- Diff: docs/ml-features.md ---
    @@ -26,7 +26,9 @@ This section covers algorithms for working with features, roughly divided into t
     
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
    -The algorithm combines Term Frequency (TF) counts with the 
    +A binary toggle parameter controls term frequency. When set to true all nonzero frequencies are
    --- End diff --
    
    I don't think this is quite right, binary does not control the term frequency.  I think it's better to say "... controls the output vector values" as it says in the docstring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001816
  
    --- Diff: docs/ml-features.md ---
    @@ -114,7 +116,10 @@ for more details on the API.
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
      term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
    --- End diff --
    
    "affect" -> "affects"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    I just did. It is local[4]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @GayathriMurali what environment are you using?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64101972
  
    --- Diff: docs/ml-features.md ---
    @@ -1093,13 +1111,10 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +The bin ranges are chosen using the `approxQuantile` method based on the Greenwald-Khanna algorithm.
    +The number of bins found is equal to `numBuckets` parameter value. `relativeError` sets the target relative precision
    --- End diff --
    
    Sure. I was not able to find API doc for `approxQuantile`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    We'll need to update the Java and Python examples accordingly too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @jkbradley the different results was due to the difference in underlying core count(thread count). @MLnick  and I were able to get the same results for `local[4]`. We could explicitly specify this in the example and get rid of the error = 0. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    I'm also getting the same results as @MLnick and @oliverpierson , also getting `Array(2.2, 5.0)` from the stat call.  My env is:
    
    master (updated this morning) on d67c82e4b647dacd0b789d72c9eaf4dc7d329dbd
    RHEL 7
    OpenJDK 64-Bit Server VM, Java 1.8.0_77
    Scala version 2.11.8


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r66690814
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01.
    --- End diff --
    
    I'd prefer to remove it in case we change it (so that there are fewer places to update).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64078816
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java ---
    @@ -54,6 +54,7 @@ public static void main(String[] args) {
           .setOutputCol("feature")
           .setVocabSize(3)
           .setMinDF(2)
    +      .setBinary(true)
    --- End diff --
    
    I agree with @MLnick on not to set the binary param in the examples, it completely changes the output and it's easy enough for the user to figure out how to set themselves.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59913/consoleFull)** for PR 13176 at commit [`5df77c3`](https://github.com/apache/spark/commit/5df77c39ec98a14530d27366cb32ff236a29917c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59915/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65634025
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala ---
    @@ -38,6 +38,7 @@ object QuantileDiscretizerExample {
           .setInputCol("hour")
           .setOutputCol("result")
           .setNumBuckets(3)
    +      .setRelativeError(0)
    --- End diff --
    
    If we do this here, then I think we should be explicit about the fact that we're computing exact quantiles for illustrative purposes.
    
    Something like 
    ```
    .setRelativeError(0) // note that we compute exact quantiles here for illustrative purposes, however in most cases the default parameter value should suffice 
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64000156
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    +a relativeError parameter.Default value is 0.001. This parameter is not available in Python yet.
    --- End diff --
    
    I filed https://issues.apache.org/jira/browse/SPARK-15442 for the missing parameter - we should add that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64362121
  
    --- Diff: docs/ml-features.md ---
    @@ -151,7 +151,7 @@ for more details on the API.
      term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
      included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    - If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete
    + If set to true all nonzero counts are set to 1. This is especially useful for discrete
    --- End diff --
    
    Let's make this consistent with the doc for `HashingTF` above. 
    
    I'd prefer both to read:
    
    "... optional parameter `binary` controls the output term frequencies. When set to true, all nonzero term frequencies are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221966136
  
    **[Test build #59399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59399/consoleFull)** for PR 13176 at commit [`39d3dfb`](https://github.com/apache/spark/commit/39d3dfb97f0334d0921942356ca03d71bfd636b5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221966410
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59399/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    `Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)` on my machine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220725160
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @MLnick @oliverpierson I checked again with a clean build off master. Here is the hash : 2bfc4f15214a870b3e067f06f37eb506b0070a1f. Here is what I see
    
    <img width="983" alt="screen shot 2016-05-31 at 10 26 18 am" src="https://cloud.githubusercontent.com/assets/7002441/15684116/738724e4-271a-11e6-9e42-a80fdbc11bc1.png">



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221966408
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64050222
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    Actually, I think the [description](https://github.com/oliverpierson/spark/blob/a5ccc0ecbd6f3960351027c90e0c9221aa15f2db/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L71) in `QuantileDiscretizer.scala` is misleading and that's my fault.  Any sampling that is performed is done by `approxQuantile` in DataFrame stats.  So it probably would be best if we just say something like "the bin ranges are chosen using `DataFrame.stats.approxQuantile`...".  Alternatively, we could say "The bin ranges are chosen using the Greenwald-Khanna algorithm..." since that the `approxQuantiles` uses.
    
    Also, in the this latest implementation of `QuantileDiscretizer`, the number of buckets found will always be = `numBuckets`.  I probably should submit a new PR with updated documentation/description in `QuantileDiscretizer.scala`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64075252
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    @MLnick @oliverpierson I can fix the `approxQuantile` documentation on Scala side and python side to be more consistent with QuantileDiscretizer in DataFrameStat in this JIRA itself. Please let me know if that makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64076959
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java ---
    @@ -54,6 +54,7 @@ public static void main(String[] args) {
           .setOutputCol("feature")
           .setVocabSize(3)
           .setMinDF(2)
    +      .setBinary(true)
    --- End diff --
    
    I'm neutral on it - my point is that the examples generally show "normal" (or "default") usage, and are intended to be short, succinct illustrations of usage. It's not necessary to show every possible param that can be set in each example.
    
    In this particular case, I would actually say that `binary=true` is a slightly "unusual" use case (not completely expert but certainly not the "normal" use case), so let's remove it from this PR.
    
    For `relativeError`, it's not "expert" (though there's an argument that it could be made an `expertParam`), but in almost all cases it is not necessary to deviate from the default, so again this can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64100666
  
    --- Diff: docs/ml-features.md ---
    @@ -1093,13 +1111,10 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +The bin ranges are chosen using the `approxQuantile` method based on the Greenwald-Khanna algorithm.
    --- End diff --
    
    See []() - I think we can say something like
    
    ```
    The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile for a detailed description).
    ```
    
    We could link the the `approxQuantile` to the relevant API doc link. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221761348
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64273246
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1097,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated.
    --- End diff --
    
    We should add a note that computing exact quantiles can be very expensive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221410130
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #60322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60322/consoleFull)** for PR 13176 at commit [`87844ef`](https://github.com/apache/spark/commit/87844efaeee06a9d4cd7c30cbd6d4b947978112c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220723607
  
    **[Test build #59032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59032/consoleFull)** for PR 13176 at commit [`490a8e8`](https://github.com/apache/spark/commit/490a8e8b038868c56f8393fee180255041f19b7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    I get this : Array[Double] = Array(5.0, 8.0)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    On Mac. Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_73). I checked again and I consistently get the same output on master. @MLnick Please let me know how you would like to proceed. Should I go ahead and change the example in the doc and investigate further on my end?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @jkbradley I just tried this. 
    <img width="533" alt="screen shot 2016-06-16 at 11 21 32 am" src="https://cloud.githubusercontent.com/assets/7002441/16128207/94f835ea-33b4-11e6-9866-369672b7bdae.png">
    and getting this output which is the same as the one in the example
    <img width="364" alt="screen shot 2016-06-16 at 11 21 52 am" src="https://cloud.githubusercontent.com/assets/7002441/16128258/cf80114c-33b4-11e6-9c8e-34d553cf5c39.png">
    
    I will create a new JIRA and link this PR to that. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r66691455
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1095,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
    +for a detailed description). The precision of the approximation can be controlled with the
    +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01.
    --- End diff --
    
    Sorry for delay - agree we should remove it. If users really care they can check API docs, code or explainParam


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220726383
  
    **[Test build #59036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59036/consoleFull)** for PR 13176 at commit [`901fb6d`](https://github.com/apache/spark/commit/901fb6df17667440339120fde3e36ae6be1ae2df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220537993
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Yes, you can do it in this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    That's wild.  I'm getting `Array[Double] = Array(2.2, 5.0)` and I'm guessing @MLnick is also.  `approxQuantile` is deterministic so I'm not really sure why we're getting different results.  Perhaps @thunterdb might have an idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64111073
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    + If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete
    + probabilistic models that model binary events rather than integer counts
    --- End diff --
    
    This sounds right, but missing period.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220704753
  
    @oliverpierson @GayathriMurali I opened #13228 for the `relativeError` param as well as cleaned up doc for `QuantileDiscretizer`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59915/consoleFull)** for PR 13176 at commit [`5a53051`](https://github.com/apache/spark/commit/5a530519d77f036a90b71a654d377e333e99a7f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221760381
  
    **[Test build #59328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59328/consoleFull)** for PR 13176 at commit [`ba832aa`](https://github.com/apache/spark/commit/ba832aa3489005ddfde7eea63388bb4635569aa0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64273057
  
    --- Diff: docs/ml-features.md ---
    @@ -1092,14 +1097,11 @@ for more details on the API.
     ## QuantileDiscretizer
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    -categorical features.
    -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
    -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
    -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    -
    -Note that the result may be different every time you run it, since the sample strategy behind it is
    -non-deterministic.
    +categorical features. The number of bins is set by the `numBuckets` parameter.
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    --- End diff --
    
    The link here should actually link to the generated HTML Scala API doc (since I don't think there is a section for this function in the user guide). Here's an [example](https://github.com/apache/spark/pull/13176/files#diff-56b7fe109d2c152329452f7da634fb3fR1074) from the example code links above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59913/consoleFull)** for PR 13176 at commit [`5df77c3`](https://github.com/apache/spark/commit/5df77c39ec98a14530d27366cb32ff236a29917c).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220727902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59036/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64698136
  
    --- Diff: docs/ml-features.md ---
    @@ -53,7 +53,10 @@ collisions, where different raw features may become the same term after hashing.
     chance of collision, we can increase the target feature dimension, i.e. the number of buckets 
     of the hash table. Since a simple modulo is used to transform the hash function to a column index, 
    --- End diff --
    
    I think we can add it - but we can simply say "The hash function used is MurmurHash 3"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64078981
  
    --- Diff: docs/ml-features.md ---
    @@ -26,7 +26,9 @@ This section covers algorithms for working with features, roughly divided into t
     
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
    -The algorithm combines Term Frequency (TF) counts with the 
    +A binary toggle parameter controls term frequency. When set to true all nonzero frequencies are
    --- End diff --
    
    It controls the output vector values in CountVectorizer and Term Frequency in HashingTF


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @GayathriMurali Looks like it could be an issue with bucketing, but I'm not sure how.  What does `df.stat.approxQuantile("hour", Array(1.0/3, 2.0/3), relativeError=0.001)` return?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64087912
  
    --- Diff: docs/ml-features.md ---
    @@ -26,7 +26,9 @@ This section covers algorithms for working with features, roughly divided into t
     
     `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
     fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
    -The algorithm combines Term Frequency (TF) counts with the 
    +A binary toggle parameter controls term frequency. When set to true all nonzero frequencies are
    --- End diff --
    
    Yup. That makes sense. Will change it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @jkbradley @MLnick  My bad. Sorry about that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64358698
  
    --- Diff: docs/ml-features.md ---
    @@ -1098,9 +1098,9 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins is set by the `numBuckets` parameter.
    -The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
     for a detailed description). The precision of the approximation can be controlled with the
    -`relativeError` parameter. When set to zero, exact quantiles are calculated.
    +`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values.
     
     **Examples**
    --- End diff --
    
    I believe the examples outlined below are no longer accurate with the change to using `approxQuantile` - unless we set `relativeError` to 0. Could we check that they still make sense? We could add something in this section saying, "if we compute exact quantiles, we should get the following DataFrame:"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221761281
  
    **[Test build #59328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59328/consoleFull)** for PR 13176 at commit [`ba832aa`](https://github.com/apache/spark/commit/ba832aa3489005ddfde7eea63388bb4635569aa0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221144374
  
    **[Test build #59171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59171/consoleFull)** for PR 13176 at commit [`27acda3`](https://github.com/apache/spark/commit/27acda3745d84f5abdd93d1e2621aa1d211c4a95).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    I just tried with `--master local[8]` and I get the same results as you do. Should I call this out in the example? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59694/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    I will try as well this afternoon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59913/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @MLnick I agree. Should I make those changes in this same PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64799207
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    @MLnick I am sorry. I did see the email alert, but i was not able to find the comment here. I am addressing it now.
    
    I am assuming you mean "This is especially useful for discrete probabilistic models that model binary, rather than integer, counts." to be consistent in both HashingTF and CountVectorizer. The other details like term frequencies is different for CountVectorizer(output vector).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64476690
  
    --- Diff: docs/ml-features.md ---
    @@ -1098,9 +1098,9 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins is set by the `numBuckets` parameter.
    -The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
     for a detailed description). The precision of the approximation can be controlled with the
    -`relativeError` parameter. When set to zero, exact quantiles are calculated.
    +`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values.
     
     **Examples**
    --- End diff --
    
    @MLnick The example is still valid for the default value of relativeError param(0.001). I will it as is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64698276
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    You haven't addressed my previous comment for this part both here and in `HashingTF`:
    
    Let's make this consistent with the doc for HashingTF above.
    
    I'd prefer both to read:
    
    "... optional parameter binary controls the output term frequencies. When set to true, all nonzero term frequencies are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    Ok, at least we know the issue now.
    
    I'd say we can leave the example as is, but let's add something like:
    ```
    Given `numBuckets = 3`, and computing exact quantiles (by setting `relativeError = 0`), we should get the following DataFrame:
    ```
    
    This makes it clear (and also independent of the hardware setup people may have).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221222998
  
    @GayathriMurali thanks. Made another small comment to make the descirption of the binary parameter consistent. Also please check the `QuantileDiscretizer` example in the guide (not the example code but the little example section in the guide). It either needs to be updated to reflect the output of `QuantileDiscretizer` with default error param, or we need to add a sentence about "computing exact quantiles" to make it match up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @BryanCutler @oliverpierson Looks like something is wrong on my side. I just checked again on a fresh build and got the same results. Will dig deeper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64523407
  
    --- Diff: docs/ml-features.md ---
    @@ -1100,7 +1100,7 @@ for more details on the API.
     categorical features. The number of bins is set by the `numBuckets` parameter.
     The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
     for a detailed description). The precision of the approximation can be controlled with the
    -`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation.
    +`relativeError` parameter. When set to zero, exact quantiles are calculated(**Note:** Computing exact quantiles is an expensive operation).
    --- End diff --
    
    need a space between `calculated` and the `(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64001008
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java ---
    @@ -54,6 +54,7 @@ public static void main(String[] args) {
           .setOutputCol("feature")
           .setVocabSize(3)
           .setMinDF(2)
    +      .setBinary(true)
    --- End diff --
    
    I don't think it's really necessary to set each and every possible parameter in every example. I think you can remove the `setBinary` and `setRelativeError` calls from these different examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220703632
  
    **[Test build #59014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59014/consoleFull)** for PR 13176 at commit [`e0b1c38`](https://github.com/apache/spark/commit/e0b1c3835a7432bd410cf2887c010ab1998a0ec4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    It would have been easier to keep this one open I think as it has all the comment history. Anyway, I commented on #13745.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    @MLnick Please let me know if there is anything else that I can help with this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220565892
  
    **[Test build #58967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58967/consoleFull)** for PR 13176 at commit [`46408bb`](https://github.com/apache/spark/commit/46408bbdb13da94ecd40ba380ee8fc219232d481).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59964/consoleFull)** for PR 13176 at commit [`881a2fc`](https://github.com/apache/spark/commit/881a2fc98217eefc454e2ad310ef28f0c677d14f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64299667
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    + If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete
    --- End diff --
    
    After reading this again, I think you say model(ing) too many times here.  You can change ".. useful for modelling discrete .."  -> ".. useful for discrete.."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r65223909
  
    --- Diff: docs/ml-features.md ---
    @@ -145,9 +148,11 @@ for more details on the API.
      passed to other algorithms like LDA.
     
      During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
    - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
    + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
      by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
    - included in the vocabulary.
    + included in the vocabulary. Another optional binary toggle parameter controls the output vector.
    --- End diff --
    
    2bfc4f15214a870b3e067f06f37eb506b0070a1f - Commit off master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13176: [SPARK-15997][DOC] Modified user guide and exampl...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali closed the pull request at:

    https://github.com/apache/spark/pull/13176


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64523300
  
    --- Diff: docs/ml-features.md ---
    @@ -1098,9 +1098,9 @@ for more details on the API.
     
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins is set by the `numBuckets` parameter.
    -The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
    +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
     for a detailed description). The precision of the approximation can be controlled with the
    -`relativeError` parameter. When set to zero, exact quantiles are calculated.
    +`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values.
     
     **Examples**
    --- End diff --
    
    @GayathriMurali are you sure about that? Because I get this:
    
    ```
    scala> import org.apache.spark.ml.feature.QuantileDiscretizer
    import org.apache.spark.ml.feature.QuantileDiscretizer
    
    scala> val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
    data: Array[(Int, Double)] = Array((0,18.0), (1,19.0), (2,8.0), (3,5.0), (4,2.2))
    
    scala> val df = spark.createDataFrame(data).toDF("id", "hour")
    df: org.apache.spark.sql.DataFrame = [id: int, hour: double]
    
    scala> val discretizer = new QuantileDiscretizer().setInputCol("hour").setOutputCol("result").setNumBuckets(3)
    discretizer: org.apache.spark.ml.feature.QuantileDiscretizer = quantileDiscretizer_c6622394ff70
    
    scala> discretizer.fit(df).transform(df).show
    +---+----+------+
    | id|hour|result|
    +---+----+------+
    |  0|18.0|   2.0|
    |  1|19.0|   2.0|
    |  2| 8.0|   2.0|
    |  3| 5.0|   2.0|
    |  4| 2.2|   1.0|
    +---+----+------+
    
    
    scala> discretizer.setRelativeError(0).fit(df).transform(df).show
    +---+----+------+
    | id|hour|result|
    +---+----+------+
    |  0|18.0|   2.0|
    |  1|19.0|   2.0|
    |  2| 8.0|   2.0|
    |  3| 5.0|   1.0|
    |  4| 2.2|   0.0|
    +---+----+------+
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13176#discussion_r64053917
  
    --- Diff: docs/ml-features.md ---
    @@ -1064,7 +1069,8 @@ categorical features.
     The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts.
     The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values.
     This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may
    -find fewer depending on the data sample values.
    +find fewer depending on the data sample values. Relative precision of the approxQuantile is set using
    --- End diff --
    
    @MLnick I just looked at the PySpark code and noticed the documentation was never updated there either.  That's my bad also:0  Are you working on SPARK-15442?  If you need any help in all of this let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    @GayathriMurali what master are you using for spark-shell? If using `local[4]` I get the same result as you (default for me is 8 threads), so probably due to difference in parallelism (merging the approximations)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-221408041
  
    **[Test build #59222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59222/consoleFull)** for PR 13176 at commit [`1028995`](https://github.com/apache/spark/commit/1028995e6456448be8fcd2c8478a5f8218e4892c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13176#issuecomment-220725163
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59032/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13176
  
    Did you try setting this up so the data has 1 partition only?  That would likely fix the issue with varying # cores affecting results.
    
    Also, can you please create a new JIRA for this PR to make it clear in JIRA what it is addressing?  Please link it to SPARK-15100.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13176
  
    **[Test build #59694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59694/consoleFull)** for PR 13176 at commit [`4b1a1fa`](https://github.com/apache/spark/commit/4b1a1fa2d7c9b19b18bf223439abe1318b9795da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org