You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by lw-lin <gi...@git.apache.org> on 2016/07/17 12:55:18 UTC

[GitHub] spark pull request #14237: [WIP][SPARK-16283][SQL] Implement `percentile_app...

GitHub user lw-lin opened a pull request:

    https://github.com/apache/spark/pull/14237

    [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQL function

    ## What changes were proposed in this pull request?
    
    WIP
    
    ## How was this patch tested?
    
    WIP
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lw-lin/spark percentile_approx

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14237
    
----
commit 479bf7387f0dcba41ce6ab25b7008c7fd6dd7b07
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-07-17T12:53:15Z

    Implement function `percentile_approx`

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    **[Test build #62431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62431/consoleFull)** for PR 14237 at commit [`479bf73`](https://github.com/apache/spark/commit/479bf7387f0dcba41ce6ab25b7008c7fd6dd7b07).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class PercentileApprox(`
      * `class QuantileSummaries(`
      * `  case class Stats(value: Double, g: Int, delta: Int)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14237: [WIP][SPARK-16283][SQL] Implement `percentile_app...

Posted by lw-lin <gi...@git.apache.org>.

Github user lw-lin closed the pull request at:

    https://github.com/apache/spark/pull/14237


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14237: [WIP][SPARK-16283][SQL] Implement `percentile_app...

Posted by lw-lin <gi...@git.apache.org>.

Github user lw-lin closed the pull request at:

    https://github.com/apache/spark/pull/14237


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62431/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    **[Test build #62431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62431/consoleFull)** for PR 14237 at commit [`479bf73`](https://github.com/apache/spark/commit/479bf7387f0dcba41ce6ab25b7008c7fd6dd7b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14237: [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62669/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14237: [WIP][SPARK-16283][SQL] Implement `percentile_app...

Posted by lw-lin <gi...@git.apache.org>.

GitHub user lw-lin reopened a pull request:

    https://github.com/apache/spark/pull/14237

    [WIP][SPARK-16283][SQL] Implement `percentile_approx` SQL function

    I'll reopen once it's ready for review, thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lw-lin/spark percentile_approx

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14237
    
----
commit 479bf7387f0dcba41ce6ab25b7008c7fd6dd7b07
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-07-17T12:53:15Z

    Implement function `percentile_approx`

commit 9d75c0a9fae00c40ef931a7c643a45161990cda4
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-07-17T06:42:28Z

    [SPARK-16584][SQL] Move regexp unit tests to RegexpExpressionsSuite
    
    ## What changes were proposed in this pull request?
    This patch moves regexp related unit tests from StringExpressionsSuite to RegexpExpressionsSuite to match the file name for regexp expressions.
    
    ## How was this patch tested?
    This is a test only change.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #14230 from rxin/SPARK-16584.

commit f7ec0233471c9a4acd4cfe7df28ca96f0fda0c61
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-07-18T02:02:21Z

    [SPARK-16027][SPARKR] Fix R tests SparkSession init/stop
    
    ## What changes were proposed in this pull request?
    
    Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
    
    ## How was this patch tested?
    
    unit tests
    
    shivaram
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #14177 from felixcheung/rsessiontest.

commit 7fcb4231dd940fba91047ea192d569a4763b7631
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-07-18T05:48:00Z

    [SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java
    
    This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python.
    
    This patch was originally written by HyukjinKwon. Closes #14236.

commit ceed2f29c9c34cd0663bef1fb984066b5a687805
Author: WeichenXu <we...@outlook.com>
Date:   2016-07-18T08:11:53Z

    [MINOR][TYPO] fix fininsh typo
    
    ## What changes were proposed in this pull request?
    
    fininsh => finish
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #14238 from WeichenXu123/fix_fininsh_typo.

commit d635cc21baea6e28313c6deea41e5e45353a9014
Author: krishnakalyan3 <kr...@gmail.com>
Date:   2016-07-18T16:46:23Z

    [SPARK-16055][SPARKR] warning added while using sparkPackages with spark-submit
    
    ## What changes were proposed in this pull request?
    https://issues.apache.org/jira/browse/SPARK-16055
    sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit
    
    ## How was this patch tested?
    In my system locally
    
    Author: krishnakalyan3 <kr...@gmail.com>
    
    Closes #14179 from krishnakalyan3/spark-pkg.

commit e01f19582cc724028b60bcf1ee1f8b4d33d91efd
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-07-18T16:49:14Z

    [SPARK-16351][SQL] Avoid per-record type dispatch in JSON when writing
    
    ## What changes were proposed in this pull request?
    
    Currently, `JacksonGenerator.apply` is doing type-based dispatch for each row to write appropriate values.
    It might not have to be done like this because the schema is already kept.
    
    So, appropriate writers can be created first according to the schema once, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.
    
    This PR corrects `JacksonGenerator` so that it creates all writers for the schema once and then applies them to each row rather than type dispatching for every row.
    
    Benchmark was proceeded with the codes below:
    
    ```scala
    test("Benchmark for JSON writer") {
      val N = 500 << 8
      val row =
        """{"struct":{"field1": true, "field2": 92233720368547758070},
          "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
          "arrayOfString":["str1", "str2"],
          "arrayOfInteger":[1, 2147483647, -2147483648],
          "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
          "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
          "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
          "arrayOfBoolean":[true, false, true],
          "arrayOfNull":[null, null, null, null],
          "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
          "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
          "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
         }"""
      val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
      val benchmark = new Benchmark("JSON writer", N)
      benchmark.addCase("writing JSON file", 10) { _ =>
        withTempPath { path =>
          df.write.format("json").save(path.getCanonicalPath)
        }
      }
      benchmark.run()
    }
    ```
    
    This produced the results below
    
    - **Before**
    
    ```
    JSON writer:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    writing JSON file                             1675 / 1767          0.1       13087.5       1.0X
    ```
    
    - **After**
    
    ```
    JSON writer:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    writing JSON file                             1597 / 1686          0.1       12477.1       1.0X
    ```
    
    In addition, I ran this benchmark 10 times for each and calculated the average elapsed time as below:
    
    | **Before** | **After**|
    |---------------|------------|
    |17478ms  |16669ms |
    
    It seems roughly ~5% is improved.
    
    ## How was this patch tested?
    
    Existing tests should cover this.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #14028 from HyukjinKwon/SPARK-16351.

commit fd062fd577dfc3ddc50b371c67016730c09f7b20
Author: Daoyuan Wang <da...@intel.com>
Date:   2016-07-18T20:58:12Z

    [SPARK-16515][SQL] set default record reader and writer for script transformation
    
    ## What changes were proposed in this pull request?
    In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.
    
    ## How was this patch tested?
    added a test case in SQLQuerySuite.
    
    Closes #14169
    
    Author: Daoyuan Wang <da...@intel.com>
    Author: Yin Huai <yh...@databricks.com>
    
    Closes #14249 from yhuai/scriptTransformation.

commit 20e20d600c8eb80e7399902e9887268d27508eac
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-07-18T23:01:57Z

    [SPARKR][DOCS] minor code sample update in R programming guide
    
    ## What changes were proposed in this pull request?
    
    Fix code style from ad hoc review of RC4 doc
    
    ## How was this patch tested?
    
    manual
    
    shivaram
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #14250 from felixcheung/rdocs2rc4.

commit 59a4af762c8d3c5185ec0f4fdccdf36e694f2438
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-07-19T00:17:37Z

    [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly
    
    ## What changes were proposed in this pull request?
    
    This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on  `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.
    
    ## How was this patch tested?
    
    Pass the Jenkins. This is only a testsuite change.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #14235 from dongjoon-hyun/SPARK-16590.

commit b1abe29160f857fcc6d3b14b14f3cf2c019bf8a4
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-07-19T00:56:36Z

    [HOTFIX] Fix Scala 2.10 compilation

commit 847111bd44ed2a5f1ad5b6fde91a884c38b3e6a0
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-07-19T01:03:35Z

    [SPARK-16615][SQL] Expose sqlContext in SparkSession
    
    ## What changes were proposed in this pull request?
    This patch removes the private[spark] qualifier for SparkSession.sqlContext, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/Re-transtition-SQLContext-to-SparkSession-td18342.html
    
    ## How was this patch tested?
    N/A - this is a visibility change.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #14252 from rxin/SPARK-16615.

commit 6a001a95c893507a7715bb93045d1e2083b9cc74
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-07-19T05:57:13Z

    [MINOR] Remove unused arg in als.py
    
    ## What changes were proposed in this pull request?
    The second arg in method `update()` is never used. So I delete it.
    
    ## How was this patch tested?
    local run with `./bin/spark-submit examples/src/main/python/als.py`
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #14247 from zhengruifeng/als_refine.

commit 04c4d6dfc89a25b503d8567878771cb8d246034a
Author: Cheng Lian <li...@databricks.com>
Date:   2016-07-19T06:07:59Z

    [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
    
    ## What changes were proposed in this pull request?
    
    This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
    
    ## How was this patch tested?
    
    Manually verified the generated HTML page.
    
    Author: Cheng Lian <li...@databricks.com>
    
    Closes #14245 from liancheng/minor-scala-example-update.

commit ab95df633e2b8e076f9fbe2839317a216fd10418
Author: Mortada Mehyar <mo...@gmail.com>
Date:   2016-07-19T06:49:47Z

    [DOC] improve python doc for rdd.histogram and dataframe.join
    
    ## What changes were proposed in this pull request?
    
    doc change only
    
    ## How was this patch tested?
    
    doc change only
    
    Author: Mortada Mehyar <mo...@gmail.com>
    
    Closes #14253 from mortada/histogram_typos.

commit d9006d08c01a48ae3812f1c42b6ae46bd211571e
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-07-19T17:24:48Z

    [SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: String)`
    
    ## What changes were proposed in this pull request?
    
    Currently `RDD.pipe(command: String)`:
    - works only when the command is specified without any options, such as `RDD.pipe("wc")`
    - does NOT work when the command is specified with some options, such as `RDD.pipe("wc -l")`
    
    This is a regression from Spark 1.6.
    
    This patch adds back the tokenization process in `RDD.pipe(command: String)` to fix this regression.
    
    ## How was this patch tested?
    Added a test which:
    - would pass in `1.6`
    - _[prior to this patch]_ would fail in `master`
    - _[after this patch]_ would pass in `master`
    
    Author: Liwei Lin <lw...@gmail.com>
    
    Closes #14256 from lw-lin/rdd-pipe.

commit 626f91e593a4764dade173c27d0c20feee4843a2
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-07-19T17:28:17Z

    [SPARK-16602][SQL] `Nvl` function should support numeric-string cases
    
    ## What changes were proposed in this pull request?
    
    `Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too.
    
    ```scala
    - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype =>
    + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype =>
    ```
    
    **Before**
    ```scala
    scala> sql("select nvl('0', 1)").collect()
    org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch:
    input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7
    ```
    
    **After**
    ```scala
    scala> sql("select nvl('0', 1)").collect()
    res0: Array[org.apache.spark.sql.Row] = Array([0])
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #14251 from dongjoon-hyun/SPARK-16602.

commit 8989d276c589197af87892f76b45cc0ac0557c2a
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-07-19T10:51:43Z

    [MINOR][BUILD] Fix Java Linter `LineLength` errors
    
    ## What changes were proposed in this pull request?
    
    This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release.
    
    ## How was this patch tested?
    
    After pass the Jenkins, `./dev/lint-java`.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #14255 from dongjoon-hyun/minor_java_linter.

commit 89eea7fdd17809c6802b07291a69fcc6403c2386
Author: Xin Ren <ia...@126.com>
Date:   2016-07-19T10:59:46Z

    [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
    
    https://issues.apache.org/jira/browse/SPARK-16535
    
    ## What changes were proposed in this pull request?
    
    When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
    ```
    Definition of groupId is redundant, because it's inherited from the parent
    ```
    ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)
    
    I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
    ```
    <groupId>org.apache.spark</groupId>
    ```
    As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).
    
    ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762
    
    ## How was this patch tested?
    
    I've tested by re-building the project, and build succeeded.
    
    Author: Xin Ren <ia...@126.com>
    
    Closes #14189 from keypointt/SPARK-16535.

commit b14885be0489d8e080ddcfb367e98e51885b8036
Author: Ahmed Mahran <ah...@mashin.io>
Date:   2016-07-19T11:01:54Z

    [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar
    
    ## What changes were proposed in this pull request?
    
    Minor fixes correcting some typos, punctuations, grammar.
    Adding more anchors for easy navigation.
    Fixing minor issues with code snippets.
    
    ## How was this patch tested?
    
    `jekyll serve`
    
    Author: Ahmed Mahran <ah...@mashin.io>
    
    Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

commit 20ac6debfd3424d3f018c8b7f45fd71d7b87074f
Author: WeichenXu <we...@outlook.com>
Date:   2016-07-19T11:07:40Z

    [SPARK-16600][MLLIB] fix some latex formula syntax error
    
    ## What changes were proposed in this pull request?
    
    `\partial\x` ==> `\partial x`
    `har{x_i}` ==> `hat{x_i}`
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #14246 from WeichenXu123/fix_formular_err.

commit 4063eb39aa3d18ea7292826ac65fd483eecc6fb9
Author: Sean Owen <so...@cloudera.com>
Date:   2016-07-19T11:10:24Z

    [SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued up in the fixed thread pool
    
    ## What changes were proposed in this pull request?
    
    Begin failing if checkpoint writes will likely keep up with storage's ability to write them, to fail fast instead of slowly filling memory
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #14152 from srowen/SPARK-16395.

commit 17315f019917e6074298a5ea070d51315e4eb728
Author: Micha\u0142 Weso\u0142owski <mi...@bzwbk.pl>
Date:   2016-07-19T11:18:42Z

    [SPARK-16478] graphX (added graph caching in strongly connected components)
    
    ## What changes were proposed in this pull request?
    
    I added caching in every iteration for sccGraph that is returned in strongly connected components. Without this cache strongly connected components returned graph that needed to be computed from scratch when some intermediary caches didn't existed anymore.
    
    ## How was this patch tested?
    I tested it by running code similar to the one  [on databrics](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html). Basically I generated large graph  and computed strongly connected components with changed code, than simply run count on vertices and edges. Count after this update takes few seconds instead 20 minutes.
    
    # statement
    contribution is my original work and I license the work to the project under the project's open source license.
    
    Author: Micha\u0142 Weso\u0142owski <mi...@bzwbk.pl>
    
    Closes #14137 from wesolowskim/SPARK-16478.

commit 18d8dabad12ab4b1737cfee5c1129b2e1c4d99fb
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-07-19T11:31:04Z

    [SPARK-16494][ML] Upgrade breeze version to 0.12
    
    ## What changes were proposed in this pull request?
    breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
    One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
    We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
    For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
    https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
    
    ## How was this patch tested?
    No new tests, should pass the existing ones.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #14150 from yanboliang/spark-16494.

commit fc7f406aff09cb391e2cee492299e317f778bc0d
Author: Yin Huai <yh...@databricks.com>
Date:   2016-07-19T19:58:08Z

    [SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false.
    
    ## What changes were proposed in this pull request?
    In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false.
    
    Author: Yin Huai <yh...@databricks.com>
    
    Closes #14267 from yhuai/SPARK-15705-changeDefaultValue.

commit 3e69692a1cd537de455c77d98c0205605b1fc10b
Author: Andrew Duffy <ro...@aduffy.org>
Date:   2016-07-20T00:08:38Z

    [SPARK-14702] Make environment of SparkLauncher launched process more configurable
    
    ## What changes were proposed in this pull request?
    
    Adds a few public methods to `SparkLauncher` to allow configuring some extra features of the `ProcessBuilder`, including the working directory, output and error stream redirection.
    
    ## How was this patch tested?
    
    Unit testing + simple Spark driver programs
    
    Author: Andrew Duffy <ro...@aduffy.org>
    
    Closes #14201 from andreweduffy/feature/launcher.

commit e2da70e770e59970dca8817c476c4b754d4e22ff
Author: WeichenXu <we...@outlook.com>
Date:   2016-07-20T01:48:41Z

    [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code
    
    ## What changes were proposed in this pull request?
    
    update `refreshTable` API in python code of the sql-programming-guide.
    
    This API is added in SPARK-15820
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #14220 from WeichenXu123/update_sql_doc_catalog.

commit 89eb4d07d1d687603b0a7b577ddccb2a3d32d4d4
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-07-20T02:28:08Z

    [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite
    
    ## What changes were proposed in this pull request?
    
    This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.
    
    ## How was this patch tested?
    SparkR unit tests, SparkSubmitSuite, check-cran.sh
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #14243 from shivaram/sparkr-jar-move.

commit d701b8ef35ca814d8f2e69dd3d511468173c5ad8
Author: Anthony Truchet <a....@criteo.com>
Date:   2016-07-20T09:39:59Z

    [SPARK-16440][MLLIB] Destroy broadcasted variables even on driver
    
    ## What changes were proposed in this pull request?
    Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.
    
    ## How was this patch tested?
    Unit Tests in Word2VecSuite were run locally.
    
    This contribution is done on behalf of Criteo, according to the
    terms of the Apache license 2.0.
    
    Author: Anthony Truchet <a....@criteo.com>
    
    Closes #14268 from AnthonyTruchet/SPARK-16440.

commit 07e1b447567755a5cc60941b945cdf1b4db36b78
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2016-07-20T05:00:22Z

    [SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
    
    When Hive (or at least certain versions of Hive) creates parquet files
    containing tinyint or smallint columns, it stores them as int32, but
    doesn't annotate the parquet field as containing the corresponding
    int8 / int16 data. When Spark reads those files using the vectorized
    reader, it follows the parquet schema for these fields, but when
    actually reading the data it tries to use the type fetched from
    the metastore, and then fails because data has been loaded into the
    wrong fields in OnHeapColumnVector.
    
    So instead of blindly trusting the parquet schema, check whether the
    Catalyst-provided schema disagrees with it, and adjust the types so
    that the necessary metadata is present when loading the data into
    the ColumnVector instance.
    
    Tested with unit tests and with tests that create byte / short columns
    in Hive and try to read them from Spark.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #14272 from vanzin/SPARK-16632.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org