You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by lnmohankumar <gi...@git.apache.org> on 2017/03/28 11:39:20 UTC

[GitHub] spark pull request #17456: Branch 2.1

GitHub user lnmohankumar opened a pull request:

    https://github.com/apache/spark/pull/17456

    Branch 2.1

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17456
    
----
commit 5693ac8e5bd5df8aca1b0d6df0be072a45abcfbd
Author: Xiangrui Meng <me...@databricks.com>
Date:   2016-12-14T00:59:09Z

    [SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes
    
    ## What changes were proposed in this pull request?
    
    Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.
    
    cc: jkbradley
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #16264 from mengxr/SPARK-18793.
    
    (cherry picked from commit 594b14f1ebd0b3db9f630e504be92228f11b4d9f)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 019d1fa3d421b5750170429fc07b204692b7b58e
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-14T02:36:36Z

    [SPARK-18588][TESTS] Ignore KafkaSourceStressForDontFailOnDataLossSuite
    
    ## What changes were proposed in this pull request?
    
    Disable KafkaSourceStressForDontFailOnDataLossSuite for now.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16275 from zsxwing/ignore-flaky-test.
    
    (cherry picked from commit e104e55c16e229e521c517393b8163cbc3bbf85a)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 8ef005931a242d087f4879805571be0660aefaf9
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-12-14T02:52:05Z

    [MINOR][SPARKR] fix kstest example error and add unit test
    
    ## What changes were proposed in this pull request?
    
    While adding vignettes for kstest, I found some errors in the example:
    1. There is a typo of kstest;
    2. print.summary.KStest doesn't work with the example;
    
    Fix the example errors;
    Add a new unit test for print.summary.KStest;
    
    ## How was this patch tested?
    Manual test;
    Add new unit test;
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #16259 from wangmiao1981/ks.
    
    (cherry picked from commit f2ddabfa09fda26ff0391d026dd67545dab33e01)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit f999312e72940b559738048646013eec9e68d657
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-12-14T10:09:31Z

    [SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32
    
    ## What changes were proposed in this pull request?
    Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
    to Analysis to fix a regression caused by SPARK-18504.
    
    This problem can be reproduced with a simple script now.
    
    Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
    Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
    sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
    
    The requirements are:
    1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
    2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
    3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
    
    ## How was this patch tested?
    
    SubquerySuite and a simplified version of TPCDS-Q32
    
    Author: Nattavut Sutyanyong <ns...@gmail.com>
    
    Closes #16246 from nsyca/18814.
    
    (cherry picked from commit cccd64393ea633e29d4a505fb0a7c01b51a79af8)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 16d4bd4a25e70e9396b3451a53157f7cc41c1359
Author: Cheng Lian <li...@databricks.com>
Date:   2016-12-14T18:57:03Z

    [SPARK-18730] Post Jenkins test report page instead of the full console output page to GitHub
    
    ## What changes were proposed in this pull request?
    
    Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while.
    
    This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure.
    
    Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page.
    
    ## How was this patch tested?
    
    N/A.
    
    Author: Cheng Lian <li...@databricks.com>
    
    Closes #16163 from liancheng/jenkins-test-report.
    
    (cherry picked from commit ba4aab9b85688141d3d0c185165ec7a402c9fbba)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit af12a21ca7145751acdec400134b1bd5c8168f74
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-12-14T19:29:11Z

    [SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side post-filter for FileFormat datasources
    
    ## What changes were proposed in this pull request?
    
    Currently, `FileSourceStrategy` does not handle the case when the pushed-down filter is `Literal(null)` and removes it at the post-filter in Spark-side.
    
    For example, the codes below:
    
    ```scala
    val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF()
    df.filter($"_1" === "true").explain(true)
    ```
    
    shows it keeps `null` properly.
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- LocalRelation [_1#17]
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#17 as double) = cast(true as double))
    +- LocalRelation [_1#17]
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#17) && null)
    +- LocalRelation [_1#17]
    
    == Physical Plan ==
    *Filter (isnotnull(_1#17) && null)       << Here `null` is there
    +- LocalTableScan [_1#17]
    ```
    
    However, when we read it back from Parquet,
    
    ```scala
    val path = "/tmp/testfile"
    df.write.parquet(path)
    spark.read.parquet(path).filter($"_1" === "true").explain(true)
    ```
    
    `null` is removed at the post-filter.
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- Relation[_1#11] parquet
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#11 as double) = cast(true as double))
    +- Relation[_1#11] parquet
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#11) && null)
    +- Relation[_1#11] parquet
    
    == Physical Plan ==
    *Project [_1#11]
    +- *Filter isnotnull(_1#11)       << Here `null` is missing
       +- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
    ```
    
    This PR fixes it to keep it properly. In more details,
    
    ```scala
    val partitionKeyFilters =
      ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    ```
    
    This keeps this `null` in `partitionKeyFilters` as `Literal` always don't have `children` and `references` is being empty  which is always the subset of `partitionSet`.
    
    And then in
    
    ```scala
    val afterScanFilters = filterSet -- partitionKeyFilters
    ```
    
    `null` is always removed from the post filter. So, if the referenced fields are empty, it should be applied into data columns too.
    
    After this PR, it becomes as below:
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- Relation[_1#276] parquet
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#276 as double) = cast(true as double))
    +- Relation[_1#276] parquet
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#276) && null)
    +- Relation[_1#276] parquet
    
    == Physical Plan ==
    *Project [_1#276]
    +- *Filter (isnotnull(_1#276) && null)
       +- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b..., PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
    ```
    
    ## How was this patch tested?
    
    Unit test in `FileSourceStrategySuite`
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #16184 from HyukjinKwon/SPARK-18753.
    
    (cherry picked from commit 89ae26dcdb73266fbc3a8b6da9f5dff30dc4ec95)
    Signed-off-by: Cheng Lian <li...@databricks.com>

commit e8866f9fc62095b78421d461549f7eaf8e9070b3
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-12-14T20:22:49Z

    [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics
    
    ## What changes were proposed in this pull request?
    This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.
    
    ## How was this patch tested?
    This should be covered by existing tests.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #16274 from rxin/SPARK-18853.
    
    (cherry picked from commit 5d799473696a15fddd54ec71a93b6f8cb169810c)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit c4de90fc76d5aa5d2c8fee4ed692d4ab922cbab0
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-14T21:36:41Z

    [SPARK-18852][SS] StreamingQuery.lastProgress should be null when recentProgress is empty
    
    ## What changes were proposed in this pull request?
    
    Right now `StreamingQuery.lastProgress` throws NoSuchElementException and it's hard to be used in Python since Python user will just see Py4jError.
    
    This PR just makes it return null instead.
    
    ## How was this patch tested?
    
    `test("lastProgress should be null when recentProgress is empty")`
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16273 from zsxwing/SPARK-18852.
    
    (cherry picked from commit 1ac6567bdb03d7cc5c5f3473827a102280cb1030)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit d0d9c5725774897703f2611484838ec7ed09e84f
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-12-14T22:10:40Z

    [SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes
    
    ## What changes were proposed in this pull request?
    
    Added short section for KSTest.
    Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)
    
    ![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png)
    
    ## How was this patch tested?
    
    Manually tested example locally.
    Built vignettes locally.
    
    Author: Joseph K. Bradley <jo...@databricks.com>
    
    Closes #16283 from jkbradley/ksTest-vignette.
    
    (cherry picked from commit 78627425708a0afbe113efdf449e8622b43b652d)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 280c35af97a20b15578c14b20aa8c19d8fe75456
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-12-15T00:12:14Z

    [SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for subqueries
    
    ## What changes were proposed in this pull request?
    This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries.
    
    This patch fixes the bug.
    
    ## How was this patch tested?
    Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #16277 from rxin/SPARK-18854.
    
    (cherry picked from commit ffdd1fcd1e8f4f6453d5b0517c0ce82766b8e75f)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 0d94201e0102fd5890ba07da6dd518cec7334b2b
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-12-15T01:07:27Z

    [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates
    
    ## What changes were proposed in this pull request?
    
    When do the QA work, I found that the following issues:
    
    1). `spark.mlp` doesn't include an example;
    2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
    3). `spark.lda` document misses default values for some parameters.
    
    I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.
    
    ## How was this patch tested?
    
    Manual test
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #16284 from wangmiao1981/ks.
    
    (cherry picked from commit 324388531648de20ee61bd42518a068d4789925c)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit cb2c8428df0607cfbb17a2c874f8228561a2e8ef
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-12-15T05:03:56Z

    [SPARK-18856][SQL] non-empty partitioned table should not report zero size
    
    ## What changes were proposed in this pull request?
    
    In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #16280 from cloud-fan/bug.
    
    (cherry picked from commit d6f11a12a146a863553c5a5e2023d79d4375ef3f)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit b14fc391893468e25de1e24d982d6f260cac59ad
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-12-15T05:08:45Z

    [SPARK-18869][SQL] Add TreeNode.p that returns BaseType
    
    ## What changes were proposed in this pull request?
    After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType.
    
    ## How was this patch tested?
    N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #16288 from rxin/SPARK-18869.
    
    (cherry picked from commit 5d510c693aca8c3fd3364b4453160bc8585ffc8e)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d399a297d1ec9e0a3c57658cba0320b4d7fe88c5
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-12-15T05:29:20Z

    [SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding `DESCRIPTION` file
    
    ## What changes were proposed in this pull request?
    
    Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that.
    
    - Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
    - Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html
    
    ## How was this patch tested?
    
    Manual.
    
    ```bash
    cd docs
    SKIP_SCALADOC=1 jekyll build
    ```
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #16292 from dongjoon-hyun/SPARK-18875.
    
    (cherry picked from commit ec0eae486331c3977505d261676b77a33c334216)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 2a8de2e11ebab0cb9056444053127619d8a47d8a
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-12-15T05:51:52Z

    [SPARK-18849][ML][SPARKR][DOC] vignettes final check update
    
    ## What changes were proposed in this pull request?
    
    doc cleanup
    
    ## How was this patch tested?
    
    ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
    Output html here: https://felixcheung.github.io/sparkr-vignettes.html
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16286 from felixcheung/rvignettespass.
    
    (cherry picked from commit 7d858bc5ce870a28a559f4e81dcfc54cbd128cb7)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit e430915fad7ffb9397a96f0ef16e741c6b4f158b
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-12-15T19:54:35Z

    [SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets
    
    ## What changes were proposed in this pull request?
    
    Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true.
    
    ## How was this patch tested?
    
    Added unit test
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #16289 from tdas/SPARK-18870.
    
    (cherry picked from commit 4f7292c87512a7da3542998d0e5aa21c27a511e9)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 900ce558a238fb9d8220527d8313646fe6830695
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-15T21:17:51Z

    [SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data.
    
    This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch.
    
    ## How was this patch tested?
    
    The added test.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16251 from zsxwing/newest-first.
    
    (cherry picked from commit 68a6dc974b25e6eddef109f6fd23ae4e9775ceca)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit b6a81f4720752efe459860d28d7f8f738b2944c3
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-12-15T22:26:54Z

    [SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not defined
    
    ## What changes were proposed in this pull request?
    
    `_to_seq` wasn't imported.
    
    ## How was this patch tested?
    
    Added partitionBy to existing write path unit test
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #16297 from brkyvz/SPARK-18888.

commit ef2ccf94224f00154cab7ab173d65442ecd389d7
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-15T22:46:00Z

    Preparing Spark release v2.1.0-rc3

commit a7364a82eb0d18f92f1d8e46c1160a55bc250032
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-15T22:46:09Z

    Preparing development version 2.1.1-SNAPSHOT

commit 08e4272872fc17c43f0dc79d329b946e8e85694d
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-12-15T23:46:03Z

    [SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single listener, check trigger...
    
    ## What changes were proposed in this pull request?
    
    Use `recentProgress` instead of `lastProgress` and filter out last non-zero value. Also add eventually to the latest assertQuery similar to first `assertQuery`
    
    ## How was this patch tested?
    
    Ran test 1000 times
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #16287 from brkyvz/SPARK-18868.
    
    (cherry picked from commit 9c7f83b0289ba4550b156e6af31cf7c44580eb12)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit ae853e8f3bdbd16427e6f1ffade4f63abaf74abb
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-12-16T00:15:51Z

    [MINOR] Only rename SparkR tar.gz if names mismatch
    
    ## What changes were proposed in this pull request?
    
    For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail.
    
    ## How was this patch tested?
    
    Manually by executing the following script
    ```
    set -o pipefail
    set -e
    set -x
    
    touch a
    
    R_PACKAGE_VERSION=2.1.0
    VERSION=2.1.0
    
    if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
      cp a a
    fi
    ```
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #16299 from shivaram/sparkr-cp-fix.
    
    (cherry picked from commit 9634018c4d6d5a4f2c909f7227d91e637107b7f4)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit ec31726581a43624fd47ce48f4e33d2a8e96c15c
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-16T00:18:20Z

    Preparing Spark release v2.1.0-rc4

commit 62a6577bfa3a83783c813e74286e62b668e9af83
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-16T00:18:29Z

    Preparing development version 2.1.1-SNAPSHOT

commit b23220fa67dd279d0b8005cb66d0875adbd3c8cb
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-12-16T01:13:35Z

    [MINOR] Handle fact that mv is different on linux, mac
    
    Follow up to https://github.com/apache/spark/commit/ae853e8f3bdbd16427e6f1ffade4f63abaf74abb as `mv` throws an error on the Jenkins machines if source and destinations are the same.
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #16302 from shivaram/sparkr-no-mv-fix.
    
    (cherry picked from commit 5a44f18a2a114bdd37b6714d81f88cb68148f0c9)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit cd0a08361e2526519e7c131c42116bf56fa62c76
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-16T01:57:04Z

    Preparing Spark release v2.1.0-rc5

commit 483624c2e13c8f239ee750bc149941b79800d0b0
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-12-16T01:57:11Z

    Preparing development version 2.1.1-SNAPSHOT

commit d8548c8a7541bfa37761382edbb1892a145b2b71
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-12-16T05:58:27Z

    [SPARK-18892][SQL] Alias percentile_approx approx_percentile
    
    ## What changes were proposed in this pull request?
    percentile_approx is the name used in Hive, and approx_percentile is the name used in Presto. approx_percentile is actually more consistent with our approx_count_distinct. Given the cost to alias SQL functions is low (one-liner), it'd be better to just alias them so it is easier to use.
    
    ## How was this patch tested?
    Technically I could add an end-to-end test to verify this one-line change, but it seemed too trivial to me.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #16300 from rxin/SPARK-18892.
    
    (cherry picked from commit 172a52f5d31337d90155feb7072381e8d5712288)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit a73201dafcf22756b8074a73e1b5da41cdf8b9a4
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-16T08:42:39Z

    [SPARK-18850][SS] Make StreamExecution and progress classes serializable
    
    ## What changes were proposed in this pull request?
    
    This PR adds StreamingQueryWrapper to make StreamExecution and progress classes serializable because it is too easy for it to get captured with normal usage. If StreamingQueryWrapper gets captured in a closure but no place calls its methods, it should not fail the Spark tasks. However if its methods are called, then this PR will throw a better message.
    
    ## How was this patch tested?
    
    `test("StreamingQuery should be Serializable but cannot be used in executors")`
    `test("progress classes should be Serializable")`
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16272 from zsxwing/SPARK-18850.
    
    (cherry picked from commit d7f3058e17571d76a8b4c8932de6de81ce8d2e78)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit d8ef0be83d8d032ddab79b465226ed3ff3d1eff7
Author: Takeshi YAMAMURO <li...@gmail.com>
Date:   2016-12-16T14:44:42Z

    [SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data
    
    ## What changes were proposed in this pull request?
    A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;
    
    ```
    scala> case class A(a: Long, b: Int)
    scala> val as = Seq(A(1, 2))
    scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
    scala> val df = spark.read.parquet("/data/")
    scala> df.printSchema
    root
     |-- a: long (nullable = true)
     |-- b: integer (nullable = true)
    scala> df.collect
    java.lang.NullPointerException
            at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
            at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    ```
    The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema.
    
    ## How was this patch tested?
    Add tests in `ParquetPartitionDiscoverySuite`
    
    Author: Takeshi YAMAMURO <li...@gmail.com>
    
    Closes #16030 from maropu/SPARK-18108.
    
    (cherry picked from commit dc2a4d4ad478fdb0486cc0515d4fe8b402d24db4)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17456: Branch 2.1

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17456
  
    @lnmohankumar please close this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17456: Branch 2.1

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17456
  
    Please close this issue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17456: Branch 2.1

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17456
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17456: Branch 2.1

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17456


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org