You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by elviento <gi...@git.apache.org> on 2017/03/11 17:07:37 UTC

[GitHub] spark pull request #17259: Branch 2.0

GitHub user elviento opened a pull request:

    https://github.com/apache/spark/pull/17259

    Branch 2.0

    ## What changes were proposed in this pull request?
    
    Missing closing bracket '}' line 1704 found during mvn build.
    
    (Please fill in changes proposed in this fix)
    
    diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
    index 6a9279f..3967d07 100644
    --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
    +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
    @@ -1701,4 +1701,5 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           assert(e3.message.contains(
             "Cannot have map type columns in DataFrame which calls set operations"))
         }
    +  }
     }
    
    ## How was this patch tested?
    
    Cloned branch, applied above fix, then successfully compiled using ./dev/make-distributable.sh
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17259
    
----
commit 8a58f2e8ec413591ec00da1e37b91b1bf49e4d1d
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-09-26T20:21:08Z

    [SPARK-17652] Fix confusing exception message while reserving capacity
    
    ## What changes were proposed in this pull request?
    
    This minor patch fixes a confusing exception message while reserving additional capacity in the vectorized parquet reader.
    
    ## How was this patch tested?
    
    Exisiting Unit Tests
    
    Author: Sameer Agarwal <sa...@cs.berkeley.edu>
    
    Closes #15225 from sameeragarwal/error-msg.
    
    (cherry picked from commit 7c7586aef9243081d02ea5065435234b5950ab66)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit f4594900d86bb39358ff19047dfa8c1e4b78aa6b
Author: Andrew Mills <am...@users.noreply.github.com>
Date:   2016-09-26T20:41:10Z

    [Docs] Update spark-standalone.md to fix link
    
    Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html).
    
    Documentation change, verified in preview.
    
    Author: Andrew Mills <am...@users.noreply.github.com>
    
    Closes #15244 from ammills01/master.
    
    (cherry picked from commit 00be16df642317137f17d2d7d2887c41edac3680)
    Signed-off-by: Andrew Or <an...@gmail.com>

commit 98bbc4410181741d903a703eac289408cb5b2c5e
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-27T21:14:27Z

    [SPARK-17618] Guard against invalid comparisons between UnsafeRow and other formats
    
    This patch ports changes from #15185 to Spark 2.x. In that patch, a  correctness bug in Spark 1.6.x which was caused by an invalid `equals()` comparison between an `UnsafeRow` and another row of a different format. Spark 2.x is not affected by that specific correctness bug but it can still reap the error-prevention benefits of that patch's changes, which modify  ``UnsafeRow.equals()` to throw an IllegalArgumentException if it is called with an object that is not an `UnsafeRow`.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15265 from JoshRosen/SPARK-17618-master.
    
    (cherry picked from commit 2f84a686604b298537bfd4d087b41594d2aa7ec6)
    Signed-off-by: Josh Rosen <jo...@databricks.com>

commit 2cd327ef5e4c3f6b8468ebb2352479a1686b7888
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-09-27T23:00:39Z

    [SPARK-17056][CORE] Fix a wrong assert regarding unroll memory in MemoryStore
    
    ## What changes were proposed in this pull request?
    
    There is an assert in MemoryStore's putIteratorAsValues method which is used to check if unroll memory is not released too much. This assert looks wrong.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <si...@tw.ibm.com>
    
    Closes #14642 from viirya/fix-unroll-memory.
    
    (cherry picked from commit e7bce9e1876de6ee975ccc89351db58119674aef)
    Signed-off-by: Josh Rosen <jo...@databricks.com>

commit 1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-28T07:59:00Z

    [SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport)
    
    This is a branch-2.0 backport of #15245.
    
    ## What changes were proposed in this pull request?
    
    This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.
    
    This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.
    
    ## How was this patch tested?
    
    Tested manually for now.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15271 from JoshRosen/SPARK-17666-backport.

commit 4d73d5cd82ebc980f996c78f9afb8a97418ab7ab
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-09-28T10:19:04Z

    [MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to fix wrongly indented examples in PySpark documentation
    
    ```
    -        >>> json_sdf = spark.readStream.format("json")\
    -                                       .schema(sdf_schema)\
    -                                       .load(tempfile.mkdtemp())
    +        >>> json_sdf = spark.readStream.format("json") \\
    +        ...     .schema(sdf_schema) \\
    +        ...     .load(tempfile.mkdtemp())
    ```
    
    ```
    -        people.filter(people.age > 30).join(department, people.deptId == department.id)\
    +        people.filter(people.age > 30).join(department, people.deptId == department.id) \\
    ```
    
    ```
    -        >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), \
    -                        LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
    +        >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])),
    +        ...             LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
    ```
    
    ```
    -        >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])), \
    -                        LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
    +        >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])),
    +        ...             LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
    ```
    
    ```
    -        ...      for x in iterator:
    -        ...           print(x)
    +        ...     for x in iterator:
    +        ...          print(x)
    ```
    
    ## How was this patch tested?
    
    Manually tested.
    
    **Before**
    
    ![2016-09-26 8 36 02](https://cloud.githubusercontent.com/assets/6477701/18834471/05c7a478-8431-11e6-94bb-09aa37b12ddb.png)
    
    ![2016-09-26 9 22 16](https://cloud.githubusercontent.com/assets/6477701/18834472/06c8735c-8431-11e6-8775-78631eab0411.png)
    
    <img width="601" alt="2016-09-27 2 29 27" src="https://cloud.githubusercontent.com/assets/6477701/18861294/29c0d5b4-84bf-11e6-99c5-3c9d913c125d.png">
    
    <img width="1056" alt="2016-09-27 2 29 58" src="https://cloud.githubusercontent.com/assets/6477701/18861298/31694cd8-84bf-11e6-9e61-9888cb8c2089.png">
    
    <img width="1079" alt="2016-09-27 2 30 05" src="https://cloud.githubusercontent.com/assets/6477701/18861301/359722da-84bf-11e6-97f9-5f5365582d14.png">
    
    **After**
    
    ![2016-09-26 9 29 47](https://cloud.githubusercontent.com/assets/6477701/18834467/0367f9da-8431-11e6-86d9-a490d3297339.png)
    
    ![2016-09-26 9 30 24](https://cloud.githubusercontent.com/assets/6477701/18834463/f870fae0-8430-11e6-9482-01fc47898492.png)
    
    <img width="515" alt="2016-09-27 2 28 19" src="https://cloud.githubusercontent.com/assets/6477701/18861305/3ff88b88-84bf-11e6-902c-9f725e8a8b10.png">
    
    <img width="652" alt="2016-09-27 3 50 59" src="https://cloud.githubusercontent.com/assets/6477701/18863053/592fbc74-84ca-11e6-8dbf-99cf57947de8.png">
    
    <img width="709" alt="2016-09-27 3 51 03" src="https://cloud.githubusercontent.com/assets/6477701/18863060/601607be-84ca-11e6-80aa-a401df41c321.png">
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15242 from HyukjinKwon/minor-example-pyspark.
    
    (cherry picked from commit 2190037757a81d3172f75227f7891d968e1f0d90)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 4c694e452278e46231720e778a80c586b9e565f1
Author: w00228970 <wa...@huawei.com>
Date:   2016-09-28T19:02:59Z

    [SPARK-17644][CORE] Do not add failedStages when abortStage for fetch failure
    
    | Time        |Thread 1 ,  Job1          | Thread 2 ,  Job2  |
    |:-------------:|:-------------:|:-----:|
    | 1 | abort stage due to FetchFailed |  |
    | 2 | failedStages += failedStage |    |
    | 3 |      |  task failed due to  FetchFailed |
    | 4 |      |  can not post ResubmitFailedStages because failedStages is not empty |
    
    Then job2 of thread2 never resubmit the failed stage and hang.
    
    We should not add the failedStages when abortStage for fetch failure
    
    added unit test
    
    Author: w00228970 <wa...@huawei.com>
    Author: wangfei <wa...@126.com>
    
    Closes #15213 from scwf/dag-resubmit.
    
    (cherry picked from commit 46d1203bf2d01b219c4efc7e0e77a844c0c664da)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit d358298f1082edd31489a1b08f428c8e60278d69
Author: Eric Liang <ek...@databricks.com>
Date:   2016-09-28T23:19:06Z

    [SPARK-17673][SQL] Incorrect exchange reuse with RowDataSourceScan (backport)
    
    This backports https://github.com/apache/spark/pull/15273 to branch-2.0
    
    Also verified the test passes after the patch was applied. rxin
    
    Author: Eric Liang <ek...@databricks.com>
    
    Closes #15282 from ericl/spark-17673-2.

commit 0a69477a10adb3969a20ae870436299ef5152788
Author: Herman van Hovell <hv...@databricks.com>
Date:   2016-09-28T23:25:10Z

    [SPARK-17641][SQL] Collect_list/Collect_set should not collect null values.
    
    ## What changes were proposed in this pull request?
    We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameAggregateSuite`.
    
    Author: Herman van Hovell <hv...@databricks.com>
    
    Closes #15208 from hvanhovell/SPARK-17641.
    
    (cherry picked from commit 7d09232028967978d9db314ec041a762599f636b)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-09-28T23:27:45Z

    Preparing Spark release v2.0.1-rc4

commit 7d612a7d5277183d3bee3882a687c76dc8ea0e9a
Author: Patrick Wendell <pw...@gmail.com>
Date:   2016-09-28T23:27:54Z

    Preparing development version 2.0.2-SNAPSHOT

commit ca8130050964fac8baa568918f0b67c44a7a2518
Author: Takeshi YAMAMURO <li...@gmail.com>
Date:   2016-09-29T12:26:03Z

    [MINOR][DOCS] Fix th doc. of spark-streaming with kinesis
    
    ## What changes were proposed in this pull request?
    This pr is just to fix the document of `spark-kinesis-integration`.
    Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code)
    from publishing,  `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work.
    Instead, it fetches the kinesis jar from the Spark Package.
    
    Author: Takeshi YAMAMURO <li...@gmail.com>
    
    Closes #15260 from maropu/DocFixKinesis.
    
    (cherry picked from commit b2e9731ca494c0c60d571499f68bb8306a3c9fe5)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 7ffafa3bfecb8bc92b79eddea1ca18166efd3385
Author: \u848b\u661f\u535a <ji...@meituan.com>
Date:   2016-07-13T16:21:27Z

    [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition.
    
    ## What changes were proposed in this pull request?
    
    Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
    ```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
    And
    ```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
    may call rand() for different times and therefore the output rows differ.
    
    This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.
    
    ## How was this patch tested?
    
    Expanded related testcases in FilterPushdownSuite.
    
    Author: \u848b\u661f\u535a <ji...@meituan.com>
    
    Closes #14012 from jiangxb1987/ppd.
    
    (cherry picked from commit f376c37268848dbb4b2fb57677e22ef2bf207b49)
    Signed-off-by: Josh Rosen <jo...@databricks.com>

commit f7839e47c3bda86d61c3b2be72c168aab4a5674f
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-29T02:03:05Z

    [SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. `WHERE false`) then it would be incorrectly pushed beneath an aggregate.
    
    Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a `HAVING` clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, `WHERE false` is deterministic but is independent of the actual data.
    
    This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns.
    
    ## How was this patch tested?
    
    New regression test in FilterPushdownSuite.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15289 from JoshRosen/SPARK-17712.
    
    (cherry picked from commit 37eb9184f1e9f1c07142c66936671f4711ef407d)
    Signed-off-by: Josh Rosen <jo...@databricks.com>

commit 7c9450b007205958984f39a881415cdbe75e0c34
Author: Gang Wu <wg...@uber.com>
Date:   2016-09-29T19:51:05Z

    [SPARK-17672] Spark 2.0 history server web Ui takes too long for a single application
    
    Added a new API getApplicationInfo(appId: String) in class ApplicationHistoryProvider and class SparkUI to get app info. In this change, FsHistoryProvider can directly fetch one app info in O(1) time complexity compared to O(n) before the change which used an Iterator.find() interface.
    
    Both ApplicationCache and OneApplicationResource classes adopt this new api.
    
     manual tests
    
    Author: Gang Wu <wg...@uber.com>
    
    Closes #15247 from wgtmac/SPARK-17671.
    
    (cherry picked from commit cb87b3ced9453b5717fa8e8637b97a2f3f25fdd7)
    Signed-off-by: Andrew Or <an...@gmail.com>

commit 0cdd7370a61618d042417ee387a3c32ee5c924e6
Author: Bjarne Fruergaard <bw...@gmail.com>
Date:   2016-09-29T22:39:57Z

    [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector
    
    ## What changes were proposed in this pull request?
    
    * changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
    * adds a test that was failing before this change, but succeeds with these changes.
    
    The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
    
    ## How was this patch tested?
    
    I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
    
    ## ___
    As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
    
    Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
    
    Author: Bjarne Fruergaard <bw...@gmail.com>
    
    Closes #15296 from bwahlgreen/bugfix-spark-17721.
    
    (cherry picked from commit 29396e7d1483d027960b9a1bed47008775c4253e)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit a99ea4c9e0e2f91e4b524987788f0acee88e564d
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-09-29T23:31:30Z

    Updated the following PR with minor changes to allow cherry-pick to branch-2.0
    
    [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting
    
    In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError.  This fix casts the Label column to a DoubleType to ensure there is no MatchError.
    
    Added unit tests to call evaluate with a dataset that has Label as other numeric types.
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.
    
    (cherry picked from commit 2f739567080d804a942cfcca0e22f91ab7cbea36)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 744aac8e6ff04d7a3f1e8ccad335605ac8fe2f29
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-10-01T05:05:59Z

    [MINOR][DOC] Add an up-to-date description for default serialization during shuffling
    
    ## What changes were proposed in this pull request?
    
    This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.
    
    ## How was this patch tested?
    
    This is a documentation update.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
    
    (cherry picked from commit 15e9bbb49e00b3982c428d39776725d0dea2cdfa)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit b57e2acb134d94dafc81686da875c5dd3ea35c74
Author: Jagadeesan <as...@us.ibm.com>
Date:   2016-10-03T09:46:38Z

    [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,\u2026
    
    ## What changes were proposed in this pull request?
    
    To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~
    
    \u2026 pandoc]
    
    Author: Jagadeesan <as...@us.ibm.com>
    
    Closes #15309 from jagadeesanas2/SPARK-17736.
    
    (cherry picked from commit a27033c0bbaae8f31db9b91693947ed71738ed11)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 613863b116b6cbc9ac83845c68a2d11b3b02f7cb
Author: zero323 <ze...@users.noreply.github.com>
Date:   2016-10-04T00:57:54Z

    [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract
    
    ## What changes were proposed in this pull request?
    
    Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.
    
    Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.
    
    ## How was this patch tested?
    
    PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.
    
    Author: zero323 <ze...@users.noreply.github.com>
    
    Closes #15144 from zero323/SPARK-17587.
    
    (cherry picked from commit d8399b600cef706c22d381b01fab19c610db439a)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 5843932021cc8bbe0277943c6c480cfeae1b29e2
Author: Herman van Hovell <hv...@databricks.com>
Date:   2016-10-04T02:32:59Z

    [SPARK-17753][SQL] Allow a complex expression as the input a value based case statement
    
    ## What changes were proposed in this pull request?
    We currently only allow relatively simple expressions as the input for a value based case statement. Expressions like `case (a > 1) or (b = 2) when true then 1 when false then 0 end` currently fail. This PR adds support for such expressions.
    
    ## How was this patch tested?
    Added a test to the ExpressionParserSuite.
    
    Author: Herman van Hovell <hv...@databricks.com>
    
    Closes #15322 from hvanhovell/SPARK-17753.
    
    (cherry picked from commit 2bbecdec2023143fd144e4242ff70822e0823986)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 7429199e5b34d5594e3fcedb57eda789d16e26f3
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-10-04T04:28:16Z

    [SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException in Thriftserver
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark Thrift Server raises `IllegalArgumentException` for queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.
    
    **Before**
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
    Closing: 0: jdbc:hive2://localhost:10000
    
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
    Closing: 0: jdbc:hive2://localhost:10000
    ```
    
    **After**
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    +-------+--+
    | NULL  |
    +-------+--+
    | NULL  |
    +-------+--+
    1 row selected (3.242 seconds)
    Beeline version 1.2.1.spark2 by Apache Hive
    Closing: 0: jdbc:hive2://localhost:10000
    
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    +-------------------------+--+
    | (IF(true, NULL, NULL))  |
    +-------------------------+--+
    | NULL                    |
    +-------------------------+--+
    1 row selected (0.201 seconds)
    Beeline version 1.2.1.spark2 by Apache Hive
    Closing: 0: jdbc:hive2://localhost:10000
    ```
    
    ## How was this patch tested?
    
    * Pass the Jenkins test with a new testsuite.
    * Also, Manually, after starting Spark Thrift Server, run the following command.
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    ```
    
    **Hive 1.2**
    ```sql
    hive> create table null_table as select null;
    hive> desc null_table;
    OK
    _c0                     void
    ```
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #15325 from dongjoon-hyun/SPARK-17112.
    
    (cherry picked from commit c571cfb2d0e1e224107fc3f0c672730cae9804cb)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 3dbe8097facb854195729da7bd577f6c14eb2b2a
Author: ding <di...@localhost.localdomain>
Date:   2016-10-04T07:00:10Z

    [SPARK-17559][MLLIB] persist edges if their storage level is non in PeriodicGraphCheckpointer
    
    ## What changes were proposed in this pull request?
    When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none.
    
    ## How was this patch tested?
     manual tests
    
    Author: ding <di...@localhost.localdomain>
    
    Closes #15124 from dding3/spark-persisitEdge.
    
    (cherry picked from commit 126baa8d32bc0e7bf8b43f9efa84f2728f02347d)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 50f6be7598547fed5190a920fd3cebb4bc908524
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-10-04T16:22:26Z

    [SPARKR][DOC] minor formatting and output cleanup for R vignettes
    
    Clean up output, format table, truncate long example output, hide warnings
    
    (new - Left; existing - Right)
    ![image](https://cloud.githubusercontent.com/assets/8969467/19064018/5dcde4d0-89bc-11e6-857b-052df3f52a4e.png)
    
    ![image](https://cloud.githubusercontent.com/assets/8969467/19064034/6db09956-89bc-11e6-8e43-232d5c3fe5e6.png)
    
    ![image](https://cloud.githubusercontent.com/assets/8969467/19064058/88f09590-89bc-11e6-9993-61639e29dfdd.png)
    
    ![image](https://cloud.githubusercontent.com/assets/8969467/19064066/95ccbf64-89bc-11e6-877f-45af03ddcadc.png)
    
    ![image](https://cloud.githubusercontent.com/assets/8969467/19064082/a8445404-89bc-11e6-8532-26d8bc9b206f.png)
    
    Run create-doc.sh manually
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #15340 from felixcheung/vignettes.
    
    (cherry picked from commit 068c198e956346b90968a4d74edb7bc820c4be28)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit a9165bb1b704483ad16331945b0968cbb1a97139
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2016-10-04T16:38:44Z

    [SPARK-17549][SQL] Only collect table size stat in driver for cached relation.
    
    This reverts commit 9ac68dbc5720026ea92acc61d295ca64d0d3d132. Turns out
    the original fix was correct.
    
    Original change description:
    The existing code caches all stats for all columns for each partition
    in the driver; for a large relation, this causes extreme memory usage,
    which leads to gc hell and application failures.
    
    It seems that only the size in bytes of the data is actually used in the
    driver, so instead just colllect that. In executors, the full stats are
    still kept, but that's not a big problem; we expect the data to be distributed
    and thus not really incur in too much memory pressure in each individual
    executor.
    
    There are also potential improvements on the executor side, since the data
    being stored currently is very wasteful (e.g. storing boxed types vs.
    primitive types for stats). But that's a separate issue.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #15304 from vanzin/SPARK-17549.2.
    
    (cherry picked from commit 8d969a2125d915da1506c17833aa98da614a257f)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit a4f7df423e1e0aa512dfc496bc9de13831eae3f3
Author: Ergin Seyfe <es...@fb.com>
Date:   2016-10-04T19:39:01Z

    [SPARK-17773][BRANCH-2.0] Input/Output] Add VoidObjectInspector
    
    This is the PR for branch2.0: PR https://github.com/apache/spark/pull/15337
    
    Added VoidObjectInspector to the list of PrimitiveObjectInspectors
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    Executing following query was failing.
    select SOME_UDAF*(a.arr)
    from (
    select Array(null) as arr from dim_one_row
    ) a
    
    After the fix, I am getting the correct output:
    res0: Array[org.apache.spark.sql.Row] = Array([null])
    
    Author: Ergin Seyfe <eseyfefb.com>
    
    Closes #15337 from seyfe/add_void_object_inspector.
    
    Author: Ergin Seyfe <es...@fb.com>
    
    Closes #15345 from seyfe/add_void_object_inspector_2.0.

commit b8df2e53c38a30f51c710543c81279a59a9ab4fc
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-10-05T21:54:55Z

    [SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of BlockManagerSuite
    
    ## What changes were proposed in this pull request?
    
    Mock SparkContext to reduce memory usage of BlockManagerSuite
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15350 from zsxwing/SPARK-17778.
    
    (cherry picked from commit 221b418b1c9db7b04c600b6300d18b034a4f444e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 3b6463a794a754d630d69398f009c055664dd905
Author: Herman van Hovell <hv...@databricks.com>
Date:   2016-10-05T23:05:30Z

    [SPARK-17758][SQL] Last returns wrong result in case of empty partition
    
    ## What changes were proposed in this pull request?
    The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order:
    ```
    - Partition 1 [Row1, Row2]
    - Partition 2 [Row3]
    - Partition 3 []
    ```
    In this case the `Last` function will currently return a null, instead of the value of `Row3`.
    
    This PR fixes this by adding a `valueSet` flag to the `Last` function.
    
    ## How was this patch tested?
    We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function.
    
    Author: Herman van Hovell <hv...@databricks.com>
    
    Closes #15348 from hvanhovell/SPARK-17758.
    
    (cherry picked from commit 5fd54b994e2078dbf0794932b4e0ffa9a9eda0c3)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 1c2dff1eeeb045f3f5c3c1423ba07371b03965d7
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-10-05T23:48:43Z

    [SPARK-17643] Remove comparable requirement from Offset (backport for branch-2.0)
    
    ## What changes were proposed in this pull request?
    
    Backport https://github.com/apache/spark/commit/988c71457354b0a443471f501cef544a85b1a76a to branch-2.0
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #15362 from zsxwing/SPARK-17643-2.0.

commit 225372adfb843afcbf9928db3989f2f8393ae6d8
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-10-06T17:33:45Z

    [SPARK-17798][SQL] Remove redundant Experimental annotations in sql.streaming
    
    ## What changes were proposed in this pull request?
    I was looking through API annotations to catch mislabeled APIs, and realized DataStreamReader and DataStreamWriter classes are already annotated as Experimental, and as a result there is no need to annotate each method within them.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #15373 from rxin/SPARK-17798.
    
    (cherry picked from commit 79accf45ace5549caa0cbab02f94fc87bedb5587)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17259: Branch 2.0

Posted by elviento <gi...@git.apache.org>.
Github user elviento closed the pull request at:

    https://github.com/apache/spark/pull/17259


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org