You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wind-org <gi...@git.apache.org> on 2017/08/28 13:42:19 UTC

[GitHub] spark pull request #19070: Branch 2.2

GitHub user wind-org opened a pull request:

    https://github.com/apache/spark/pull/19070

    Branch 2.2

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19070
    
----
commit 0bd918f67630f83cdc2922a2f48bd28b023ef821
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-15T16:22:06Z

    [SPARK-12837][SPARK-20666][CORE][FOLLOWUP] getting name should not fail if accumulator is garbage collected
    
    ## What changes were proposed in this pull request?
    
    After https://github.com/apache/spark/pull/17596 , we do not send internal accumulator name to executor side anymore, and always look up the accumulator name in `AccumulatorContext`.
    
    This cause a regression if the accumulator is already garbage collected, this PR fixes this by still sending accumulator name for `SQLMetrics`.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #17931 from cloud-fan/bug.
    
    (cherry picked from commit e1aaab1e277b1b07c26acea75ade78e39bdac209)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 82ae1f0aca9c00fddba130c144adfe0777172cc8
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-05-15T17:46:38Z

    [SPARK-20716][SS] StateStore.abort() should not throw exceptions
    
    ## What changes were proposed in this pull request?
    
    StateStore.abort() should do a best effort attempt to clean up temporary resources. It should not throw errors, especially because its called in a TaskCompletionListener, because this error could hide previous real errors in the task.
    
    ## How was this patch tested?
    No unit test.
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #17958 from tdas/SPARK-20716.
    
    (cherry picked from commit 271175e2bd0f7887a068db92de73eff60f5ef2b2)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit a79a120a8fc595045b32f16663286b32dadc53ed
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-05-15T17:48:10Z

    [SPARK-20717][SS] Minor tweaks to the MapGroupsWithState behavior
    
    ## What changes were proposed in this pull request?
    
    Timeout and state data are two independent entities and should be settable independently. Therefore, in the same call of the user-defined function, one should be able to set the timeout before initializing the state and also after removing the state. Whether timeouts can be set or not, should not depend on the current state, and vice versa.
    
    However, a limitation of the current implementation is that state cannot be null while timeout is set. This is checked lazily after the function call has completed.
    
    ## How was this patch tested?
    - Updated existing unit tests that test the behavior of GroupState.setTimeout*** wrt to the current state
    - Added new tests that verify the disallowed cases where state is undefined but timeout is set.
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #17957 from tdas/SPARK-20717.
    
    (cherry picked from commit 499ba2cb47efd6a860e74e6995412408efc5238d)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit e84e9dd54cf67369c75fc38dc60d758ee8930240
Author: Dongjoon Hyun <do...@apache.org>
Date:   2017-05-15T18:24:30Z

    [SPARK-20735][SQL][TEST] Enable cross join in TPCDSQueryBenchmark
    
    ## What changes were proposed in this pull request?
    
    Since [SPARK-17298](https://issues.apache.org/jira/browse/SPARK-17298), some queries (q28, q61, q77, q88, q90) in the test suites fail with a message "_Use the CROSS JOIN syntax to allow cartesian products between these relations_".
    
    This benchmark is used as a reference model for Spark TPC-DS, so this PR aims to enable the correct configuration in `TPCDSQueryBenchmark.scala`.
    
    ## How was this patch tested?
    
    Manual. (Run TPCDSQueryBenchmark)
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #17977 from dongjoon-hyun/SPARK-20735.
    
    (cherry picked from commit bbd163d589e7503c5cb150d934e7565b18a908f2)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 10e599f69c5c1b1b17d9181b2e93b7e315759b9d
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-05-15T23:52:22Z

    [SPARK-20588][SQL] Cache TimeZone instances.
    
    ## What changes were proposed in this pull request?
    
    Because the method `TimeZone.getTimeZone(String ID)` is synchronized on the TimeZone class, concurrent call of this method will become a bottleneck.
    This especially happens when casting from string value containing timezone info to timestamp value, which uses `DateTimeUtils.stringToTimestamp()` and gets TimeZone instance on the site.
    
    This pr makes a cache of the generated TimeZone instances to avoid the synchronization.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ue...@databricks.com>
    
    Closes #17933 from ueshin/issues/SPARK-20588.
    
    (cherry picked from commit c8c878a4166415728f6e940504766a099a2f6744)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit a869e8bfdc23b9e3796a7c4d51f91902b5a067d2
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-05-16T02:08:23Z

    [SPARK-20707][ML] ML deprecated APIs should be removed in major release.
    
    ## What changes were proposed in this pull request?
    Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0.
    Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR.
    
    Before:
    ![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png)
    
    After:
    ![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png)
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #17946 from yanboliang/spark-20707.
    
    (cherry picked from commit d4022d49514cc1f8ffc5bfe243186ec3748df475)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit 57c87cf2da8063f2757389bd37f2847d397e16ee
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-05-16T04:21:54Z

    [SPARK-20501][ML] ML 2.2 QA: New Scala APIs, docs
    
    ## What changes were proposed in this pull request?
    Review new Scala APIs introduced in 2.2.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #17934 from yanboliang/spark-20501.
    
    (cherry picked from commit dbe81633a766c4dc68a0a27063e5dfde0f5690af)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit b8d37ac37bcd1ecf8b5f17233bce6b5b39ed2fd0
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-16T08:54:42Z

    [SPARK-20553][ML][PYSPARK] Update ALS examples with recommend-all methods
    
    Update ALS examples illustrating use of "recommendForAllX" methods.
    
    ## How was this patch tested?
    Built and ran examples locally
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #17950 from MLnick/SPARK-20553-update-als-examples.
    
    (cherry picked from commit 6af7b43b34942c662122e3905b0724b2dd40a63f)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit ee0d2af950ea82f539fb08e66d7cf14045912ebe
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-16T08:59:34Z

    [SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs
    
    Small clean ups from #17742 and #17845.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #17919 from MLnick/SPARK-20677-als-perf-followup.
    
    (cherry picked from commit 25b4f41d239ac67402566c0254a893e2e58ae7d8)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 75e5ea294c15ecfb7366ae15dce196aa92c87ca4
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-16T17:35:51Z

    [SPARK-20529][CORE] Allow worker and master work with a proxy server
    
    ## What changes were proposed in this pull request?
    
    In the current codes, when worker connects to master, master will send its address to the worker. Then worker will save this address and use it to reconnect in case of failure. However, sometimes, this address is not correct. If there is a proxy between master and worker, the address master sent is not the address of proxy.
    
    In this PR, the master address used by the worker will be sent to the master, then master just replies this address back, worker will use this address to reconnect in case of failure. In other words, the worker will use the config master address set in the worker side if possible rather than the master address set in the master side.
    
    There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between  master and worker, the address may be wrong. However, there is no way to figure it out just in the worker.
    
    ## How was this patch tested?
    
    The new added unit test.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #17821 from zsxwing/SPARK-20529.
    
    (cherry picked from commit 9150bca47e4b8782e20441386d3d225eb5f2f404)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 7076ab40f86fe606cd9b813dad506e921501383e
Author: Yash Sharma <ys...@atlassian.com>
Date:   2017-05-16T22:08:05Z

    [SPARK-20140][DSTREAM] Remove hardcoded kinesis retry wait and max retries
    
    ## What changes were proposed in this pull request?
    
    The pull requests proposes to remove the hardcoded values for Amazon Kinesis - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
    
    This change is critical for kinesis checkpoint recovery when the kinesis backed rdd is huge.
    Following happens in a typical kinesis recovery :
    - kinesis throttles large number of requests while recovering
    - retries in case of throttling are not able to recover due to the small wait period
    - kinesis throttles per second, the wait period should be configurable for recovery
    
    The patch picks the spark kinesis configs from:
    - spark.streaming.kinesis.retry.wait.time
    - spark.streaming.kinesis.retry.max.attempts
    
    Jira : https://issues.apache.org/jira/browse/SPARK-20140
    
    ## How was this patch tested?
    
    Modified the KinesisBackedBlockRDDSuite.scala to run kinesis tests with the modified configurations. Wasn't able to test the patch with actual throttling.
    
    Author: Yash Sharma <ys...@atlassian.com>
    
    Closes #17467 from yssharma/ysharma/spark-kinesis-retries.
    
    (cherry picked from commit 38f4e8692ce3b6cbcfe0c1aff9b5e662f7a308b7)
    Signed-off-by: Burak Yavuz <br...@gmail.com>

commit d42c67a1f9724c68b15b7ffafa0c7256b7d86fb2
Author: Josh Rosen <jo...@databricks.com>
Date:   2017-05-17T05:04:21Z

    [SPARK-20776] Fix perf. problems in JobProgressListener caused by TaskMetrics construction
    
    ## What changes were proposed in this pull request?
    
    In
    
    ```
    ./bin/spark-shell --master=local[64]
    ```
    
    I ran
    
    ```
    sc.parallelize(1 to 100000, 100000).count()
    ```
    and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent in `TaskMetrics.empty` calls in `JobProgressListener.onTaskStart`. It turns out that we can slightly refactor to remove the need to construct one empty instance per call, greatly improving the performance of this code.
    
    The performance gains here help to avoid an issue where listener events would be dropped because the JobProgressListener couldn't keep up with the throughput.
    
    **Before:**
    
    ![image](https://cloud.githubusercontent.com/assets/50748/26133095/95bcd42a-3a59-11e7-8051-a50550e447b8.png)
    
    **After:**
    
    ![image](https://cloud.githubusercontent.com/assets/50748/26133070/7935e148-3a59-11e7-8c2d-73d5aa5a2397.png)
    
    ## How was this patch tested?
    
    Benchmarks described above.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #18008 from JoshRosen/nametoaccums-improvements.
    
    (cherry picked from commit 30e0557dbc134898ee65fe67d31054dcc8728576)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit dac0b50b68d18c95a9968bc90a013396a42cc526
Author: Andrew Ray <ra...@gmail.com>
Date:   2017-05-17T09:06:01Z

    [SPARK-20769][DOC] Incorrect documentation for using Jupyter notebook
    
    ## What changes were proposed in this pull request?
    
    SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error.
    
    ## How was this patch tested?
    
    Tested invocation locally with
    ```bash
    PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark
    ```
    
    Author: Andrew Ray <ra...@gmail.com>
    
    Closes #18001 from aray/patch-1.
    
    (cherry picked from commit 1995417696a028f8a4fa7f706a77537c7182528d)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 2db6101001512fe80998d99cfb972ee51614dfcc
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-17T21:13:49Z

    [SPARK-20788][CORE] Fix the Executor task reaper's false alarm warning logs
    
    ## What changes were proposed in this pull request?
    
    Executor task reaper may fail to detect if a task is finished or not when a task is finishing but being killed at the same time.
    
    The fix is pretty easy, just flip the "finished" flag when a task is successful.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #18021 from zsxwing/SPARK-20788.
    
    (cherry picked from commit f8e0f0f47c15ddd646b9f295b91d6748583fe011)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit b8fa79cec7a15e748bf9916e8a3c6476e0d350a3
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-18T00:21:46Z

    [SPARK-13747][CORE] Add ThreadUtils.awaitReady and disallow Await.ready
    
    ## What changes were proposed in this pull request?
    
    Add `ThreadUtils.awaitReady` similar to `ThreadUtils.awaitResult` and disallow `Await.ready`.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #17763 from zsxwing/awaitready.
    
    (cherry picked from commit 324a904d8e80089d8865e4c7edaedb92ab2ec1b2)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit ba0117c2716a6a3b9810bc17b67f9f502c49fa9b
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-05-18T03:54:09Z

    [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.
    
    ## What changes were proposed in this pull request?
    Add docs and examples for ```ml.stat.Correlation``` and ```ml.stat.ChiSquareTest```.
    
    ## How was this patch tested?
    Generate docs and run examples manually, successfully.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #17994 from yanboliang/spark-20505.
    
    (cherry picked from commit 697a5e5517e32c5ef44c273e3b26662d0eb70f24)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit c708b14803ae461b5c721b2aebb10b5bbd2e1d26
Author: Xingbo Jiang <xi...@databricks.com>
Date:   2017-05-18T06:32:31Z

    [SPARK-20700][SQL] InferFiltersFromConstraints stackoverflows for query (v2)
    
    ## What changes were proposed in this pull request?
    
    In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail at preventing the recursive deductions.
    
    We fix this problem by collecting all `Alias`s from the logical plan.
    
    ## How was this patch tested?
    
    No additional test case is added, but do modified one test case to cover this situation.
    
    Author: Xingbo Jiang <xi...@databricks.com>
    
    Closes #18020 from jiangxb1987/inferConstrants.
    
    (cherry picked from commit b7aac15d566b048c20c2491fbf376b727f2eeb68)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit db821fe55c99e29dc246c2c3156a1fff3a7ec2a5
Author: liuzhaokun <li...@zte.com.cn>
Date:   2017-05-18T16:44:40Z

    [SPARK-20796] the location of start-master.sh in spark-standalone.md is wrong
    
    [https://issues.apache.org/jira/browse/SPARK-20796](https://issues.apache.org/jira/browse/SPARK-20796)
    the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh".
    
    Author: liuzhaokun <li...@zte.com.cn>
    
    Closes #18027 from liu-zhaokun/sbin.
    
    (cherry picked from commit 99452df44fb98c2721d427da4c97f549793615fe)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 8b0cb3a7be138d0f2059731ed4bbd8d01f599497
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-18T17:52:23Z

    [SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names
    
    ## What changes were proposed in this pull request?
    
    This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with https://github.com/apache/spark/pull/17680.
    
    The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.
    
    I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.
    
    This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).
    
    Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.
    
    **With dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    +--------+
    ```
    
    **Without dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("coldots").write.parquet(path)
    spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
    ```
    
    ```
    +-------+
    |coldots|
    +-------+
    |      1|
    +-------+
    ```
    
    **After**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    |       1|
    +--------+
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `ParquetFilterSuite`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #18000 from HyukjinKwon/SPARK-20364-workaround.
    
    (cherry picked from commit 8fb3d5c6da30922458091837eec17ccca502098a)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 556ad019fa49deb40ba8da3aa6067484ab3d6331
Author: Yash Sharma <ys...@atlassian.com>
Date:   2017-05-18T18:24:33Z

    [DSTREAM][DOC] Add documentation for kinesis retry configurations
    
    ## What changes were proposed in this pull request?
    
    The changes were merged as part of - https://github.com/apache/spark/pull/17467.
    The documentation was missed somewhere in the review iterations. Adding the documentation where it belongs.
    
    ## How was this patch tested?
    Docs. Not tested.
    
    cc budde , brkyvz
    
    Author: Yash Sharma <ys...@atlassian.com>
    
    Closes #18028 from yssharma/ysharma/kinesis_retry_docs.
    
    (cherry picked from commit 92580bd0eae5dbf739573093cca1b12fd0c14049)
    Signed-off-by: Burak Yavuz <br...@gmail.com>

commit 2eed4c96a5c3c9a7f318a96368493bb6fad2945d
Author: Ala Luszczak <al...@databricks.com>
Date:   2017-05-19T11:18:48Z

    [SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter
    
    ## What changes were proposed in this pull request?
    
    GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior.
    
    Example of code generated before:
    ```scala
    /* 059 */         final UTF8String fieldName = value.getUTF8String(0);
    /* 060 */         if (value.isNullAt(0)) {
    /* 061 */           rowWriter1.setNullAt(0);
    /* 062 */         } else {
    /* 063 */           rowWriter1.write(0, fieldName);
    /* 064 */         }
    ```
    
    Example of code generated now:
    ```scala
    /* 060 */         boolean isNull1 = value.isNullAt(0);
    /* 061 */         UTF8String value1 = isNull1 ? null : value.getUTF8String(0);
    /* 062 */         if (isNull1) {
    /* 063 */           rowWriter1.setNullAt(0);
    /* 064 */         } else {
    /* 065 */           rowWriter1.write(0, value1);
    /* 066 */         }
    ```
    
    ## How was this patch tested?
    
    Adds GenerateUnsafeProjectionSuite.
    
    Author: Ala Luszczak <al...@databricks.com>
    
    Closes #18030 from ala/fix-generate-unsafe-projection.
    
    (cherry picked from commit ce8edb8bf4db5f82bcfeb11efbdf5229b0d25dfa)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 939b9536fa8f547b7df59c3c22caee9fd0f58688
Author: tpoterba <tp...@broadinstitute.org>
Date:   2017-05-19T12:17:12Z

    [SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields
    
    Fix quadratic List indexing in ParquetWriteSupport.
    
    I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.
    
    ## What changes were proposed in this pull request?
    
    The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.
    
    Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.
    
    ## How was this patch tested?
    
    This is a one-line change for performance reasons.
    
    Author: tpoterba <tp...@broadinstitute.org>
    Author: Tim Poterba <tp...@gmail.com>
    
    Closes #18005 from tpoterba/tpoterba-patch-1.
    
    (cherry picked from commit 3f2cd51ee06f2c6d735754e5440bc4b74f8dcbc8)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 001b82c18cd6518e9e6ae2e6f6d0de3dbc639943
Author: liuzhaokun <li...@zte.com.cn>
Date:   2017-05-19T14:26:39Z

    [SPARK-20759] SCALA_VERSION in _config.yml should be consistent with pom.xml
    
    [https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759)
    SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml.
    
    Author: liuzhaokun <li...@zte.com.cn>
    
    Closes #17992 from liu-zhaokun/new.
    
    (cherry picked from commit dba2ca2c129b6d2597f1707e0315d4e238c40ed6)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 43f9fb7984c898dd7bab380ba8f6ca72b4e7d7e3
Author: liuxian <li...@zte.com.cn>
Date:   2017-05-19T17:25:21Z

    [SPARK-20763][SQL] The function of `month` and `day` return the value which is not we expected.
    
    ## What changes were proposed in this pull request?
    spark-sql>select month("1582-09-28");
    spark-sql>10
    For this case, the expected result is 9, but it is 10.
    
    spark-sql>select day("1582-04-18");
    spark-sql>28
    For this case, the expected result is 18, but it is 28.
    
    when the date  before "1582-10-04", the function of `month` and `day` return the value which is not we expected.
    
    ## How was this patch tested?
    unit tests
    
    Author: liuxian <li...@zte.com.cn>
    
    Closes #17997 from 10110346/wip_lx_0516.
    
    (cherry picked from commit ea3b1e352a605cd35cdee987d0e5eb8528ef1b45)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 4fcd52b48825400acf54f4b021f365ad6414c57a
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-19T18:51:56Z

    [SPARK-20506][DOCS] 2.2 migration guide
    
    Update ML guide for migration `2.1` -> `2.2` and the previous version migration guide section.
    
    ## How was this patch tested?
    
    Build doc locally.
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #17996 from MLnick/SPARK-20506-2.2-migration-guide.
    
    (cherry picked from commit b5d8d9ba17d62167cfbacd5f6188a8b4a5b8a2be)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 3aad5982a80c300a6c86b876340da85c64cd6ac6
Author: liuzhaokun <li...@zte.com.cn>
Date:   2017-05-19T19:47:30Z

    [SPARK-20781] the location of Dockerfile in docker.properties.templat is wrong
    
    [https://issues.apache.org/jira/browse/SPARK-20781](https://issues.apache.org/jira/browse/SPARK-20781)
    the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile"
    
    Author: liuzhaokun <li...@zte.com.cn>
    
    Closes #18013 from liu-zhaokun/dockerfile_location.
    
    (cherry picked from commit 749418d285461958a0f22ed355edafd87f1ee913)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit cfd1bf0bef766a9b13fe16bcca172d4108eb4e56
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-05-21T20:07:25Z

    [SPARK-20792][SS] Support same timeout operations in mapGroupsWithState function in batch queries as in streaming queries
    
    ## What changes were proposed in this pull request?
    
    Currently, in the batch queries, timeout is disabled (i.e. GroupStateTimeout.NoTimeout) which means any GroupState.setTimeout*** operation would throw UnsupportedOperationException. This makes it weird when converting a streaming query into a batch query by changing the input DF from streaming to a batch DF. If the timeout was enabled and used, then the batch query will start throwing UnsupportedOperationException.
    
    This PR creates the dummy state in batch queries with the provided timeoutConf so that it behaves in the same way. The code has been refactored to make it obvious when the state is being created for a batch query or a streaming query.
    
    ## How was this patch tested?
    Additional tests
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #18024 from tdas/SPARK-20792.
    
    (cherry picked from commit 9d6661c829a4a82aae64ed0522c44e4c3d8f4f0b)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 41d8d21655dc8462238a6252923d23a54f64067f
Author: Michal Senkyr <mi...@gmail.com>
Date:   2017-05-22T08:49:19Z

    [SPARK-19089][SQL] Add support for nested sequences
    
    ## What changes were proposed in this pull request?
    
    Replaced specific sequence encoders with generic sequence encoder to enable nesting of sequences.
    
    Does not add support for nested arrays as that cannot be solved in this way.
    
    ## How was this patch tested?
    
    ```bash
    build/mvn -DskipTests clean package && dev/run-tests
    ```
    
    Additionally in Spark shell:
    
    ```
    scala> Seq(Seq(Seq(1))).toDS.collect()
    res0: Array[Seq[Seq[Int]]] = Array(List(List(1)))
    ```
    
    Author: Michal Senkyr <mi...@gmail.com>
    
    Closes #18011 from michalsenkyr/dataset-seq-nested.
    
    (cherry picked from commit a2b3b67624ce7bbb29ddade03c1791d95e51869b)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit af1ff8b00ac7271ddf4cef87013e591e46de79e9
Author: Ignacio Bermudez <ig...@gmail.com>
Date:   2017-05-22T09:27:28Z

    [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix
    
    ## What changes were proposed in this pull request?
    
    When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
    
    In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
    
    This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
    
    See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
    
    ## How was this patch tested?
    
    Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
    
    Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
    
    Author: Ignacio Bermudez <ig...@gmail.com>
    Author: Ignacio Bermudez Corrales <ic...@splunk.com>
    
    Closes #17940 from ghoto/bug-fix/SPARK-20687.
    
    (cherry picked from commit 06dda1d58f8670e996921e935d5f5402d664699e)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 50dba3053bc352858b77f1c9558a2a37e982d386
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-22T10:29:29Z

    [SPARK-20506][DOCS] Add HTML links to highlight list in MLlib guide for 2.2
    
    Quick follow up to #17996 - forgot to add the HTML links to the relevant sections of the guide in the highlights list.
    
    ## How was this patch tested?
    
    Built docs locally and tested links.
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #18043 from MLnick/SPARK-20506-2.2-migration-guide-2.
    
    (cherry picked from commit be846db48b226de2b0dfb5f87d059eda15ecf7cd)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19070: Branch 2.2

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19070
  
    @wind-org close this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19070: Branch 2.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19070
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19070: Branch 2.2

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19070
  
    @wind-org Can you close this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19070: Branch 2.2

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19070


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org