You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jimmy144 <gi...@git.apache.org> on 2018/01/08 07:26:04 UTC

[GitHub] spark pull request #20185: Branch 2.3

GitHub user jimmy144 opened a pull request:

    https://github.com/apache/spark/pull/20185

    Branch 2.3

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20185.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20185
    
----
commit 5244aafc2d7945c11c96398b8d5b752b45fd148c
Author: Xianjin YE <ad...@...>
Date:   2018-01-02T15:30:38Z

    [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
    
    ## What changes were proposed in this pull request?
    stageAttemptId added in TaskContext and corresponding construction modification
    
    ## How was this patch tested?
    Added a new test in TaskContextSuite, two cases are tested:
    1. Normal case without failure
    2. Exception case with resubmitted stages
    
    Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897)
    
    Author: Xianjin YE <ad...@gmail.com>
    
    Closes #20082 from advancedxy/SPARK-22897.
    
    (cherry picked from commit a6fc300e91273230e7134ac6db95ccb4436c6f8f)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit b96a2132413937c013e1099be3ec4bc420c947fd
Author: Juliusz Sompolski <ju...@...>
Date:   2018-01-03T13:40:51Z

    [SPARK-22938] Assert that SQLConf.get is accessed only on the driver.
    
    ## What changes were proposed in this pull request?
    
    Assert if code tries to access SQLConf.get on executor.
    This can lead to hard to detect bugs, where the executor will read fallbackConf, falling back to default config values, ignoring potentially changed non-default configs.
    If a config is to be passed to executor code, it needs to be read on the driver, and passed explicitly.
    
    ## How was this patch tested?
    
    Check in existing tests.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #20136 from juliuszsompolski/SPARK-22938.
    
    (cherry picked from commit 247a08939d58405aef39b2a4e7773aa45474ad12)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit a05e85ecb76091567a26a3a14ad0879b4728addc
Author: gatorsmile <ga...@...>
Date:   2018-01-03T14:09:30Z

    [SPARK-22934][SQL] Make optional clauses order insensitive for CREATE TABLE SQL statement
    
    ## What changes were proposed in this pull request?
    Currently, our CREATE TABLE syntax require the EXACT order of clauses. It is pretty hard to remember the exact order. Thus, this PR is to make optional clauses order insensitive for `CREATE TABLE` SQL statement.
    
    ```
    CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
        [(col_name1 col_type1 [COMMENT col_comment1], ...)]
        USING datasource
        [OPTIONS (key1=val1, key2=val2, ...)]
        [PARTITIONED BY (col_name1, col_name2, ...)]
        [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
        [LOCATION path]
        [COMMENT table_comment]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
        [AS select_statement]
    ```
    
    The proposal is to make the following clauses order insensitive.
    ```
        [OPTIONS (key1=val1, key2=val2, ...)]
        [PARTITIONED BY (col_name1, col_name2, ...)]
        [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
        [LOCATION path]
        [COMMENT table_comment]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
    ```
    
    The same idea is also applicable to Create Hive Table.
    ```
    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
        [(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
        [COMMENT table_comment]
        [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
        [ROW FORMAT row_format]
        [STORED AS file_format]
        [LOCATION path]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
        [AS select_statement]
    ```
    
    The proposal is to make the following clauses order insensitive.
    ```
        [COMMENT table_comment]
        [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
        [ROW FORMAT row_format]
        [STORED AS file_format]
        [LOCATION path]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
    ```
    
    ## How was this patch tested?
    Added test cases
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20133 from gatorsmile/createDataSourceTableDDL.
    
    (cherry picked from commit 1a87a1609c4d2c9027a2cf669ea3337b89f61fb6)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit b96248862589bae1ddcdb14ce4c802789a001306
Author: Wenchen Fan <we...@...>
Date:   2018-01-03T14:18:13Z

    [SPARK-20236][SQL] dynamic partition overwrite
    
    ## What changes were proposed in this pull request?
    
    When overwriting a partitioned table with dynamic partition columns, the behavior is different between data source and hive tables.
    
    data source table: delete all partition directories that match the static partition values provided in the insert statement.
    
    hive table: only delete partition directories which have data written into it
    
    This PR adds a new config to make users be able to choose hive's behavior.
    
    ## How was this patch tested?
    
    new tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18714 from cloud-fan/overwrite-partition.
    
    (cherry picked from commit a66fe36cee9363b01ee70e469f1c968f633c5713)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 27c949d673e45fdbbae0f2c08969b9d51222dd8d
Author: gatorsmile <ga...@...>
Date:   2018-01-02T01:19:18Z

    [SPARK-22932][SQL] Refactor AnalysisContext
    
    ## What changes were proposed in this pull request?
    Add a `reset` function to ensure the state in `AnalysisContext ` is per-query.
    
    ## How was this patch tested?
    The existing test cases
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20127 from gatorsmile/refactorAnalysisContext.

commit 79f7263daa5f83e2026fda9a8bbb1090a1333f80
Author: chetkhatri <ck...@...>
Date:   2018-01-03T17:31:32Z

    [SPARK-22896] Improvement in String interpolation
    
    ## What changes were proposed in this pull request?
    
    * String interpolation in ml pipeline example has been corrected as per scala standard.
    
    ## How was this patch tested?
    * manually tested.
    
    Author: chetkhatri <ck...@gmail.com>
    
    Closes #20070 from chetkhatri/mllib-chetan-contrib.
    
    (cherry picked from commit 9a2b65a3c0c36316aae0a53aa0f61c5044c2ceff)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a51212b642f05f28447b80aa29f5482de2c27f58
Author: Wenchen Fan <we...@...>
Date:   2018-01-03T23:28:53Z

    [SPARK-20960][SQL] make ColumnVector public
    
    ## What changes were proposed in this pull request?
    
    move `ColumnVector` and related classes to `org.apache.spark.sql.vectorized`, and improve the document.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #20116 from cloud-fan/column-vector.
    
    (cherry picked from commit b297029130735316e1ac1144dee44761a12bfba7)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit f51c8fde8bf08705bacf8a93b5dba685ebbcec17
Author: Wenchen Fan <we...@...>
Date:   2018-01-04T05:14:52Z

    [SPARK-22944][SQL] improve FoldablePropagation
    
    ## What changes were proposed in this pull request?
    
    `FoldablePropagation` is a little tricky as it needs to handle attributes that are miss-derived from children, e.g. outer join outputs. This rule does a kind of stop-able tree transform, to skip to apply this rule when hit a node which may have miss-derived attributes.
    
    Logically we should be able to apply this rule above the unsupported nodes, by just treating the unsupported nodes as leaf nodes. This PR improves this rule to not stop the tree transformation, but reduce the foldable expressions that we want to propagate.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #20139 from cloud-fan/foldable.
    
    (cherry picked from commit 7d045c5f00e2c7c67011830e2169a4e130c3ace8)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 1860a43e9affb7619be0a5a1c786e264d09bc446
Author: Felix Cheung <fe...@...>
Date:   2018-01-04T05:43:14Z

    [SPARK-22933][SPARKR] R Structured Streaming API for withWatermark, trigger, partitionBy
    
    ## What changes were proposed in this pull request?
    
    R Structured Streaming API for withWatermark, trigger, partitionBy
    
    ## How was this patch tested?
    
    manual, unit tests
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #20129 from felixcheung/rwater.
    
    (cherry picked from commit df95a908baf78800556636a76d58bba9b3dd943f)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit a7cfd6beaf35f79a744047a4a09714ef1da60293
Author: Kent Yao <ya...@...>
Date:   2018-01-04T11:10:10Z

    [SPARK-22950][SQL] Handle ChildFirstURLClassLoader's parent
    
    ## What changes were proposed in this pull request?
    
    ChildFirstClassLoader's parent is set to null, so we can't get jars from its parent. This will cause ClassNotFoundException during HiveClient initialization with builtin hive jars, where we may should use spark context loader instead.
    
    ## How was this patch tested?
    
    add new ut
    cc cloud-fan gatorsmile
    
    Author: Kent Yao <ya...@hotmail.com>
    
    Closes #20145 from yaooqinn/SPARK-22950.
    
    (cherry picked from commit 9fa703e89318922393bae03c0db4575f4f4b4c56)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit eb99b8adecc050240ce9d5e0b92a20f018df465e
Author: Wenchen Fan <we...@...>
Date:   2018-01-04T11:17:22Z

    [SPARK-22945][SQL] add java UDF APIs in the functions object
    
    ## What changes were proposed in this pull request?
    
    Currently Scala users can use UDF like
    ```
    val foo = udf((i: Int) => Math.random() + i).asNondeterministic
    df.select(foo('a))
    ```
    Python users can also do it with similar APIs. However Java users can't do it, we should add Java UDF APIs in the functions object.
    
    ## How was this patch tested?
    
    new tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #20141 from cloud-fan/udf.
    
    (cherry picked from commit d5861aba9d80ca15ad3f22793b79822e470d6913)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 1f5e3540c7535ceaea66ebd5ee2f598e8b3ba1a5
Author: gatorsmile <ga...@...>
Date:   2018-01-04T13:07:31Z

    [SPARK-22939][PYSPARK] Support Spark UDF in registerFunction
    
    ## What changes were proposed in this pull request?
    ```Python
    import random
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType
    random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()
    spark.catalog.registerFunction("random_udf", random_udf, StringType())
    spark.sql("SELECT random_udf()").collect()
    ```
    
    We will get the following error.
    ```
    Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
    py4j.Py4JException: Method __getnewargs__([]) does not exist
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    	at py4j.Gateway.invoke(Gateway.java:274)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:214)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    This PR is to support it.
    
    ## How was this patch tested?
    WIP
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20137 from gatorsmile/registerFunction.
    
    (cherry picked from commit 5aadbc929cb194e06dbd3bab054a161569289af5)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit bcfeef5a944d56af1a5106f5c07296ea2c262991
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-04T13:15:10Z

    [SPARK-22771][SQL] Add a missing return statement in Concat.checkInputDataTypes
    
    ## What changes were proposed in this pull request?
    This pr is a follow-up to fix a bug left in #19977.
    
    ## How was this patch tested?
    Added tests in `StringExpressionsSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20149 from maropu/SPARK-22771-FOLLOWUP.
    
    (cherry picked from commit 6f68316e98fad72b171df422566e1fc9a7bbfcde)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit cd92913f345c8d932d3c651626c7f803e6abdcdb
Author: jerryshao <ss...@...>
Date:   2018-01-04T19:39:42Z

    [SPARK-21475][CORE][2ND ATTEMPT] Change to use NIO's Files API for external shuffle service
    
    ## What changes were proposed in this pull request?
    
    This PR is the second attempt of #18684 , NIO's Files API doesn't override `skip` method for `InputStream`, so it will bring in performance issue (mentioned in #20119). But using `FileInputStream`/`FileOutputStream` will also bring in memory issue (https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful), which is severe for long running external shuffle service. So here in this proposal, only fixing the external shuffle service related code.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #20144 from jerryshao/SPARK-21475-v2.
    
    (cherry picked from commit 93f92c0ed7442a4382e97254307309977ff676f8)
    Signed-off-by: Shixiong Zhu <zs...@gmail.com>

commit bc4bef472de0e99f74a80954d694c3d1744afe3a
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-04T22:19:00Z

    [SPARK-22850][CORE] Ensure queued events are delivered to all event queues.
    
    The code in LiveListenerBus was queueing events before start in the
    queues themselves; so in situations like the following:
    
       bus.post(someEvent)
       bus.addToEventLogQueue(listener)
       bus.start()
    
    "someEvent" would not be delivered to "listener" if that was the first
    listener in the queue, because the queue wouldn't exist when the
    event was posted.
    
    This change buffers the events before starting the bus in the bus itself,
    so that they can be delivered to all registered queues when the bus is
    started.
    
    Also tweaked the unit tests to cover the behavior above.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20039 from vanzin/SPARK-22850.
    
    (cherry picked from commit d2cddc88eac32f26b18ec26bb59e85c6f09a8c88)
    Signed-off-by: Imran Rashid <ir...@cloudera.com>

commit 2ab4012adda941ebd637bd248f65cefdf4aaf110
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-04T23:00:09Z

    [SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20156 from vanzin/SPARK-22948.
    
    (cherry picked from commit 95f9659abe8845f9f3f42fd7ababd79e55c52489)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 84707f0c6afa9c5417e271657ff930930f82213c
Author: Yinan Li <li...@...>
Date:   2018-01-04T23:35:20Z

    [SPARK-22953][K8S] Avoids adding duplicated secret volumes when init-container is used
    
    ## What changes were proposed in this pull request?
    
    User-specified secrets are mounted into both the main container and init-container (when it is used) in a Spark driver/executor pod, using the `MountSecretsBootstrap`. Because `MountSecretsBootstrap` always adds new secret volumes for the secrets to the pod, the same secret volumes get added twice, one when mounting the secrets to the main container, and the other when mounting the secrets to the init-container. This PR fixes the issue by separating `MountSecretsBootstrap.mountSecrets` out into two methods: `addSecretVolumes` for adding secret volumes to a pod and `mountSecrets` for mounting secret volumes to a container, respectively. `addSecretVolumes` is only called once for each pod, whereas `mountSecrets` is called individually for the main container and the init-container (if it is used).
    
    Ref: https://github.com/apache-spark-on-k8s/spark/issues/594.
    
    ## How was this patch tested?
    Unit tested and manually tested.
    
    vanzin This replaces https://github.com/apache/spark/pull/20148.
    hex108 foxish kimoonkim
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20159 from liyinan926/master.
    
    (cherry picked from commit e288fc87a027ec1e1a21401d1f151df20dbfecf3)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit ea9da6152af9223787cffd83d489741b4cc5aa34
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-05T00:34:56Z

    [SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly.
    
    - Make it possible to build images from a git clone.
    - Make it easy to use minikube to test things.
    
    Also fixed what seemed like a bug: the base image wasn't getting the tag
    provided in the command line. Adding the tag allows users to use multiple
    Spark builds in the same kubernetes cluster.
    
    Tested by deploying images on minikube and running spark-submit from a dev
    environment; also by building the images with different tags and verifying
    "docker images" in minikube.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20154 from vanzin/SPARK-22960.
    
    (cherry picked from commit 0428368c2c5e135f99f62be20877bbbda43be310)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 158f7e6a93b5acf4ce05c97b575124fd599cf927
Author: Juliusz Sompolski <ju...@...>
Date:   2018-01-05T02:16:34Z

    [SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt
    
    ## What changes were proposed in this pull request?
    
    32bit Int was used for row rank.
    That overflowed in a dataframe with more than 2B rows.
    
    ## How was this patch tested?
    
    Added test, but ignored, as it takes 4 minutes.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #20152 from juliuszsompolski/SPARK-22957.
    
    (cherry picked from commit df7fc3ef3899cadd252d2837092bebe3442d6523)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 145820bda140d1385c4dd802fa79a871e6bf98be
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-05T06:02:21Z

    [SPARK-22825][SQL] Fix incorrect results of Casting Array to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting arrays into strings;
    ```
    scala> val df = spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))
    scala> df.write.saveAsTable("t")
    scala> sql("SELECT cast(ids as String) FROM t").show(false)
    +------------------------------------------------------------------+
    |ids                                                               |
    +------------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df|
    +------------------------------------------------------------------+
    ```
    
    This pr modified the result into;
    ```
    +------------------------------+
    |ids                           |
    +------------------------------+
    |[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]|
    +------------------------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite` and `SQLQuerySuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20024 from maropu/SPARK-22825.
    
    (cherry picked from commit 52fc5c17d9d784b846149771b398e741621c0b5c)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 5b524cc0cd5a82e4fb0681363b6641e40b37075d
Author: Bago Amirbekian <ba...@...>
Date:   2018-01-05T06:45:15Z

    [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit
    
    ## What changes were proposed in this pull request?
    
    Avoid holding all models in memory for `TrainValidationSplit`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bago Amirbekian <ba...@databricks.com>
    
    Closes #20143 from MrBago/trainValidMemoryFix.
    
    (cherry picked from commit cf0aa65576acbe0209c67f04c029058fd73555c1)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit f9dcdbcefb545ced3f5b457e1e88c88a8e180f9f
Author: Yinan Li <li...@...>
Date:   2018-01-05T07:23:41Z

    [SPARK-22757][K8S] Enable spark.jars and spark.files in KUBERNETES mode
    
    ## What changes were proposed in this pull request?
    
    We missed enabling `spark.files` and `spark.jars` in https://github.com/apache/spark/pull/19954. The result is that remote dependencies specified through `spark.files` or `spark.jars` are not included in the list of remote dependencies to be downloaded by the init-container. This PR fixes it.
    
    ## How was this patch tested?
    
    Manual tests.
    
    vanzin This replaces https://github.com/apache/spark/pull/20157.
    
    foxish
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20160 from liyinan926/SPARK-22757.
    
    (cherry picked from commit 6cff7d19f6a905fe425bd6892fe7ca014c0e696b)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit fd4e30476894b7c37cc2ae6243a941f0bc90388d
Author: Adrian Ionescu <ad...@...>
Date:   2018-01-05T13:32:39Z

    [SPARK-22961][REGRESSION] Constant columns should generate QueryPlanConstraints
    
    ## What changes were proposed in this pull request?
    
    #19201 introduced the following regression: given something like `df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a constraint and infer filters from it when joins are involved, which may lead to noticeable performance degradation.
    
    This patch re-enables this optimization by picking up Aliases of Literals in Projection lists as constraints and making sure they're not treated as aliased columns.
    
    ## How was this patch tested?
    
    Unit test was added.
    
    Author: Adrian Ionescu <ad...@databricks.com>
    
    Closes #20155 from adrian-ionescu/constant_constraints.
    
    (cherry picked from commit 51c33bd0d402af9e0284c6cbc0111f926446bfba)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 0a30e93507ba784729a498943e7eeda1d6f19fbf
Author: Bruce Robbins <be...@...>
Date:   2018-01-05T17:58:28Z

    [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite should succeed on platforms that don't have wget
    
    ## What changes were proposed in this pull request?
    
    Modified HiveExternalCatalogVersionsSuite.scala to use Utils.doFetchFile to download different versions of Spark binaries rather than launching wget as an external process.
    
    On platforms that don't have wget installed, this suite fails with an error.
    
    cloud-fan : would you like to check this change?
    
    ## How was this patch tested?
    
    1) test-only of HiveExternalCatalogVersionsSuite on several platforms. Tested bad mirror, read timeout, and redirects.
    2) ./dev/run-tests
    
    Author: Bruce Robbins <be...@gmail.com>
    
    Closes #20147 from bersprockets/SPARK-22940-alt.
    
    (cherry picked from commit c0b7424ecacb56d3e7a18acc11ba3d5e7be57c43)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit d1f422c1c12c8095e8522d1051a6e0e406748a3a
Author: Joseph K. Bradley <jo...@...>
Date:   2018-01-05T19:51:25Z

    [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
    
    ## What changes were proposed in this pull request?
    
    Follow-up cleanups for the OneHotEncoderEstimator PR.  See some discussion in the original PR: https://github.com/apache/spark/pull/19527 or read below for what this PR includes:
    * configedCategorySize: I reverted this to return an Array.  I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF.
    * encoder: I reorganized the logic to show what I meant in the comment in the previous PR.  I think it's simpler but am open to suggestions.
    
    I also made some small style cleanups based on IntelliJ warnings.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Joseph K. Bradley <jo...@databricks.com>
    
    Closes #20132 from jkbradley/viirya-SPARK-13030.
    
    (cherry picked from commit 930b90a84871e2504b57ed50efa7b8bb52d3ba44)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 55afac4e7b4f655aa05c5bcaf7851bb1e7699dba
Author: Gera Shegalov <ge...@...>
Date:   2018-01-06T01:25:28Z

    [SPARK-22914][DEPLOY] Register history.ui.port
    
    ## What changes were proposed in this pull request?
    
    Register spark.history.ui.port as a known spark conf to be used in substitution expressions even if it's not set explicitly.
    
    ## How was this patch tested?
    
    Added unit test to demonstrate the issue
    
    Author: Gera Shegalov <ge...@apache.org>
    Author: Gera Shegalov <gs...@salesforce.com>
    
    Closes #20098 from gerashegalov/gera/register-SHS-port-conf.
    
    (cherry picked from commit ea956833017fcbd8ed2288368bfa2e417a2251c5)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit bf853018cabcd3b3abf84bfe534d2981020b4a71
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-06T01:26:03Z

    [SPARK-22937][SQL] SQL elt output binary for binary inputs
    
    ## What changes were proposed in this pull request?
    This pr modified `elt` to output binary for binary inputs.
    `elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise).
    This pr is related to #19977.
    
    ## How was this patch tested?
    Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20135 from maropu/SPARK-22937.
    
    (cherry picked from commit e8af7e8aeca15a6107248f358d9514521ffdc6d3)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 3e3e9386ed95435a2d1817653d1402c102e380dc
Author: Yinan Li <li...@...>
Date:   2018-01-06T01:29:27Z

    [SPARK-22960][K8S] Revert use of ARG base_image in images
    
    ## What changes were proposed in this pull request?
    
    This PR reverts the `ARG base_image` before `FROM` in the images of driver, executor, and init-container, introduced in https://github.com/apache/spark/pull/20154. The reason is Docker versions before 17.06 do not support this use (`ARG` before `FROM`).
    
    ## How was this patch tested?
    
    Tested manually.
    
    vanzin foxish kimoonkim
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20170 from liyinan926/master.
    
    (cherry picked from commit bf65cd3cda46d5480bfcd13110975c46ca631972)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 7236914e5e7aeb4eb919530b6edbad70256cca52
Author: Li Jin <ic...@...>
Date:   2018-01-06T08:11:20Z

    [SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases
    
    ## What changes were proposed in this pull request?
    
    Add tests for using non deterministic UDFs in aggregate.
    
    Update pandas_udf docstring w.r.t to determinism.
    
    ## How was this patch tested?
    test_nondeterministic_udf_in_aggregate
    
    Author: Li Jin <ic...@gmail.com>
    
    Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.
    
    (cherry picked from commit f2dd8b923759e8771b0e5f59bfa7ae4ad7e6a339)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit e6449e8167776e3921c286d75e8cdd30ee33d77a
Author: zuotingbing <zu...@...>
Date:   2018-01-06T10:07:45Z

    [SPARK-22793][SQL] Memory leak in Spark Thrift Server
    
    # What changes were proposed in this pull request?
    1. Start HiveThriftServer2.
    2. Connect to thriftserver through beeline.
    3. Close the beeline.
    4. repeat step2 and step 3 for many times.
    we found there are many directories never be dropped under the path `hive.exec.local.scratchdir` and `hive.exec.scratchdir`, as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem `deleteOnExit` will keep increasing until JVM terminated.
    
    In addition, we use `jmap -histo:live [PID]`
    to printout the size of objects in HiveThriftServer2 Process, we can find the object `org.apache.spark.sql.hive.client.HiveClientImpl` and `org.apache.hadoop.hive.ql.session.SessionState` keep increasing even though we closed all the beeline connections, which may caused the leak of Memory.
    
    # How was this patch tested?
    manual tests
    
    This PR follw-up the https://github.com/apache/spark/pull/19989
    
    Author: zuotingbing <zu...@zte.com.cn>
    
    Closes #20029 from zuotingbing/SPARK-22793.
    
    (cherry picked from commit be9a804f2ef77a5044d3da7d9374976daf59fc16)
    Signed-off-by: gatorsmile <ga...@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20185: Branch 2.3

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20185
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20185: Branch 2.3

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20185
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20185: Branch 2.3

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20185
  
    Seems mistakenly open. Could you close this @jimmy144 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20185: Branch 2.3

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20185


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20185: Branch 2.3

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20185
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org