You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zhangzg187 <gi...@git.apache.org> on 2018/02/08 05:12:18 UTC

[GitHub] spark pull request #20540: Branch 2.3

GitHub user zhangzg187 opened a pull request:

    https://github.com/apache/spark/pull/20540

    Branch 2.3

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20540.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20540
    
----
commit cd92913f345c8d932d3c651626c7f803e6abdcdb
Author: jerryshao <ss...@...>
Date:   2018-01-04T19:39:42Z

    [SPARK-21475][CORE][2ND ATTEMPT] Change to use NIO's Files API for external shuffle service
    
    ## What changes were proposed in this pull request?
    
    This PR is the second attempt of #18684 , NIO's Files API doesn't override `skip` method for `InputStream`, so it will bring in performance issue (mentioned in #20119). But using `FileInputStream`/`FileOutputStream` will also bring in memory issue (https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful), which is severe for long running external shuffle service. So here in this proposal, only fixing the external shuffle service related code.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #20144 from jerryshao/SPARK-21475-v2.
    
    (cherry picked from commit 93f92c0ed7442a4382e97254307309977ff676f8)
    Signed-off-by: Shixiong Zhu <zs...@gmail.com>

commit bc4bef472de0e99f74a80954d694c3d1744afe3a
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-04T22:19:00Z

    [SPARK-22850][CORE] Ensure queued events are delivered to all event queues.
    
    The code in LiveListenerBus was queueing events before start in the
    queues themselves; so in situations like the following:
    
       bus.post(someEvent)
       bus.addToEventLogQueue(listener)
       bus.start()
    
    "someEvent" would not be delivered to "listener" if that was the first
    listener in the queue, because the queue wouldn't exist when the
    event was posted.
    
    This change buffers the events before starting the bus in the bus itself,
    so that they can be delivered to all registered queues when the bus is
    started.
    
    Also tweaked the unit tests to cover the behavior above.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20039 from vanzin/SPARK-22850.
    
    (cherry picked from commit d2cddc88eac32f26b18ec26bb59e85c6f09a8c88)
    Signed-off-by: Imran Rashid <ir...@cloudera.com>

commit 2ab4012adda941ebd637bd248f65cefdf4aaf110
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-04T23:00:09Z

    [SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20156 from vanzin/SPARK-22948.
    
    (cherry picked from commit 95f9659abe8845f9f3f42fd7ababd79e55c52489)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 84707f0c6afa9c5417e271657ff930930f82213c
Author: Yinan Li <li...@...>
Date:   2018-01-04T23:35:20Z

    [SPARK-22953][K8S] Avoids adding duplicated secret volumes when init-container is used
    
    ## What changes were proposed in this pull request?
    
    User-specified secrets are mounted into both the main container and init-container (when it is used) in a Spark driver/executor pod, using the `MountSecretsBootstrap`. Because `MountSecretsBootstrap` always adds new secret volumes for the secrets to the pod, the same secret volumes get added twice, one when mounting the secrets to the main container, and the other when mounting the secrets to the init-container. This PR fixes the issue by separating `MountSecretsBootstrap.mountSecrets` out into two methods: `addSecretVolumes` for adding secret volumes to a pod and `mountSecrets` for mounting secret volumes to a container, respectively. `addSecretVolumes` is only called once for each pod, whereas `mountSecrets` is called individually for the main container and the init-container (if it is used).
    
    Ref: https://github.com/apache-spark-on-k8s/spark/issues/594.
    
    ## How was this patch tested?
    Unit tested and manually tested.
    
    vanzin This replaces https://github.com/apache/spark/pull/20148.
    hex108 foxish kimoonkim
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20159 from liyinan926/master.
    
    (cherry picked from commit e288fc87a027ec1e1a21401d1f151df20dbfecf3)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit ea9da6152af9223787cffd83d489741b4cc5aa34
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-05T00:34:56Z

    [SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly.
    
    - Make it possible to build images from a git clone.
    - Make it easy to use minikube to test things.
    
    Also fixed what seemed like a bug: the base image wasn't getting the tag
    provided in the command line. Adding the tag allows users to use multiple
    Spark builds in the same kubernetes cluster.
    
    Tested by deploying images on minikube and running spark-submit from a dev
    environment; also by building the images with different tags and verifying
    "docker images" in minikube.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20154 from vanzin/SPARK-22960.
    
    (cherry picked from commit 0428368c2c5e135f99f62be20877bbbda43be310)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 158f7e6a93b5acf4ce05c97b575124fd599cf927
Author: Juliusz Sompolski <ju...@...>
Date:   2018-01-05T02:16:34Z

    [SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt
    
    ## What changes were proposed in this pull request?
    
    32bit Int was used for row rank.
    That overflowed in a dataframe with more than 2B rows.
    
    ## How was this patch tested?
    
    Added test, but ignored, as it takes 4 minutes.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #20152 from juliuszsompolski/SPARK-22957.
    
    (cherry picked from commit df7fc3ef3899cadd252d2837092bebe3442d6523)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 145820bda140d1385c4dd802fa79a871e6bf98be
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-05T06:02:21Z

    [SPARK-22825][SQL] Fix incorrect results of Casting Array to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting arrays into strings;
    ```
    scala> val df = spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))
    scala> df.write.saveAsTable("t")
    scala> sql("SELECT cast(ids as String) FROM t").show(false)
    +------------------------------------------------------------------+
    |ids                                                               |
    +------------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df|
    +------------------------------------------------------------------+
    ```
    
    This pr modified the result into;
    ```
    +------------------------------+
    |ids                           |
    +------------------------------+
    |[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]|
    +------------------------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite` and `SQLQuerySuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20024 from maropu/SPARK-22825.
    
    (cherry picked from commit 52fc5c17d9d784b846149771b398e741621c0b5c)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 5b524cc0cd5a82e4fb0681363b6641e40b37075d
Author: Bago Amirbekian <ba...@...>
Date:   2018-01-05T06:45:15Z

    [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit
    
    ## What changes were proposed in this pull request?
    
    Avoid holding all models in memory for `TrainValidationSplit`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bago Amirbekian <ba...@databricks.com>
    
    Closes #20143 from MrBago/trainValidMemoryFix.
    
    (cherry picked from commit cf0aa65576acbe0209c67f04c029058fd73555c1)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit f9dcdbcefb545ced3f5b457e1e88c88a8e180f9f
Author: Yinan Li <li...@...>
Date:   2018-01-05T07:23:41Z

    [SPARK-22757][K8S] Enable spark.jars and spark.files in KUBERNETES mode
    
    ## What changes were proposed in this pull request?
    
    We missed enabling `spark.files` and `spark.jars` in https://github.com/apache/spark/pull/19954. The result is that remote dependencies specified through `spark.files` or `spark.jars` are not included in the list of remote dependencies to be downloaded by the init-container. This PR fixes it.
    
    ## How was this patch tested?
    
    Manual tests.
    
    vanzin This replaces https://github.com/apache/spark/pull/20157.
    
    foxish
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20160 from liyinan926/SPARK-22757.
    
    (cherry picked from commit 6cff7d19f6a905fe425bd6892fe7ca014c0e696b)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit fd4e30476894b7c37cc2ae6243a941f0bc90388d
Author: Adrian Ionescu <ad...@...>
Date:   2018-01-05T13:32:39Z

    [SPARK-22961][REGRESSION] Constant columns should generate QueryPlanConstraints
    
    ## What changes were proposed in this pull request?
    
    #19201 introduced the following regression: given something like `df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a constraint and infer filters from it when joins are involved, which may lead to noticeable performance degradation.
    
    This patch re-enables this optimization by picking up Aliases of Literals in Projection lists as constraints and making sure they're not treated as aliased columns.
    
    ## How was this patch tested?
    
    Unit test was added.
    
    Author: Adrian Ionescu <ad...@databricks.com>
    
    Closes #20155 from adrian-ionescu/constant_constraints.
    
    (cherry picked from commit 51c33bd0d402af9e0284c6cbc0111f926446bfba)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 0a30e93507ba784729a498943e7eeda1d6f19fbf
Author: Bruce Robbins <be...@...>
Date:   2018-01-05T17:58:28Z

    [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite should succeed on platforms that don't have wget
    
    ## What changes were proposed in this pull request?
    
    Modified HiveExternalCatalogVersionsSuite.scala to use Utils.doFetchFile to download different versions of Spark binaries rather than launching wget as an external process.
    
    On platforms that don't have wget installed, this suite fails with an error.
    
    cloud-fan : would you like to check this change?
    
    ## How was this patch tested?
    
    1) test-only of HiveExternalCatalogVersionsSuite on several platforms. Tested bad mirror, read timeout, and redirects.
    2) ./dev/run-tests
    
    Author: Bruce Robbins <be...@gmail.com>
    
    Closes #20147 from bersprockets/SPARK-22940-alt.
    
    (cherry picked from commit c0b7424ecacb56d3e7a18acc11ba3d5e7be57c43)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit d1f422c1c12c8095e8522d1051a6e0e406748a3a
Author: Joseph K. Bradley <jo...@...>
Date:   2018-01-05T19:51:25Z

    [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
    
    ## What changes were proposed in this pull request?
    
    Follow-up cleanups for the OneHotEncoderEstimator PR.  See some discussion in the original PR: https://github.com/apache/spark/pull/19527 or read below for what this PR includes:
    * configedCategorySize: I reverted this to return an Array.  I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF.
    * encoder: I reorganized the logic to show what I meant in the comment in the previous PR.  I think it's simpler but am open to suggestions.
    
    I also made some small style cleanups based on IntelliJ warnings.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Joseph K. Bradley <jo...@databricks.com>
    
    Closes #20132 from jkbradley/viirya-SPARK-13030.
    
    (cherry picked from commit 930b90a84871e2504b57ed50efa7b8bb52d3ba44)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 55afac4e7b4f655aa05c5bcaf7851bb1e7699dba
Author: Gera Shegalov <ge...@...>
Date:   2018-01-06T01:25:28Z

    [SPARK-22914][DEPLOY] Register history.ui.port
    
    ## What changes were proposed in this pull request?
    
    Register spark.history.ui.port as a known spark conf to be used in substitution expressions even if it's not set explicitly.
    
    ## How was this patch tested?
    
    Added unit test to demonstrate the issue
    
    Author: Gera Shegalov <ge...@apache.org>
    Author: Gera Shegalov <gs...@salesforce.com>
    
    Closes #20098 from gerashegalov/gera/register-SHS-port-conf.
    
    (cherry picked from commit ea956833017fcbd8ed2288368bfa2e417a2251c5)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit bf853018cabcd3b3abf84bfe534d2981020b4a71
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-06T01:26:03Z

    [SPARK-22937][SQL] SQL elt output binary for binary inputs
    
    ## What changes were proposed in this pull request?
    This pr modified `elt` to output binary for binary inputs.
    `elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise).
    This pr is related to #19977.
    
    ## How was this patch tested?
    Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20135 from maropu/SPARK-22937.
    
    (cherry picked from commit e8af7e8aeca15a6107248f358d9514521ffdc6d3)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 3e3e9386ed95435a2d1817653d1402c102e380dc
Author: Yinan Li <li...@...>
Date:   2018-01-06T01:29:27Z

    [SPARK-22960][K8S] Revert use of ARG base_image in images
    
    ## What changes were proposed in this pull request?
    
    This PR reverts the `ARG base_image` before `FROM` in the images of driver, executor, and init-container, introduced in https://github.com/apache/spark/pull/20154. The reason is Docker versions before 17.06 do not support this use (`ARG` before `FROM`).
    
    ## How was this patch tested?
    
    Tested manually.
    
    vanzin foxish kimoonkim
    
    Author: Yinan Li <li...@gmail.com>
    
    Closes #20170 from liyinan926/master.
    
    (cherry picked from commit bf65cd3cda46d5480bfcd13110975c46ca631972)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 7236914e5e7aeb4eb919530b6edbad70256cca52
Author: Li Jin <ic...@...>
Date:   2018-01-06T08:11:20Z

    [SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases
    
    ## What changes were proposed in this pull request?
    
    Add tests for using non deterministic UDFs in aggregate.
    
    Update pandas_udf docstring w.r.t to determinism.
    
    ## How was this patch tested?
    test_nondeterministic_udf_in_aggregate
    
    Author: Li Jin <ic...@gmail.com>
    
    Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.
    
    (cherry picked from commit f2dd8b923759e8771b0e5f59bfa7ae4ad7e6a339)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit e6449e8167776e3921c286d75e8cdd30ee33d77a
Author: zuotingbing <zu...@...>
Date:   2018-01-06T10:07:45Z

    [SPARK-22793][SQL] Memory leak in Spark Thrift Server
    
    # What changes were proposed in this pull request?
    1. Start HiveThriftServer2.
    2. Connect to thriftserver through beeline.
    3. Close the beeline.
    4. repeat step2 and step 3 for many times.
    we found there are many directories never be dropped under the path `hive.exec.local.scratchdir` and `hive.exec.scratchdir`, as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem `deleteOnExit` will keep increasing until JVM terminated.
    
    In addition, we use `jmap -histo:live [PID]`
    to printout the size of objects in HiveThriftServer2 Process, we can find the object `org.apache.spark.sql.hive.client.HiveClientImpl` and `org.apache.hadoop.hive.ql.session.SessionState` keep increasing even though we closed all the beeline connections, which may caused the leak of Memory.
    
    # How was this patch tested?
    manual tests
    
    This PR follw-up the https://github.com/apache/spark/pull/19989
    
    Author: zuotingbing <zu...@zte.com.cn>
    
    Closes #20029 from zuotingbing/SPARK-22793.
    
    (cherry picked from commit be9a804f2ef77a5044d3da7d9374976daf59fc16)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 0377755985897c850dcf3a938520e5c10de4040a
Author: fjh100456 <fu...@...>
Date:   2018-01-06T10:19:57Z

    [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.
    
    [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.
    
    ## What changes were proposed in this pull request?
    Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
    In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
    We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
    If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
    The rule for Parquet is consistent with the ORC after the change.
    
    Changes:
    1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.
    
    2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".
    
    3.Change `compressionCode` to `compressionCodecClassName`.
    
    ## How was this patch tested?
    Add test.
    
    Author: fjh100456 <fu...@zte.com.cn>
    
    Closes #20076 from fjh100456/ParquetOptionIssue.
    
    (cherry picked from commit 7b78041423b6ee330def2336dfd1ff9ae8469c59)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit b66700a5ed9a6e31433bec7961361c382a8b4162
Author: hyukjinkwon <gu...@...>
Date:   2018-01-06T15:08:26Z

    [SPARK-22901][PYTHON][FOLLOWUP] Adds the doc for asNondeterministic for wrapped UDF function
    
    ## What changes were proposed in this pull request?
    
    This PR wraps the `asNondeterministic` attribute in the wrapped UDF function to set the docstring properly.
    
    ```python
    from pyspark.sql.functions import udf
    help(udf(lambda x: x).asNondeterministic)
    ```
    
    Before:
    
    ```
    Help on function <lambda> in module pyspark.sql.udf:
    
    <lambda> lambda
    (END
    ```
    
    After:
    
    ```
    Help on function asNondeterministic in module pyspark.sql.udf:
    
    asNondeterministic()
        Updates UserDefinedFunction to nondeterministic.
    
        .. versionadded:: 2.3
    (END)
    ```
    
    ## How was this patch tested?
    
    Manually tested and a simple test was added.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #20173 from HyukjinKwon/SPARK-22901-followup.

commit f9e7b0c8aa9334fa2e467b26516ac8e54a51dc63
Author: gatorsmile <ga...@...>
Date:   2018-01-06T16:19:21Z

    [HOTFIX] Fix style checking failure
    
    ## What changes were proposed in this pull request?
    This PR is to fix the  style checking failure.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20175 from gatorsmile/stylefix.
    
    (cherry picked from commit 9a7048b2889bd0fd66e68a0ce3e07e466315a051)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 285d342c406cf304931844665e56725ef1a848e0
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-07T05:42:01Z

    [SPARK-22973][SQL] Fix incorrect results of Casting Map to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting maps into strings;
    ```
    scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t")
    scala> sql("SELECT cast(a as String) FROM t").show(false)
    +----------------------------------------------------------------+
    |a                                                               |
    +----------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeMapData38bdd75d|
    +----------------------------------------------------------------+
    ```
    This pr modified the result into;
    ```
    +----------------+
    |a               |
    +----------------+
    |[1 -> a, 2 -> b]|
    +----------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20166 from maropu/SPARK-22973.
    
    (cherry picked from commit 18e94149992618a2b4e6f0fd3b3f4594e1745224)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 7673e9c569821e20c56c2a9a875bcc025d36315f
Author: Josh Rosen <jo...@...>
Date:   2018-01-08T03:39:45Z

    [SPARK-22985] Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
    
    ## What changes were proposed in this pull request?
    
    This patch adds additional escaping in `from_utc_timestamp` / `to_utc_timestamp` expression codegen in order to a bug where invalid timezones which contain special characters could cause generated code to fail to compile.
    
    ## How was this patch tested?
    
    New regression tests in `DateExpressionsSuite`.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #20182 from JoshRosen/SPARK-22985-fix-utc-timezone-function-escaping-bugs.
    
    (cherry picked from commit 71d65a32158a55285be197bec4e41fedc9225b94)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit a1d3352d279d9eec6de23df83398b12454c6564e
Author: Guilherme Berger <gb...@...>
Date:   2018-01-08T05:32:05Z

    [SPARK-22566][PYTHON] Better error message for `_merge_type` in Pandas to Spark DF conversion
    
    ## What changes were proposed in this pull request?
    
    It provides a better error message when doing `spark_session.createDataFrame(pandas_df)` with no schema and an error occurs in the schema inference due to incompatible types.
    
    The Pandas column names are propagated down and the error message mentions which column had the merging error.
    
    https://issues.apache.org/jira/browse/SPARK-22566
    
    ## How was this patch tested?
    
    Manually in the `./bin/pyspark` console, and with new tests: `./python/run-tests`
    
    <img width="873" alt="screen shot 2017-11-21 at 13 29 49" src="https://user-images.githubusercontent.com/3977115/33080121-382274e0-cecf-11e7-808f-057a65bb7b00.png">
    
    I state that the contribution is my original work and that I license the work to the Apache Spark project under the project’s open source license.
    
    Author: Guilherme Berger <gb...@palantir.com>
    
    Closes #19792 from gberger/master.
    
    (cherry picked from commit 3e40eb3f1ffac3d2f49459a801e3ce171ed34091)
    Signed-off-by: Takuya UESHIN <ue...@databricks.com>

commit 8bf24e9fea9e9e23f03caf8b32acb4a64f5b00e3
Author: hyukjinkwon <gu...@...>
Date:   2018-01-08T05:59:08Z

    [SPARK-22979][PYTHON][SQL] Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
    
    ## What changes were proposed in this pull request?
    
    Seems we can avoid type dispatch for each value when Java objection (from Pyrolite) -> Spark's internal data format because we know the schema ahead.
    
    I manually performed the benchmark as below:
    
    ```scala
      test("EvaluatePython.fromJava / EvaluatePython.makeFromJava") {
        val numRows = 1000 * 1000
        val numFields = 30
    
        val random = new Random(System.nanoTime())
        val types = Array(
          BooleanType, ByteType, FloatType, DoubleType, IntegerType, LongType, ShortType,
          DecimalType.ShortDecimal, DecimalType.IntDecimal, DecimalType.ByteDecimal,
          DecimalType.FloatDecimal, DecimalType.LongDecimal, new DecimalType(5, 2),
          new DecimalType(12, 2), new DecimalType(30, 10), CalendarIntervalType)
        val schema = RandomDataGenerator.randomSchema(random, numFields, types)
        val rows = mutable.ArrayBuffer.empty[Array[Any]]
        var i = 0
        while (i < numRows) {
          val row = RandomDataGenerator.randomRow(random, schema)
          rows += row.toSeq.toArray
          i += 1
        }
    
        val benchmark = new Benchmark("EvaluatePython.fromJava / EvaluatePython.makeFromJava", numRows)
        benchmark.addCase("Before - EvaluatePython.fromJava", 3) { _ =>
          var i = 0
          while (i < numRows) {
            EvaluatePython.fromJava(rows(i), schema)
            i += 1
          }
        }
    
        benchmark.addCase("After - EvaluatePython.makeFromJava", 3) { _ =>
          val fromJava = EvaluatePython.makeFromJava(schema)
          var i = 0
          while (i < numRows) {
            fromJava(rows(i))
            i += 1
          }
        }
    
        benchmark.run()
      }
    ```
    
    ```
    EvaluatePython.fromJava / EvaluatePython.makeFromJava: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Before - EvaluatePython.fromJava              1265 / 1346          0.8        1264.8       1.0X
    After - EvaluatePython.makeFromJava            571 /  649          1.8         570.8       2.2X
    ```
    
    If the structure is nested, I think the advantage should be larger than this.
    
    ## How was this patch tested?
    
    Existing tests should cover this. Also, I manually checked if the values from before / after are actually same via `assert` when performing the benchmarks.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #20172 from HyukjinKwon/type-dispatch-python-eval.
    
    (cherry picked from commit 8fdeb4b9946bd9be045abb919da2e531708b3bd4)
    
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 6964dfe47b2090e542b26cd64e27420ec3eb1a3d
Author: Josh Rosen <jo...@...>
Date:   2018-01-08T08:04:03Z

    [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions
    
    ## What changes were proposed in this pull request?
    
    The following SQL query should return zero rows, but in Spark it actually returns one row:
    
    ```
    SELECT 1 from (
      SELECT 1 AS z,
      MIN(a.x)
      FROM (select 1 as x) a
      WHERE false
    ) b
    where b.z != b.z
    ```
    
    The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.
    
    In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.
    
    This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.
    
    ## How was this patch tested?
    
    New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.
    
    (cherry picked from commit 2c73d2a948bdde798aaf0f87c18846281deb05fd)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 4a45f0a532216736f8874417c5cbd7912ca13db5
Author: Wenchen Fan <we...@...>
Date:   2018-01-08T11:41:41Z

    [SPARK-21865][SQL] simplify the distribution semantic of Spark SQL
    
    ## What changes were proposed in this pull request?
    
    **The current shuffle planning logic**
    
    1. Each operator specifies the distribution requirements for its children, via the `Distribution` interface.
    2. Each operator specifies its output partitioning, via the `Partitioning` interface.
    3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a `Distribution`.
    4. For each operator, check each child of it, add a shuffle node above the child if the child partitioning can not satisfy the required distribution.
    5. For each operator, check if its children's output partitionings are compatible with each other, via the `Partitioning.compatibleWith`.
    6. If the check in 5 failed, add a shuffle above each child.
    7. try to eliminate the shuffles added in 6, via `Partitioning.guarantees`.
    
    This design has a major problem with the definition of "compatible".
    
    `Partitioning.compatibleWith` is not well defined, ideally a `Partitioning` can't know if it's compatible with other `Partitioning`, without more information from the operator. For example, `t1 join t2 on t1.a = t2.b`, `HashPartitioning(a, 10)` should be compatible with `HashPartitioning(b, 10)` under this case, but the partitioning itself doesn't know it.
    
    As a result, currently `Partitioning.compatibleWith` always return false except for literals, which make it almost useless. This also means, if an operator has distribution requirements for multiple children, Spark always add shuffle nodes to all the children(although some of them can be eliminated). However, there is no guarantee that the children's output partitionings are compatible with each other after adding these shuffles, we just assume that the operator will only specify `ClusteredDistribution` for multiple children.
    
    I think it's very hard to guarantee children co-partition for all kinds of operators, and we can not even give a clear definition about co-partition between distributions like `ClusteredDistribution(a,b)` and `ClusteredDistribution(c)`.
    
    I think we should drop the "compatible" concept in the distribution model, and let the operator achieve the co-partition requirement by special distribution requirements.
    
    **Proposed shuffle planning logic after this PR**
    (The first 4 are same as before)
    1. Each operator specifies the distribution requirements for its children, via the `Distribution` interface.
    2. Each operator specifies its output partitioning, via the `Partitioning` interface.
    3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a `Distribution`.
    4. For each operator, check each child of it, add a shuffle node above the child if the child partitioning can not satisfy the required distribution.
    5. For each operator, check if its children's output partitionings have the same number of partitions.
    6. If the check in 5 failed, pick the max number of partitions from children's output partitionings, and add shuffle to child whose number of partitions doesn't equal to the max one.
    
    The new distribution model is very simple, we only have one kind of relationship, which is `Partitioning.satisfy`. For multiple children, Spark only guarantees they have the same number of partitions, and it's the operator's responsibility to leverage this guarantee to achieve more complicated requirements. For example, non-broadcast joins can use the newly added `HashPartitionedDistribution` to achieve co-partition.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #19080 from cloud-fan/exchange.
    
    (cherry picked from commit eb45b52e826ea9cea48629760db35ef87f91fea0)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 06fd842e3a120fde1c137e4945bcb747fc71a322
Author: Xianjin YE <ad...@...>
Date:   2018-01-08T15:49:07Z

    [SPARK-22952][CORE] Deprecate stageAttemptId in favour of stageAttemptNumber
    
    ## What changes were proposed in this pull request?
    1.  Deprecate attemptId in StageInfo and add `def attemptNumber() = attemptId`
    2. Replace usage of stageAttemptId with stageAttemptNumber
    
    ## How was this patch tested?
    I manually checked the compiler warning info
    
    Author: Xianjin YE <ad...@gmail.com>
    
    Closes #20178 from advancedxy/SPARK-22952.
    
    (cherry picked from commit 40b983c3b44b6771f07302ce87987fa4716b5ebf)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit eecd83cb2d24907aba303095b052997471247500
Author: foxish <ra...@...>
Date:   2018-01-08T21:01:45Z

    [SPARK-22992][K8S] Remove assumption of the DNS domain
    
    ## What changes were proposed in this pull request?
    
    Remove the use of FQDN to access the driver because it assumes that it's set up in a DNS zone - `cluster.local` which is common but not ubiquitous
    Note that we already access the in-cluster API server through `kubernetes.default.svc`, so, by extension, this should work as well.
    The alternative is to introduce DNS zones for both of those addresses.
    
    ## How was this patch tested?
    Unit tests
    
    cc vanzin liyinan926 mridulm mccheah
    
    Author: foxish <ra...@google.com>
    
    Closes #20187 from foxish/cluster.local.
    
    (cherry picked from commit eed82a0b211352215316ec70dc48aefc013ad0b2)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 8032cf852fccd0ab8754f633affdc9ba8fc99e58
Author: xubo245 <60...@...>
Date:   2018-01-09T02:15:01Z

    [SPARK-22972] Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc
    
    ## What changes were proposed in this pull request?
    Fix the warning: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc.
    
    ## How was this patch tested?
     test("SPARK-22972: hive orc source")
        assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc")
          .equals(HiveSerDe.sourceToSerDe("orc")))
    
    Author: xubo245 <60...@qq.com>
    
    Closes #20165 from xubo245/HiveSerDe.
    
    (cherry picked from commit 68ce792b5857f0291154f524ac651036db868bb9)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 850b9f39186665fd727737a98b29abe5236830db
Author: Wang Gengliang <lt...@...>
Date:   2018-01-09T02:44:21Z

    [SPARK-22990][CORE] Fix method isFairScheduler in JobsTab and StagesTab
    
    ## What changes were proposed in this pull request?
    
    In current implementation, the function `isFairScheduler` is always false, since it is comparing String with `SchedulingMode`
    
    Author: Wang Gengliang <lt...@gmail.com>
    
    Closes #20186 from gengliangwang/isFairScheduler.
    
    (cherry picked from commit 849043ce1d28a976659278d29368da0799329db8)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20540: Branch 2.3

Posted by zhangzg187 <gi...@git.apache.org>.
Github user zhangzg187 closed the pull request at:

    https://github.com/apache/spark/pull/20540


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org