You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by edurekagithub <gi...@git.apache.org> on 2018/12/02 04:41:27 UTC

[GitHub] spark pull request #23198: Branch 2.4

GitHub user edurekagithub opened a pull request:

    https://github.com/apache/spark/pull/23198

    Branch 2.4

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/edurekagithub/spark branch-2.4

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23198
    
----
commit b7efca7ece484ee85091b1b50bbc84ad779f9bfe
Author: Mario Molina <mm...@...>
Date:   2018-09-11T12:47:14Z

    [SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set.
    
    ## What changes were proposed in this pull request?
    
    In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s.
    
    Closes #22234
    Closes #22367
    
    ## How was this patch tested?
    
    It was tested by `CSVSuite` and new tests added in the PR #22234
    
    Closes #22389 from MaxGekk/csv-empty-value-master.
    
    Lead-authored-by: Mario Molina <mm...@gmail.com>
    Co-authored-by: Maxim Gekk <ma...@databricks.com>
    Signed-off-by: hyukjinkwon <gu...@apache.org>
    (cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276)
    Signed-off-by: hyukjinkwon <gu...@apache.org>

commit 0b8bfbe12b8a368836d7ddc8445de18b7ee42cde
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-11T15:57:42Z

    [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
    
    ## What changes were proposed in this pull request?
    
    Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back.
    
    **INSERT OVERWRITE DIRECTORY USING**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id")
    ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`;
    ```
    
    **INSERT OVERWRITE DIRECTORY STORED AS**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id")
    // It generates corrupted files
    scala> spark.read.parquet("/tmp/parquet").show
    18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`;
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.
    
    Closes #22378 from dongjoon-hyun/SPARK-25389.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
    (cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e)
    Signed-off-by: Dongjoon Hyun <do...@apache.org>

commit 4414e026097c74aadd252b541c9d3009cd7e9d09
Author: Gera Shegalov <ge...@...>
Date:   2018-09-11T16:28:32Z

    [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values
    
    ## What changes were proposed in this pull request?
    
    Stop trimming values of properties loaded from a file
    
    ## How was this patch tested?
    
    Added unit test demonstrating the issue hit in production.
    
    Closes #22213 from gerashegalov/gera/SPARK-25221.
    
    Authored-by: Gera Shegalov <ge...@apache.org>
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>
    (cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 16127e844f8334e1152b2e3ed3d878ec8de13dfa
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-11T17:31:06Z

    [SPARK-24889][CORE] Update block info when unpersist rdds
    
    ## What changes were proposed in this pull request?
    
    We will update block info coming from executors, at the timing like caching a RDD. However, when removing RDDs with unpersisting, we don't ask to update block info. So the block info is not updated.
    
    We can fix this with few options:
    
    1. Ask to update block info when unpersisting
    
    This is simplest but changes driver-executor communication a bit.
    
    2. Update block info when processing the event of unpersisting RDD
    
    We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When processing this event, we can update block info of the RDD. This only changes event processing code so the risk seems to be lower.
    
    Currently this patch takes option 2 for lower risk. If we agree first option has no risk, we can change to it.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22341 from viirya/SPARK-24889.
    
    Authored-by: Liang-Chi Hsieh <vi...@gmail.com>
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>
    (cherry picked from commit 14f3ad20932535fe952428bf255e7eddd8fa1b58)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 99b37a91871f8bf070d43080f1c58475548c99fd
Author: Sean Owen <se...@...>
Date:   2018-09-11T19:46:03Z

    [SPARK-25398] Minor bugs from comparing unrelated types
    
    ## What changes were proposed in this pull request?
    
    Correct some comparisons between unrelated types to what they seem to… have been trying to do
    
    ## How was this patch tested?
    
    Existing tests.
    
    Closes #22384 from srowen/SPARK-25398.
    
    Authored-by: Sean Owen <se...@databricks.com>
    Signed-off-by: Sean Owen <se...@databricks.com>
    (cherry picked from commit cfbdd6a1f5906b848c520d3365cc4034992215d9)
    Signed-off-by: Sean Owen <se...@databricks.com>

commit 3a6ef8b7e2d17fe22458bfd249f45b5a5ce269ec
Author: Sean Owen <se...@...>
Date:   2018-09-11T19:52:58Z

    Revert "[SPARK-23820][CORE] Enable use of long form of callsite in logs"
    
    This reverts commit e58dadb77ed6cac3e1b2a037a6449e5a6e7f2cec.

commit 0dbf1450f7965c27ce9329c7dad351ff8b8072dc
Author: Mukul Murthy <mu...@...>
Date:   2018-09-11T22:53:15Z

    [SPARK-25399][SS] Continuous processing state should not affect microbatch execution jobs
    
    ## What changes were proposed in this pull request?
    
    The leftover state from running a continuous processing streaming job should not affect later microbatch execution jobs. If a continuous processing job runs and the same thread gets reused for a microbatch execution job in the same environment, the microbatch job could get wrong answers because it can attempt to load the wrong version of the state.
    
    ## How was this patch tested?
    
    New and existing unit tests
    
    Closes #22386 from mukulmurthy/25399-streamthread.
    
    Authored-by: Mukul Murthy <mu...@gmail.com>
    Signed-off-by: Tathagata Das <ta...@gmail.com>
    (cherry picked from commit 9f5c5b4cca7d4eaa30a3f8adb4cb1eebe3f77c7a)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 40e4db0eb72be7640bd8b5b319ad4ba99c9dc846
Author: gatorsmile <ga...@...>
Date:   2018-09-12T13:11:22Z

    [SPARK-25402][SQL] Null handling in BooleanSimplification
    
    ## What changes were proposed in this pull request?
    This PR is to fix the null handling in BooleanSimplification. In the rule BooleanSimplification, there are two cases that do not properly handle null values. The optimization is not right if either side is null. This PR is to fix them.
    
    ## How was this patch tested?
    Added test cases
    
    Closes #22390 from gatorsmile/fixBooleanSimplification.
    
    Authored-by: gatorsmile <ga...@gmail.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 79cc59718fdf7785bdc37a26bb8df4c6151114a6)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 071babbab5a49b7106d61b0c9a18672bd67e1786
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-12T14:54:05Z

    [SPARK-25352][SQL] Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
    
    ## What changes were proposed in this pull request?
    
    We have optimization on global limit to evenly distribute limit rows across all partitions. This optimization doesn't work for ordered results.
    
    For a query ending with sort + limit, in most cases it is performed by `TakeOrderedAndProjectExec`.
    
    But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, global limit will be used. At this moment, we need to do ordered global limit.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22344 from viirya/SPARK-25352.
    
    Authored-by: Liang-Chi Hsieh <vi...@gmail.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 2f422398b524eacc89ab58e423bb134ae3ca3941)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 4c1428fa2b29c371458977427561d2b4bb9daa5b
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-12T17:43:40Z

    [SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields
    
    ## What changes were proposed in this pull request?
    
    Schema pruning doesn't work if nested column is used in where clause.
    
    For example,
    ```
    sql("select name.first from contacts where name.first = 'David'")
    
    == Physical Plan ==
    *(1) Project [name#19.first AS first#40]
    +- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
       +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [],
        PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
    ```
    
    In above query plan, the scan node reads the entire schema of `name` column.
    
    This issue is reported by:
    https://github.com/apache/spark/pull/21320#issuecomment-419290197
    
    The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22357 from viirya/SPARK-25363.
    
    Authored-by: Liang-Chi Hsieh <vi...@gmail.com>
    Signed-off-by: DB Tsai <d_...@apple.com>
    (cherry picked from commit 3030b82c89d3e45a2e361c469fbc667a1e43b854)
    Signed-off-by: DB Tsai <d_...@apple.com>

commit 15d2e9d7d2f0d5ecefd69bdc3f8a149670b05e79
Author: Wenchen Fan <we...@...>
Date:   2018-09-12T18:25:24Z

    [SPARK-24882][SQL] Revert [] improve data source v2 API from branch 2.4
    
    ## What changes were proposed in this pull request?
    
    As discussed in the dev list, we don't want to include https://github.com/apache/spark/pull/22009 in Spark 2.4, as it needs data source v2 users to change the implementation intensitively, while they need to change again in next release.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #22388 from cloud-fan/revert.

commit 71f70130f1b2b4ec70595627f0a02a88e2c0e27d
Author: Michael Mior <mm...@...>
Date:   2018-09-13T01:45:25Z

    [SPARK-23820][CORE] Enable use of long form of callsite in logs
    
    This is a rework of #21433 to address some concerns there.
    
    Closes #22398 from michaelmior/long-callsite2.
    
    Authored-by: Michael Mior <mm...@uwaterloo.ca>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit ab25c967905ca0973fc2f30b8523246bb9244206)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 776dc42c1326764233a4466172330b74b98df7aa
Author: Maxim Gekk <ma...@...>
Date:   2018-09-13T01:51:49Z

    [SPARK-25387][SQL] Fix for NPE caused by bad CSV input
    
    ## What changes were proposed in this pull request?
    
    The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.
    
    ## How was this patch tested?
    
    I added a test which reproduce the issue and tested by `CSVSuite`.
    
    Closes #22374 from MaxGekk/npe-on-bad-csv.
    
    Lead-authored-by: Maxim Gekk <ma...@gmail.com>
    Co-authored-by: Maxim Gekk <ma...@databricks.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 083c9447671719e0bd67312e3d572f6160c06a4a)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 6f4d647e07ef527ef93c4fc849a478008a52bc80
Author: LantaoJin <ji...@...>
Date:   2018-09-13T01:57:34Z

    [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information like file path to event log
    
    ## What changes were proposed in this pull request?
    
    Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many meta data was also removed from event SparkListenerSQLExecutionStart in Spark event log. If we want to analyze event log to get all input paths, we couldn't get them. Instead, simpleString of SparkPlanInfo JSON only display 100 characters, it won't help.
    
    Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log looks like below (It contains the metadata field which has the intact information):
    >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., "metadata": {"Location": "InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"}
    
    After #18600, metadata field was removed.
    >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
    
    So I add this field back to SparkPlanInfo class. Then it will log out the meta data to event log. Intact information in event log is very useful for offline job analysis.
    
    ## How was this patch tested?
    Unit test
    
    Closes #22353 from LantaoJin/SPARK-25357.
    
    Authored-by: LantaoJin <ji...@gmail.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 6dc5921e66d56885b95c07e56e687f9f6c1eaca7)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit ae5c7bb204c52dd18cfb63e5c621537023e36539
Author: Sean Owen <se...@...>
Date:   2018-09-13T03:19:43Z

    [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
    
    (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)
    
    ## What changes were proposed in this pull request?
    
    Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.
    
    ## How was this patch tested?
    
    Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.
    
    Closes #22400 from srowen/SPARK-25238.2.
    
    Authored-by: Sean Owen <se...@databricks.com>
    Signed-off-by: hyukjinkwon <gu...@apache.org>
    (cherry picked from commit 08c76b5d39127ae207d9d1fff99c2551e6ce2581)
    Signed-off-by: hyukjinkwon <gu...@apache.org>

commit abb5196c7ef685e1027eb1b0b09f4559d3eba015
Author: Stavros Kontopoulos <st...@...>
Date:   2018-09-13T05:02:59Z

    [SPARK-25295][K8S] Fix executor names collision
    
    ## What changes were proposed in this pull request?
    Fixes the collision issue with spark executor names in client mode, see SPARK-25295 for the details.
    It follows the cluster name convention as app-name will be used as the prefix and if that is not defined we use "spark" as the default prefix. Eg. `spark-pi-1536781360723-exec-1` where spark-pi is the name of the app passed at the config side or transformed if it contains illegal characters.
    
    Also fixes the issue with spark app name having spaces in cluster mode.
    If you run the Spark Pi test in client mode it passes.
    The tricky part is the user may set the app name:
    https://github.com/apache/spark/blob/3030b82c89d3e45a2e361c469fbc667a1e43b854/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L30
    If i do:
    
    ```
    ./bin/spark-submit
    ...
     --deploy-mode cluster --name "spark pi"
    ...
    ```
    it will fail as the app name is used for the prefix of driver's pod name and it cannot have spaces (according to k8s conventions).
    
    ## How was this patch tested?
    
    Manually by running spark job in client mode.
    To reproduce do:
    ```
    kubectl create -f service.yaml
    kubectl create -f pod.yaml
    ```
     service.yaml :
    ```
    kind: Service
    apiVersion: v1
    metadata:
      name: spark-test-app-1-svc
    spec:
      clusterIP: None
      selector:
        spark-app-selector: spark-test-app-1
      ports:
      - protocol: TCP
        name: driver-port
        port: 7077
        targetPort: 7077
      - protocol: TCP
        name: block-manager
        port: 10000
        targetPort: 10000
    ```
    pod.yaml:
    
    ```
    apiVersion: v1
    kind: Pod
    metadata:
      name: spark-test-app-1
      labels:
        spark-app-selector: spark-test-app-1
    spec:
      containers:
      - name: spark-test
        image: skonto/spark:k8s-client-fix
        imagePullPolicy: Always
        command:
          - 'sh'
          - '-c'
          -  "/opt/spark/bin/spark-submit
                  --verbose
                  --master k8s://https://kubernetes.default.svc
                  --deploy-mode client
                  --class org.apache.spark.examples.SparkPi
                  --conf spark.app.name=spark
                  --conf spark.executor.instances=1
                  --conf spark.kubernetes.container.image=skonto/spark:k8s-client-fix
                  --conf spark.kubernetes.container.image.pullPolicy=Always
                  --conf spark.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
                  --conf spark.kubernetes.authenticate.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  --conf spark.executor.memory=500m
                  --conf spark.executor.cores=1
                  --conf spark.executor.instances=1
                  --conf spark.driver.host=spark-test-app-1-svc.default.svc
                  --conf spark.driver.port=7077
                  --conf spark.driver.blockManager.port=10000
                  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 1000000"
    ```
    
    Closes #22405 from skonto/fix-k8s-client-mode-executor-names.
    
    Authored-by: Stavros Kontopoulos <st...@lightbend.com>
    Signed-off-by: Yinan Li <yn...@google.com>
    (cherry picked from commit 3e75a9fa24f8629d068b5fbbc7356ce2603fa58d)
    Signed-off-by: Yinan Li <yn...@google.com>

commit e7f511ad0803f4a25c657ea25a63a70c6f33367a
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-13T12:21:00Z

    [SPARK-25352][SQL][FOLLOWUP] Add helper method and address style issue
    
    ## What changes were proposed in this pull request?
    
    This follow-up patch addresses [the review comment](https://github.com/apache/spark/pull/22344/files#r217070658) by adding a helper method to simplify code and fixing style issue.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Liang-Chi Hsieh <vi...@gmail.com>
    
    Closes #22409 from viirya/SPARK-25352-followup.
    
    (cherry picked from commit 5b761c537a600115450b53817bee0679d5c2bb97)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit cc19f424bc7d405acdec024a983345ec986b25fc
Author: LucaCanali <lu...@...>
Date:   2018-09-13T15:19:21Z

    [SPARK-25170][DOC] Add list and short description of Spark Executor Task Metrics to the documentation.
    
    ## What changes were proposed in this pull request?
    
    Add description of Executor Task Metrics to the documentation.
    
    Closes #22397 from LucaCanali/docMonitoringTaskMetrics.
    
    Authored-by: LucaCanali <lu...@cern.ch>
    Signed-off-by: Sean Owen <se...@databricks.com>
    (cherry picked from commit 45c4ebc8171d75fc0d169bb8071a4c43263d283e)
    Signed-off-by: Sean Owen <se...@databricks.com>

commit 35a84baa5b0e7ee68b1a810c77d92b7b39e47a02
Author: Michael Allman <ms...@...>
Date:   2018-09-13T17:08:45Z

    [SPARK-25406][SQL] For ParquetSchemaPruningSuite.scala, move calls to `withSQLConf` inside calls to `test`
    
    (Link to Jira: https://issues.apache.org/jira/browse/SPARK-25406)
    
    ## What changes were proposed in this pull request?
    
    The current use of `withSQLConf` in `ParquetSchemaPruningSuite.scala` is incorrect. The desired configuration settings are not being set when running the test cases.
    
    This PR fixes that defective usage and addresses the test failures that were previously masked by that defect.
    
    ## How was this patch tested?
    
    I added code to relevant test cases to print the expected SQL configuration settings and found that the settings were not being set as expected. When I changed the order of calls to `test` and `withSQLConf` I found that the configuration settings were being set as expected.
    
    Closes #22394 from mallman/spark-25406-fix_broken_schema_pruning_tests.
    
    Authored-by: Michael Allman <ms...@allman.ms>
    Signed-off-by: DB Tsai <d_...@apple.com>
    (cherry picked from commit a7e5aa6cd430d0a49bb6dac92c007fab189db3a3)
    Signed-off-by: DB Tsai <d_...@apple.com>

commit 9273be09f64f23d70c13fd80479cc41ebd514313
Author: Imran Rashid <ir...@...>
Date:   2018-09-13T19:11:55Z

    [SPARK-25400][CORE][TEST] Increase test timeouts
    
    We've seen some flakiness in jenkins in SchedulerIntegrationSuite which looks like it just needs a
    longer timeout.
    
    Closes #22385 from squito/SPARK-25400.
    
    Authored-by: Imran Rashid <ir...@cloudera.com>
    Signed-off-by: Sean Owen <se...@databricks.com>
    (cherry picked from commit 9deddbb13edebfefb3fd03f063679ed12e73c575)
    Signed-off-by: Sean Owen <se...@databricks.com>

commit 1220ab8a0738b5f67dc522df5e3e77ffc83d207a
Author: Wenchen <we...@...>
Date:   2018-09-14T12:39:46Z

    Preparing Spark release v2.4.0-rc1

commit 8cdf7f4c9345f8a58adffcf048fb84cc618cffcf
Author: Wenchen <we...@...>
Date:   2018-09-14T12:41:28Z

    Preparing development version 2.4.1-SNAPSHOT

commit 59054fa89b1f39f0d5d83cfe0b531ec39517f8fe
Author: Takuya UESHIN <ue...@...>
Date:   2018-09-14T16:25:27Z

    [SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results.
    
    ## What changes were proposed in this pull request?
    
    There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix and unify them.
    
    ## How was this patch tested?
    
    Manually executed the examples.
    
    Closes #22421 from ueshin/issues/SPARK-25431/fix_examples.
    
    Authored-by: Takuya UESHIN <ue...@databricks.com>
    Signed-off-by: gatorsmile <ga...@gmail.com>
    (cherry picked from commit 9c25d7f735ed8c49c795babea3fda3cab226e7cb)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit d3f5475a1a9efc1dcffbf5a4697f8431b0588e9e
Author: cclauss <cc...@...>
Date:   2018-09-15T01:13:07Z

    [SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle to v2.4.0
    
    See https://pycodestyle.readthedocs.io/en/latest/developer.html#changes for changes made in this release.
    
    ## What changes were proposed in this pull request?
    
    Upgrade pycodestyle to v2.4.0
    
    ## How was this patch tested?
    
    __pycodestyle__
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Closes #22231 from cclauss/patch-1.
    
    Authored-by: cclauss <cc...@bluewin.ch>
    Signed-off-by: Sean Owen <se...@databricks.com>
    (cherry picked from commit 9bb798f2e6eefd9edb7b6d9980a894557c107bd3)
    Signed-off-by: Sean Owen <se...@databricks.com>

commit ae2ca0e5ddc477dea3fdffafe5b69f548b502692
Author: Takuya UESHIN <ue...@...>
Date:   2018-09-15T03:51:46Z

    Revert "[SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results."
    
    This reverts commit 59054fa89b1f39f0d5d83cfe0b531ec39517f8fe.

commit b40e5feec2660891590e21807133a508cbd004d3
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-16T00:48:39Z

    [SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption
    
    ## What changes were proposed in this pull request?
    
    This PR aims to fix three things in `FilterPushdownBenchmark`.
    
    **1. Use the same memory assumption.**
    The following configurations are used in ORC and Parquet.
    
    - Memory buffer for writing
      - parquet.block.size (default: 128MB)
      - orc.stripe.size (default: 64MB)
    
    - Compression chunk size
      - parquet.page.size (default: 1MB)
      - orc.compress.size (default: 256KB)
    
    SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.
    
    **2. Dictionary encoding should not be enforced for all cases.**
    SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`.
    
    **3. Generate test result on AWS r3.xlarge**
    SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.
    
    ## How was this patch tested?
    
    Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes.
    
    Closes #22427 from dongjoon-hyun/SPARK-25438.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
    (cherry picked from commit fefaa3c30df2c56046370081cb51bfe68d26976b)
    Signed-off-by: Dongjoon Hyun <do...@apache.org>

commit b839721f3cea2b9d9af73ab4fd9dad225025ec86
Author: npoggi <np...@...>
Date:   2018-09-16T03:06:08Z

    [SPARK-25439][TESTS][SQL] Fixes TPCHQuerySuite datatype of customer.c_nationkey to BIGINT according to spec
    
    ## What changes were proposed in this pull request?
    Fixes TPCH DDL datatype of `customer.c_nationkey` from `STRING` to `BIGINT` according to spec and `nation.nationkey` in `TPCHQuerySuite.scala`. The rest of the keys are OK.
    Note, this will lead to **non-comparable previous results** to new runs involving the customer table.
    
    ## How was this patch tested?
    Manual tests
    
    Author: npoggi <np...@gmail.com>
    
    Closes #22430 from npoggi/SPARK-25439_Fix-TPCH-customer-c_nationkey.
    
    (cherry picked from commit 02c2963f895b9d78d7f6d9972cacec4ef55fa278)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 60af706b4c49fa1be1b2b1223490c98868c801c3
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-16T04:14:19Z

    [SPARK-24418][FOLLOWUP][DOC] Update docs to show Scala 2.11.12
    
    ## What changes were proposed in this pull request?
    
    SPARK-24418 upgrades Scala to 2.11.12. This PR updates Scala version in docs.
    
    - https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications (screenshot)
    ![screen1](https://user-images.githubusercontent.com/9700541/45590509-9c5f0400-b8ee-11e8-9293-e48d297db894.png)
    
    - https://spark.apache.org/docs/latest/rdd-programming-guide.html#working-with-key-value-pairs (Scala, Java)
    (These are hyperlink updates)
    
    - https://spark.apache.org/docs/latest/streaming-flume-integration.html#configuring-flume-1 (screenshot)
    ![screen2](https://user-images.githubusercontent.com/9700541/45590511-a123b800-b8ee-11e8-97a5-b7f2288229c2.png)
    
    ## How was this patch tested?
    
    Manual.
    ```bash
    $ cd docs
    $ SKIP_API=1 jekyll build
    ```
    
    Closes #22431 from dongjoon-hyun/SPARK-24418.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: DB Tsai <d_...@apple.com>
    (cherry picked from commit bfcf7426057a964b3cee90089aab6c003addc4fb)
    Signed-off-by: DB Tsai <d_...@apple.com>

commit 1cb1e43012e57e649d77524f8ff2de231f52c66a
Author: Michael Chirico <mi...@...>
Date:   2018-09-16T19:57:44Z

    [MINOR][DOCS] Axe deprecated doc refs
    
    Continuation of #22370. Summary of discussion there:
    
    There is some inconsistency in the R manual w.r.t. supercedent functions linking back to deprecated functions.
    
     - `createOrReplaceTempView` and `createTable` both link back to functions which are deprecated (`registerTempTable` and `createExternalTable`, respectively)
     - `sparkR.session` and `dropTempView` do _not_ link back to deprecated functions
    
    This PR takes the view that it is preferable _not_ to link back to deprecated functions, and removes these references from `?createOrReplaceTempView` and `?createTable`.
    
    As `registerTempTable` was included in the `SparkDataFrame functions` `family` of functions, other documentation pages which included a link to `?registerTempTable` will similarly be altered.
    
    Author: Michael Chirico <mi...@grabtaxi.com>
    Author: Michael Chirico <mi...@gmail.com>
    
    Closes #22393 from MichaelChirico/axe_deprecated_doc_refs.
    
    (cherry picked from commit a1dd78255a3ae023820b2f245cd39f0c57a32fb1)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit fb1539ad876d0878dde56258af53399dfdf706eb
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-17T03:07:51Z

    [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAppendOnlyMapSuite due to timeout
    
    ## What changes were proposed in this pull request?
    
    SPARK-22713 uses [`eventually` with the default timeout `150ms`](https://github.com/apache/spark/pull/21369/files#diff-5bbb6a931b7e4d6a31e4938f51935682R462). It causes flakiness because it's executed once when GC is slow.
    
    ```scala
    eventually {
      System.gc()
      ...
    }
    ```
    
    **Failures**
    ```scala
    org.scalatest.exceptions.TestFailedDueToTimeoutException:
    The code passed to eventually never returned normally.
    Attempted 1 times over 501.22261 milliseconds.
    Last failure message: tmpIsNull was false.
    ```
    - master-test-sbt-hadoop-2.7
      [4916](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4916)
      [4907](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4907)
      [4906](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4906)
    
    - spark-master-test-sbt-hadoop-2.6
      [4979](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4979)
      [4974](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4974)
      [4967](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4967)
      [4966](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4966)
    
    ## How was this patch tested?
    
    Pass the Jenkins.
    
    Closes #22432 from dongjoon-hyun/SPARK-22713.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 538e0478783160d8fab2dc76fd8fc7b469cb4e19)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23198: Branch 2.4

Posted by edurekagithub <gi...@git.apache.org>.
Github user edurekagithub closed the pull request at:

    https://github.com/apache/spark/pull/23198


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23198: Branch 2.4

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23198
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org