You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by engineeyao <gi...@git.apache.org> on 2017/10/04 02:50:12 UTC

[GitHub] spark pull request #19423: Branch 2.2

GitHub user engineeyao opened a pull request:

    https://github.com/apache/spark/pull/19423

    Branch 2.2

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19423
    
----
commit 9cbf39f1c74f16483865cd93d6ffc3c521e878a7
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-05-25T12:15:15Z

    [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.
    
    ## What changes were proposed in this pull request?
    Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.
    
    ## How was this patch tested?
    Existing UT.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #18089 from yanboliang/spark-19281.
    
    (cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit e01f1f222bcb7c469b1e1595e9338ed478d99894
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-05-25T13:40:39Z

    [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.
    
    ## What changes were proposed in this pull request?
    
    Expose numPartitions (expert) param of PySpark FPGrowth.
    
    ## How was this patch tested?
    
    + [x] Pass all unit tests.
    
    Author: Yan Facai (颜发才) <fa...@gmail.com>
    
    Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.
    
    (cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit 022a4957d8dc8d6049e0a8c9191fcfd1bd95a4a4
Author: Lior Regev <li...@gmail.com>
Date:   2017-05-25T16:08:19Z

    [SPARK-20741][SPARK SUBMIT] Added cleanup of JARs archive generated by SparkSubmit
    
    ## What changes were proposed in this pull request?
    
    Deleted generated JARs archive after distribution to HDFS
    
    ## How was this patch tested?
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Lior Regev <li...@gmail.com>
    
    Closes #17986 from liorregev/master.
    
    (cherry picked from commit 7306d556903c832984c7f34f1e8fe738a4b2343c)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 5ae1c652147aba9c5087335b0c6916a1035090b2
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-25T16:10:30Z

    [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid path check for sc.addJar on Windows
    
    ## What changes were proposed in this pull request?
    
    This PR proposes two things:
    
    - A follow up for SPARK-19707 (Improving the invalid path check for sc.addJar on Windows as well).
    
    ```
    org.apache.spark.SparkContextSuite:
     - add jar with invalid path *** FAILED *** (32 milliseconds)
       2 was not equal to 1 (SparkContextSuite.scala:309)
       ...
    ```
    
    - Fix path vs URI related test failures on Windows.
    
    ```
    org.apache.spark.storage.LocalDirsSuite:
     - SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0 milliseconds)
       new java.io.File("/NONEXISTENT_PATH").exists() was true (LocalDirsSuite.scala:50)
       ...
    
     - Utils.getLocalDir() throws an exception if any temporary directory cannot be retrieved *** FAILED *** (15 milliseconds)
       Expected exception java.io.IOException to be thrown, but no exception was thrown. (LocalDirsSuite.scala:64)
       ...
    ```
    
    ```
    org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
     - orc: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
       ...
    
     - parquet: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
       ...
    
     - orc: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (141 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c
       ...
    
     - parquet: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (125 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc
       ...
    
     - orc: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (156 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a
       ...
    
     - parquet: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (547 milliseconds)
       java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee
       ...
    ```
    
    ```
    org.apache.spark.sql.execution.command.DDLSuite:
     - create temporary view using *** FAILED *** (15 milliseconds)
       org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-3881d9ca-561b-488d-90b9-97587472b853	mp;
       ...
    
     - insert data to a data source table which has a non-existing location should succeed *** FAILED *** (109 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54 did not equal file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54 (DDLSuite.scala:1869)
       ...
    
     - insert into a data source table with a non-existing partition location should succeed *** FAILED *** (94 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d did not equal file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d (DDLSuite.scala:1910)
       ...
    
     - read data from a data source table which has a non-existing location should succeed *** FAILED *** (93 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34 did not equal file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34 (DDLSuite.scala:1937)
       ...
    
     - read data from a data source table with non-existing partition location should succeed *** FAILED *** (110 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - create datasource table with a non-existing location *** FAILED *** (94 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8 did not equal file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8 (DDLSuite.scala:1982)
       ...
    
     - CTAS for external data source table with a non-existing location *** FAILED *** (16 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - CTAS for external data source table with a existed location *** FAILED *** (15 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - data source table:partition column name containing a b *** FAILED *** (125 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - data source table:partition column name containing a:b *** FAILED *** (143 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - data source table:partition column name containing a%b *** FAILED *** (109 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - data source table:partition column name containing a,b *** FAILED *** (109 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - location uri contains a b for datasource table *** FAILED *** (94 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b did not equal file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b (DDLSuite.scala:2084)
       ...
    
     - location uri contains a:b for datasource table *** FAILED *** (78 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b did not equal file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b (DDLSuite.scala:2084)
       ...
    
     - location uri contains a%b for datasource table *** FAILED *** (78 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b did not equal file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b (DDLSuite.scala:2084)
       ...
    
     - location uri contains a b for database *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - location uri contains a:b for database *** FAILED *** (15 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - location uri contains a%b for database *** FAILED *** (0 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    ```
    
    ```
    org.apache.spark.sql.hive.execution.HiveDDLSuite:
     - create hive table with a non-existing location *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - CTAS for external hive table with a non-existing location *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - CTAS for external hive table with a existed location *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - partition column name of parquet table containing a b *** FAILED *** (156 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - partition column name of parquet table containing a:b *** FAILED *** (94 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - partition column name of parquet table containing a%b *** FAILED *** (125 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - partition column name of parquet table containing a,b *** FAILED *** (110 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    
     - partition column name of hive table containing a b *** FAILED *** (15 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - partition column name of hive table containing a:b *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - partition column name of hive table containing a%b *** FAILED *** (16 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - partition column name of hive table containing a,b *** FAILED *** (0 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - hive table: location uri contains a b *** FAILED *** (0 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - hive table: location uri contains a:b *** FAILED *** (0 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    
     - hive table: location uri contains a%b *** FAILED *** (0 milliseconds)
       org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
       ...
    ```
    
    ```
    org.apache.spark.sql.sources.PathOptionSuite:
     - path option also exist for write path *** FAILED *** (94 milliseconds)
       file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc did not equal file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc (PathOptionSuite.scala:98)
       ...
    ```
    
    ```
    org.apache.spark.sql.CachedTableSuite:
     - SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer to this table *** FAILED *** (110 milliseconds)
       java.lang.IllegalArgumentException: Can not create a Path from an empty string
       ...
    ```
    
    ```
    org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite:
     - treeString is redacted *** FAILED *** (250 milliseconds)
       "file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" did not contain "C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" (DataSourceScanExecRedactionSuite.scala:46)
       ...
    ```
    
    ## How was this patch tested?
    
    Tested via AppVeyor for each and checked it passed once each. These should be retested via AppVeyor in this PR.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #17987 from HyukjinKwon/windows-20170515.
    
    (cherry picked from commit e9f983df275c138626af35fd263a7abedf69297f)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 7a21de9e2bb0d9344a371a8570b2fffa68c3236e
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-25T17:49:14Z

    [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project
    
    ## What changes were proposed in this pull request?
    
    Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.
    
    ## How was this patch tested?
    
    manually tested it.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #18101 from zsxwing/add-missing-example-dep.
    
    (cherry picked from commit 98c3852986a2cb5f2d249d6c8ef602be283bd90e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 289dd170cb3e0b9eca9af5841a0155ceaffee447
Author: Michael Allman <mi...@videoamp.com>
Date:   2017-05-26T01:25:43Z

    [SPARK-20888][SQL][DOCS] Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode
    
    (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888)
    
    ## What changes were proposed in this pull request?
    
    Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes.
    
    Author: Michael Allman <mi...@videoamp.com>
    
    Closes #18112 from mallman/spark-20888-document_infer_and_save.
    
    (cherry picked from commit c1e7989c4ffd83c51f5c97998b4ff6fe8dd83cf4)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit fafe283277b50974c26684b06449086acd0cf05a
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-26T07:01:28Z

    [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo
    
    ## What changes were proposed in this pull request?
    
    Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.
    
     However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.
    
    https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18091 from cloud-fan/shuffle.
    
    (cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit f99456b5f6225a534ce52cf2b817285eb8853926
Author: NICHOLAS T. MARION <nm...@us.ibm.com>
Date:   2017-05-10T09:59:57Z

    [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities
    
    ## What changes were proposed in this pull request?
    
    Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.
    
    ## How was this patch tested?
    
    Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.
    
    Author: NICHOLAS T. MARION <nm...@us.ibm.com>
    
    Closes #17686 from n-marion/xss-fix.
    
    (cherry picked from commit b512233a457092b0e2a39d0b42cb021abc69d375)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 92837aeb47fc3427166e4b6e62f6130f7480d7fa
Author: Kazuaki Ishizaki <is...@jp.ibm.com>
Date:   2017-05-16T21:47:21Z

    [SPARK-19372][SQL] Fix throwing a Java exception at df.fliter() due to 64KB bytecode size limit
    
    ## What changes were proposed in this pull request?
    
    When an expression for `df.filter()` has many nodes (e.g. 400), the size of Java bytecode for the generated Java code is more than 64KB. It produces an Java exception. As a result, the execution fails.
    This PR continues to execute by calling `Expression.eval()` disabling code generation if an exception has been caught.
    
    ## How was this patch tested?
    
    Add a test suite into `DataFrameSuite`
    
    Author: Kazuaki Ishizaki <is...@jp.ibm.com>
    
    Closes #17087 from kiszk/SPARK-19372.

commit 2b59ed4f1d4e859d5987b6eaaee074260b2a12f8
Author: Michael Armbrust <mi...@databricks.com>
Date:   2017-05-26T20:33:23Z

    [SPARK-20844] Remove experimental from Structured Streaming APIs
    
    Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #18065 from marmbrus/streamingGA.

commit 30922dec8a8cc598b6715f85281591208a91df00
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-05-26T22:01:01Z

    [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
    
    ## What changes were proposed in this pull request?
    
    - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
    - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
    - Remove bucketing from Unsupported Hive Functionalities.
    
    ## How was this patch tested?
    
    Manual tests, docs build.
    
    Author: zero323 <ze...@users.noreply.github.com>
    
    Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
    
    (cherry picked from commit ae33abf71b353c638487948b775e966c7127cd46)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit fc799d730304c6a176636b414fc15184e89367d7
Author: Yu Peng <lo...@gmail.com>
Date:   2017-05-26T23:28:36Z

    [SPARK-10643][CORE] Make spark-submit download remote files to local in client mode
    
    ## What changes were proposed in this pull request?
    
    This PR makes spark-submit script download remote files to local file system for local/standalone client mode.
    
    ## How was this patch tested?
    
    - Unit tests
    - Manual tests by adding s3a jar and testing against file on s3.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Yu Peng <lo...@gmail.com>
    
    Closes #18078 from loneknightpy/download-jar-in-spark-submit.
    
    (cherry picked from commit 4af37812915763ac3bfd91a600a7f00a4b84d29a)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 39f76657ef2967f4c87230e06cbbb1611c276375
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-27T02:57:43Z

    [SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read
    
    ## What changes were proposed in this pull request?
    
    This PR includes some minor improvement for the comments and tests in https://github.com/apache/spark/pull/16989
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18117 from cloud-fan/follow.
    
    (cherry picked from commit 1d62f8aca82601506c44b6fd852f4faf3602d7e2)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit f2408bdd7a0950385ee1364e006d55bfa6e5a200
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-27T05:25:38Z

    [SPARK-20843][CORE] Add a config to set driver terminate timeout
    
    ## What changes were proposed in this pull request?
    
    Add a `worker` configuration to set how long to wait before forcibly killing driver.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #18126 from zsxwing/SPARK-20843.
    
    (cherry picked from commit 6c1dbd6fc8d49acf7c1c902d2ebf89ed5e788a4e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 25e87d80c483785dc4a79fb283bc80f68197bf4a
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-27T23:16:51Z

    [SPARK-20897][SQL] cached self-join should not fail
    
    ## What changes were proposed in this pull request?
    
    The failed test case is, we have a `SortMergeJoinExec` for a self-join, which means we have a `ReusedExchange` node in the query plan. It works fine without caching, but throws an exception in `SortMergeJoinExec.outputPartitioning` if we cache it.
    
    The root cause is, `ReusedExchange` doesn't propagate the output partitioning from its child, so in `SortMergeJoinExec.outputPartitioning` we create `PartitioningCollection` with a hash partitioning and an unknown partitioning, and fail.
    
    This bug is mostly fine, because inserting the `ReusedExchange` is the last step to prepare the physical plan, we won't call `SortMergeJoinExec.outputPartitioning` anymore after this.
    
    However, if the dataframe is cached, the physical plan of it becomes `InMemoryTableScanExec`, which contains another physical plan representing the cached query, and it has gone through the entire planning phase and may have `ReusedExchange`. Then the planner call `InMemoryTableScanExec.outputPartitioning`, which then calls `SortMergeJoinExec.outputPartitioning` and trigger this bug.
    
    ## How was this patch tested?
    
    a new regression test
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18121 from cloud-fan/bug.
    
    (cherry picked from commit 08ede46b897b7e52cfe8231ffc21d9515122cf49)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit dc51be1e79b89d143da5df16b893df86a306f059
Author: Xiao Li <ga...@gmail.com>
Date:   2017-05-28T04:32:18Z

    [SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching
    
    ### What changes were proposed in this pull request?
    
    In Cache manager, the plan matching should ignore Hint.
    ```Scala
          val df1 = spark.range(10).join(broadcast(spark.range(10)))
          df1.cache()
          spark.range(10).join(spark.range(10)).explain()
    ```
    The output plan of the above query shows that the second query is  not using the cached data of the first query.
    ```
    BroadcastNestedLoopJoin BuildRight, Inner
    :- *Range (0, 10, step=1, splits=2)
    +- BroadcastExchange IdentityBroadcastMode
       +- *Range (0, 10, step=1, splits=2)
    ```
    
    After the fix, the plan becomes
    ```
    InMemoryTableScan [id#20L, id#23L]
       +- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
             +- BroadcastNestedLoopJoin BuildRight, Inner
                :- *Range (0, 10, step=1, splits=2)
                +- BroadcastExchange IdentityBroadcastMode
                   +- *Range (0, 10, step=1, splits=2)
    ```
    
    ### How was this patch tested?
    Added a test.
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #18131 from gatorsmile/HintCache.
    
    (cherry picked from commit 06c155c90dc784b07002f33d98dcfe9be1e38002)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 26640a26984bac4fc1037714e60bd3607929b377
Author: Kazuaki Ishizaki <is...@jp.ibm.com>
Date:   2017-05-29T19:17:14Z

    [SPARK-20907][TEST] Use testQuietly for test suites that generate long log output
    
    ## What changes were proposed in this pull request?
    
    Supress console output by using `testQuietly` in test suites
    
    ## How was this patch tested?
    
    Tested by `"SPARK-19372: Filter can be executed w/o generated code due to JVM code size limit"` in `DataFrameSuite`
    
    Author: Kazuaki Ishizaki <is...@jp.ibm.com>
    
    Closes #18135 from kiszk/SPARK-20907.
    
    (cherry picked from commit c9749068ecf8e0acabdfeeceeedff0f1f73293b7)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 3b79e4cda74e0bf82ec55e673beb8f84e7cfaca4
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-05-29T23:10:22Z

    [SPARK-8184][SQL] Add additional function description for weekofyear
    
    ## What changes were proposed in this pull request?
    
    Add additional function description for weekofyear.
    
    ## How was this patch tested?
    
     manual tests
    
    ![weekofyear](https://cloud.githubusercontent.com/assets/5399861/26525752/08a1c278-4394-11e7-8988-7cbf82c3a999.gif)
    
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #18132 from wangyum/SPARK-8184.
    
    (cherry picked from commit 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f6730a70cb47ebb3df7f42209df7b076aece1093
Author: Prashant Sharma <pr...@in.ibm.com>
Date:   2017-05-30T01:12:01Z

    [SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch.
    
    ## What changes were proposed in this pull request?
    
    In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged.
    
    Furthermore, I have performance improvement of 10x in latency, with this patch.
    
    ### These are times that addBatch took in ms. Without applying this patch
    ![with-out_patch](https://cloud.githubusercontent.com/assets/992952/23994612/a9de4a42-0a6b-11e7-9d5b-7ae18775bee4.png)
    ### These are times that addBatch took in ms. After applying this patch
    ![with_patch](https://cloud.githubusercontent.com/assets/992952/23994616/ad8c11ec-0a6b-11e7-8634-2266ebb5033f.png)
    
    ## How was this patch tested?
    Running distributed benchmarks comparing runs with this patch and without it.
    Added relevant unit tests.
    
    Author: Prashant Sharma <pr...@in.ibm.com>
    
    Closes #17308 from ScrapCodes/cached-kafka-producer.
    
    (cherry picked from commit 96a4d1d0827fc3fba83f174510b061684f0d00f7)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 5fdc7d80f46d51d4a8e49d9390b191fff42ec222
Author: Xiao Li <ga...@gmail.com>
Date:   2017-05-30T21:06:19Z

    [SPARK-20924][SQL] Unable to call the function registered in the not-current database
    
    ### What changes were proposed in this pull request?
    We are unable to call the function registered in the not-current database.
    ```Scala
    sql("CREATE DATABASE dAtABaSe1")
    sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS '${classOf[GenericUDAFAverage].getName}'")
    sql("SELECT dAtABaSe1.test_avg(1)")
    ```
    The above code returns an error:
    ```
    Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
    ```
    
    This PR is to fix the above issue.
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #18146 from gatorsmile/qualifiedFunction.
    
    (cherry picked from commit 4bb6a53ebd06de3de97139a2dbc7c85fc3aa3e66)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 287440df6816b5c9f2be2aee949a4c20ab165180
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-05-31T03:24:43Z

    [SPARK-20275][UI] Do not display "Completed" column for in-progress applications
    
    ## What changes were proposed in this pull request?
    
    Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.
    
    The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.
    
    ```
    [ {
      "id" : "local-1491805439678",
      "name" : "Spark shell",
      "attempts" : [ {
        "startTime" : "2017-04-10T06:23:57.574GMT",
        "endTime" : "1969-12-31T23:59:59.999GMT",
        "lastUpdated" : "2017-04-10T06:23:57.574GMT",
        "duration" : 0,
        "sparkUser" : "",
        "completed" : false,
        "startTimeEpoch" : 1491805437574,
        "endTimeEpoch" : -1,
        "lastUpdatedEpoch" : 1491805437574
      } ]
    } ]%
    ```
    
    Here is UI before changed:
    
    <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
    
    And after:
    
    <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png">
    
    ## How was this patch tested?
    
    Manual verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17588 from jerryshao/SPARK-20275.
    
    (cherry picked from commit 52ed9b289d169219f7257795cbedc56565a39c71)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 3cad66e5e06a4020a16fa757fbf67f666b319bab
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-05-31T05:33:29Z

    [SPARK-20877][SPARKR][WIP] add timestamps to test runs
    
    to investigate how long they run
    
    Jenkins, AppVeyor
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #18104 from felixcheung/rtimetest.
    
    (cherry picked from commit 382fefd1879e4670f3e9e8841ec243e3eb11c578)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 3686c2e965758f471f9784b3e06223ce143b6aca
Author: David Eis <de...@bloomberg.net>
Date:   2017-05-31T12:52:55Z

    [SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback in ALS
    
    ## What changes were proposed in this pull request?
    
    Revert the handling of negative values in ALS with implicit feedback, so that the confidence is the absolute value of the rating and the preference is 0 for negative ratings. This was the original behavior.
    
    ## How was this patch tested?
    
    This patch was tested with the existing unit tests and an added unit test to ensure that negative ratings are not ignored.
    
    mengxr
    
    Author: David Eis <de...@bloomberg.net>
    
    Closes #18022 from davideis/bugfix/negative-rating.
    
    (cherry picked from commit d52f636228e833db89045bc7a0c17b72da13f138)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit f59f9a380351726de20453ab101f46e199a7079c
Author: liuxian <li...@zte.com.cn>
Date:   2017-05-31T18:43:36Z

    [SPARK-20876][SQL][BACKPORT-2.2] If the input parameter is float type for ceil or floor,the result is not we expected
    
    ## What changes were proposed in this pull request?
    
    This PR is to backport #18103 to Spark 2.2
    
    ## How was this patch tested?
    unit test
    
    Author: liuxian <li...@zte.com.cn>
    
    Closes #18155 from 10110346/wip-lx-0531.

commit a607a26b344470bbff1247908d49b848bb7918a0
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-06-01T00:26:18Z

    [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException
    
    ## What changes were proposed in this pull request?
    
    `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #18168 from zsxwing/SPARK-20940.
    
    (cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 14fda6f313c63c9d5a86595c12acfb1e36df43ad
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-06-01T05:34:53Z

    [SPARK-20244][CORE] Handle incorrect bytesRead metrics when using PySpark
    
    ## What changes were proposed in this pull request?
    
    Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD.
    
    So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
    
    ## How was this patch tested?
    
    Unit test and local cluster verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17617 from jerryshao/SPARK-20244.
    
    (cherry picked from commit 5854f77ce1d3b9491e2a6bd1f352459da294e369)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 4ab7b820bfa90bd76670715bedfd155df7d8f0fd
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-06-01T06:17:15Z

    [MINOR][SQL] Fix a few function description error.
    
    ## What changes were proposed in this pull request?
    
    Fix a few function description error.
    
    ## How was this patch tested?
    
    manual tests
    
    ![descissues](https://cloud.githubusercontent.com/assets/5399861/26619392/d547736c-4610-11e7-85d7-aeeb09c02cc8.gif)
    
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #18157 from wangyum/DescIssues.
    
    (cherry picked from commit c8045f8b482e347eccf2583e0952e1d8bcb6cb96)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 6a4e023b250a86887475958093f1d3bdcbb49a03
Author: Xiao Li <ga...@gmail.com>
Date:   2017-06-01T16:52:18Z

    [SPARK-20941][SQL] Fix SubqueryExec Reuse
    
    Before this PR, Subquery reuse does not work. Below are three issues:
    - Subquery reuse does not work.
    - It is sharing the same `SQLConf` (`spark.sql.exchange.reuse`) with the one for Exchange Reuse.
    - No test case covers the rule Subquery reuse.
    
    This PR is to fix the above three issues.
    - Ignored the physical operator `SubqueryExec` when comparing two plans.
    - Added a dedicated conf `spark.sql.subqueries.reuse` for controlling Subquery Reuse
    - Added a test case for verifying the behavior
    
    N/A
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #18169 from gatorsmile/subqueryReuse.
    
    (cherry picked from commit f7cf2096fdecb8edab61c8973c07c6fc877ee32d)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit b81a702f44c3c0fe851a602d5c4682a82149a1f4
Author: Li Yichao <ly...@zhihu.com>
Date:   2017-06-01T21:39:57Z

    [SPARK-20365][YARN] Remove local scheme when add path to ClassPath.
    
    In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars started with "local" scheme), we will get inaccurate classpath for AM and containers. This is because we don't remove "local" scheme when concatenating classpath. It is OK to run because classpath is separated with ":" and java treat "local" as a separate jar. But we could improve it to remove the scheme.
    
    Updated `ClientSuite` to check "local" is not in the classpath.
    
    cc jerryshao
    
    Author: Li Yichao <ly...@zhihu.com>
    Author: Li Yichao <li...@gmail.com>
    
    Closes #18129 from liyichao/SPARK-20365.
    
    (cherry picked from commit 640afa49aa349c7ebe35d365eec3ef9bb7710b1d)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 4cba3b5a350f4d477466fc73b32cbd653eee8405
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-06-01T21:44:34Z

    [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher.
    
    Blindly deserializing classes using Java serialization opens the code up to
    issues in other libraries, since just deserializing data from a stream may
    end up execution code (think readObject()).
    
    Since the launcher protocol is pretty self-contained, there's just a handful
    of classes it legitimately needs to deserialize, and they're in just two
    packages, so add a filter that throws errors if classes from any other
    package show up in the stream.
    
    This also maintains backwards compatibility (the updated launcher code can
    still communicate with the backend code in older Spark releases).
    
    Tested with new and existing unit tests.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #18166 from vanzin/SPARK-20922.
    
    (cherry picked from commit 8efc6e986554ae66eab93cd64a9035d716adbab0)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19423: Branch 2.2

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19423
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19423: Branch 2.2

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19423
  
    @engineeyao, close this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19423: Branch 2.2

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19423


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org