You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sharkdtu <gi...@git.apache.org> on 2017/02/13 11:56:19 UTC

[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...

GitHub user sharkdtu opened a pull request:

    https://github.com/apache/spark/pull/16911

    [SPARK-19576] [Core] Task attempt paths exist in output path after saveAsNewAPIHadoopFile completes with speculation enabled

    `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. The problem is that when speculation is enabled sometimes this can result in multiple tasks committing their output to the same path, which may lead to task temporary paths exist in output path after `saveAsNewAPIHadoopFile` completes. 
    
    ```scala
    -rw-r--r--    3   user group       0   2017-02-11 19:36 hdfs://.../output/_SUCCESS
    drwxr-xr-x    -   user group       0   2017-02-11 19:36 hdfs://.../output/attempt_201702111936_32487_r_000044_0
    -rw-r--r--    3   user group    8952   2017-02-11 19:36 hdfs://.../output/part-r-00000
    -rw-r--r--    3   user group    7878   2017-02-11 19:36 hdfs://.../output/part-r-00001
    ```
    Assume there are two attempt tasks that commit at the same time, The two attempt tasks maybe rename their task attempt paths to task committed path at the same time. When one task's `rename` operation completes, the other task's `rename` operation will let its task attempt path under the task committed path.
    
    Anyway, it is not recommended that `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been solved. Newest master has solved it too. This PR just fix 2.1

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sharkdtu/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16911
    
----
commit a7f8ebb8629706c54c286b7aca658838e718e804
Author: Cheng Lian <li...@databricks.com>
Date:   2016-12-02T06:02:45Z

    [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686
    
    This PR targets to both master and branch-2.1.
    
    ## What changes were proposed in this pull request?
    
    Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`.
    
    ## How was this patch tested?
    
    New test case added in `ParquetFilterSuite`.
    
    Author: Cheng Lian <li...@databricks.com>
    
    Closes #16106 from liancheng/spark-17213-bad-string-ppd.
    
    (cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 65e896a6e9a5378f2d3a02c0c2a57fdb8d8f1d9d
Author: Eric Liang <ek...@databricks.com>
Date:   2016-12-02T12:59:39Z

    [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables
    
    ## What changes were proposed in this pull request?
    
    In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.
    
    This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).
    
    cc mallman  cloud-fan
    
    ## How was this patch tested?
    
    Checked metrics in unit tests.
    
    Author: Eric Liang <ek...@databricks.com>
    
    Closes #16112 from ericl/spark-18679.
    
    (cherry picked from commit 294163ee9319e4f7f6da1259839eb3c80bba25c2)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 415730e19cea3a0e7ea5491bf801a22859bbab66
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-12-02T13:48:22Z

    [SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options
    
    ## What changes were proposed in this pull request?
    
    Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.
    
    **JDBCRelation.insert**
    ```scala
    override def insert(data: DataFrame, overwrite: Boolean): Unit = {
      val url = jdbcOptions.url
      val table = jdbcOptions.table
    - val properties = jdbcOptions.asConnectionProperties
    + val properties = jdbcOptions.asProperties
      data.write
        .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
        .jdbc(url, table, properties)
    ```
    
    **JDBCOptions.asConnectionProperties**
    ```scala
    scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
    scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
    scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
    res0: java.util.Properties = {numpartitions=10}
    scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
    res1: java.util.Properties = {numpartitions=10}
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a new testcase.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #15863 from dongjoon-hyun/SPARK-18419.
    
    (cherry picked from commit 55d528f2ba0ba689dbb881616d9436dc7958e943)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit e374b2426114d841e1935719f6e21919475f6804
Author: Eric Liang <ek...@databricks.com>
Date:   2016-12-02T13:59:02Z

    [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables
    
    ## What changes were proposed in this pull request?
    
    Two bugs are addressed here
    1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
    2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
    
    cc yhuai cloud-fan
    
    ## How was this patch tested?
    
    Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
    
    Author: Eric Liang <ek...@databricks.com>
    
    Closes #16088 from ericl/spark-18659.
    
    (cherry picked from commit 7935c8470c5c162ef7213e394fe8588e5dd42ca2)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 32c85383bfd6210e96b4bbcdedbe27a88935e4c7
Author: gatorsmile <ga...@gmail.com>
Date:   2016-12-02T14:12:19Z

    [SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join
    
    ### What changes were proposed in this pull request?
    Added a test case for using joins with nested fields.
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #16110 from gatorsmile/followup-18674.
    
    (cherry picked from commit 2f8776ccad532fbed17381ff97d302007918b8d8)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit c69825a98989ee975dc8b87979e29e0fff15a3f7
Author: Ryan Blue <bl...@apache.org>
Date:   2016-12-02T16:41:40Z

    [SPARK-18677] Fix parsing ['key'] in JSON path expressions.
    
    ## What changes were proposed in this pull request?
    
    This fixes the parser rule to match named expressions, which doesn't work for two reasons:
    1. The name match is not coerced to a regular expression (missing .r)
    2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary
    
    ## How was this patch tested?
    
    This adds test cases for named expressions using the bracket syntax, including one with quoted spaces.
    
    Author: Ryan Blue <bl...@apache.org>
    
    Closes #16107 from rdblue/SPARK-18677-fix-json-path.
    
    (cherry picked from commit 48778976e0566d9c93a8c900825def82c6b81fd6)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit f915f8128bd47b9d668065f848d5d437365e564a
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-12-02T20:16:57Z

    [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial."
    
    ## What changes were proposed in this pull request?
    It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.
    
    ## How was this patch tested?
    Existing unit tests.
    
    This reverts commit daa975f4bfa4f904697bf3365a4be9987032e490.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #16118 from yanboliang/spark-18291-revert.
    
    (cherry picked from commit a985dd8e99d2663a3cb4745c675fa2057aa67155)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit f53763275ae1b74925e4123dd87f567798f16ba1
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-02T20:42:47Z

    [SPARK-18670][SS] Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data
    
    ## What changes were proposed in this pull request?
    
    This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data.
    
    ## How was this patch tested?
    
    The added unit test.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16108 from zsxwing/SPARK-18670.
    
    (cherry picked from commit 56a503df5ccbb233ad6569e22002cc989e676337)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 839d4e9ca94b132732225632e8c50364e53579a0
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-12-03T00:28:01Z

    [SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release
    
    ## What changes were proposed in this pull request?
    Update ML programming and migration guide for 2.1 release.
    
    ## How was this patch tested?
    Doc change, no test.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #16076 from yanboliang/spark-18324.
    
    (cherry picked from commit 2dc0d7efe3380a5763cb69ef346674a46f8e3d57)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit cf3dbec68d379763ee541bf3b7a4809e1f2d0cb7
Author: zero323 <ze...@users.noreply.github.com>
Date:   2016-12-03T01:39:28Z

    [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames
    
    ## What changes were proposed in this pull request?
    
    Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward compatible.
    
    ## How was this patch tested?
    
    Pyspark SQL unittests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: zero323 <ze...@users.noreply.github.com>
    
    Closes #16123 from zero323/SPARK-17845-follow-up.
    
    (cherry picked from commit a9cbfc4f6a8db936215fcf64697d5b65f13f666e)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 28ea432a26953866eaf95b2fd32a251ecf0c8094
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-12-03T10:12:28Z

    [SPARK-18685][TESTS] Fix URI and release resources after opening in tests at ExecutorClassLoaderSuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes two problems as below:
    
    - Close `BufferedSource` after `Source.fromInputStream(...)` to release resource and make the tests pass on Windows in `ExecutorClassLoaderSuite`
    
      ```
      [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 333 milliseconds)
      [info]   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
      [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
      [info]   at org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
      [info]   at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
      ...
      ```
    
    - Fix URI correctly so that related tests can be passed on Windows.
    
      ```
      [info] - child first *** FAILED *** (78 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ...
      [info] - parent first *** FAILED *** (15 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ...
      [info] - child first can fall back *** FAILED *** (0 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ...
      [info] - child first can fail *** FAILED *** (0 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ...
      [info] - resource from parent *** FAILED *** (0 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ...
      [info] - resources from parent *** FAILED *** (0 milliseconds)
      [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
      [info]   at java.net.URI$Parser.fail(URI.java:2848)
      [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
      ```
    
    ## How was this patch tested?
    
    Manually tested via AppVeyor.
    
    **Before**
    https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite
    
    **After**
    https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #16116 from HyukjinKwon/close-after-open.
    
    (cherry picked from commit d1312fb7edffd6e10c86f69ddfff05f8915856ac)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit b098b4845c557a3139c76caa0377c3049b6fe8aa
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-12-03T19:36:26Z

    [SPARK-18582][SQL] Whitelist LogicalPlan operators allowed in correlated subqueries
    
    ## What changes were proposed in this pull request?
    
    This fix puts an explicit list of operators that Spark supports for correlated subqueries.
    
    ## How was this patch tested?
    
    Run sql/test, catalyst/test and add a new test case on Generate.
    
    Author: Nattavut Sutyanyong <ns...@gmail.com>
    
    Closes #16046 from nsyca/spark18455.0.
    
    (cherry picked from commit 4a3c09601ba69f7d49d1946bb6f20f5cfe453031)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 28f698b4845e6497d060270ba790cc60dc7e1a6e
Author: Yunni <eu...@gmail.com>
Date:   2016-12-04T00:58:15Z

    [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH)
    
    ## What changes were proposed in this pull request?
    The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.
    
    ## How was this patch tested?
    Doc has been generated through Jekyll, and checked through manual inspection.
    
    Author: Yunni <Eu...@gmail.com>
    Author: Yun Ni <yu...@uber.com>
    Author: Joseph K. Bradley <jo...@databricks.com>
    Author: Yun Ni <Eu...@gmail.com>
    
    Closes #15795 from Yunni/SPARK-18081-lsh-guide.
    
    (cherry picked from commit 34777184cd8cab61e1dd25d0a4d5e738880a57b2)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 8145c82bc8e4c44e7b74695e2307bb837cde1207
Author: Kapil Singh <ka...@adobe.com>
Date:   2016-12-04T09:16:40Z

    [SPARK-18091][SQL] Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit
    
    ## What changes were proposed in this pull request?
    
    Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit.
    
    This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code.
    ## How was this patch tested?
    
    Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place.
    
    Author: Kapil Singh <ka...@adobe.com>
    
    Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
    
    (cherry picked from commit e463678b194e08be4a8bc9d1d45461d6c77a15ee)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 41d698ecead46979e9a77b21e6a9c8f27cff63ac
Author: Eric Liang <ek...@databricks.com>
Date:   2016-12-04T12:44:04Z

    [SPARK-18661][SQL] Creating a partitioned datasource table should not scan all files for table
    
    ## What changes were proposed in this pull request?
    
    Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason.
    
    We should avoid doing this when the user specifies a schema.
    
    ## How was this patch tested?
    
    Perf stat tests.
    
    Author: Eric Liang <ek...@databricks.com>
    
    Closes #16090 from ericl/spark-18661.
    
    (cherry picked from commit d9eb4c7215f26dd05527c0b9980af35087ab9d64)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit c13c2939fb19901d86ee013aa7bb5e200d79be85
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-12-05T04:25:11Z

    [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark
    
    ## What changes were proposed in this pull request?
    
    If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
    
    This seems to be a regression on the earlier behavior.
    
    Fix is to always try to install or check for the cached Spark if running in an interactive session.
    As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
    
    ## How was this patch tested?
    
    Manually
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16077 from felixcheung/rsessioninteractive.
    
    (cherry picked from commit b019b3a8ac49336e657f5e093fa2fba77f8d12d2)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 88e07efe86512142eeada6a6f1f7fe858204c59b
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-12-05T08:32:58Z

    [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol
    
    ## What changes were proposed in this pull request?
    add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`
    
    ## How was this patch tested?
    added tests
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #16059 from zhengruifeng/ovrm_setCol.
    
    (cherry picked from commit bdfe7f67468ecfd9927a1fec60d6605dd05ebe3f)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit 1821cbead1875fbe1c16d7c50563aa0839e1f70f
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-12-05T08:39:44Z

    [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.
    
    ## What changes were proposed in this pull request?
    Add R examples to ML programming guide for the following algorithms as POC:
    * spark.glm
    * spark.survreg
    * spark.naiveBayes
    * spark.kmeans
    
    The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR.
    
    ## How was this patch tested?
    This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```:
    ![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png)
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #16136 from yanboliang/spark-18279.
    
    (cherry picked from commit eb8dd68132998aa00902dfeb935db1358781e1c1)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit afd2321b689fb29d18fee1840f5a0058cefd6d60
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-12-05T18:36:13Z

    [MINOR][DOC] Use SparkR `TRUE` value and add default values for `StructField` in SQL Guide.
    
    ## What changes were proposed in this pull request?
    
    In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional.
    
    **BEFORE**
    * SPARK 2.1.0 RC1
    http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types
    
    **AFTER**
    
    * R
    <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png">
    
    * Scala
    <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png">
    
    * Python
    <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png">
    
    ## How was this patch tested?
    
    Manual.
    
    ```
    cd docs
    SKIP_API=1 jekyll build
    open _site/index.html
    ```
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
    
    (cherry picked from commit 410b7898661f77e748564aaee6a5ab7747ce34ad)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 30c074308f723f95823b43fbc54e2e4742d52840
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-12-05T18:49:22Z

    Revert "[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise"
    
    This reverts commit fce1be6cc81b1fe3991a4df91128f4fcd14ff615 from branch-2.1.

commit e23c8cfc8e59508743fc69c82028831f95bc25d7
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-12-05T19:37:13Z

    [SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable
    
    ## What changes were proposed in this pull request?
    
    This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.
    
    However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.
    
    This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.
    
    ## How was this patch tested?
    
    updated test in `DatasetAggregatorSuite`
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #16143 from cloud-fan/aggregator.
    
    (cherry picked from commit 01a7d33d0851d82fd1bb477a58d9925fe8d727d8)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 39759ff00ba4313a82834387eea53b1af7b7daaf
Author: Nicholas Chammas <ni...@gmail.com>
Date:   2016-12-05T20:57:41Z

    [DOCS][MINOR] Update location of Spark YARN shuffle jar
    
    Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`.
    
    This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one.
    
    Author: Nicholas Chammas <ni...@gmail.com>
    
    Closes #16130 from nchammas/yarn-doc-fix.
    
    (cherry picked from commit 5a92dc76ab431d73275a2bdfbc2c0a8ceb0d75d1)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit c6a4e3d96997bf166360524a95510b3490b68b49
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-05T22:59:42Z

    [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
    
    ## What changes were proposed in this pull request?
    
    Backport #16125 to branch 2.1.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16153 from zsxwing/SPARK-18694-2.1.

commit fecd23d2cebe691e4dee43ef26ef0090ead2c0d0
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2016-12-06T01:50:43Z

    [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs
    
    ## What changes were proposed in this pull request?
    
    As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
    
    The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
    
        >>> from pyspark.sql.functions import *
        >>> from pyspark.sql.types import *
        >>>
        >>> df = spark.range(10)
        >>>
        >>> def return_range(value):
        ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
        ...
        >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
        ...                                                     StructField("string_val", StringType())])))
        >>>
        >>> df.select("id", explode(range_udf(df.id))).show()
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
            print(self._jdf.showString(n, 20))
          File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
          File "/spark/python/pyspark/sql/utils.py", line 63, in deco
            return f(*a, **kw)
          File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
            at scala.Predef$.assert(Predef.scala:156)
            at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
            at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
    
    The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
    
    Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
    
    It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
    
    However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
    
    To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
    
    ## How was this patch tested?
    
    Added test cases to PySpark.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <vi...@gmail.com>
    
    Closes #16120 from viirya/fix-py-udf-with-generator.
    
    (cherry picked from commit 3ba69b64852ccbf6d4ec05a021bc20616a09f574)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 6c4c3368473f7f2c8fe810b895b9148e72370ba6
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-06T02:15:55Z

    [SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink
    
    ## What changes were proposed in this pull request?
    
    Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16162 from zsxwing/SPARK-18729.
    
    (cherry picked from commit 1b2785c3d0a40da2fca923af78066060dbfbcf0a)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 1946854abd4e4dc4bf0bba30ca521170b966d467
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-12-06T02:17:38Z

    [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
    
    Here are the major changes in this PR.
    - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
    - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
    - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
    - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
    
    Implementation details
    - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
    - Added the `id` as the new `StreamMetadata`.
    - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
    - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
    
    TODO
    - [x] Test handling of name=null in json generation of StreamingQueryProgress
    - [x] Test handling of name=null in json generation of StreamingQueryListener events
    - [x] Test python API of runId
    
    Updated unit tests and new unit tests
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #16113 from tdas/SPARK-18657.
    
    (cherry picked from commit bb57bfe97d9fb077885065b8e804b85d4c493faf)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit d4588165ed0c68c2712304a6814eda4fbb470ea2
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-06T02:51:07Z

    [SPARK-18722][SS] Move no data rate limit from StreamExecution to ProgressReporter
    
    ## What changes were proposed in this pull request?
    
    Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16155 from zsxwing/SPARK-18722.
    
    (cherry picked from commit 4af142f55771affa5fc7f2abbbf5e47766194e6e)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 8ca6a82c1d04b0986d3063e3ee321698fc278992
Author: Michael Allman <mi...@videoamp.com>
Date:   2016-12-06T03:33:35Z

    [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog`
    
    (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
    
    ## What changes were proposed in this pull request?
    
    Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
    
    To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
    
    Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
    7.901
    3.983
    4.018
    4.331
    4.261
    
    Spark at bdc8153, `SHOW PARTITIONS table2`
    (Timed out after 10 minutes with a `SocketTimeoutException`.)
    
    Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
    3.801
    0.449
    0.395
    0.348
    0.336
    
    Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
    5.184
    1.63
    1.474
    1.519
    1.41
    
    Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
    
    This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
    
    ## How was this patch tested?
    
    I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
    
    Author: Michael Allman <mi...@videoamp.com>
    
    Closes #15998 from mallman/spark-18572-list_partition_names.
    
    (cherry picked from commit 772ddbeaa6fe5abf189d01246f57d295f9346fa3)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 655297b35651fc68632ebe92ea97ed560548c68e
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-06T04:35:24Z

    [SPARK-18721][SS] Fix ForeachSink with watermark + append
    
    ## What changes were proposed in this pull request?
    
    Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark.
    
    This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan.
    
    ## How was this patch tested?
    
    `test("foreach with watermark: append")`.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16160 from zsxwing/SPARK-18721.
    
    (cherry picked from commit 7863c623791d088684107f833fdecb4b5fdab4ec)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit e362d998d045f9c6b22f34cba0ad1e77a505883b
Author: Herman van Hovell <hv...@databricks.com>
Date:   2016-12-06T13:51:39Z

    [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
    
    ## What changes were proposed in this pull request?
    I jumped the gun on merging https://github.com/apache/spark/pull/16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Herman van Hovell <hv...@databricks.com>
    
    Closes #16170 from hvanhovell/SPARK-18634.
    
    (cherry picked from commit 381ef4ea76b0920e05c81adb44b1fef88bee5d25)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...

Posted by sharkdtu <gi...@git.apache.org>.
Github user sharkdtu closed the pull request at:

    https://github.com/apache/spark/pull/16911


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org