You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zhangchj1990 <gi...@git.apache.org> on 2018/06/30 09:40:07 UTC

[GitHub] spark pull request #21681: Pin tag 210

GitHub user zhangchj1990 opened a pull request:

    https://github.com/apache/spark/pull/21681

    Pin tag 210

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhangchj1990/spark pin-tag-210

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21681.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21681
    
----
commit 6b6eb4e520d07a27aa68d3450f3c7613b233d928
Author: Zheng RuiFeng <ru...@...>
Date:   2016-11-16T10:46:27Z

    [SPARK-18434][ML] Add missing ParamValidations for ML algos
    
    ## What changes were proposed in this pull request?
    Add missing ParamValidations for ML algos
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #15881 from zhengruifeng/arg_checking.
    
    (cherry picked from commit c68f1a38af67957ee28889667193da8f64bb4342)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit 416bc3dd3db7f7ae2cc7b3ffe395decd0c5b73f9
Author: Zheng RuiFeng <ru...@...>
Date:   2016-11-16T10:53:23Z

    [SPARK-18446][ML][DOCS] Add links to API docs for ML algos
    
    ## What changes were proposed in this pull request?
    Add links to API docs for ML algos
    ## How was this patch tested?
    Manual checking for the API links
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #15890 from zhengruifeng/algo_link.
    
    (cherry picked from commit a75e3fe923372c56bc1b2f4baeaaf5868ad28341)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit b0ae8712358fc8c07aa5efe4d0bd337e7e452078
Author: Xianyang Liu <xy...@...>
Date:   2016-11-16T11:59:00Z

    [SPARK-18420][BUILD] Fix the errors caused by lint check in Java
    
    Small fix, fix the errors caused by lint check in Java
    
    - Clear unused objects and `UnusedImports`.
    - Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
    - Cut the line which is longer than 100 characters into two lines.
    
    Travis CI.
    ```
    $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
    $ dev/lint-java
    ```
    Before:
    ```
    Checkstyle checks failed at following occurrences:
    [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
    [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
    [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
    [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
    [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
    [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
    [ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
    [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
    [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
    ```
    
    After:
    ```
    $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
    $ dev/lint-java
    Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
    Checkstyle checks passed.
    ```
    
    Author: Xianyang Liu <xy...@icloud.com>
    
    Closes #15865 from ConeyLiu/master.
    
    (cherry picked from commit 7569cf6cb85bda7d0e76d3e75e286d4796e77e08)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit c0dbe08d604dea543eb17ccb802a8a20d6c21a69
Author: gatorsmile <ga...@...>
Date:   2016-11-16T16:25:15Z

    [SPARK-18415][SQL] Weird Plan Output when CTE used in RunnableCommand
    
    ### What changes were proposed in this pull request?
    Currently, when CTE is used in RunnableCommand, the Analyzer does not replace the logical node `With`. The child plan of RunnableCommand is not resolved. Thus, the output of the `With` plan node looks very confusing.
    For example,
    ```
    sql(
      """
        |CREATE VIEW cte_view AS
        |WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
        |SELECT n FROM w
      """.stripMargin).explain()
    ```
    The output is like
    ```
    ExecutedCommand
       +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
    SELECT n FROM w, false, false, PersistedView
             +- 'With [(w,SubqueryAlias w
    +- Project [1 AS n#16]
       +- OneRowRelation$
    ), (cte1,'SubqueryAlias cte1
    +- 'Project [unresolvedalias(2, None)]
       +- OneRowRelation$
    ), (cte2,'SubqueryAlias cte2
    +- 'Project [unresolvedalias(3, None)]
       +- OneRowRelation$
    )]
                +- 'Project ['n]
                   +- 'UnresolvedRelation `w`
    ```
    After the fix, the output is as shown below.
    ```
    ExecutedCommand
       +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
    SELECT n FROM w, false, false, PersistedView
             +- CTE [w, cte1, cte2]
                :  :- SubqueryAlias w
                :  :  +- Project [1 AS n#16]
                :  :     +- OneRowRelation$
                :  :- 'SubqueryAlias cte1
                :  :  +- 'Project [unresolvedalias(2, None)]
                :  :     +- OneRowRelation$
                :  +- 'SubqueryAlias cte2
                :     +- 'Project [unresolvedalias(3, None)]
                :        +- OneRowRelation$
                +- 'Project ['n]
                   +- 'UnresolvedRelation `w`
    ```
    
    BTW, this PR also fixes the output of the view type.
    
    ### How was this patch tested?
    Manual
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15854 from gatorsmile/cteName.
    
    (cherry picked from commit 608ecc512b759514c75a1b475582f237ed569f10)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit b86e962c90c4322cd98b5bf3b19e251da2d32442
Author: Tathagata Das <ta...@...>
Date:   2016-11-16T18:00:59Z

    [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and add triggerDetails to json in StreamingQueryStatus
    
    ## What changes were proposed in this pull request?
    
    SPARK-18459: triggerId seems like a number that should be increasing with each trigger, whether or not there is data in it. However, actually, triggerId increases only where there is a batch of data in a trigger. So its better to rename it to batchId.
    
    SPARK-18460: triggerDetails was missing from json representation. Fixed it.
    
    ## How was this patch tested?
    Updated existing unit tests.
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15895 from tdas/SPARK-18459.
    
    (cherry picked from commit 0048ce7ce64b02cbb6a1c4a2963a0b1b9541047e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 3d4756d56b852dcf4e1bebe621d4a30570873c3c
Author: Tathagata Das <ta...@...>
Date:   2016-11-16T19:03:10Z

    [SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries
    
    ## What changes were proposed in this pull request?
    <img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
    <img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15897 from tdas/SPARK-18461.
    
    (cherry picked from commit bb6cdfd9a6a6b6c91aada7c3174436146045ed1e)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 523abfe19caa11747133877b0c8319c68ac66e56
Author: Artur Sukhenko <ar...@...>
Date:   2016-11-16T23:08:01Z

    [YARN][DOC] Increasing NodeManager's heap size with External Shuffle Service
    
    ## What changes were proposed in this pull request?
    
    Suggest users to increase `NodeManager's` heap size if `External Shuffle Service` is enabled as
    `NM` can spend a lot of time doing GC resulting in  shuffle operations being a bottleneck due to `Shuffle Read blocked time` bumped up.
    Also because of GC  `NodeManager` can use an enormous amount of CPU and cluster performance will suffer.
    I have seen NodeManager using 5-13G RAM and up to 2700% CPU with `spark_shuffle` service on.
    
    ## How was this patch tested?
    
    #### Added step 5:
    ![shuffle_service](https://cloud.githubusercontent.com/assets/15244468/20355499/2fec0fde-ac2a-11e6-8f8b-1c80daf71be1.png)
    
    Author: Artur Sukhenko <ar...@gmail.com>
    
    Closes #15906 from Devian-ua/nmHeapSize.
    
    (cherry picked from commit 55589987be89ff78dadf44498352fbbd811a206e)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 9515793820c7954d82116238a67e632ea3e783b5
Author: Takuya UESHIN <ue...@...>
Date:   2016-11-17T03:21:08Z

    [SPARK-18442][SQL] Fix nullability of WrapOption.
    
    ## What changes were proposed in this pull request?
    
    The nullability of `WrapOption` should be `false`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ue...@happy-camper.st>
    
    Closes #15887 from ueshin/issues/SPARK-18442.
    
    (cherry picked from commit 170eeb345f951de89a39fe565697b3e913011768)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 6a3cbbc037fe631e1b89c46000373dc2ba86a5eb
Author: Holden Karau <ho...@...>
Date:   2016-11-16T22:22:15Z

    [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed
    
    ## What changes were proposed in this pull request?
    
    This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).
    
    Done:
    - pip installable on conda [manual tested]
    - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
    - Automated testing of this (virtualenv)
    - packaging and signing with release-build*
    
    Possible follow up work:
    - release-build update to publish to PyPI (SPARK-18128)
    - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
    - Windows support and or testing ( SPARK-18136 )
    - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
    - consider how we want to number our dev/snapshot versions
    
    Explicitly out of scope:
    - Using pip installed PySpark to start a standalone cluster
    - Using pip installed PySpark for non-Python Spark programs
    
    *I've done some work to test release-build locally but as a non-committer I've just done local testing.
    ## How was this patch tested?
    
    Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.
    
    release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)
    
    Author: Holden Karau <ho...@us.ibm.com>
    Author: Juliet Hougland <ju...@cloudera.com>
    Author: Juliet Hougland <no...@myemail.com>
    
    Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.

commit 014fceee04c69d7944c74b3794e821e4d1003dd0
Author: Wenchen Fan <we...@...>
Date:   2016-11-17T08:00:38Z

    [SPARK-18464][SQL] support old table which doesn't store schema in metastore
    
    ## What changes were proposed in this pull request?
    
    Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.
    
    This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.
    
    ## How was this patch tested?
    
    regression test.
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #15900 from cloud-fan/hive-catalog.
    
    (cherry picked from commit 07b3f045cd6f79b92bc86b3b1b51d3d5e6bd37ce)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 2ee4fc8891be53b2fae43faa5cd09ade32173bba
Author: Weiqing Yang <ya...@...>
Date:   2016-11-17T11:13:22Z

    [YARN][DOC] Remove non-Yarn specific configurations from running-on-yarn.md
    
    ## What changes were proposed in this pull request?
    
    Remove `spark.driver.memory`, `spark.executor.memory`,  `spark.driver.cores`, and `spark.executor.cores` from `running-on-yarn.md` as they are not Yarn-specific, and they are also defined in`configuration.md`.
    
    ## How was this patch tested?
    Build passed & Manually check.
    
    Author: Weiqing Yang <ya...@gmail.com>
    
    Closes #15869 from weiqingy/yarnDoc.
    
    (cherry picked from commit a3cac7bd86a6fe8e9b42da1bf580aaeb59378304)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 4fcecb4cf081fba0345f1939420ca1d9f6de720c
Author: anabranch <wa...@...>
Date:   2016-11-17T11:34:55Z

    [SPARK-18365][DOCS] Improve Sample Method Documentation
    
    ## What changes were proposed in this pull request?
    
    I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
    
    - [x] Scala
    - [x] Python
    - [x] R
    - [x] RDD Scala
    - [ ] RDD Python with SEED
    - [X] RDD Java
    - [x] RDD Java with SEED
    - [x] RDD Python
    
    ## How was this patch tested?
    
    NA
    
    Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
    
    Author: anabranch <wa...@gmail.com>
    Author: Bill Chambers <bi...@databricks.com>
    
    Closes #15815 from anabranch/SPARK-18365.
    
    (cherry picked from commit 49b6f456aca350e9e2c170782aa5cc75e7822680)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 42777b1b3c10d3945494e27f1dedd43f2f836361
Author: VinceShieh <vi...@...>
Date:   2016-11-17T13:37:42Z

    [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings
    
    ## What changes were proposed in this pull request?
    
    Several places in MLlib use custom regexes or other approaches to parse Spark versions.
    Those should be fixed to use the VersionUtils. This PR replaces custom regexes with
    VersionUtils to get Spark version numbers.
    ## How was this patch tested?
    
    Existing tests.
    
    Signed-off-by: VinceShieh vincent.xieintel.com
    
    Author: VinceShieh <vi...@intel.com>
    
    Closes #15055 from VinceShieh/SPARK-17462.
    
    (cherry picked from commit de77c67750dc868d75d6af173c3820b75a9fe4b7)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 536a2159393c82d414cc46797c8bfd958f453d33
Author: Zheng RuiFeng <ru...@...>
Date:   2016-11-17T13:40:16Z

    [SPARK-18480][DOCS] Fix wrong links for ML guide docs
    
    ## What changes were proposed in this pull request?
    1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
    2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
    3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
    4, Other link updates.
    ## How was this patch tested?
     manual tests
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #15912 from zhengruifeng/md_fix.
    
    (cherry picked from commit cdaf4ce9fe58c4606be8aa2a5c3756d30545c850)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 978798880c0b1e6a15e8a342847e1ff4d83a5ac0
Author: root <ro...@...>
Date:   2016-11-17T17:04:19Z

    [SPARK-18490][SQL] duplication nodename extrainfo for ShuffleExchange
    
    ## What changes were proposed in this pull request?
    
       In ShuffleExchange, the nodename's extraInfo are the same when exchangeCoordinator.isEstimated
     is true or false.
    
    Merge the two situation in the PR.
    
    Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
    
    Closes #15920 from windpiger/DupNodeNameShuffleExchange.
    
    (cherry picked from commit b0aa1aa1af6c513a6a881eaea96abdd2b480ef98)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit fc466be4fd8def06880f59d50e5567c22cc53d6a
Author: Wenchen Fan <we...@...>
Date:   2016-11-18T01:31:12Z

    [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database
    
    ## What changes were proposed in this pull request?
    
    The current semantic of the warehouse config:
    
    1. it's a static config, which means you can't change it once your spark application is launched.
    2. Once a database is created, its location won't change even the warehouse path config is changed.
    3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.
    
    rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.
    
    This PR fixes hive serde tables to make it consistent with data source tables.
    
    ## How was this patch tested?
    
    HiveSparkSubmitSuite
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #15812 from cloud-fan/default-db.
    
    (cherry picked from commit ce13c2672318242748f7520ed4ce6bcfad4fb428)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit e8b1955e20a966da9a95f75320680cbab1096540
Author: Josh Rosen <jo...@...>
Date:   2016-11-18T02:45:15Z

    [SPARK-18462] Fix ClassCastException in SparkListenerDriverAccumUpdates event
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a `ClassCastException: java.lang.Integer cannot be cast to java.lang.Long` error which could occur in the HistoryServer while trying to process a deserialized `SparkListenerDriverAccumUpdates` event.
    
    The problem stems from how `jackson-module-scala` handles primitive type parameters (see https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges for more details). This was causing a problem where our code expected a field to be deserialized as a `(Long, Long)` tuple but we got an `(Int, Int)` tuple instead.
    
    This patch hacks around this issue by registering a custom `Converter` with Jackson in order to deserialize the tuples as `(Object, Object)` and perform the appropriate casting.
    
    ## How was this patch tested?
    
    New regression tests in `SQLListenerSuite`.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15922 from JoshRosen/SPARK-18462.
    
    (cherry picked from commit d9dd979d170f44383a9a87f892f2486ddb3cca7d)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 5912c19e76719a1c388a7a151af03ebf71b8f0db
Author: Tyson Condie <tc...@...>
Date:   2016-11-18T19:11:24Z

    [SPARK-18187][SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.
    
    ## What changes were proposed in this pull request?
    CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value.
    
    ## How was this patch tested?
    When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one.
    
    The primary solution to this issue was given by uncleGen
    Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing
    
    Author: Tyson Condie <tc...@gmail.com>
    Author: genmao.ygm <ge...@genmaoygmdeMacBook-Air.local>
    
    Closes #15852 from tcondie/spark-18187.
    
    (cherry picked from commit 51baca2219fda8692b88fc8552548544aec73a1e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit ec622eb7e1ffd0775c9ca4683d1032ca8d41654a
Author: Andrew Ray <ra...@...>
Date:   2016-11-18T19:19:49Z

    [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count
    
    ## What changes were proposed in this pull request?
    
    When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
    
    ## How was this patch tested?
    
    Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
    
    Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
    ```
    build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
    ```
    However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
    
    I tested with the following setup using above build options
    ```
    case class OrcData(intField: Long, stringField: String)
    spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
    
    sql(
          s"""CREATE EXTERNAL TABLE orc_test(
             |  intField LONG,
             |  stringField STRING
             |)
             |STORED AS ORC
             |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
           """.stripMargin)
    ```
    
    ## Results
    
    query | Spark 2.0.2 | this PR
    ---|---|---
    `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
    `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
    `sql("select * from orc_test").collect`|4.4 MB|4.4 MB
    
    Author: Andrew Ray <ra...@gmail.com>
    
    Closes #15898 from aray/sql-orc-no-col.
    
    (cherry picked from commit 795e9fc9213cb9941ae131aadcafddb94bde5f74)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 6717981e4d76f0794a75c60586de4677c49659ad
Author: hyukjinkwon <gu...@...>
Date:   2016-11-18T21:45:18Z

    [SPARK-18422][CORE] Fix wholeTextFiles test to pass on Windows in JavaAPISuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the test `wholeTextFiles` in `JavaAPISuite.java`. This is failed due to the different path format on Windows.
    
    For example, the path in `container` was
    
    ```
    C:\projects\spark\target\tmp\1478967560189-0/part-00000
    ```
    
    whereas `new URI(res._1()).getPath()` was as below:
    
    ```
    /C:/projects/spark/target/tmp/1478967560189-0/part-00000
    ```
    
    ## How was this patch tested?
    
    Tests in `JavaAPISuite.java`.
    
    Tested via AppVeyor.
    
    **Before**
    Build: https://ci.appveyor.com/project/spark-test/spark/build/63-JavaAPISuite-1
    Diff: https://github.com/apache/spark/compare/master...spark-test:JavaAPISuite-1
    
    ```
    [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
    [error] Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: java.lang.AssertionError: expected:<spark is easy to use.
    [error] > but was:<null>, took 0.578 sec
    [error]     at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
    ...
    ```
    
    **After**
    Build started: [CORE] `org.apache.spark.JavaAPISuite` [![PR-15866](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=198DDA52-F201-4D2B-BE2F-244E0C1725B2&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/198DDA52-F201-4D2B-BE2F-244E0C1725B2)
    Diff: https://github.com/apache/spark/compare/master...spark-test:198DDA52-F201-4D2B-BE2F-244E0C1725B2
    
    ```
    [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
    ...
    ```
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15866 from HyukjinKwon/SPARK-18422.
    
    (cherry picked from commit 40d59ff5eaac6df237fe3d50186695c3806b268c)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 136f687c6282c328c2ae121fc3d45207550d184b
Author: Shixiong Zhu <sh...@...>
Date:   2016-11-19T00:13:02Z

    [SPARK-18477][SS] Enable interrupts for HDFS in HDFSMetadataLog
    
    ## What changes were proposed in this pull request?
    
    HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast on HDFS.
    
    This PR just changes the logic to only disable interrupts for local file system, as HADOOP-10622 only happens for local file system.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15911 from zsxwing/interrupt-on-dfs.
    
    (cherry picked from commit e5f5c29e021d504284fe5ad1a77dcd5a992ac10a)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 4b1df0e89badd9bb175673aefc96d3f9358e976d
Author: Reynold Xin <rx...@...>
Date:   2016-11-19T00:34:11Z

    [SPARK-18505][SQL] Simplify AnalyzeColumnCommand
    
    ## What changes were proposed in this pull request?
    I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
    
    This is a small pull request to clean up AnalyzeColumnCommand:
    
    1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
    2. Removed the nested updateStats function, by just inlining the function.
    3. Renamed a few functions to better reflect what they do.
    4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
    5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
    6. Added more documentation explaining some of the non-obvious return types and code blocks.
    
    In follow-up pull requests, I'd like to address the following:
    
    1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
    2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
    3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
    4. Clearly document the data representation stored in the catalog for statistics.
    
    ## How was this patch tested?
    Affected test cases have been updated.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #15933 from rxin/SPARK-18505.
    
    (cherry picked from commit 6f7ff75091154fed7649ea6d79e887aad9fbde6a)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit b4bad04c5e20b06992100c1d44ece9d3a5b4f817
Author: Shixiong Zhu <sh...@...>
Date:   2016-11-19T00:34:38Z

    [SPARK-18497][SS] Make ForeachSink support watermark
    
    ## What changes were proposed in this pull request?
    
    The issue in ForeachSink is the new created DataSet still uses the old QueryExecution. When `foreachPartition` is called, `QueryExecution.toString` will be called and then fail because it doesn't know how to plan EventTimeWatermark.
    
    This PR just replaces the QueryExecution with IncrementalExecution to fix the issue.
    
    ## How was this patch tested?
    
    `test("foreach with watermark")`.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15934 from zsxwing/SPARK-18497.
    
    (cherry picked from commit 2a40de408b5eb47edba92f9fe92a42ed1e78bf98)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 693401be24bfefe5305038b87888cdeb641d7642
Author: Sean Owen <so...@...>
Date:   2016-11-19T09:00:11Z

    [SPARK-18448][CORE] SparkSession should implement java.lang.AutoCloseable like JavaSparkContext
    
    ## What changes were proposed in this pull request?
    
    Just adds `close()` + `Closeable` as a synonym for `stop()`. This makes it usable in Java in try-with-resources, as suggested by ash211  (`Closeable` extends `AutoCloseable` BTW)
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15932 from srowen/SPARK-18448.
    
    (cherry picked from commit db9fb9baacbf8640dd37a507b7450db727c7e6ea)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 4b396a6545ec0f1e31b0e211228f04bdc5660300
Author: hyukjinkwon <gu...@...>
Date:   2016-11-19T11:24:15Z

    [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
    
    It seems in Scala/Java,
    
    - `Note:`
    - `NOTE:`
    - `Note that`
    - `'''Note:'''`
    - `note`
    
    This PR proposes to fix those to `note` to be consistent.
    
    **Before**
    
    - Scala
      ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
    
    - Java
      ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
    
    **After**
    
    - Scala
      ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
    
    - Java
      ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
    
    The notes were found via
    
    ```bash
    grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
    grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
    grep -E '.scala|.java' | \ # java/scala files
    grep -v Suite | \ # exclude tests
    grep -v Test | \ # exclude tests
    grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
    -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
    -e 'org.apache.spark.api.r' \
    ...
    ```
    
    ```bash
    grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
    grep -v "// Note that " | \  # starting with // does not appear in API documentation.
    grep -E '.scala|.java' | \ # java/scala files
    grep -v Suite | \ # exclude tests
    grep -v Test | \ # exclude tests
    grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
    -e 'org.apache.spark.api.java.function' \
    -e 'org.apache.spark.api.r' \
    ...
    ```
    
    ```bash
    grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
    grep -v "// Note: " | \  # starting with // does not appear in API documentation.
    grep -E '.scala|.java' | \ # java/scala files
    grep -v Suite | \ # exclude tests
    grep -v Test | \ # exclude tests
    grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
    -e 'org.apache.spark.api.java.function' \
    -e 'org.apache.spark.api.r' \
    ...
    ```
    
    ```bash
    grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
    grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
    grep -E '.scala|.java' | \ # java/scala files
    grep -v Suite | \ # exclude tests
    grep -v Test | \ # exclude tests
    grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
    -e 'org.apache.spark.api.java.function' \
    -e 'org.apache.spark.api.r' \
    ...
    ```
    
    And then fixed one by one comparing with API documentation/access modifiers.
    
    After that, manually tested via `jekyll build`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15889 from HyukjinKwon/SPARK-18437.
    
    (cherry picked from commit d5b1d5fc80153571c308130833d0c0774de62c92)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 30a6fbbb0fb47f5b74ceba3384f28a61bf4e4740
Author: Sean Owen <so...@...>
Date:   2016-11-19T11:28:25Z

    [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s
    
    ## What changes were proposed in this pull request?
    
    Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15833 from srowen/SPARK-18353.
    
    (cherry picked from commit 8b1e1088eb274fb15260cd5d6d9508d42837a4d6)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 15ad3a319b91a8b495da9a0e6f5386417991d30d
Author: Sean Owen <so...@...>
Date:   2016-11-19T13:48:56Z

    [SPARK-18448][CORE] Fix @since 2.1.0 on new SparkSession.close() method
    
    ## What changes were proposed in this pull request?
    
    Fix since 2.1.0 on new SparkSession.close() method. I goofed in https://github.com/apache/spark/pull/15932 because it was back-ported to 2.1 instead of just master as originally planned.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15938 from srowen/SPARK-18448.2.
    
    (cherry picked from commit ded5fefb6f5c0a97bf3d7fa1c0494dc434b6ee40)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 15eb86c29c02178f4413df63c39b8df3cda30ca8
Author: sethah <se...@...>
Date:   2016-11-20T01:42:37Z

    [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training
    
    ## What changes were proposed in this pull request?
    
    This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
    
    Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.
    
    ## How was this patch tested?
    
    This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.
    
    Author: sethah <se...@gmail.com>
    
    Closes #15893 from sethah/logreg_refactor.
    
    (cherry picked from commit 856e0042007c789dda4539fb19a5d4580999fbf4)
    Signed-off-by: DB Tsai <db...@dbtsai.com>

commit b0b2f10817f38d9cebd2e436a07d4dd3e41e9328
Author: Kazuaki Ishizaki <is...@...>
Date:   2016-11-20T05:50:20Z

    [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java
    
    ## What changes were proposed in this pull request?
    
    This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
    
    ## How was this patch tested?
    
    Manually executed query82 of TPC-DS with 100TB
    
    Author: Kazuaki Ishizaki <is...@jp.ibm.com>
    
    Closes #15907 from kiszk/SPARK-18458.
    
    (cherry picked from commit d93b6552473468df297a08c0bef9ea0bf0f5c13a)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 94a9eed11a11510a91dc4c8adb793dc3cbdef8f5
Author: Reynold Xin <rx...@...>
Date:   2016-11-20T05:57:09Z

    [SPARK-18508][SQL] Fix documentation error for DateDiff
    
    ## What changes were proposed in this pull request?
    The previous documentation and example for DateDiff was wrong.
    
    ## How was this patch tested?
    Doc only change.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #15937 from rxin/datediff-doc.
    
    (cherry picked from commit bce9a03677f931d52491e7768aba9e4a19a7e696)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21681: Pin tag 210

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21681
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21681: Pin tag 210

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21681
  
    Close this @zhangchj1990 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21681: Pin tag 210

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21681
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21681: Pin tag 210

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21681
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21681: Pin tag 210

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21681
  
     @zhangchj1990 Looks mistakenly open. Mind closing this please?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21681: Pin tag 210

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21681


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org