You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bkpathak <gi...@git.apache.org> on 2016/04/26 09:13:12 UTC

[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

GitHub user bkpathak opened a pull request:

    https://github.com/apache/spark/pull/12691

    [Spark-14761][SQL][WIP] Reject invalid join methods even when join columns are not specified in PySpark DataFrame join.

    ## What changes were proposed in this pull request?
    
    In PySpark, the invalid join type will not throw error for the following join:
    ```df1.join(df2, how='not-a-valid-join-type')```
    
    The signature of the join is:
    ```def join(self, other, on=None, how=None):```
    The existing code completely ignores the `how` parameter when `on` is `None`. This patch will process the arguments passed to join and pass in to JVM Spark SQL Analyzer, which will validate the join type passed.
    
    ## How was this patch tested?
    Used manual and existing test suites.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bkpathak/spark spark-14761

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12691
    
----
commit ab6518468d67063e92667f3a8f6b563fea5b00f8
Author: Bijay Pathak <bk...@mtu.edu>
Date:   2016-04-26T05:07:30Z

    refactored join so it always passes complete arguments to JVM api

commit 66746964427cd8028250963cc26b0397e8141bc4
Author: Bijay Pathak <bk...@mtu.edu>
Date:   2016-04-26T06:00:07Z

    updated to handle condition when on is None

commit db36befc3dd969d5b5ade5398c6d3aa0a93c7fbd
Author: Bijay Pathak <bk...@mtu.edu>
Date:   2016-04-26T06:38:18Z

    fixed the style error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak closed the pull request at:

    https://github.com/apache/spark/pull/12691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

Posted by bkpathak <gi...@git.apache.org>.
GitHub user bkpathak reopened a pull request:

    https://github.com/apache/spark/pull/12691

    [Spark-14761][SQL][WIP] Reject invalid join methods  when join columns are not specified in PySpark DataFrame join.

    ## What changes were proposed in this pull request?
    
    In PySpark, the invalid join type will not throw error for the following join:
    ```df1.join(df2, how='not-a-valid-join-type')```
    
    The signature of the join is:
    ```def join(self, other, on=None, how=None):```
    The existing code completely ignores the `how` parameter when `on` is `None`. This patch will process the arguments passed to join and pass in to JVM Spark SQL Analyzer, which will validate the join type passed.
    
    ## How was this patch tested?
    Used manual and existing test suites.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bkpathak/spark spark-14761

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12691
    
----
commit c76baff0cc4775c2191d075cc9a8176e4915fec8
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-09-11T09:19:39Z

    [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from spark-config.sh
    
    ## What changes were proposed in this pull request?
    During startup of Spark standalone, the script file spark-config.sh appends to the PYTHONPATH and can be sourced many times, causing duplicates in the path.  This change adds a env flag that is set when the PYTHONPATH is appended so it will happen only one time.
    
    ## How was this patch tested?
    Manually started standalone master/worker and verified PYTHONPATH has no duplicate entries.
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.

commit 883c7631847a95684534222c1b6cfed8e62710c8
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-09-11T12:47:13Z

    [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2.
    
    ## What changes were proposed in this pull request?
    #14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #15050 from yanboliang/spark-17389.

commit 767d48076971f6f1e2c93ee540a9b2e5e465631b
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-09-11T15:35:27Z

    [SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs
    
    ## What changes were proposed in this pull request?
    
    This is a trivial patch that catches all `OutOfMemoryError` while building the broadcast hash relation and rethrows it by wrapping it in a nice error message.
    
    ## How was this patch tested?
    
    Existing Tests
    
    Author: Sameer Agarwal <sa...@cs.berkeley.edu>
    
    Closes #14979 from sameeragarwal/broadcast-join-error.

commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T04:51:22Z

    [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
    
    The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.

commit cc87280fcd065b01667ca7a59a1a32c7ab757355
Author: cenyuhai <ce...@didichuxing.com>
Date:   2016-09-12T10:52:56Z

    [SPARK-17171][WEB UI] DAG will list all partitions in the graph
    
    ## What changes were proposed in this pull request?
    DAG will list all partitions in the graph, it is too slow and hard to see all graph.
    Always we don't want to see all partitions\uff0cwe just want to see the relations of DAG graph.
    So I just show 2 root nodes for Rdds.
    
    Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)
    
    Author: cenyuhai <ce...@didichuxing.com>
    Author: \u5c91\u7389\u6d77 <26...@qq.com>
    
    Closes #14737 from cenyuhai/SPARK-17171.

commit 4efcdb7feae24e41d8120b59430f8b77cc2106a6
Author: codlife <10...@qq.com>
Date:   2016-09-12T11:10:46Z

    [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy
    
    ## What changes were proposed in this pull request?
    
    if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: codlife <10...@qq.com>
    
    Closes #15039 from codlife/master.

commit b3c22912284c2a010a4af3c43dc5e6fd53c68f8c
Author: Gaetan Semet <ga...@xeberon.net>
Date:   2016-09-12T11:21:33Z

    [SPARK-16992][PYSPARK] use map comprehension in doc
    
    Code is equivalent, but map comprehency is most of the time faster than a map.
    
    Author: Gaetan Semet <ga...@xeberon.net>
    
    Closes #14863 from Stibbons/map_comprehension.

commit 8087ecf8daad1587d0ce9040991b14320628a65e
Author: WeichenXu <we...@outlook.com>
Date:   2016-09-12T11:23:16Z

    [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
    
    ## What changes were proposed in this pull request?
    
    In order to avoid confusing user,
    error message in `PairRDDfunctions`
    `Default partitioner cannot partition array keys.`
    is updated,
    the one in `partitionBy` is replaced with
    `Specified partitioner cannot partition array keys.`
    other is replaced with
    `Specified or default partitioner cannot partition array keys.`
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #15045 from WeichenXu123/fix_partitionBy_error_message.

commit 1742c3ab86d75ce3d352f7cddff65e62fb7c8dd4
Author: Sean Zhong <se...@databricks.com>
Date:   2016-09-12T18:30:06Z

    [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory
    
    ## What changes were proposed in this pull request?
    
       MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory.
       ```
       scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count()
    
       java.lang.OutOfMemoryError: Java heap space
    	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
    	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
    	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    	at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
    	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
    	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
    	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
    	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    	at org.apache.spark.scheduler.Task.run(Task.scala:86)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
       ```
    
    Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly.
    
    More info can be found at https://issues.apache.org/jira/browse/SPARK-17503
    
    ## How was this patch tested?
    
    Unit test and manual test.
    
    ### Before change
    
    Heap memory consumption
    <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">
    
    Heap dump
    <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">
    
    ### After change
    
    Heap memory consumption
    <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">
    
    Author: Sean Zhong <se...@databricks.com>
    
    Closes #15056 from clockfly/memory_store_leak.

commit 3d40896f410590c0be044b3fa7e5d32115fac05e
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T20:09:33Z

    [SPARK-17483] Refactoring in BlockManager status reporting and block removal
    
    This patch makes three minor refactorings to the BlockManager:
    
    - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine).
    - Split  `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch.
    - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block.
    
    These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.

commit 7c51b99a428a965ff7d136e1cdda20305d260453
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T22:24:33Z

    [SPARK-14818] Post-2.0 MiMa exclusion and build changes
    
    This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
    
    - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them.
    - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`.
    - Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0.
    - Add exclusions for an API change introduced by SPARK-17096 / #14675.
    - Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15061 from JoshRosen/post-2.0-mima-changes.

commit f9c580f11098d95f098936a0b90fa21d71021205
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T22:43:57Z

    [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job
    
    ## What changes were proposed in this pull request?
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.
    
    ## How was this patch tested?
    
    Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15037 from JoshRosen/SPARK-17485.

commit a91ab705e8c124aa116c3e5b1f3ba88ce832dcde
Author: Davies Liu <da...@databricks.com>
Date:   2016-09-12T23:35:42Z

    [SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec
    
    ## What changes were proposed in this pull request?
    
    When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]].
    
    Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15030 from davies/all_expr.

commit 46f5c201e70053635bdeab4984ba1b649478bd12
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-09-13T09:42:51Z

    [BUILD] Closing some stale PRs and ones suggested to be closed by committer(s)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to close some stale PRs and ones suggested to be closed by committer(s)
    
    Closes #10052
    Closes #11079
    Closes #12661
    Closes #12772
    Closes #12958
    Closes #12990
    Closes #13409
    Closes #13779
    Closes #13811
    Closes #14577
    Closes #14714
    Closes #14875
    Closes #15020
    
    ## How was this patch tested?
    
    N/A
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15057 from HyukjinKwon/closing-stale-pr.

commit 3f6a2bb3f7beac4ce928eb660ee36258b5b9e8c8
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-13T10:54:03Z

    [SPARK-17515] CollectLimit.execute() should perform per-partition limits
    
    ## What changes were proposed in this pull request?
    
    CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example).
    
    ## How was this patch tested?
    
    Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15070 from JoshRosen/SPARK-17515.

commit 4ba63b193c1ac292493e06343d9d618c12c5ef3f
Author: jiangxingbo <ji...@gmail.com>
Date:   2016-09-13T15:04:51Z

    [SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec
    
    ## What changes were proposed in this pull request?
    
    In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule.
    For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be:
    ```
    SELECT
      ((t1.a + 1) + (t2.a + 2)) AS out_col
    FROM
      testdata2 AS t1
    INNER JOIN
      testdata2 AS t2
    ON
      (t1.a = t2.a)
    GROUP BY (t1.a + 1), (t2.a + 2)
    ```
    `((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`.
    Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.
    
    ## How was this patch tested?
    
    Add new test case in `ReorderAssociativeOperatorSuite`.
    
    Author: jiangxingbo <ji...@gmail.com>
    
    Closes #14917 from jiangxb1987/rao.

commit 72edc7e958271cedb01932880550cfc2c0631204
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-09-13T22:11:55Z

    [SPARK-17531] Don't initialize Hive Listeners for the Execution Client
    
    ## What changes were proposed in this pull request?
    
    If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons:
    1. The Execution Client will actually generate garbage
    2. The listener class needs to be both in the Spark Classpath and Hive Classpath
    
    This PR empties the listener configurations in `HiveUtils.newTemporaryConfiguration` so that the execution client will not contain the listener confs, but the metadata client will.
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #15086 from brkyvz/null-listeners.

commit 37b93f54e89332b6b77bb02c1c2299614338fd7c
Author: gatorsmile <ga...@gmail.com>
Date:   2016-09-13T22:37:42Z

    [SPARK-17530][SQL] Add Statistics into DESCRIBE FORMATTED
    
    ### What changes were proposed in this pull request?
    Statistics is missing in the output of `DESCRIBE FORMATTED`. This PR is to add it. After the PR, the output will be like:
    ```
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    |col_name                    |data_type                                                                                                             |comment|
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    |key                         |string                                                                                                                |null   |
    |value                       |string                                                                                                                |null   |
    |                            |                                                                                                                      |       |
    |# Detailed Table Information|                                                                                                                      |       |
    |Database:                   |default                                                                                                               |       |
    |Owner:                      |xiaoli                                                                                                                |       |
    |Create Time:                |Tue Sep 13 14:36:57 PDT 2016                                                                                          |       |
    |Last Access Time:           |Wed Dec 31 16:00:00 PST 1969                                                                                          |       |
    |Location:                   |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-9982e1db-df17-4376-a140-dbbee0203d83/texttable|       |
    |Table Type:                 |MANAGED                                                                                                               |       |
    |Statistics:                 |sizeInBytes=5812, rowCount=500, isBroadcastable=false                                                                 |       |
    |Table Parameters:           |                                                                                                                      |       |
    |  rawDataSize               |-1                                                                                                                    |       |
    |  numFiles                  |1                                                                                                                     |       |
    |  transient_lastDdlTime     |1473802620                                                                                                            |       |
    |  totalSize                 |5812                                                                                                                  |       |
    |  COLUMN_STATS_ACCURATE     |false                                                                                                                 |       |
    |  numRows                   |-1                                                                                                                    |       |
    |                            |                                                                                                                      |       |
    |# Storage Information       |                                                                                                                      |       |
    |SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                    |       |
    |InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                                                                              |       |
    |OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                            |       |
    |Compressed:                 |No                                                                                                                    |       |
    |Storage Desc Parameters:    |                                                                                                                      |       |
    |  serialization.format      |1                                                                                                                     |       |
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    ```
    
    Also improve the output of statistics in `DESCRIBE EXTENDED` by removing duplicate `Statistics`. Below is the example after the PR:
    
    ```
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    |col_name                    |data_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |comment|
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    |key                         |string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |null   |
    |value                       |string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |null   |
    |                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |       |
    |# Detailed Table Information|CatalogTable(
    	Table: `default`.`texttable`
    	Owner: xiaoli
    	Created: Tue Sep 13 14:38:43 PDT 2016
    	Last Access: Wed Dec 31 16:00:00 PST 1969
    	Type: MANAGED
    	Schema: [StructField(key,StringType,true), StructField(value,StringType,true)]
    	Provider: hive
    	Properties: [rawDataSize=-1, numFiles=1, transient_lastDdlTime=1473802726, totalSize=5812, COLUMN_STATS_ACCURATE=false, numRows=-1]
    	Statistics: sizeInBytes=5812, rowCount=500, isBroadcastable=false
    	Storage(Location: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-8ea5c5a0-5680-4778-91cb-c6334cf8a708/texttable, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1]))|       |
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    ```
    
    ### How was this patch tested?
    Manually tested.
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15083 from gatorsmile/descFormattedStats.

commit a454a4d86bbed1b6988da0a0e23b3e87a1a16340
Author: junyangq <qi...@gmail.com>
Date:   2016-09-14T04:01:03Z

    [SPARK-17317][SPARKR] Add SparkR vignette
    
    ## What changes were proposed in this pull request?
    
    This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.
    
    ## How was this patch tested?
    
    Manual test.
    
    Author: junyangq <qi...@gmail.com>
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    Author: Junyang Qian <ju...@databricks.com>
    
    Closes #14980 from junyangq/SPARKR-vignette.

commit def7c265f539f3e119f068b6e9050300d05b14a4
Author: Jagadeesan <as...@us.ibm.com>
Date:   2016-09-14T08:03:16Z

    [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and\u2026
    
    ## What changes were proposed in this pull request?
    
    The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.
    
    \u2026 network timeout]
    
    Author: Jagadeesan <as...@us.ibm.com>
    
    Closes #15042 from jagadeesanas2/SPARK-17449.

commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3
Author: Sami Jaktholm <sj...@outlook.com>
Date:   2016-09-14T08:38:30Z

    [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as it was removed from the Scala API prior to Spark 2.0.0
    
    ## What changes were proposed in this pull request?
    
    This pull request removes the SparkContext.clearFiles() method from the PySpark API as the method was removed from the Scala API in 8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to an exception as PySpark tries to call the non-existent method on the JVM side.
    
    ## How was this patch tested?
    
    Existing tests (though none of them tested this particular method).
    
    Author: Sami Jaktholm <sj...@outlook.com>
    
    Closes #15081 from sjakthol/pyspark-sc-clearfiles.

commit 18b4f035f40359b3164456d0dab52dbc762ea3b4
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-09-14T08:49:15Z

    [CORE][DOC] remove redundant comment
    
    ## What changes were proposed in this pull request?
    In the comment, there is redundant `the estimated`.
    
    This PR simply remove the redundant comment and adjusts format.
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #15091 from wangmiao1981/comment.

commit 4cea9da2ae88b40a5503111f8f37051e2372163e
Author: Ergin Seyfe <es...@fb.com>
Date:   2016-09-14T08:51:14Z

    [SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n)
    
    ## What changes were proposed in this pull request?
    Scala's List.length method is O(N) and it makes the gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by writing it in Scala way.
    
    https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36
    
    As suggested. Extended the fix to HiveInspectors and AggregationIterator classes as well.
    
    ## How was this patch tested?
    Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% of the CPU is spend on List.length which is called inside gatherCompressibilityStats.
    
    After this change we started to save 6.24% of the CPU.
    
    Author: Ergin Seyfe <es...@fb.com>
    
    Closes #15032 from seyfe/gatherCompressibilityStats.

commit dc0a4c916151c795dc41b5714e9d23b4937f4636
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-14T09:10:16Z

    [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages
    
    ## What changes were proposed in this pull request?
    
    Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
    
    This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15075 from srowen/SPARK-17445.

commit 52738d4e099a19466ef909b77c24cab109548706
Author: gatorsmile <ga...@gmail.com>
Date:   2016-09-14T15:10:20Z

    [SPARK-17409][SQL] Do Not Optimize Query in CTAS More Than Once
    
    ### What changes were proposed in this pull request?
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example,
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?
    Added a test
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15048 from gatorsmile/ctasOptimized.

commit 6d06ff6f7e2dd72ba8fe96cd875e83eda6ebb2a9
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-14T17:10:01Z

    [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python
    
    ## What changes were proposed in this pull request?
    
    In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing.
    
    The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).
    
    In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations.
    
    ## How was this patch tested?
    
    Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15068 from JoshRosen/pyspark-collect-limit.

commit a79838bdeeb12cec4d50da3948bd8a33777e53a6
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-09-14T17:33:56Z

    [MINOR][SQL] Add missing functions for some options in SQLConf and use them where applicable
    
    ## What changes were proposed in this pull request?
    
    I first thought they are missing because they are kind of hidden options but it seems they are just missing.
    
    For example, `spark.sql.parquet.mergeSchema` is documented in [sql-programming-guide.md](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md) but this function is missing whereas many options such as `spark.sql.join.preferSortMergeJoin` are not documented but have its own function individually.
    
    So, this PR suggests making them consistent by adding the missing functions for some options in `SQLConf` and use them where applicable, in order to make them more readable.
    
    ## How was this patch tested?
    
    Existing tests should cover this.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #14678 from HyukjinKwon/sqlconf-cleanup.

commit 040e46979d5f90edc7f9be3cbedd87e8986e8053
Author: Xin Wu <xi...@us.ibm.com>
Date:   2016-09-14T19:14:29Z

    [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BY
    
    ## What changes were proposed in this pull request?
    Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values.
    However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC).
    
    This PR is to support this new feature.
    
    ## How was this patch tested?
    New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: Xin Wu <xi...@us.ibm.com>
    
    Closes #14842 from xwu0226/SPARK-10747.

commit ff6e4cbdc80e2ad84c5d70ee07f323fad9374e3e
Author: Kishor Patil <kp...@yahoo-inc.com>
Date:   2016-09-14T19:19:35Z

    [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as Failed
    
    ## What changes were proposed in this pull request?
    
    Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container:
    ```
    java.lang.AssertionError: assertion failed
            at scala.Predef$.assert(Predef.scala:156)
            at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
            at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    ## How was this patch tested?
    This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception.
    
    Author: Kishor Patil <kp...@yahoo-inc.com>
    
    Closes #15069 from kishorvpatil/SPARK-17511.

commit e33bfaed3b160fbc617c878067af17477a0044f5
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-09-14T20:33:51Z

    [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely
    
    ## What changes were proposed in this pull request?
    
    Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463).
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15063 from zsxwing/SPARK-17463.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

Posted by bkpathak <gi...@git.apache.org>.
GitHub user bkpathak reopened a pull request:

    https://github.com/apache/spark/pull/12691

    [Spark-14761][SQL][WIP] Reject invalid join methods  when join columns are not specified in PySpark DataFrame join.

    ## What changes were proposed in this pull request?
    
    In PySpark, the invalid join type will not throw error for the following join:
    ```df1.join(df2, how='not-a-valid-join-type')```
    
    The signature of the join is:
    ```def join(self, other, on=None, how=None):```
    The existing code completely ignores the `how` parameter when `on` is `None`. This patch will process the arguments passed to join and pass in to JVM Spark SQL Analyzer, which will validate the join type passed.
    
    ## How was this patch tested?
    Used manual and existing test suites.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bkpathak/spark spark-14761

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12691
    
----
commit c76baff0cc4775c2191d075cc9a8176e4915fec8
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-09-11T09:19:39Z

    [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from spark-config.sh
    
    ## What changes were proposed in this pull request?
    During startup of Spark standalone, the script file spark-config.sh appends to the PYTHONPATH and can be sourced many times, causing duplicates in the path.  This change adds a env flag that is set when the PYTHONPATH is appended so it will happen only one time.
    
    ## How was this patch tested?
    Manually started standalone master/worker and verified PYTHONPATH has no duplicate entries.
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.

commit 883c7631847a95684534222c1b6cfed8e62710c8
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-09-11T12:47:13Z

    [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2.
    
    ## What changes were proposed in this pull request?
    #14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #15050 from yanboliang/spark-17389.

commit 767d48076971f6f1e2c93ee540a9b2e5e465631b
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-09-11T15:35:27Z

    [SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs
    
    ## What changes were proposed in this pull request?
    
    This is a trivial patch that catches all `OutOfMemoryError` while building the broadcast hash relation and rethrows it by wrapping it in a nice error message.
    
    ## How was this patch tested?
    
    Existing Tests
    
    Author: Sameer Agarwal <sa...@cs.berkeley.edu>
    
    Closes #14979 from sameeragarwal/broadcast-join-error.

commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T04:51:22Z

    [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
    
    The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.

commit cc87280fcd065b01667ca7a59a1a32c7ab757355
Author: cenyuhai <ce...@didichuxing.com>
Date:   2016-09-12T10:52:56Z

    [SPARK-17171][WEB UI] DAG will list all partitions in the graph
    
    ## What changes were proposed in this pull request?
    DAG will list all partitions in the graph, it is too slow and hard to see all graph.
    Always we don't want to see all partitions\uff0cwe just want to see the relations of DAG graph.
    So I just show 2 root nodes for Rdds.
    
    Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)
    
    Author: cenyuhai <ce...@didichuxing.com>
    Author: \u5c91\u7389\u6d77 <26...@qq.com>
    
    Closes #14737 from cenyuhai/SPARK-17171.

commit 4efcdb7feae24e41d8120b59430f8b77cc2106a6
Author: codlife <10...@qq.com>
Date:   2016-09-12T11:10:46Z

    [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy
    
    ## What changes were proposed in this pull request?
    
    if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: codlife <10...@qq.com>
    
    Closes #15039 from codlife/master.

commit b3c22912284c2a010a4af3c43dc5e6fd53c68f8c
Author: Gaetan Semet <ga...@xeberon.net>
Date:   2016-09-12T11:21:33Z

    [SPARK-16992][PYSPARK] use map comprehension in doc
    
    Code is equivalent, but map comprehency is most of the time faster than a map.
    
    Author: Gaetan Semet <ga...@xeberon.net>
    
    Closes #14863 from Stibbons/map_comprehension.

commit 8087ecf8daad1587d0ce9040991b14320628a65e
Author: WeichenXu <we...@outlook.com>
Date:   2016-09-12T11:23:16Z

    [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
    
    ## What changes were proposed in this pull request?
    
    In order to avoid confusing user,
    error message in `PairRDDfunctions`
    `Default partitioner cannot partition array keys.`
    is updated,
    the one in `partitionBy` is replaced with
    `Specified partitioner cannot partition array keys.`
    other is replaced with
    `Specified or default partitioner cannot partition array keys.`
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #15045 from WeichenXu123/fix_partitionBy_error_message.

commit 1742c3ab86d75ce3d352f7cddff65e62fb7c8dd4
Author: Sean Zhong <se...@databricks.com>
Date:   2016-09-12T18:30:06Z

    [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory
    
    ## What changes were proposed in this pull request?
    
       MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory.
       ```
       scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count()
    
       java.lang.OutOfMemoryError: Java heap space
    	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
    	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
    	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    	at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
    	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
    	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
    	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
    	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    	at org.apache.spark.scheduler.Task.run(Task.scala:86)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
       ```
    
    Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly.
    
    More info can be found at https://issues.apache.org/jira/browse/SPARK-17503
    
    ## How was this patch tested?
    
    Unit test and manual test.
    
    ### Before change
    
    Heap memory consumption
    <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">
    
    Heap dump
    <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">
    
    ### After change
    
    Heap memory consumption
    <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">
    
    Author: Sean Zhong <se...@databricks.com>
    
    Closes #15056 from clockfly/memory_store_leak.

commit 3d40896f410590c0be044b3fa7e5d32115fac05e
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T20:09:33Z

    [SPARK-17483] Refactoring in BlockManager status reporting and block removal
    
    This patch makes three minor refactorings to the BlockManager:
    
    - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine).
    - Split  `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch.
    - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block.
    
    These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.

commit 7c51b99a428a965ff7d136e1cdda20305d260453
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T22:24:33Z

    [SPARK-14818] Post-2.0 MiMa exclusion and build changes
    
    This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
    
    - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them.
    - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`.
    - Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0.
    - Add exclusions for an API change introduced by SPARK-17096 / #14675.
    - Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15061 from JoshRosen/post-2.0-mima-changes.

commit f9c580f11098d95f098936a0b90fa21d71021205
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-12T22:43:57Z

    [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job
    
    ## What changes were proposed in this pull request?
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.
    
    ## How was this patch tested?
    
    Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15037 from JoshRosen/SPARK-17485.

commit a91ab705e8c124aa116c3e5b1f3ba88ce832dcde
Author: Davies Liu <da...@databricks.com>
Date:   2016-09-12T23:35:42Z

    [SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec
    
    ## What changes were proposed in this pull request?
    
    When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]].
    
    Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15030 from davies/all_expr.

commit 46f5c201e70053635bdeab4984ba1b649478bd12
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-09-13T09:42:51Z

    [BUILD] Closing some stale PRs and ones suggested to be closed by committer(s)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to close some stale PRs and ones suggested to be closed by committer(s)
    
    Closes #10052
    Closes #11079
    Closes #12661
    Closes #12772
    Closes #12958
    Closes #12990
    Closes #13409
    Closes #13779
    Closes #13811
    Closes #14577
    Closes #14714
    Closes #14875
    Closes #15020
    
    ## How was this patch tested?
    
    N/A
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15057 from HyukjinKwon/closing-stale-pr.

commit 3f6a2bb3f7beac4ce928eb660ee36258b5b9e8c8
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-13T10:54:03Z

    [SPARK-17515] CollectLimit.execute() should perform per-partition limits
    
    ## What changes were proposed in this pull request?
    
    CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example).
    
    ## How was this patch tested?
    
    Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15070 from JoshRosen/SPARK-17515.

commit 4ba63b193c1ac292493e06343d9d618c12c5ef3f
Author: jiangxingbo <ji...@gmail.com>
Date:   2016-09-13T15:04:51Z

    [SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec
    
    ## What changes were proposed in this pull request?
    
    In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule.
    For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be:
    ```
    SELECT
      ((t1.a + 1) + (t2.a + 2)) AS out_col
    FROM
      testdata2 AS t1
    INNER JOIN
      testdata2 AS t2
    ON
      (t1.a = t2.a)
    GROUP BY (t1.a + 1), (t2.a + 2)
    ```
    `((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`.
    Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.
    
    ## How was this patch tested?
    
    Add new test case in `ReorderAssociativeOperatorSuite`.
    
    Author: jiangxingbo <ji...@gmail.com>
    
    Closes #14917 from jiangxb1987/rao.

commit 72edc7e958271cedb01932880550cfc2c0631204
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-09-13T22:11:55Z

    [SPARK-17531] Don't initialize Hive Listeners for the Execution Client
    
    ## What changes were proposed in this pull request?
    
    If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons:
    1. The Execution Client will actually generate garbage
    2. The listener class needs to be both in the Spark Classpath and Hive Classpath
    
    This PR empties the listener configurations in `HiveUtils.newTemporaryConfiguration` so that the execution client will not contain the listener confs, but the metadata client will.
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #15086 from brkyvz/null-listeners.

commit 37b93f54e89332b6b77bb02c1c2299614338fd7c
Author: gatorsmile <ga...@gmail.com>
Date:   2016-09-13T22:37:42Z

    [SPARK-17530][SQL] Add Statistics into DESCRIBE FORMATTED
    
    ### What changes were proposed in this pull request?
    Statistics is missing in the output of `DESCRIBE FORMATTED`. This PR is to add it. After the PR, the output will be like:
    ```
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    |col_name                    |data_type                                                                                                             |comment|
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    |key                         |string                                                                                                                |null   |
    |value                       |string                                                                                                                |null   |
    |                            |                                                                                                                      |       |
    |# Detailed Table Information|                                                                                                                      |       |
    |Database:                   |default                                                                                                               |       |
    |Owner:                      |xiaoli                                                                                                                |       |
    |Create Time:                |Tue Sep 13 14:36:57 PDT 2016                                                                                          |       |
    |Last Access Time:           |Wed Dec 31 16:00:00 PST 1969                                                                                          |       |
    |Location:                   |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-9982e1db-df17-4376-a140-dbbee0203d83/texttable|       |
    |Table Type:                 |MANAGED                                                                                                               |       |
    |Statistics:                 |sizeInBytes=5812, rowCount=500, isBroadcastable=false                                                                 |       |
    |Table Parameters:           |                                                                                                                      |       |
    |  rawDataSize               |-1                                                                                                                    |       |
    |  numFiles                  |1                                                                                                                     |       |
    |  transient_lastDdlTime     |1473802620                                                                                                            |       |
    |  totalSize                 |5812                                                                                                                  |       |
    |  COLUMN_STATS_ACCURATE     |false                                                                                                                 |       |
    |  numRows                   |-1                                                                                                                    |       |
    |                            |                                                                                                                      |       |
    |# Storage Information       |                                                                                                                      |       |
    |SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                    |       |
    |InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                                                                              |       |
    |OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                            |       |
    |Compressed:                 |No                                                                                                                    |       |
    |Storage Desc Parameters:    |                                                                                                                      |       |
    |  serialization.format      |1                                                                                                                     |       |
    +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
    ```
    
    Also improve the output of statistics in `DESCRIBE EXTENDED` by removing duplicate `Statistics`. Below is the example after the PR:
    
    ```
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    |col_name                    |data_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |comment|
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    |key                         |string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |null   |
    |value                       |string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |null   |
    |                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |       |
    |# Detailed Table Information|CatalogTable(
    	Table: `default`.`texttable`
    	Owner: xiaoli
    	Created: Tue Sep 13 14:38:43 PDT 2016
    	Last Access: Wed Dec 31 16:00:00 PST 1969
    	Type: MANAGED
    	Schema: [StructField(key,StringType,true), StructField(value,StringType,true)]
    	Provider: hive
    	Properties: [rawDataSize=-1, numFiles=1, transient_lastDdlTime=1473802726, totalSize=5812, COLUMN_STATS_ACCURATE=false, numRows=-1]
    	Statistics: sizeInBytes=5812, rowCount=500, isBroadcastable=false
    	Storage(Location: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-8ea5c5a0-5680-4778-91cb-c6334cf8a708/texttable, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1]))|       |
    +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
    ```
    
    ### How was this patch tested?
    Manually tested.
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15083 from gatorsmile/descFormattedStats.

commit a454a4d86bbed1b6988da0a0e23b3e87a1a16340
Author: junyangq <qi...@gmail.com>
Date:   2016-09-14T04:01:03Z

    [SPARK-17317][SPARKR] Add SparkR vignette
    
    ## What changes were proposed in this pull request?
    
    This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.
    
    ## How was this patch tested?
    
    Manual test.
    
    Author: junyangq <qi...@gmail.com>
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    Author: Junyang Qian <ju...@databricks.com>
    
    Closes #14980 from junyangq/SPARKR-vignette.

commit def7c265f539f3e119f068b6e9050300d05b14a4
Author: Jagadeesan <as...@us.ibm.com>
Date:   2016-09-14T08:03:16Z

    [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and\u2026
    
    ## What changes were proposed in this pull request?
    
    The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.
    
    \u2026 network timeout]
    
    Author: Jagadeesan <as...@us.ibm.com>
    
    Closes #15042 from jagadeesanas2/SPARK-17449.

commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3
Author: Sami Jaktholm <sj...@outlook.com>
Date:   2016-09-14T08:38:30Z

    [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as it was removed from the Scala API prior to Spark 2.0.0
    
    ## What changes were proposed in this pull request?
    
    This pull request removes the SparkContext.clearFiles() method from the PySpark API as the method was removed from the Scala API in 8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to an exception as PySpark tries to call the non-existent method on the JVM side.
    
    ## How was this patch tested?
    
    Existing tests (though none of them tested this particular method).
    
    Author: Sami Jaktholm <sj...@outlook.com>
    
    Closes #15081 from sjakthol/pyspark-sc-clearfiles.

commit 18b4f035f40359b3164456d0dab52dbc762ea3b4
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-09-14T08:49:15Z

    [CORE][DOC] remove redundant comment
    
    ## What changes were proposed in this pull request?
    In the comment, there is redundant `the estimated`.
    
    This PR simply remove the redundant comment and adjusts format.
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #15091 from wangmiao1981/comment.

commit 4cea9da2ae88b40a5503111f8f37051e2372163e
Author: Ergin Seyfe <es...@fb.com>
Date:   2016-09-14T08:51:14Z

    [SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n)
    
    ## What changes were proposed in this pull request?
    Scala's List.length method is O(N) and it makes the gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by writing it in Scala way.
    
    https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36
    
    As suggested. Extended the fix to HiveInspectors and AggregationIterator classes as well.
    
    ## How was this patch tested?
    Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% of the CPU is spend on List.length which is called inside gatherCompressibilityStats.
    
    After this change we started to save 6.24% of the CPU.
    
    Author: Ergin Seyfe <es...@fb.com>
    
    Closes #15032 from seyfe/gatherCompressibilityStats.

commit dc0a4c916151c795dc41b5714e9d23b4937f4636
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-14T09:10:16Z

    [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages
    
    ## What changes were proposed in this pull request?
    
    Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
    
    This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15075 from srowen/SPARK-17445.

commit 52738d4e099a19466ef909b77c24cab109548706
Author: gatorsmile <ga...@gmail.com>
Date:   2016-09-14T15:10:20Z

    [SPARK-17409][SQL] Do Not Optimize Query in CTAS More Than Once
    
    ### What changes were proposed in this pull request?
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example,
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?
    Added a test
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15048 from gatorsmile/ctasOptimized.

commit 6d06ff6f7e2dd72ba8fe96cd875e83eda6ebb2a9
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-09-14T17:10:01Z

    [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python
    
    ## What changes were proposed in this pull request?
    
    In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing.
    
    The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).
    
    In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations.
    
    ## How was this patch tested?
    
    Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #15068 from JoshRosen/pyspark-collect-limit.

commit a79838bdeeb12cec4d50da3948bd8a33777e53a6
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-09-14T17:33:56Z

    [MINOR][SQL] Add missing functions for some options in SQLConf and use them where applicable
    
    ## What changes were proposed in this pull request?
    
    I first thought they are missing because they are kind of hidden options but it seems they are just missing.
    
    For example, `spark.sql.parquet.mergeSchema` is documented in [sql-programming-guide.md](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md) but this function is missing whereas many options such as `spark.sql.join.preferSortMergeJoin` are not documented but have its own function individually.
    
    So, this PR suggests making them consistent by adding the missing functions for some options in `SQLConf` and use them where applicable, in order to make them more readable.
    
    ## How was this patch tested?
    
    Existing tests should cover this.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #14678 from HyukjinKwon/sqlconf-cleanup.

commit 040e46979d5f90edc7f9be3cbedd87e8986e8053
Author: Xin Wu <xi...@us.ibm.com>
Date:   2016-09-14T19:14:29Z

    [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BY
    
    ## What changes were proposed in this pull request?
    Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values.
    However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC).
    
    This PR is to support this new feature.
    
    ## How was this patch tested?
    New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: Xin Wu <xi...@us.ibm.com>
    
    Closes #14842 from xwu0226/SPARK-10747.

commit ff6e4cbdc80e2ad84c5d70ee07f323fad9374e3e
Author: Kishor Patil <kp...@yahoo-inc.com>
Date:   2016-09-14T19:19:35Z

    [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as Failed
    
    ## What changes were proposed in this pull request?
    
    Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container:
    ```
    java.lang.AssertionError: assertion failed
            at scala.Predef$.assert(Predef.scala:156)
            at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
            at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    ## How was this patch tested?
    This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception.
    
    Author: Kishor Patil <kp...@yahoo-inc.com>
    
    Closes #15069 from kishorvpatil/SPARK-17511.

commit e33bfaed3b160fbc617c878067af17477a0044f5
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-09-14T20:33:51Z

    [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely
    
    ## What changes were proposed in this pull request?
    
    Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463).
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15063 from zsxwing/SPARK-17463.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the pull request:

    https://github.com/apache/spark/pull/12691#issuecomment-214931412
  
    Thank's Josh. I'll add the regression test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the issue:

    https://github.com/apache/spark/pull/12691
  
    Hi @holdenk I am still interested in working on this but  it looks like I pull and merged with a master branch instead of rebasing it. Can I close it and open another pull request. Or how should I proceed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak closed the pull request at:

    https://github.com/apache/spark/pull/12691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12691#issuecomment-214648455
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12691#issuecomment-214912721
  
    Per my comments in the JIRA, I'd add a regression test for this since it's really easy to do. See the examples in `python/sql/tests.py`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak closed the pull request at:

    https://github.com/apache/spark/pull/12691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the issue:

    https://github.com/apache/spark/pull/12691
  
    @srowen @holdenk I have created  new branch, applied patch and created the new pull request. Closing this pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the pull request:

    https://github.com/apache/spark/pull/12691#issuecomment-215205728
  
    @JoshRosen I have added regression test. Please have a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-14761][SQL][WIP] Reject invalid join me...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the pull request:

    https://github.com/apache/spark/pull/12691#issuecomment-214647792
  
    @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

Posted by bkpathak <gi...@git.apache.org>.
Github user bkpathak commented on the issue:

    https://github.com/apache/spark/pull/12691
  
    Hi @JoshRosen, could you please look at the pull request?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/12691
  
    Thanks for working on this @bkpathak :) Are you still interested in working on this? If so can you update this to the latest master and then we can try and find a committer to take a more thorough look. Thanks for taking the time to add the regression test @JoshRosen asked you to add and sorry this slipped through the cracks. (Also if you think its ready to be merged once you update it should drop the "[WIP]" part of the PR title so reviewers are more inclined to take a look).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12691
  
    @bkpathak you can just close this and reopen. Although you can fix this with git surgery and a force-push, sometimes it's not worth it if you can just branch, patch the commit(s) you want to keep, and proceed from there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org