You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by deeppark <gi...@git.apache.org> on 2017/10/08 10:48:58 UTC

[GitHub] spark pull request #19455: Branch 2.0

GitHub user deeppark opened a pull request:

    https://github.com/apache/spark/pull/19455

    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19455.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19455
    
----
commit 5ec3e6680a091883369c002ae599d6b03f38c863
Author: Ergin Seyfe <es...@fb.com>
Date:   2016-10-11T19:51:08Z

    [SPARK-17816][CORE][BRANCH-2.0] Fix ConcurrentModificationException issue in BlockStatusesAccumulator
    
    ## What changes were proposed in this pull request?
    Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is thread safe and few more cleanups.
    
    ## How was this patch tested?
    Tested in master branch and cherry-picked.
    
    Author: Ergin Seyfe <es...@fb.com>
    
    Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0.

commit e68e95e947045704d3e6a36bb31e104a99d3adcc
Author: Alexander Pivovarov <ap...@gmail.com>
Date:   2016-10-12T05:31:21Z

    Fix hadoop.version in building-spark.md
    
    Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number
    
    Author: Alexander Pivovarov <ap...@gmail.com>
    
    Closes #15440 from apivovarov/patch-1.
    
    (cherry picked from commit 299eb04ba05038c7dbb3ecf74a35d4bbfa456643)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f3d82b53c42a971deedc04de6950b9228e5262ea
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2016-10-12T05:36:57Z

    [SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is incorrect.
    
    ## What changes were proposed in this pull request?
    
    In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.
    
    ## How was this patch tested?
    manual test.
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #15439 from sarutak/SPARK-17880.
    
    (cherry picked from commit b512f04f8e546843d5a3f35dcc6b675b5f4f5bc0)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f12b74c02eec9e201fec8a16dac1f8e549c1b4f0
Author: cody koeninger <co...@koeninger.org>
Date:   2016-10-12T07:40:47Z

    [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad
    
    ## What changes were proposed in this pull request?
    
    Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer.
    
    ## How was this patch tested?
    
    I built jekyll doc and made sure it looked ok.
    
    Author: cody koeninger <co...@koeninger.org>
    
    Closes #15442 from koeninger/SPARK-17853.
    
    (cherry picked from commit c264ef9b1918256a5018c7a42a1a2b42308ea3f7)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 4dcbde48de6c46e2fd8ccfec732b8ff5c24f97a4
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-10-11T06:29:52Z

    [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13
    
    ## What changes were proposed in this pull request?
    Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL
    
    ## How was this patch tested?
    Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
    
    (cherry picked from commit 658c7147f5bf637f36e8c66b9207d94b1e7c74c5)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 5451541d1113aa75bab80914ca51a913f6ba4753
Author: prigarg <pr...@adobe.com>
Date:   2016-10-12T17:14:45Z

    [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty string to interval type.
    
    ## What changes were proposed in this pull request?
    This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true.
    
    Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason.
    
    ## How was this patch tested?
    Added test case in CastSuite.scala
    
    jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884
    
    Author: prigarg <pr...@adobe.com>
    
    Closes #15449 from priyankagargnitk/SPARK-17884.
    
    (cherry picked from commit d5580ebaa086b9feb72d5428f24c5b60cd7da745)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d55ba3063da1a5d12e3b09e55f089f16ecf327bb
Author: Hossein <ho...@databricks.com>
Date:   2016-10-12T17:32:38Z

    [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
    
    ## What changes were proposed in this pull request?
    If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
    
    I tested this on my MacBook. Following code works with this patch:
    ```R
    intMax <- .Machine$integer.max
    largeVec <- 1:intMax
    rdd <- SparkR:::parallelize(sc, largeVec, 2)
    ```
    
    ## How was this patch tested?
    * [x] Unit tests
    
    Author: Hossein <ho...@databricks.com>
    
    Closes #15375 from falaki/SPARK-17790.
    
    (cherry picked from commit 5cc503f4fe9737a4c7947a80eecac053780606df)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 050b8177e27df06d33a6f6f2b3b6a952b0d03ba6
Author: cody koeninger <co...@koeninger.org>
Date:   2016-10-12T22:22:06Z

    [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice
    
    ## What changes were proposed in this pull request?
    
    Alternative approach to https://github.com/apache/spark/pull/15387
    
    Author: cody koeninger <co...@koeninger.org>
    
    Closes #15401 from koeninger/SPARK-17782-alt.
    
    (cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 5903dabc57c07310573babe94e4f205bdea6455f
Author: Brian Cho <bc...@fb.com>
Date:   2016-10-13T03:43:18Z

    [SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics
    
    ## What changes were proposed in this pull request?
    
    Fix a bug where spill metrics were being reported as shuffle metrics. Eventually these spill metrics should be reported (SPARK-3577), but separate from shuffle metrics. The fix itself basically reverts the line to what it was in 1.6.
    
    ## How was this patch tested?
    
    Cherry-picked from master (#15347)
    
    Author: Brian Cho <bc...@fb.com>
    
    Closes #15455 from dafrista/shuffle-metrics-2.0.

commit ab00e410c6b1d7dafdfabcea1f249c78459b94f0
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-10-13T04:40:45Z

    [SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once
    
    ## What changes were proposed in this pull request?
    
    The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch.
    You may come across stacktraces that look like:
    ```
    java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.lang.StringCoding.encode(StringCoding.java:350)
    at java.lang.String.getBytes(String.java:941)
    at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
    
    ```
    The safer way is to write to an output stream so that we don't have to materialize a huge string.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #15437 from brkyvz/ser-to-stream.
    
    (cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit d38f38a093b4dff32c686675d93ab03e7a8f4908
Author: buzhihuojie <re...@gmail.com>
Date:   2016-10-13T05:51:54Z

    minor doc fix for Row.scala
    
    ## What changes were proposed in this pull request?
    
    minor doc fix for "getAnyValAs" in class Row
    
    ## How was this patch tested?
    
    None.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: buzhihuojie <re...@gmail.com>
    
    Closes #15452 from david-weiluo-ren/minorDocFixForRow.
    
    (cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d7fa3e32421c73adfa522adfeeb970edd4c22eb3
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-10-13T20:31:50Z

    [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
    
    ## What changes were proposed in this pull request?
    
    Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15397 from zsxwing/SPARK-17834.
    
    (cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit c53b8374911e801ed98c1436c384f0aef076eaab
Author: Davies Liu <da...@databricks.com>
Date:   2016-10-14T21:45:20Z

    [SPARK-17863][SQL] should not add column into Distinct
    
    ## What changes were proposed in this pull request?
    
    We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that.
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15489 from davies/order_distinct.
    
    (cherry picked from commit da9aeb0fde589f7c21c2f4a32036a68c0353965d)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 2a1b10b649a8d4c077a0e19df976f1fd36b7e266
Author: Jun Kim <i2...@gmail.com>
Date:   2016-10-15T07:36:55Z

    [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc
    
    ## What changes were proposed in this pull request?
    
    ### Before:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value").
         .getOrCreate()
    ```
    
    ### After:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value")
         .getOrCreate()
    ```
    
    There was one unexpected dot!
    
    Author: Jun Kim <i2...@gmail.com>
    
    Closes #15498 from tae-jun/SPARK-17953.
    
    (cherry picked from commit 36d81c2c68ef4114592b069287743eb5cb078318)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 3cc2fe5b94d3bcdfb4f28bfa6d8e51fe67d6e1b4
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-10-17T05:15:47Z

    [SPARK-17819][SQL][BRANCH-2.0] Support default database in connection URIs for Spark Thrift Server
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark Thrift Server ignores the default database in URI. This PR supports that like the following.
    
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a int)"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    | t          | false        |
    +------------+--------------+--+
    1 row selected (0.347 seconds)
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    +------------+--------------+--+
    No rows selected (0.098 seconds)
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added testsuite.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #15507 from dongjoon-hyun/SPARK-17819-BACK.

commit ca66f52ff81c19e17ca3733eac92d66012a3ec6e
Author: Weiqing Yang <ya...@gmail.com>
Date:   2016-10-17T05:38:30Z

    [MINOR][SQL] Add prettyName for current_database function
    
    ## What changes were proposed in this pull request?
    Added a `prettyname` for current_database function.
    
    ## How was this patch tested?
    Manually.
    
    Before:
    ```
    scala> sql("select current_database()").show
    +-----------------+
    |currentdatabase()|
    +-----------------+
    |          default|
    +-----------------+
    ```
    
    After:
    ```
    scala> sql("select current_database()").show
    +------------------+
    |current_database()|
    +------------------+
    |           default|
    +------------------+
    ```
    
    Author: Weiqing Yang <ya...@gmail.com>
    
    Closes #15506 from weiqingy/prettyName.
    
    (cherry picked from commit 56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d1a02117862b20d0e8e58f4c6da6a97665a02590
Author: gatorsmile <ga...@gmail.com>
Date:   2016-10-17T07:29:53Z

    [SPARK-17892][SQL][2.0] Do Not Optimize Query in CTAS More Than Once #15048
    
    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/15048 and https://github.com/apache/spark/pull/15459.
    
    However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks!
    
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example,
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15502 from gatorsmile/ctasOptimize2.0.

commit a0d9015b3f34582c5d43bd31bbf35a0e92b1da29
Author: Maxime Rihouey <ma...@gmail.com>
Date:   2016-10-17T09:56:22Z

    Fix example of tf_idf with minDocFreq
    
    ## What changes were proposed in this pull request?
    
    The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter.
    The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore".
    
    ## How was this patch tested?
    
    Before the results for "tfidf" and "tfidfIgnore" were the same:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    
    After the fix those are how they should be:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[0.0])
    (1048576,[1046920],[0.0])
    (1048576,[1046923],[0.0])
    (1048576,[892732],[0.0])
    (1048576,[892733],[0.0])
    (1048576,[892734],[0.0])
    
    Author: Maxime Rihouey <ma...@gmail.com>
    
    Closes #15503 from maximerihouey/patch-1.
    
    (cherry picked from commit e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 881e0eb05782ea74cf92a62954466b14ea9e05b6
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-10-17T23:56:40Z

    [SPARK-17731][SQL][STREAMING] Metrics for structured streaming for branch-2.0
    
    **This PR adds the same metrics to branch-2.0 that was added to master in #15307.**
    
    The differences compared to the #15307 are
    - The new configuration is added only in the `SQLConf `object (i.e. `SQLConf.STREAMING_METRICS_ENABLED`) and not in the `SQLConf` class (i.e. no `SQLConf.isStreamingMetricsEnabled`). Spark master has all the streaming configurations exposed as actual fields in SQLConf class (e.g. [streamingPollingDelay](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L642)), but [not in Spark 2.0](https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L608). So I didnt add it in this 2.0 PR.
    
    - In the previous master PR, the aboveconfiguration was read in `StreamExecution` as `sparkSession.sessionState.conf.isStreamingMetricsEnabled`. In this 2.0 PR, I am instead reading it as `sparkSession.conf.get(STREAMING_METRICS_ENABLED)`(i.e. no `sessionState`) to keep it consistent with how other confs are read in `StreamExecution` (e.g. [STREAMING_POLLING_DELAY](https://github.com/tdas/spark/blob/ee8e899e4c274c363a8b4d13e8bf57b0b467a50e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L62)).
    
    - Different Mima exclusions
    
    ------
    
    ## What changes were proposed in this pull request?
    
    Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
    https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
    
    Specifically, this PR adds the following public APIs changes.
    
    - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
    
    - `StreamingQueryStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
      - processingRate - Current rate (rows/sec) at which the query is processing data from
                                      all the sources
      - ~~outputRate~~ - *Does not work with wholestage codegen*
      - latency - Current average latency between the data being available in source and the sink writing the corresponding output
      - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
      - sinkStatus: SinkStatus - Current status of the sink
      - triggerStatus - Low-level detailed status of the last completed/currently active trigger
        - latencies - getOffset, getBatch, full trigger, wal writes
        - timestamps - trigger start, finish, after getOffset, after getBatch
        - numRows - input, output, state total/updated rows for aggregations
    
    - `SourceStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by the source
      - processingRate - Current rate (rows/sec) at which the query is processing data from the source
      - triggerStatus - Low-level detailed status of the last completed/currently active trigger
    
    - Python API for `StreamingQuery.status()`
    
    ### Breaking changes to existing APIs
    
    **Existing direct public facing APIs**
    - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
      - Branch 2.0 should have it deprecated, master should have it removed.
    
    **Existing advanced listener APIs**
    - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
       - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
    
    - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
    
    - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
    
    - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
    
    ## How was this patch tested?
    
    Old and new unit tests.
    - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
    - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
    - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
    - Source-specific tests for making sure input rows are counted are is source-specific test suites.
    - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
    
    Metrics also manually tested using Ganglia sink
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15472 from tdas/SPARK-17731-branch-2.0.

commit 01520de6b999c73b5e302778487d8bd1db8fdd2e
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-10-18T07:49:57Z

    [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite
    
    This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
    
    ## What changes were proposed in this pull request?
    There were two sources of flakiness in StreamingQueryListener test.
    
    - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
    ```
    +-----------------------------------+--------------------------------+
    |      StreamExecution thread       |         testing thread         |
    +-----------------------------------+--------------------------------+
    |  ManualClock.waitTillTime(100) {  |                                |
    |        _isWaiting = true          |                                |
    |            wait(10)               |                                |
    |        still in wait(10)          |  if (_isWaiting) advance(100)  |
    |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
    |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
    |      wake up from wait(10)        |                                |
    |       current time is 600         |                                |
    |       _isWaiting = false          |                                |
    |  }                                |                                |
    +-----------------------------------+--------------------------------+
    ```
    
    - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
    
    My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
    
    In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
    
    ## How was this patch tested?
    Ran existing unit test MANY TIME in Jenkins
    
    Author: Tathagata Das <ta...@gmail.com>
    Author: Liwei Lin <lw...@gmail.com>
    
    Closes #15519 from tdas/metrics-flaky-test-fix.
    
    (cherry picked from commit 7d878cf2da04800bc4147b05610170865b148c64)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 9e806f2a4fbc0e7d1441a3eda375cba2fda8ffe5
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-10-18T09:29:55Z

    [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility
    
    ## What changes were proposed in this pull request?
    
    Scala 2.10 does not have Option.contains, which broke Scala 2.10 build.
    
    ## How was this patch tested?
    Locally compiled and ran sql/core unit tests in 2.10
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15531 from tdas/metrics-flaky-test-fix-1.

commit 2aa25833c6f40af13a03a813b5f75d515f689577
Author: gatorsmile <ga...@gmail.com>
Date:   2016-10-18T17:58:19Z

    [SPARK-17751][SQL][BACKPORT-2.0] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException
    
    ### What changes were proposed in this pull request?
    This PR is to backport the fix https://github.com/apache/spark/pull/15316 to 2.0.
    
    Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it.
    
    This PR also outputs the plan. Without the fix, the analysis error is like
    ```
    cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12
    ```
    
    After the fix, the analysis error becomes:
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12;
    'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6]
    +- SubqueryAlias t
       +- Project [_1#2 AS k#5, _2#3 AS v#6]
          +- LocalRelation [_1#2, _2#3]
    ```
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15529 from gatorsmile/eagerAnalysis20.

commit 26e978a93f029e1a1b5c7524d0b52c8141b70997
Author: Yu Peng <lo...@gmail.com>
Date:   2016-10-18T20:23:31Z

    [SPARK-17711] Compress rolled executor log
    
    ## What changes were proposed in this pull request?
    
    This PR adds support for executor log compression.
    
    ## How was this patch tested?
    
    Unit tests
    
    cc: yhuai tdas mengxr
    
    Author: Yu Peng <lo...@gmail.com>
    
    Closes #15285 from loneknightpy/compress-executor-log.
    
    (cherry picked from commit 231f39e3f6641953a90bc4c40444ede63f363b23)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 6ef9231377c7cce949dc7a988bb9d7a5cb3e458d
Author: Weiqing Yang <ya...@gmail.com>
Date:   2016-10-18T20:38:14Z

    [MINOR][DOC] Add more built-in sources in sql-programming-guide.md
    
    ## What changes were proposed in this pull request?
    Add more built-in sources in sql-programming-guide.md.
    
    ## How was this patch tested?
    Manually.
    
    Author: Weiqing Yang <ya...@gmail.com>
    
    Closes #15522 from weiqingy/dsDoc.
    
    (cherry picked from commit 20dd11096cfda51e47b9dbe3b715a12ccbb4ce1d)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f6b87939cb90bf4a0996b3728c1bccdf5e24dd4e
Author: cody koeninger <co...@koeninger.org>
Date:   2016-10-18T21:01:49Z

    [SPARK-17841][STREAMING][KAFKA] drain commitQueue
    
    ## What changes were proposed in this pull request?
    
    Actually drain commit queue rather than just iterating it.
    iterator() on a concurrent linked queue won't remove items from the queue, poll() will.
    
    ## How was this patch tested?
    Unit tests
    
    Author: cody koeninger <co...@koeninger.org>
    
    Closes #15407 from koeninger/SPARK-17841.
    
    (cherry picked from commit cd106b050ff789b6de539956a7f01159ab15c820)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 99943bf6905ca82a2c3e16e5d807fb572fa3dd3b
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-10-19T00:31:21Z

    [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener APIs for branch-2.0
    
    This is the branch-2.0 PR of #15530 to make the APIs consistent with the master. Since these APIs are experimental and not direct user facing (StreamingQueryListener is advanced Structured Streaming APIs), its okay to change them in branch-2.0.
    
    ## What changes were proposed in this pull request?
    
    As per rxin request, here are further API changes
    - Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event`
    - Changed the fields in `StreamingQueryListener.on***` from `query*` to `event`
    
    ## How was this patch tested?
    Existing unit tests.
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15535 from tdas/SPARK-17731-1-branch-2.0.

commit 3796a98cf3efad1dcbef536b295c7c47bf47d5dd
Author: Yu Peng <lo...@gmail.com>
Date:   2016-10-19T02:43:08Z

    [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error
    
    ## What changes were proposed in this pull request?
    
    Fix hadoop2.2 compilation error.
    
    ## How was this patch tested?
    
    Existing tests.
    
    cc tdas zsxwing
    
    Author: Yu Peng <lo...@gmail.com>
    
    Closes #15537 from loneknightpy/fix-17711.
    
    (cherry picked from commit 2629cd74602cfe77188b76428fed62a7a7149315)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit cdd2570e6dbfc5af68d0c9a49e4493e4e5e53020
Author: Tommy YU <tu...@163.com>
Date:   2016-10-19T04:15:32Z

    [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame
    
    ## What changes were proposed in this pull request?
    
    In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)"
    
    Link to R DataFrame doesn't work that return
    The requested URL /docs/latest/api/R/DataFrame.html was not found on this server.
    
    Correct link is SparkDataFrame.html for spark 2.0
    
    ## How was this patch tested?
    
    Manual checked.
    
    Author: Tommy YU <tu...@163.com>
    
    Closes #15543 from Wenpei/spark-18001.
    
    (cherry picked from commit f39852e59883c214b0d007faffb406570ea3084b)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 995f602d27bdcf9e6787d93dbea2357e6dc6ccaa
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-10-20T02:36:21Z

    [SPARK-17989][SQL] Check ascendingOrder type in sort_array function rather than throwing ClassCastException
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to check the second argument, `ascendingOrder`  rather than throwing `ClassCastException` exception message.
    
    ```sql
    select sort_array(array('b', 'd'), '1');
    ```
    
    **Before**
    
    ```
    16/10/19 13:16:08 ERROR SparkSQLDriver: Failed in [select sort_array(array('b', 'd'), '1')]
    java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Boolean
    	at scala.runtime.BoxesRunTime.unboxToBoolean(BoxesRunTime.java:85)
    	at org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:185)
    	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
    	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:50)
    	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:43)
    	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
    	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
    	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
    	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:297)
    ```
    
    **After**
    
    ```
    Error in query: cannot resolve 'sort_array(array('b', 'd'), '1')' due to data type mismatch: Sort order in second argument requires a boolean literal.; line 1 pos 7;
    ```
    
    ## How was this patch tested?
    
    Unit test in `DataFrameFunctionsSuite`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #15532 from HyukjinKwon/SPARK-17989.
    
    (cherry picked from commit 4b2011ec9da1245923b5cbd883240fef0dbf3ef0)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 4131623a8585fe99f79d82c24ab3b8b506d0d616
Author: WeichenXu <we...@outlook.com>
Date:   2016-10-20T06:41:38Z

    [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing
    
    ## What changes were proposed in this pull request?
    
    - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.
    
    - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.
    
    ## How was this patch tested?
    
    test added.
    
    Author: WeichenXu <We...@outlook.com>
    
    Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
    
    (cherry picked from commit 39755169fb5bb07332eef263b4c18ede1528812d)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19455: Branch 2.0

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19455


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19455: Branch 2.0

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19455
  
    @deeppark could you please close this if this is a PR that you did not intend?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19455: Branch 2.0

Posted by deeppark <gi...@git.apache.org>.

Github user deeppark commented on the issue:

    https://github.com/apache/spark/pull/19455
  
    Hi All,
    
     Apologies I did it by mistake. I'll try to close it.
    
    
    Regards,
    Deepak
    
    On 8 Oct 2017 4:23 pm, "UCB AMPLab" <no...@github.com> wrote:
    
    > Can one of the admins verify this patch?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/19455#issuecomment-334998331>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/Aebw5MCG7FiMnKBaJRZ8OHiTHJ00NlO2ks5sqKmxgaJpZM4Pxqgw>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19455: Branch 2.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19455
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org