You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Softsapiens <gi...@git.apache.org> on 2017/07/10 10:46:09 UTC

[GitHub] spark pull request #18588: Branch 2.0

GitHub user Softsapiens opened a pull request:

    https://github.com/apache/spark/pull/18588

    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18588.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18588
    
----
commit 3487b020354988a91181f23b1c6711bfcdb4c529
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-10-07T07:27:55Z

    [SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths
    
    ## What changes were proposed in this pull request?
    If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases
    
    ## How was this patch tested?
    Added unit test for reading list of files
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
    
    (cherry picked from commit bcaa799cb01289f73e9f48526e94653a07628983)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 9f2eb27a425385836dba5aad61babfb1db738a73
Author: Sean Owen <so...@cloudera.com>
Date:   2016-10-07T17:31:41Z

    [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
    
    This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called.
    
    (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.)
    
    This also adds `sc.stop()` to the quick start guide example.
    
    Existing tests; _pending_ at least manual verification of the fix.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15381 from srowen/SPARK-17707.
    
    (cherry picked from commit cff560755244dd4ccb998e0c56e81d2620cd4cff)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit f460a199e8fc78ce879b79844c6c9e340b574439
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-10-07T18:32:39Z

    [SPARK-17346][SQL][TEST-MAVEN] Add Kafka source for Structured Streaming (branch 2.0)
    
    ## What changes were proposed in this pull request?
    
    Backport https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db and https://github.com/apache/spark/commit/b678e465afa417780b54db0fbbaa311621311f15 into branch 2.0.
    
    The only difference is the Spark version in pom file.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15367 from zsxwing/kafka-source-branch-2.0.

commit a84d8ef375f853c5841d458a593e41b457b9e6ff
Author: Herman van Hovell <hv...@databricks.com>
Date:   2016-10-07T10:46:39Z

    [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules
    
    ## What changes were proposed in this pull request?
    This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change.
    
    Author: Herman van Hovell <hv...@databricks.com>
    
    Closes #15355 from hvanhovell/SPARK-17782.

commit 6d056c168c45d2decf5ffbb96d59623d52ed8490
Author: Davies Liu <da...@databricks.com>
Date:   2016-10-07T22:03:47Z

    [SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin
    
    ## What changes were proposed in this pull request?
    
    In HashJoin, we try to rewrite the join key as Long to improve the performance of finding a match. The rewriting part is not well tested, has a bug that could cause wrong result when there are at least three integral columns in the joining key also the total length of the key exceed 8 bytes.
    
    ## How was this patch tested?
    
    Added unit test to covering the rewriting with different number of columns and different data types. Manually test the reported case and confirmed that this PR fix the bug.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15390 from davies/rewrite_key.
    
    (cherry picked from commit 94b24b84a666517e31e9c9d693f92d9bbfd7f9ad)
    Signed-off-by: Davies Liu <da...@gmail.com>

commit d27df35795fac0fd167e51d5ba08092a17eedfc2
Author: jiangxingbo <ji...@gmail.com>
Date:   2016-10-10T04:52:46Z

    [SPARK-17832][SQL] TableIdentifier.quotedString creates un-parseable names when name contains a backtick
    
    ## What changes were proposed in this pull request?
    
    The `quotedString` method in `TableIdentifier` and `FunctionIdentifier` produce an illegal (un-parseable) name when the name contains a backtick. For example:
    ```
    import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._
    import org.apache.spark.sql.catalyst.TableIdentifier
    import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
    val complexName = TableIdentifier("`weird`table`name", Some("`d`b`1"))
    parseTableIdentifier(complexName.unquotedString) // Does not work
    parseTableIdentifier(complexName.quotedString) // Does not work
    parseExpression(complexName.unquotedString) // Does not work
    parseExpression(complexName.quotedString) // Does not work
    ```
    We should handle the backtick properly to make `quotedString` parseable.
    
    ## How was this patch tested?
    Add new testcases in `TableIdentifierParserSuite` and `ExpressionParserSuite`.
    
    Author: jiangxingbo <ji...@gmail.com>
    
    Closes #15403 from jiangxb1987/backtick.
    
    (cherry picked from commit 26fbca480604ba258f97b9590cfd6dda1ecd31db)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit d719e9a080a909a6a56db938750d553668743f8f
Author: Dhruve Ashar <dh...@gmail.com>
Date:   2016-10-10T15:55:57Z

    [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
    
    ## What changes were proposed in this pull request?
    Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).
    
    ## How was this patch tested?
    I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output.
    
    Author: Dhruve Ashar <dh...@gmail.com>
    
    Closes #15370 from dhruve/bug/SPARK-17417.
    
    (cherry picked from commit 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0)
    Signed-off-by: Tom Graves <tg...@yahoo-inc.com>

commit ff9f5bbf1795d9f5b14838099dcc1bb4ac8a9b5b
Author: Davies Liu <da...@databricks.com>
Date:   2016-10-11T02:14:01Z

    [SPARK-17738][TEST] Fix flaky test in ColumnTypeSuite
    
    ## What changes were proposed in this pull request?
    
    The default buffer size is not big enough for randomly generated MapType.
    
    ## How was this patch tested?
    
    Ran the tests in 100 times, it never fail (it fail 8 times before the patch).
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15395 from davies/flaky_map.
    
    (cherry picked from commit d5ec4a3e014494a3d991a6350caffbc3b17be0fd)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit a6b5e1dccf0be0e709d6d4113cdacb0cecce39fd
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-10-11T17:53:07Z

    [SPARK-17346][SQL][TESTS] Fix the flaky topic deletion in KafkaSourceStressSuite
    
    ## What changes were proposed in this pull request?
    
    A follow up Pr for SPARK-17346 to fix flaky `org.apache.spark.sql.kafka010.KafkaSourceStressSuite`.
    
    Test log: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/
    
    Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky. This PR just simply ignores internal topics.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15384 from zsxwing/SPARK-17346-flaky-test.
    
    (cherry picked from commit 75b9e351413dca0930e8545e6283874db09d8482)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 5ec3e6680a091883369c002ae599d6b03f38c863
Author: Ergin Seyfe <es...@fb.com>
Date:   2016-10-11T19:51:08Z

    [SPARK-17816][CORE][BRANCH-2.0] Fix ConcurrentModificationException issue in BlockStatusesAccumulator
    
    ## What changes were proposed in this pull request?
    Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is thread safe and few more cleanups.
    
    ## How was this patch tested?
    Tested in master branch and cherry-picked.
    
    Author: Ergin Seyfe <es...@fb.com>
    
    Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0.

commit e68e95e947045704d3e6a36bb31e104a99d3adcc
Author: Alexander Pivovarov <ap...@gmail.com>
Date:   2016-10-12T05:31:21Z

    Fix hadoop.version in building-spark.md
    
    Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number
    
    Author: Alexander Pivovarov <ap...@gmail.com>
    
    Closes #15440 from apivovarov/patch-1.
    
    (cherry picked from commit 299eb04ba05038c7dbb3ecf74a35d4bbfa456643)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f3d82b53c42a971deedc04de6950b9228e5262ea
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2016-10-12T05:36:57Z

    [SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is incorrect.
    
    ## What changes were proposed in this pull request?
    
    In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.
    
    ## How was this patch tested?
    manual test.
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #15439 from sarutak/SPARK-17880.
    
    (cherry picked from commit b512f04f8e546843d5a3f35dcc6b675b5f4f5bc0)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit f12b74c02eec9e201fec8a16dac1f8e549c1b4f0
Author: cody koeninger <co...@koeninger.org>
Date:   2016-10-12T07:40:47Z

    [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad
    
    ## What changes were proposed in this pull request?
    
    Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer.
    
    ## How was this patch tested?
    
    I built jekyll doc and made sure it looked ok.
    
    Author: cody koeninger <co...@koeninger.org>
    
    Closes #15442 from koeninger/SPARK-17853.
    
    (cherry picked from commit c264ef9b1918256a5018c7a42a1a2b42308ea3f7)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 4dcbde48de6c46e2fd8ccfec732b8ff5c24f97a4
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-10-11T06:29:52Z

    [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13
    
    ## What changes were proposed in this pull request?
    Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL
    
    ## How was this patch tested?
    Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
    
    (cherry picked from commit 658c7147f5bf637f36e8c66b9207d94b1e7c74c5)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 5451541d1113aa75bab80914ca51a913f6ba4753
Author: prigarg <pr...@adobe.com>
Date:   2016-10-12T17:14:45Z

    [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty string to interval type.
    
    ## What changes were proposed in this pull request?
    This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true.
    
    Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason.
    
    ## How was this patch tested?
    Added test case in CastSuite.scala
    
    jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884
    
    Author: prigarg <pr...@adobe.com>
    
    Closes #15449 from priyankagargnitk/SPARK-17884.
    
    (cherry picked from commit d5580ebaa086b9feb72d5428f24c5b60cd7da745)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d55ba3063da1a5d12e3b09e55f089f16ecf327bb
Author: Hossein <ho...@databricks.com>
Date:   2016-10-12T17:32:38Z

    [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
    
    ## What changes were proposed in this pull request?
    If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
    
    I tested this on my MacBook. Following code works with this patch:
    ```R
    intMax <- .Machine$integer.max
    largeVec <- 1:intMax
    rdd <- SparkR:::parallelize(sc, largeVec, 2)
    ```
    
    ## How was this patch tested?
    * [x] Unit tests
    
    Author: Hossein <ho...@databricks.com>
    
    Closes #15375 from falaki/SPARK-17790.
    
    (cherry picked from commit 5cc503f4fe9737a4c7947a80eecac053780606df)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 050b8177e27df06d33a6f6f2b3b6a952b0d03ba6
Author: cody koeninger <co...@koeninger.org>
Date:   2016-10-12T22:22:06Z

    [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice
    
    ## What changes were proposed in this pull request?
    
    Alternative approach to https://github.com/apache/spark/pull/15387
    
    Author: cody koeninger <co...@koeninger.org>
    
    Closes #15401 from koeninger/SPARK-17782-alt.
    
    (cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 5903dabc57c07310573babe94e4f205bdea6455f
Author: Brian Cho <bc...@fb.com>
Date:   2016-10-13T03:43:18Z

    [SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics
    
    ## What changes were proposed in this pull request?
    
    Fix a bug where spill metrics were being reported as shuffle metrics. Eventually these spill metrics should be reported (SPARK-3577), but separate from shuffle metrics. The fix itself basically reverts the line to what it was in 1.6.
    
    ## How was this patch tested?
    
    Cherry-picked from master (#15347)
    
    Author: Brian Cho <bc...@fb.com>
    
    Closes #15455 from dafrista/shuffle-metrics-2.0.

commit ab00e410c6b1d7dafdfabcea1f249c78459b94f0
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-10-13T04:40:45Z

    [SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once
    
    ## What changes were proposed in this pull request?
    
    The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch.
    You may come across stacktraces that look like:
    ```
    java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.lang.StringCoding.encode(StringCoding.java:350)
    at java.lang.String.getBytes(String.java:941)
    at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
    
    ```
    The safer way is to write to an output stream so that we don't have to materialize a huge string.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #15437 from brkyvz/ser-to-stream.
    
    (cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit d38f38a093b4dff32c686675d93ab03e7a8f4908
Author: buzhihuojie <re...@gmail.com>
Date:   2016-10-13T05:51:54Z

    minor doc fix for Row.scala
    
    ## What changes were proposed in this pull request?
    
    minor doc fix for "getAnyValAs" in class Row
    
    ## How was this patch tested?
    
    None.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: buzhihuojie <re...@gmail.com>
    
    Closes #15452 from david-weiluo-ren/minorDocFixForRow.
    
    (cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d7fa3e32421c73adfa522adfeeb970edd4c22eb3
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-10-13T20:31:50Z

    [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
    
    ## What changes were proposed in this pull request?
    
    Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #15397 from zsxwing/SPARK-17834.
    
    (cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit c53b8374911e801ed98c1436c384f0aef076eaab
Author: Davies Liu <da...@databricks.com>
Date:   2016-10-14T21:45:20Z

    [SPARK-17863][SQL] should not add column into Distinct
    
    ## What changes were proposed in this pull request?
    
    We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that.
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #15489 from davies/order_distinct.
    
    (cherry picked from commit da9aeb0fde589f7c21c2f4a32036a68c0353965d)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 2a1b10b649a8d4c077a0e19df976f1fd36b7e266
Author: Jun Kim <i2...@gmail.com>
Date:   2016-10-15T07:36:55Z

    [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc
    
    ## What changes were proposed in this pull request?
    
    ### Before:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value").
         .getOrCreate()
    ```
    
    ### After:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value")
         .getOrCreate()
    ```
    
    There was one unexpected dot!
    
    Author: Jun Kim <i2...@gmail.com>
    
    Closes #15498 from tae-jun/SPARK-17953.
    
    (cherry picked from commit 36d81c2c68ef4114592b069287743eb5cb078318)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 3cc2fe5b94d3bcdfb4f28bfa6d8e51fe67d6e1b4
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-10-17T05:15:47Z

    [SPARK-17819][SQL][BRANCH-2.0] Support default database in connection URIs for Spark Thrift Server
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark Thrift Server ignores the default database in URI. This PR supports that like the following.
    
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a int)"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    | t          | false        |
    +------------+--------------+--+
    1 row selected (0.347 seconds)
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    +------------+--------------+--+
    No rows selected (0.098 seconds)
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added testsuite.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #15507 from dongjoon-hyun/SPARK-17819-BACK.

commit ca66f52ff81c19e17ca3733eac92d66012a3ec6e
Author: Weiqing Yang <ya...@gmail.com>
Date:   2016-10-17T05:38:30Z

    [MINOR][SQL] Add prettyName for current_database function
    
    ## What changes were proposed in this pull request?
    Added a `prettyname` for current_database function.
    
    ## How was this patch tested?
    Manually.
    
    Before:
    ```
    scala> sql("select current_database()").show
    +-----------------+
    |currentdatabase()|
    +-----------------+
    |          default|
    +-----------------+
    ```
    
    After:
    ```
    scala> sql("select current_database()").show
    +------------------+
    |current_database()|
    +------------------+
    |           default|
    +------------------+
    ```
    
    Author: Weiqing Yang <ya...@gmail.com>
    
    Closes #15506 from weiqingy/prettyName.
    
    (cherry picked from commit 56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d1a02117862b20d0e8e58f4c6da6a97665a02590
Author: gatorsmile <ga...@gmail.com>
Date:   2016-10-17T07:29:53Z

    [SPARK-17892][SQL][2.0] Do Not Optimize Query in CTAS More Than Once #15048
    
    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/15048 and https://github.com/apache/spark/pull/15459.
    
    However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks!
    
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example,
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #15502 from gatorsmile/ctasOptimize2.0.

commit a0d9015b3f34582c5d43bd31bbf35a0e92b1da29
Author: Maxime Rihouey <ma...@gmail.com>
Date:   2016-10-17T09:56:22Z

    Fix example of tf_idf with minDocFreq
    
    ## What changes were proposed in this pull request?
    
    The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter.
    The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore".
    
    ## How was this patch tested?
    
    Before the results for "tfidf" and "tfidfIgnore" were the same:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    
    After the fix those are how they should be:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[0.0])
    (1048576,[1046920],[0.0])
    (1048576,[1046923],[0.0])
    (1048576,[892732],[0.0])
    (1048576,[892733],[0.0])
    (1048576,[892734],[0.0])
    
    Author: Maxime Rihouey <ma...@gmail.com>
    
    Closes #15503 from maximerihouey/patch-1.
    
    (cherry picked from commit e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 881e0eb05782ea74cf92a62954466b14ea9e05b6
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-10-17T23:56:40Z

    [SPARK-17731][SQL][STREAMING] Metrics for structured streaming for branch-2.0
    
    **This PR adds the same metrics to branch-2.0 that was added to master in #15307.**
    
    The differences compared to the #15307 are
    - The new configuration is added only in the `SQLConf `object (i.e. `SQLConf.STREAMING_METRICS_ENABLED`) and not in the `SQLConf` class (i.e. no `SQLConf.isStreamingMetricsEnabled`). Spark master has all the streaming configurations exposed as actual fields in SQLConf class (e.g. [streamingPollingDelay](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L642)), but [not in Spark 2.0](https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L608). So I didnt add it in this 2.0 PR.
    
    - In the previous master PR, the aboveconfiguration was read in `StreamExecution` as `sparkSession.sessionState.conf.isStreamingMetricsEnabled`. In this 2.0 PR, I am instead reading it as `sparkSession.conf.get(STREAMING_METRICS_ENABLED)`(i.e. no `sessionState`) to keep it consistent with how other confs are read in `StreamExecution` (e.g. [STREAMING_POLLING_DELAY](https://github.com/tdas/spark/blob/ee8e899e4c274c363a8b4d13e8bf57b0b467a50e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L62)).
    
    - Different Mima exclusions
    
    ------
    
    ## What changes were proposed in this pull request?
    
    Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
    https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
    
    Specifically, this PR adds the following public APIs changes.
    
    - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
    
    - `StreamingQueryStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
      - processingRate - Current rate (rows/sec) at which the query is processing data from
                                      all the sources
      - ~~outputRate~~ - *Does not work with wholestage codegen*
      - latency - Current average latency between the data being available in source and the sink writing the corresponding output
      - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
      - sinkStatus: SinkStatus - Current status of the sink
      - triggerStatus - Low-level detailed status of the last completed/currently active trigger
        - latencies - getOffset, getBatch, full trigger, wal writes
        - timestamps - trigger start, finish, after getOffset, after getBatch
        - numRows - input, output, state total/updated rows for aggregations
    
    - `SourceStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by the source
      - processingRate - Current rate (rows/sec) at which the query is processing data from the source
      - triggerStatus - Low-level detailed status of the last completed/currently active trigger
    
    - Python API for `StreamingQuery.status()`
    
    ### Breaking changes to existing APIs
    
    **Existing direct public facing APIs**
    - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
      - Branch 2.0 should have it deprecated, master should have it removed.
    
    **Existing advanced listener APIs**
    - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
       - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
    
    - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
    
    - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
    
    - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
    
    ## How was this patch tested?
    
    Old and new unit tests.
    - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
    - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
    - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
    - Source-specific tests for making sure input rows are counted are is source-specific test suites.
    - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
    
    Metrics also manually tested using Ganglia sink
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15472 from tdas/SPARK-17731-branch-2.0.

commit 01520de6b999c73b5e302778487d8bd1db8fdd2e
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-10-18T07:49:57Z

    [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite
    
    This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
    
    ## What changes were proposed in this pull request?
    There were two sources of flakiness in StreamingQueryListener test.
    
    - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
    ```
    +-----------------------------------+--------------------------------+
    |      StreamExecution thread       |         testing thread         |
    +-----------------------------------+--------------------------------+
    |  ManualClock.waitTillTime(100) {  |                                |
    |        _isWaiting = true          |                                |
    |            wait(10)               |                                |
    |        still in wait(10)          |  if (_isWaiting) advance(100)  |
    |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
    |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
    |      wake up from wait(10)        |                                |
    |       current time is 600         |                                |
    |       _isWaiting = false          |                                |
    |  }                                |                                |
    +-----------------------------------+--------------------------------+
    ```
    
    - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
    
    My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
    
    In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
    
    ## How was this patch tested?
    Ran existing unit test MANY TIME in Jenkins
    
    Author: Tathagata Das <ta...@gmail.com>
    Author: Liwei Lin <lw...@gmail.com>
    
    Closes #15519 from tdas/metrics-flaky-test-fix.
    
    (cherry picked from commit 7d878cf2da04800bc4147b05610170865b148c64)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 9e806f2a4fbc0e7d1441a3eda375cba2fda8ffe5
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-10-18T09:29:55Z

    [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility
    
    ## What changes were proposed in this pull request?
    
    Scala 2.10 does not have Option.contains, which broke Scala 2.10 build.
    
    ## How was this patch tested?
    Locally compiled and ran sql/core unit tests in 2.10
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #15531 from tdas/metrics-flaky-test-fix-1.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18588: Branch 2.0

Posted by Softsapiens <gi...@git.apache.org>.
Github user Softsapiens closed the pull request at:

    https://github.com/apache/spark/pull/18588


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org