You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ShobhaEndigeri <gi...@git.apache.org> on 2017/05/31 15:33:05 UTC

[GitHub] spark pull request #18163: Branch 2.1

GitHub user ShobhaEndigeri opened a pull request:

    https://github.com/apache/spark/pull/18163

    Branch 2.1

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18163
    
----
commit 0668e061beba683d026a2d48011ff74faf8a38ab
Author: Andrew Ash <an...@andrewash.com>
Date:   2017-01-13T07:14:07Z

    Fix missing close-parens for In filter's toString
    
    Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
    
        PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
    
    Author: Andrew Ash <an...@andrewash.com>
    
    Closes #16558 from ash211/patch-9.
    
    (cherry picked from commit b040cef2ed0ed46c3dfb483a117200c9dac074ca)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit b2c9a2c8c8e8c38baa6d876c81d143af61328aa2
Author: Vinayak <vi...@in.ibm.com>
Date:   2017-01-13T10:35:12Z

    [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
    
    Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
    
    Existing unit tests and a new unit test added to pyspark-sql:
    
    /python/run-tests --python-executables=python --modules=pyspark-sql
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Vinayak <vi...@in.ibm.com>
    Author: Vinayak Joshi <vi...@users.noreply.github.com>
    
    Closes #16119 from vijoshi/SPARK-18687_master.
    
    (cherry picked from commit 285a7798e267311730b0163d37d726a81465468a)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 2c2ca8943c4355af491ec19fe6d13949182260ab
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-01-13T06:52:34Z

    [SPARK-19178][SQL] convert string of large numbers to int should return null
    
    ## What changes were proposed in this pull request?
    
    When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
    
    However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
    
    This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
    
    ## How was this patch tested?
    
    new regression tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #16550 from cloud-fan/string-to-int.
    
    (cherry picked from commit 6b34e745bb8bdcf5a8bb78359fa39bbe8c6563cc)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit ee3642f5182f199aac15b69d1a6a1167f75e5c65
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-01-13T18:08:14Z

    [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter
    
    ## What changes were proposed in this pull request?
    
    To allow specifying number of partitions when the DataFrame is created
    
    ## How was this patch tested?
    
    manual, unit tests
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16512 from felixcheung/rnumpart.
    
    (cherry picked from commit b0e8eb6d3e9e80fa62625a5b9382d93af77250db)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 5e9be1e1f05936da48aa2977f78144f26b2dd266
Author: Yucai Yu <yu...@intel.com>
Date:   2017-01-13T21:40:53Z

    [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn
    
    ## What changes were proposed in this pull request?
    
    the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
    
    ## How was this patch tested?
    
    unit test
    
    Author: Yucai Yu <yu...@intel.com>
    
    Closes #16555 from yucai/offheap_short.
    
    (cherry picked from commit ad0dadaa251b031a480fc2080f792a54ed7dfc5f)
    Signed-off-by: Davies Liu <da...@gmail.com>

commit db37049da6d2fb743a16ba0ea3fec5dbce46e30c
Author: gatorsmile <ga...@gmail.com>
Date:   2017-01-15T12:40:44Z

    [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables
    
    ```Scala
            sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
    
            // This table fetch is to fill the cache with zero leaf files
            spark.table("tab").show()
    
            sql(
              s"""
                 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                 |INTO TABLE tab
               """.stripMargin)
    
            spark.table("tab").show()
    ```
    
    In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
    
    This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
    
    In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
    
    Added test cases in parquetSuites.scala
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
    
    (cherry picked from commit de62ddf7ff42bdc383da127e6b1155897565354c)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit bf2f233e49013da54a6accd96c471acafc24df15
Author: gatorsmile <ga...@gmail.com>
Date:   2017-01-16T02:58:10Z

    [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan all the saved files #16481
    
    ### What changes were proposed in this pull request?
    
    #### This PR is to backport https://github.com/apache/spark/pull/16481 to Spark 2.1
    ---
    `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.
    
    ### How was this patch tested?
    Added and modified the test cases
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #16588 from gatorsmile/backport-19092.

commit 4f3ce062ce2e9b403f9d38a44eb7fc76a800ed67
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-01-16T07:26:41Z

    [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet
    
    ## What changes were proposed in this pull request?
    
    We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:
    
    1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
    2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.
    
    This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.
    
    Two main changes in this patch:
    
    1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner
    
        We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.
    
    2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.
    
    One thing to notice is:
    
    We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <vi...@gmail.com>
    
    Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
    
    (cherry picked from commit 61e48f52d1d8c7431707bd3511b6fe9f0ae996c0)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 97589050714901139b6fda358916ef64c3bbd78c
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-01-16T17:35:52Z

    [SPARK-19232][SPARKR] Update Spark distribution download cache location on Windows
    
    ## What changes were proposed in this pull request?
    
    Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
    Current path of `AppData\Local\spark\spark\Cache` is a bit odd.
    
    ## How was this patch tested?
    
    manual.
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16590 from felixcheung/rcachedir.
    
    (cherry picked from commit a115a54399cd4bedb1a5086943a88af6339fbe85)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit f4317be66d0e169693e3407abf3d0bfa4d7e37af
Author: CodingCat <zh...@gmail.com>
Date:   2017-01-17T02:33:20Z

    [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets
    
    ## What changes were proposed in this pull request?
    
    the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
    Let's consider the following case:
    A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully.
    1. The main thread in the Spark streaming application will execute the line mentioned above,
    2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed.
    3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
    the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed
    
    This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished
    
    ## How was this patch tested?
    
    existing tests
    
    Author: CodingCat <zh...@gmail.com>
    Author: Nan Zhu <zh...@gmail.com>
    
    Closes #16542 from CodingCat/SPARK-18905.
    
    (cherry picked from commit f8db8945f25cb884278ff8841bac5f6f28f0dec6)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 2ff366912a5a72f090dd8a4e54bd7533ede7be27
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-01-17T17:53:20Z

    [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
    
    ## What changes were proposed in this pull request?
    
    Currently, PySpark does not work with Python 3.6.0.
    
    Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
    
    ```
    Traceback (most recent call last):
      File ".../spark/python/pyspark/shell.py", line 30, in <module>
        import pyspark
      File ".../spark/python/pyspark/__init__.py", line 46, in <module>
        from pyspark.context import SparkContext
      File ".../spark/python/pyspark/context.py", line 36, in <module>
        from pyspark.java_gateway import launch_gateway
      File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
        from py4j.java_gateway import java_import, JavaGateway, GatewayClient
      File "<frozen importlib._bootstrap>", line 961, in _find_and_load
      File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
      File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
      File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
        import pkgutil
      File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
        ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
      File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
        cls = _old_namedtuple(*args, **kwargs)
    TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
    ```
    
    The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).
    
    We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
    
    This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
    
    Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
    
    ## How was this patch tested?
    
    Manually tested with Python 2.7.6 and Python 3.6.0.
    
    ```
    ./bin/pyspsark
    ```
    
    , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
    
    and Jenkins tests for other Python versions.
    
    Also,
    
    ```
    ./run-tests --python-executables=python3.6
    ```
    
    ```
    Will test against the following Python executables: ['python3.6']
    Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
    Finished test(python3.6): pyspark.sql.tests (192s)
    Finished test(python3.6): pyspark.accumulators (3s)
    Finished test(python3.6): pyspark.mllib.tests (198s)
    Finished test(python3.6): pyspark.broadcast (3s)
    Finished test(python3.6): pyspark.conf (2s)
    Finished test(python3.6): pyspark.context (14s)
    Finished test(python3.6): pyspark.ml.classification (21s)
    Finished test(python3.6): pyspark.ml.evaluation (11s)
    Finished test(python3.6): pyspark.ml.clustering (20s)
    Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
    Finished test(python3.6): pyspark.streaming.tests (240s)
    Finished test(python3.6): pyspark.tests (240s)
    Finished test(python3.6): pyspark.ml.recommendation (19s)
    Finished test(python3.6): pyspark.ml.feature (36s)
    Finished test(python3.6): pyspark.ml.regression (37s)
    Finished test(python3.6): pyspark.ml.tuning (28s)
    Finished test(python3.6): pyspark.mllib.classification (26s)
    Finished test(python3.6): pyspark.mllib.evaluation (18s)
    Finished test(python3.6): pyspark.mllib.clustering (44s)
    Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
    Finished test(python3.6): pyspark.mllib.feature (26s)
    Finished test(python3.6): pyspark.mllib.fpm (23s)
    Finished test(python3.6): pyspark.mllib.random (8s)
    Finished test(python3.6): pyspark.ml.tests (92s)
    Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
    Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
    Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
    Finished test(python3.6): pyspark.mllib.recommendation (24s)
    Finished test(python3.6): pyspark.mllib.regression (26s)
    Finished test(python3.6): pyspark.profiler (9s)
    Finished test(python3.6): pyspark.mllib.tree (16s)
    Finished test(python3.6): pyspark.shuffle (1s)
    Finished test(python3.6): pyspark.mllib.util (18s)
    Finished test(python3.6): pyspark.serializers (11s)
    Finished test(python3.6): pyspark.rdd (20s)
    Finished test(python3.6): pyspark.sql.conf (8s)
    Finished test(python3.6): pyspark.sql.catalog (17s)
    Finished test(python3.6): pyspark.sql.column (18s)
    Finished test(python3.6): pyspark.sql.context (18s)
    Finished test(python3.6): pyspark.sql.group (27s)
    Finished test(python3.6): pyspark.sql.dataframe (33s)
    Finished test(python3.6): pyspark.sql.functions (35s)
    Finished test(python3.6): pyspark.sql.types (6s)
    Finished test(python3.6): pyspark.sql.streaming (13s)
    Finished test(python3.6): pyspark.streaming.util (0s)
    Finished test(python3.6): pyspark.sql.session (16s)
    Finished test(python3.6): pyspark.sql.window (4s)
    Finished test(python3.6): pyspark.sql.readwriter (35s)
    Tests passed in 433 seconds
    ```
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #16429 from HyukjinKwon/SPARK-19019.
    
    (cherry picked from commit 20e6280626fe243b170a2e7c5e018c67f3dac1db)
    Signed-off-by: Davies Liu <da...@gmail.com>

commit 13986a72024aa95f39b1d191f8e2233e995653f3
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-01-17T17:57:12Z

    [SPARK-19065][SQL] Don't inherit expression id in dropDuplicates
    
    ## What changes were proposed in this pull request?
    
    `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.
    
    ## How was this patch tested?
    
    test("SPARK-19065: dropDuplicates should not create expressions using the same id")
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16564 from zsxwing/SPARK-19065.
    
    (cherry picked from commit a83accfcfd6a92afac5040c50577258ab83d10dd)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 3ec3e3f2edf86315d7e32e96899cad279e90f1d1
Author: gatorsmile <ga...@gmail.com>
Date:   2017-01-17T18:01:30Z

    [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec
    
    Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
    
    ```Scala
    val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
    df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
    spark.sql("alter table partitionedTable drop partition(partCol1='')")
    spark.table("partitionedTable").show()
    ```
    
    In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
    
    When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
    
    Added test cases
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #16583 from gatorsmile/disallowEmptyPartColValue.
    
    (cherry picked from commit a23debd7bc8f85ea49c54b8cf3cd112cf0a803ff)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 29b954bba1a9fa6e3bd823fa36ea7df4c2461381
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2017-01-18T05:24:33Z

    [SPARK-19066][SPARKR][BACKPORT-2.1] LDA doesn't set optimizer correctly
    
    ## What changes were proposed in this pull request?
    Back port the fix to SPARK-19066 to 2.1 branch.
    
    ## How was this patch tested?
    Unit tests
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #16623 from wangmiao1981/bugport.

commit 77202a6c57e6ac2438cdb6bd232a187b6734fa2b
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-01-18T17:53:14Z

    [SPARK-19231][SPARKR] add error handling for download and untar for Spark release
    
    ## What changes were proposed in this pull request?
    
    When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.
    
    ## How was this patch tested?
    
    manually
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16589 from felixcheung/rtarreturncode.
    
    (cherry picked from commit 278fa1eb305220a85c816c948932d6af8fa619aa)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 047506bae4f9a00003505ac886ba04969d8d11f5
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-01-18T18:50:51Z

    [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests
    
    ## What changes were proposed in this pull request?
    
    #16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16567 from zsxwing/SPARK-19113-2.
    
    (cherry picked from commit c050c12274fba2ac4c4938c4724049a47fa59280)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 4cff0b504c367db314f10e730fe39dc083529f16
Author: Liwei Lin <lw...@gmail.com>
Date:   2017-01-18T18:52:47Z

    [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error
    
    ## What changes were proposed in this pull request?
    
    We should call `StateStore.abort()` when there should be any error before the store is committed.
    
    ## How was this patch tested?
    
    Manually.
    
    Author: Liwei Lin <lw...@gmail.com>
    
    Closes #16547 from lw-lin/append-filter.
    
    (cherry picked from commit 569e50680f97b1ed054337a39fe198769ef52d93)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 7bc3e9ba73869c0c6cb8e754e41dbdd4740cfd07
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-12-20T04:03:33Z

    [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table
    
    ## What changes were proposed in this pull request?
    
    When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
    
    However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
    
    This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
    * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
    * SPARK-18912: We forget to check the number of columns for non-file-based data source table
    * SPARK-18913: We don't support append data to a table with special column names.
    
    ## How was this patch tested?
    new regression test.
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #16313 from cloud-fan/bug1.
    
    (cherry picked from commit f923c849e5b8f7e7aeafee59db598a9bf4970f50)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 482d361c36b5d0e093f931e27701fb59488ad583
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-01-20T22:04:51Z

    [SPARK-19314][SS][CATALYST] Do not allow sort before aggregation in Structured Streaming plan
    
    ## What changes were proposed in this pull request?
    
    Sort in a streaming plan should be allowed only after a aggregation in complete mode. Currently it is incorrectly allowed when present anywhere in the plan. It gives unpredictable potentially incorrect results.
    
    ## How was this patch tested?
    New test
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #16662 from tdas/SPARK-19314.
    
    (cherry picked from commit 552e5f08841828e55f5924f1686825626da8bcd0)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 4d286c903b1f88ef175209156b72ccbc3b9e8ae7
Author: Davies Liu <da...@databricks.com>
Date:   2017-01-21T00:11:40Z

    [SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join
    
    PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
    
    This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.
    
    Add a regression test.
    
    Author: Davies Liu <da...@databricks.com>
    
    Closes #16581 from davies/pyudf_join.

commit 6f0ad575df219a58ba814fb402fbac653df46399
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-01-21T01:49:26Z

    [SPARK-19267][SS] Fix a race condition when stopping StateStore
    
    ## What changes were proposed in this pull request?
    
    There is a race condition when stopping StateStore which makes `StateStoreSuite.maintenance` flaky. `StateStore.stop` doesn't wait for the running task to finish, and an out-of-date task may fail `doMaintenance` and cancel the new task. Here is a reproducer: https://github.com/zsxwing/spark/commit/dde1b5b106ba034861cf19e16883cfe181faa6f3
    
    This PR adds MaintenanceTask to eliminate the race condition.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #16627 from zsxwing/SPARK-19267.
    
    (cherry picked from commit ea31f92bb8554a901ff5b48986097a2642c64399)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 8daf10e3f499a32493fb8a84369f7c4c74d65ff8
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-01-22T05:15:57Z

    [SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive
    
    ## What changes were proposed in this pull request?
    MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415).
    
    ## How was this patch tested?
    Update corresponding tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #16516 from yanboliang/spark-19133.
    
    (cherry picked from commit 3dcad9fab17297f9966026f29fefb5c726965a13)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit 1e07a71924ef1420c96a3a0a8cb5be2f3a830037
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-01-23T08:53:44Z

    [SPARK-19155][ML] Make family case insensitive in GLM
    
    ## What changes were proposed in this pull request?
    This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
    ```
    model.getFamily == Binomial.name || model.getFamily == Poisson.name
    ```
    
    ## How was this patch tested?
    Update existing tests for 'Poisson' and 'Binomial'.
    
    yanboliang felixcheung imatiach-msft
    
    Author: actuaryzhang <ac...@gmail.com>
    
    Closes #16675 from actuaryzhang/family.
    
    (cherry picked from commit f067acefabebf04939d03a639a2aaa654e1bc8f9)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit ed5d1e7251142e9e3f4e5e2783118bde38ac192c
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-01-23T21:36:41Z

    [SPARK-19306][CORE] Fix inconsistent state in DiskBlockObject when expection occurred
    
    ## What changes were proposed in this pull request?
    
    In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.
    
    ## How was this patch tested?
    
    Existing test.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #16657 from jerryshao/SPARK-19306.
    
    (cherry picked from commit e4974721f33e64604501f673f74052e11920d438)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 4a2be09023fb80db44775d05822e202104f29e75
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-01-24T06:20:42Z

    [SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF
    
    ## What changes were proposed in this pull request?
    
    Currently, running the codes in Java
    
    ```java
    spark.udf().register("inc", new UDF1<Long, Long>() {
      Override
      public Long call(Long i) {
        return i + 1;
      }
    }, DataTypes.LongType);
    
    spark.range(10).toDF("x").createOrReplaceTempView("tmp");
    Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
    Assert.assertEquals(7, result.getLong(0));
    ```
    
    fails as below:
    
    ```
    org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
    Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
    +- SubqueryAlias tmp, `tmp`
       +- Project [id#16L AS x#19L]
          +- Range (0, 10, step=1, splits=Some(8))
    
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
    	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
    ```
    
    The root cause is because we were creating the function every time when it needs to build as below:
    
    ```scala
    scala> def inc(i: Int) = i + 1
    inc: (i: Int)Int
    
    scala> (inc(_: Int)).hashCode
    res15: Int = 1231799381
    
    scala> (inc(_: Int)).hashCode
    res16: Int = 2109839984
    
    scala> (inc(_: Int)) == (inc(_: Int))
    res17: Boolean = false
    ```
    
    This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`.
    
    In case of Scala one, it seems already fine.
    
    Both can be tested easily as below if any reviewer is more comfortable with Scala:
    
    ```scala
    val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
    val javaUDF = new UDF1[Int, Int]  {
      override def call(i: Int): Int = i + 1
    }
    // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API
    // spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for Scala API
    df.createOrReplaceTempView("tmp")
    spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
    ```
    
    ## How was this patch tested?
    
    Unit test in `JavaUDFSuite.java` and `./dev/lint-java`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #16553 from HyukjinKwon/SPARK-9435.
    
    (cherry picked from commit e576c1ed793fe8ac6e65381dc0635413cc18470f)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 570e5e11dfd5d9fa3ee6caae3bba85d53ceac4e8
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-01-24T06:30:51Z

    [SPARK-19268][SS] Disallow adaptive query execution for streaming queries
    
    ## What changes were proposed in this pull request?
    
    As adaptive query execution may change the number of partitions in different batches, it may break streaming queries. Hence, we should disallow this feature in Structured Streaming.
    
    ## How was this patch tested?
    
    `test("SPARK-19268: Adaptive query execution should be disallowed")`.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16683 from zsxwing/SPARK-19268.
    
    (cherry picked from commit 60bd91a34078a9239fbf5e8f49ce8b680c11635d)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 9c04e427d0a4b99bfdb6af1ea1bc8c4bdaee724e
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-01-24T08:23:23Z

    [SPARK-18823][SPARKR] add support for assigning to column
    
    ## What changes were proposed in this pull request?
    
    Support for
    ```
    df[[myname]] <- 1
    df[[2]] <- df$eruptions
    ```
    
    ## How was this patch tested?
    
    manual tests, unit tests
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16663 from felixcheung/rcolset.
    
    (cherry picked from commit f27e024768e328b96704a9ef35b77381da480328)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit d128b6a39ebafd56041e1fb44d71c61033ae6f8e
Author: Ilya Matiach <il...@microsoft.com>
Date:   2017-01-23T21:34:27Z

    [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case
    
    [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments
    
    ## What changes were proposed in this pull request?
    
    Fix a bug in which BisectingKMeans fails with error:
    java.util.NoSuchElementException: key not found: 166
            at scala.collection.MapLike$class.default(MapLike.scala:228)
            at scala.collection.AbstractMap.default(Map.scala:58)
            at scala.collection.MapLike$class.apply(MapLike.scala:141)
            at scala.collection.AbstractMap.apply(Map.scala:58)
            at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
            at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
            at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
            at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
            at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
            at scala.collection.immutable.List.foldLeft(List.scala:84)
            at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
            at scala.collection.immutable.List.reduceLeft(List.scala:84)
            at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
            at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
            at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
            at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
            at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    
    ## How was this patch tested?
    
    The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Ilya Matiach <il...@microsoft.com>
    
    Closes #16355 from imatiach-msft/ilmat/fix-kmeans.

commit b94fb284b93c763cf6e604705509a4e970d6ce6e
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2017-01-24T22:31:06Z

    [SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
    
    Example:
    The query
    
     select a1,b1
     from   t1
     where  (a1,b1) not in (select a2,b2
                            from   t2);
    
    has the (a1, b1) = (a2, b2) rewritten from (before this fix):
    
    Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
    
    to (after this fix):
    
    Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
    
    ## How was this patch tested?
    
    sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
    
    Author: Nattavut Sutyanyong <ns...@gmail.com>
    
    Closes #16467 from nsyca/19017.
    
    (cherry picked from commit cdb691eb4da5dbf52dccf1da0ae57a9b1874f010)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit c133787965e65e19c0aab636c941b5673e6a68e5
Author: Liwei Lin <lw...@gmail.com>
Date:   2017-01-25T00:36:17Z

    [SPARK-19330][DSTREAMS] Also show tooltip for successful batches
    
    ## What changes were proposed in this pull request?
    
    ### Before
    ![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)
    
    ### After
    ![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png)
    
    ## How was this patch tested?
    
    Manually
    
    Author: Liwei Lin <lw...@gmail.com>
    
    Closes #16673 from lw-lin/streaming.
    
    (cherry picked from commit 40a4cfc7c7911107d1cf7a2663469031dcf1f576)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18163: Branch 2.1

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18163
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18163: Branch 2.1

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18163


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18163: Branch 2.1

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18163
  
    @ShobhaEndigeri it looks mistakenly open. Could you close this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org