You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sohum2002 <gi...@git.apache.org> on 2017/10/06 11:35:11 UTC

[GitHub] spark pull request #19446: Dataset optimization

GitHub user sohum2002 opened a pull request:

    https://github.com/apache/spark/pull/19446

    Dataset optimization

    The proposed two new additional functions is to help select all the columns in a Dataset except for given columns.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sohum2002/spark dataset_optimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19446
    
----
commit 0e80ecae300f3e2033419b2d98da8bf092c105bb
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-07-10T05:53:27Z

    [SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for Dataset.summary
    
    ## What changes were proposed in this pull request?
    
    Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18570 from cloud-fan/summary.

commit 96d58f285bc98d4c2484150eefe7447db4784a86
Author: Eric Vandenberg <er...@fb.com>
Date:   2017-07-10T06:40:20Z

    [SPARK-21219][CORE] Task retry occurs on same executor due to race condition with blacklisting
    
    ## What changes were proposed in this pull request?
    
    There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor.  This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure).  Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed.  There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    ## How was this patch tested?
    
    Implemented a unit test that verifies the task is black listed before it is added to the pending task.  Ran the unit test without the fix and it fails.  Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Eric Vandenberg <er...@fb.com>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.

commit c444d10868c808f4ae43becd5506bf944d9c2e9b
Author: Dongjoon Hyun <do...@apache.org>
Date:   2017-07-10T06:46:47Z

    [MINOR][DOC] Remove obsolete `ec2-scripts.md`
    
    ## What changes were proposed in this pull request?
    
    Since this document became obsolete, we had better remove this for Apache Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, and currently it's just redirection page. The only reference in Apache Spark website will go directly to the destination in https://github.com/apache/spark-website/pull/54.
    
    ## How was this patch tested?
    
    N/A. This is a removal of documentation.
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.

commit 647963a26a2d4468ebd9b68111ebe68bee501fde
Author: Takeshi Yamamuro <ya...@apache.org>
Date:   2017-07-10T07:58:34Z

    [SPARK-20460][SQL] Make it more consistent to handle column name duplication
    
    ## What changes were proposed in this pull request?
    This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication:
    ```
    // json
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
    scala> spark.read.format("json").schema(schema).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.;
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
    
    scala> spark.read.format("json").load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format;
      at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
      at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
      at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
    
    // csv
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
    scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.;
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
    
    // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
    scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
    +---+---+
    | a0| a1|
    +---+---+
    |  1|  1|
    +---+---+
    
    // parquet
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
    scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.;
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    ```
    When this patch applied, the results change to;
    ```
    
    // json
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
    scala> spark.read.format("json").schema(schema).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
      at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
      at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
    
    scala> spark.read.format("json").load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
      at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
      at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
    
    // csv
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
    scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
      at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
      at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    
    scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
    +---+---+
    | a0| a1|
    +---+---+
    |  1|  1|
    +---+---+
    
    // parquet
    scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
    scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
    scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
      at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
      at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
    ```
    
    ## How was this patch tested?
    Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #17758 from maropu/SPARK-20460.

commit 6a06c4b03c4dd86241fb9d11b4360371488f0e53
Author: jinxing <ji...@126.com>
Date:   2017-07-10T13:06:58Z

    [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFetcher.
    
    ## What changes were proposed in this pull request?
    
    When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result.
    
    This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher`
    
    Author: jinxing <ji...@126.com>
    Author: Shixiong Zhu <zs...@gmail.com>
    
    Closes #18565 from jinxing64/SPARK-21342.

commit 18b3b00ecfde6c694fb6fee4f4d07d04e3d08ccf
Author: Juliusz Sompolski <ju...@databricks.com>
Date:   2017-07-10T16:26:42Z

    [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows
    
    ## What changes were proposed in this pull request?
    
    Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin.
    
    ## How was this patch tested?
    
    Non-zero output rows manually seen in metrics.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #18494 from juliuszsompolski/SPARK-21272.

commit 2bfd5accdce2ae31feeeddf213a019cf8ec97663
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-07-10T17:40:03Z

    [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json
    
    ## What changes were proposed in this pull request?
    
    This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
    
    Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
    
    **Python**
    
    `from_json`
    
    ```python
    from pyspark.sql.functions import from_json
    
    data = [(1, '''{"a": 1}''')]
    df = spark.createDataFrame(data, ("key", "value"))
    df.select(from_json(df.value, "a INT").alias("json")).show()
    ```
    
    **R**
    
    `from_json`
    
    ```R
    df <- sql("SELECT named_struct('name', 'Bob') as people")
    df <- mutate(df, people_json = to_json(df$people))
    head(select(df, from_json(df$people_json, "name STRING")))
    ```
    
    `structType.character`
    
    ```R
    structType("a STRING, b INT")
    ```
    
    `dapply`
    
    ```R
    dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
    ```
    
    `gapply`
    
    ```R
    gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
    ```
    
    ## How was this patch tested?
    
    Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #18498 from HyukjinKwon/SPARK-21266.

commit d03aebbe6508ba441dc87f9546f27aeb27553d77
Author: Bryan Cutler <cu...@gmail.com>
Date:   2017-07-10T22:21:03Z

    [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
    
    ## What changes were proposed in this pull request?
    Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
    
    Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
    
    ## How was this patch tested?
    Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
    
    Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
    
    Author: Bryan Cutler <cu...@gmail.com>
    Author: Li Jin <ic...@gmail.com>
    Author: Li Jin <li...@twosigma.com>
    Author: Wes McKinney <we...@twosigma.com>
    
    Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.

commit c3713fde86204bf3f027483914ff9e60e7aad261
Author: chie8842 <ch...@gmail.com>
Date:   2017-07-11T01:56:54Z

    [SPARK-21358][EXAMPLES] Argument of repartitionandsortwithinpartitions at pyspark
    
    ## What changes were proposed in this pull request?
    At example of repartitionAndSortWithinPartitions at rdd.py, third argument should be True or False.
    I proposed fix of example code.
    
    ## How was this patch tested?
    * I rename test_repartitionAndSortWithinPartitions to test_repartitionAndSortWIthinPartitions_asc to specify boolean argument.
    * I added test_repartitionAndSortWithinPartitions_desc to test False pattern at third argument.
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: chie8842 <ch...@gmail.com>
    
    Closes #18586 from chie8842/SPARK-21358.

commit a2bec6c92a063f4a8e9ed75a9f3f06808485b6d7
Author: Takeshi Yamamuro <ya...@apache.org>
Date:   2017-07-11T03:16:29Z

    [SPARK-21043][SQL] Add unionByName in Dataset
    
    ## What changes were proposed in this pull request?
    This pr added `unionByName` in `DataSet`.
    Here is how to use:
    ```
    val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
    val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
    df1.unionByName(df2).show
    
    // output:
    // +----+----+----+
    // |col0|col1|col2|
    // +----+----+----+
    // |   1|   2|   3|
    // |   6|   4|   5|
    // +----+----+----+
    ```
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #18300 from maropu/SPARK-21043-2.

commit 1471ee7af5a9952b60cf8c56d60cb6a7ec46cc69
Author: gatorsmile <ga...@gmail.com>
Date:   2017-07-11T03:19:59Z

    [SPARK-21350][SQL] Fix the error message when the number of arguments is wrong when invoking a UDF
    
    ### What changes were proposed in this pull request?
    Users get a very confusing error when users specify a wrong number of parameters.
    ```Scala
        val df = spark.emptyDataFrame
        spark.udf.register("foo", (_: String).length)
        df.selectExpr("foo(2, 3, 4)")
    ```
    ```
    org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
    java.lang.ClassCastException: org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
    	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:109)
    ```
    
    This PR is to capture the exception and issue an error message that is consistent with what we did for built-in functions. After the fix, the error message is improved to
    ```
    Invalid number of arguments for function foo; line 1 pos 0
    org.apache.spark.sql.AnalysisException: Invalid number of arguments for function foo; line 1 pos 0
    	at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:119)
    ```
    
    ### How was this patch tested?
    Added a test case
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #18574 from gatorsmile/statsCheck.

commit 833eab2c9bd273ee9577fbf9e480d3e3a4b7d203
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-07-11T03:26:17Z

    [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*
    
    ## What changes were proposed in this pull request?
    
    Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #18593 from zsxwing/SPARK-21369.

commit 97a1aa2c70b1bf726d5f572789e150d168ac61e5
Author: jinxing <ji...@126.com>
Date:   2017-07-11T03:47:47Z

    [SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.
    
    ## What changes were proposed in this pull request?
    
    In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time.
    
    ## How was this patch tested?
    
    Added unit test
    
    Did a small test for benchmark:
    
    Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows.
    Move the iterator forward to index=2000001.
    
    *With this change*:
    `getIterator(2000001)`, it will cost almost 0ms~1ms;
    *Without this change*:
    `for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms.
    
    Author: jinxing <ji...@126.com>
    
    Closes #18541 from jinxing64/SPARK-21315.

commit d4d9e17b31daab9d0c07de24383abd585145fd9c
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-07-11T06:23:03Z

    [SPARK-20456][PYTHON][FOLLOWUP] Fix timezone-dependent doctests in unix_timestamp and from_unixtime
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to simply ignore the results in examples that are timezone-dependent in `unix_timestamp` and `from_unixtime`.
    
    ```
    Failed example:
        time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
    Expected:
        [Row(unix_time=1428476400)]
    Got:unix_timestamp
        [Row(unix_time=1428418800)]
    ```
    
    ```
    Failed example:
        time_df.select(from_unixtime('unix_time').alias('ts')).collect()
    Expected:
        [Row(ts=u'2015-04-08 00:00:00')]
    Got:
        [Row(ts=u'2015-04-08 16:00:00')]
    ```
    
    ## How was this patch tested?
    
    Manually tested and `./run-tests --modules pyspark-sql`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #18597 from HyukjinKwon/SPARK-20456.

commit a4baa8f48fe611e5c4d147f22fb2bb4c78d58a09
Author: Michael Allman <mi...@videoamp.com>
Date:   2017-07-11T06:50:11Z

    [SPARK-20331][SQL] Enhanced Hive partition pruning predicate pushdown
    
    (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20331)
    
    ## What changes were proposed in this pull request?
    
    Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is *not* supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities.
    
    One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, writing a query with a predicate like
    
        date = 20161011 or date = 20161014
    
    will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata.
    
    The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time.
    
    We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s.
    
    ## How was this patch tested?
    
    The `HiveClientSuite` and `VersionsSuite` were refactored and simplified to make Hive client-based, version-specific testing more modular and conceptually simpler. There are now two Hive test suites: `HiveClientSuite` and `HivePartitionFilteringSuite`. These test suites have a single-argument constructor taking a `version` parameter. As such, these test suites cannot be run by themselves. Instead, they have been bundled into "aggregation" test suites which run each suite for each Hive client version. These aggregation suites are called `HiveClientSuites` and `HivePartitionFilteringSuites`. The `VersionsSuite` and `HiveClientSuite` have been refactored into each of these aggregation suites, respectively.
    
    `HiveClientSuite` and `HivePartitionFilteringSuite` subclass a new abstract class, `HiveVersionSuite`. `HiveVersionSuite` collects functionality related to testing a single Hive version and overrides relevant test suite methods to display version-specific information.
    
    A new trait, `HiveClientVersions`, has been added with a sequence of Hive test versions.
    
    Author: Michael Allman <mi...@videoamp.com>
    
    Closes #17633 from mallman/spark-20331-enhanced_partition_pruning_pushdown.

commit 7514db1deca22b44ce18e4c571275ce79addc100
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-07-11T10:11:08Z

    [SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to remove `NumberFormat.parse` use to disallow a case of partially parsed data. For example,
    
    ```
    scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show()
    +----+
    |   a|
    +----+
    |10.0|
    +----+
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `UnivocityParserSuite` and `CSVSuite`.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #18532 from HyukjinKwon/SPARK-21263.

commit 66d21686556681457aab6e44e19f5614c5635f0c
Author: Xingbo Jiang <xi...@databricks.com>
Date:   2017-07-11T13:52:54Z

    [SPARK-21366][SQL][TEST] Add sql test for window functions
    
    ## What changes were proposed in this pull request?
    
    Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
    
    ## How was this patch tested?
    
    Added `window.sql` and the corresponding output file.
    
    Author: Xingbo Jiang <xi...@databricks.com>
    
    Closes #18591 from jiangxb1987/window.

commit ebc124d4c44d4c84f7868f390f778c0ff5cd66cb
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-07-11T14:03:10Z

    [SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition
    
    ## What changes were proposed in this pull request?
    
    This PR deals with four points as below:
    
    - Reuse existing DDL parser APIs rather than reimplementing within PySpark
    
    - Support DDL formatted string, `field type, field type`.
    
    - Support case-insensitivity for parsing.
    
    - Support nested data types as below:
    
      **Before**
      ```
      >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
      ...
      ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
      ```
    
      ```
      >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
      ...
      ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
      ```
    
      ```
      >>> spark.createDataFrame([[1]], "a int").show()
      ...
      ValueError: Could not parse datatype: a int
      ```
    
      **After**
      ```
      >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
      +---+
      |  a|
      +---+
      |[1]|
      +---+
      ```
    
      ```
      >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
      +---+
      |  a|
      +---+
      |[1]|
      +---+
      ```
    
      ```
      >>> spark.createDataFrame([[1]], "a int").show()
      +---+
      |  a|
      +---+
      |  1|
      +---+
      ```
    
    ## How was this patch tested?
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #18590 from HyukjinKwon/deduplicate-python-ddl.

commit 1cad31f00644d899d8e74d58c6eb4e9f72065473
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-07-11T18:25:40Z

    [SPARK-16019][YARN] Use separate RM poll interval when starting client AM.
    
    Currently the code monitoring the launch of the client AM uses the value of
    spark.yarn.report.interval as the interval for polling the RM; if someone
    has that value to a really large interval, it would take that long to detect
    that the client AM has started, which is not expected.
    
    Instead, have a separate config for the interval to use when the client AM is
    starting. The other config is still used in cluster mode, and to detect the
    status of the client AM after it is already running.
    
    Tested by running client and cluster mode apps with a modified value of
    spark.yarn.report.interval, verifying client AM launch is detected before
    that interval elapses.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #18380 from vanzin/SPARK-16019.

commit d3e071658f931f601cd4caaf00997ae411593a44
Author: gatorsmile <ga...@gmail.com>
Date:   2017-07-11T22:44:29Z

    [SPARK-19285][SQL] Implement UDF0
    
    ### What changes were proposed in this pull request?
    This PR is to implement UDF0. `UDF0` is needed when users need to implement a JAVA UDF with no argument.
    
    ### How was this patch tested?
    Added a test case
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #18598 from gatorsmile/udf0.

commit 2cbfc975ba937a4eb761de7a6473b7747941f386
Author: Jane Wang <ja...@fb.com>
Date:   2017-07-12T05:00:36Z

    [SPARK-12139][SQL] REGEX Column Specification
    
    ## What changes were proposed in this pull request?
    Hive interprets regular expression, e.g., `(a)?+.+` in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.
    
    ## How was this patch tested?
    
    - Add unittests in SQLQuerySuite.scala
    - Run spark-shell tested the original failed query:
    scala> hc.sql("SELECT `(a|b)?+.+` from test1").collect.foreach(println)
    
    Author: Jane Wang <ja...@fb.com>
    
    Closes #18023 from janewangfb/support_select_regex.

commit 24367f23f77349a864da340573e39ab2168c5403
Author: liuzhaokun <li...@zte.com.cn>
Date:   2017-07-12T06:02:20Z

    [SPARK-21382] The note about Scala 2.10 in building-spark.md is wrong.
    
    [https://issues.apache.org/jira/browse/SPARK-21382](https://issues.apache.org/jira/browse/SPARK-21382)
    There should be "Note that support for Scala 2.10 is deprecated as of Spark 2.1.0 and may be removed in Spark 2.3.0",right?
    
    Author: liuzhaokun <li...@zte.com.cn>
    
    Closes #18606 from liu-zhaokun/new07120923.

commit e16e8c7ad31762aaca5e2bc874de1540af9cc4b7
Author: Devaraj K <de...@apache.org>
Date:   2017-07-12T07:14:58Z

    [SPARK-21146][CORE] Master/Worker should handle and shutdown when any thread gets UncaughtException
    
    ## What changes were proposed in this pull request?
    
    Adding the default UncaughtExceptionHandler to the Worker.
    
    ## How was this patch tested?
    
    I verified it manually, when any of the worker thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions.
    
    Author: Devaraj K <de...@apache.org>
    
    Closes #18357 from devaraj-kavali/SPARK-21146.

commit e0af76a36a67d409776bd379c6d6ef6d60356c06
Author: Burak Yavuz <br...@gmail.com>
Date:   2017-07-12T07:39:09Z

    [SPARK-21370][SS] Add test for state reliability when one read-only state store aborts after read-write state store commits
    
    ## What changes were proposed in this pull request?
    
    During Streaming Aggregation, we have two StateStores per task, one used as read-only in
    `StateStoreRestoreExec`, and one read-write used in `StateStoreSaveExec`. `StateStore.abort`
    will be called for these StateStores if they haven't committed their results. We need to
    make sure that `abort` in read-only store after a `commit` in the read-write store doesn't
    accidentally lead to the deletion of state.
    
    This PR adds a test for this condition.
    
    ## How was this patch tested?
    
    This PR adds a test.
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #18603 from brkyvz/ss-test.

commit f587d2e3fa133051a64e4ec1aa788b554b552690
Author: Xiao Li <ga...@gmail.com>
Date:   2017-07-12T07:48:44Z

    [SPARK-20842][SQL] Upgrade to 1.2.2 for Hive Metastore Client 1.2
    
    ### What changes were proposed in this pull request?
    Hive 1.2.2 release is available. Below is the list of bugs fixed in 1.2.2
    
    https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332952&styleName=Text&projectId=12310843
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #18063 from gatorsmile/upgradeHiveClientTo1.2.2.

commit 5ed134ee213060882c6e3ed713473fa6cc158d36
Author: Peng Meng <pe...@intel.com>
Date:   2017-07-12T10:02:04Z

    [SPARK-21305][ML][MLLIB] Add options to disable multi-threading of native BLAS
    
    ## What changes were proposed in this pull request?
    
    Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to improvement the performance.
    Many popular Native BLAS, like Intel MKL, OpenBLAS, use multi-threading technology, which will conflict with Spark.  Spark should provide options to disable multi-threading of Native BLAS.
    
    https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
    https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
    
    ## How was this patch tested?
    The existing UT.
    
    Author: Peng Meng <pe...@intel.com>
    
    Closes #18551 from mpjlu/optimzeBLAS.

commit aaad34dc2f537f7eef50fc5f72a7f178800e8d38
Author: liuxian <li...@zte.com.cn>
Date:   2017-07-12T10:51:19Z

    [SPARK-21007][SQL] Add SQL function - RIGHT && LEFT
    
    ## What changes were proposed in this pull request?
     Add  SQL function - RIGHT && LEFT, same as MySQL:
    https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left
    https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right
    
    ## How was this patch tested?
    unit test
    
    Author: liuxian <li...@zte.com.cn>
    
    Closes #18228 from 10110346/lx-wip-0607.

commit d2d2a5de186ddf381d0bdb353b23d64ff0224e7f
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-07-12T14:09:03Z

    [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
    
    ## What changes were proposed in this pull request?
    1, HasHandleInvaild support override
    2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
    
    ## How was this patch tested?
    existing tests
    
    [JIRA](https://issues.apache.org/jira/browse/SPARK-18619)
    
    Author: Zheng RuiFeng <ru...@foxmail.com>
    
    Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.

commit 780586a9f2400c3fdfdb9a6b954001a3c9663941
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-07-12T16:23:54Z

    [SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings
    
    ## What changes were proposed in this pull request?
    
    Currently, `RowDataSourceScanExec` and `FileSourceScanExec` rely on a "metadata" string map to implement equality comparison, since the RDDs they depend on cannot be directly compared. This has resulted in a number of correctness bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818.
    
    To make these comparisons less brittle, we should refactor these classes to compare constructor parameters directly instead of relying on the metadata map.
    
    This PR refactors `RowDataSourceScanExec`, `FileSourceScanExec` will be fixed in the follow-up PR.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #18600 from cloud-fan/minor.

commit e08d06b37bc96cc48fec1c5e40f73e0bca09c616
Author: Kohki Nishio <ta...@me.com>
Date:   2017-07-13T00:22:40Z

    [SPARK-18646][REPL] Set parent classloader as null for ExecutorClassLoader
    
    ## What changes were proposed in this pull request?
    
    `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null or the load failed, that it will call the overridden `findClass` function. To avoid the potential issue caused by loading class using inappropriate class loader, we should set the `parent` of `ClassLoader` to null, so that we can fully control which class loader is used.
    
    This is take over of #17074,  the primary author of this PR is taroplus .
    
    Should close #17074 after this PR get merged.
    
    ## How was this patch tested?
    
    Add test case in `ExecutorClassLoaderSuite`.
    
    Author: Kohki Nishio <ta...@me.com>
    Author: Xingbo Jiang <xi...@databricks.com>
    
    Closes #18614 from jiangxb1987/executor_classloader.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19446: Dataset optimization

Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 closed the pull request at:

    https://github.com/apache/spark/pull/19446


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org