You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2014/05/13 18:52:43 UTC

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/758

    [SPARK-1368][SQL] Optimized HiveTableScan

    JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
    
    This PR introduces two major updates:
    
    - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
    - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.
    
    My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:
    
    ```
    Original:
    
    [info] CSV: 27676 ms, RCFile: 26415 ms
    [info] CSV: 27703 ms, RCFile: 26029 ms
    [info] CSV: 27511 ms, RCFile: 25962 ms
    
    Optimized:
    
    [info] CSV: 13820 ms, RCFile: 10402 ms
    [info] CSV: 14158 ms, RCFile: 10691 ms
    [info] CSV: 13606 ms, RCFile: 10346 ms
    ```
    
    The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.
    
    Preparation code:
    
    ```scala
    package org.apache.spark.examples.sql.hive
    
    import org.apache.spark.sql.hive.LocalHiveContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    object HiveTableScanPrepare extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      hql("drop table scan_csv")
      hql("drop table scan_rcfile")
    
      hql("""create table scan_csv (key int, value string)
            |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
            |  with serdeproperties ('field.delim'=',')
          """.stripMargin)
    
      hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
    
      hql("""create table scan_rcfile (key int, value string)
            |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
            |stored as
            |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
            |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
          """.stripMargin)
    
      hql(
        """
          |from scan_csv
          |insert overwrite table scan_rcfile
          |select scan_csv.key, scan_csv.value
        """.stripMargin)
    }
    ```
    
    Benchmark code:
    
    ```scala
    package org.apache.spark.examples.sql.hive
    
    import org.apache.spark.sql.hive.LocalHiveContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    object HiveTableScanBenchmark extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      val scanCsv = hql("select key from scan_csv")
      val scanRcfile = hql("select key from scan_rcfile")
    
      val csvDuration = benchmark(scanCsv.count())
      val rcfileDuration = benchmark(scanRcfile.count())
    
      println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
    
      def benchmark(f: => Unit) = {
        val begin = System.currentTimeMillis()
        f
        val end = System.currentTimeMillis()
        end - begin
      }
    }
    ```
    
    @marmbrus Please help review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark fastHiveTableScan

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/758.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #758
    
----
commit 964087fd96c5f5034d79988a0d7d76733561b610
Author: Cheng Lian <li...@gmail.com>
Date:   2014-05-11T06:41:42Z

    [SPARK-1368] Optimized HiveTableScan

commit a3c272b04852bb5135847504d0ad1258fd583ec1
Author: Cheng Lian <li...@gmail.com>
Date:   2014-05-13T16:33:06Z

    Using ColumnProjectionUtils to optimise RCFile and ORC column pruning

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-42990418
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-44405869
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43173435
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43180729
  
    @marmbrus I worked around the test failure by adding a `SortedOperation` pattern that conservatively matches *some* definitely sorted operations (false negative rather than false positive). This may slow down the test suite a bit, since most test output are empty or very small, this shouldn't be an issue right now.
    
    Two new optimizations applied:
    
    - Using mutable pairs
    - Avoiding pattern matching function calls (`Array.unapplySeq`)
    
    New micro benchmark data:
    
    ```
    Original:
    
    [info] CSV: 27676 ms, RCFile: 26415 ms
    [info] CSV: 27703 ms, RCFile: 26029 ms
    [info] CSV: 27511 ms, RCFile: 25962 ms
    
    Optimized:
    
    [info] CSV: 12357 ms, RCFile: 9283 ms
    [info] CSV: 12291 ms, RCFile: 9298 ms
    [info] CSV: 12325 ms, RCFile: 9242 ms
    ```
    
    As for Hive data unwrapping, I couldn't find a "static" method to eliminate right now. Any hints?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43178499
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-42981451
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43178500
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15017/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43562306
  
    @marmbrus Updated `HiveComparisonTest` and removed `SortedOperation`, how about this version?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12598706
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -143,20 +185,35 @@ case class HiveTableScan(
       }
     
       def execute() = {
    -    inputRdd.map { row =>
    -      val values = row match {
    -        case Array(deserializedRow: AnyRef, partitionKeys: Array[String]) =>
    -          attributeFunctions.map(_(deserializedRow, partitionKeys))
    -        case deserializedRow: AnyRef =>
    -          attributeFunctions.map(_(deserializedRow, Array.empty))
    +    inputRdd.mapPartitions { iterator =>
    +      if (iterator.isEmpty) {
    +        Iterator.empty
    +      } else {
    +        val mutableRow = new GenericMutableRow(attributes.length)
    +        val buffered = iterator.buffered
    +        val rowsAndPartitionKeys = buffered.head match {
    +          case Array(_, _) =>
    +            buffered.map { case Array(deserializedRow, partitionKeys: Array[String]) =>
    +              (deserializedRow, partitionKeys)
    +            }
    +
    +          case _ =>
    +            buffered.map {
    +              (_, Array.empty[String])
    --- End diff --
    
    I think this is allocating a new array every time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43242728
  
    > @marmbrus I worked around the test failure by adding a SortedOperation pattern that conservatively matches some definitely sorted operations (false negative rather than false positive). This may slow down the test suite a bit. Since most test output are empty or very small, this shouldn't be an issue right now.
    
    I think false negatives are the wrong direction to go here.  A false negative means that we think the query is not ordered when it should be and thus are disregarding the order when we should in fact be checking it.
    
    Maybe it would be better to recursively walk the tree looking explicitly for nodes that do not preserve order (aggregation, join, base relations) and then return false.  Sorts would return true.  Thoughts?
    
    > New micro benchmark data:
    
    Sweet, looks like we shaved off a little bit more, so these optimizations were worth it!  It would be good to make notes on which changes lead to what kind of speed up here.  That way, we can better focus our efforts when we optimize in the future.
    
    > As for Hive data unwrapping, I couldn't find a "static" method to eliminate right now. Any hints?
    
    My thought was that you will create an `Array` of `Any => Any` functions that can be applied to each column.  This way you only match on the datatype once, at the beginning, and then simply index into this array instead of matching for each data item.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43438588
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43438593
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-44415302
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12617940
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -143,20 +185,35 @@ case class HiveTableScan(
       }
     
       def execute() = {
    -    inputRdd.map { row =>
    -      val values = row match {
    -        case Array(deserializedRow: AnyRef, partitionKeys: Array[String]) =>
    -          attributeFunctions.map(_(deserializedRow, partitionKeys))
    -        case deserializedRow: AnyRef =>
    -          attributeFunctions.map(_(deserializedRow, Array.empty))
    +    inputRdd.mapPartitions { iterator =>
    +      if (iterator.isEmpty) {
    +        Iterator.empty
    +      } else {
    +        val mutableRow = new GenericMutableRow(attributes.length)
    +        val buffered = iterator.buffered
    +        val rowsAndPartitionKeys = buffered.head match {
    +          case Array(_, _) =>
    +            buffered.map { case Array(deserializedRow, partitionKeys: Array[String]) =>
    +              (deserializedRow, partitionKeys)
    +            }
    +
    +          case _ =>
    +            buffered.map {
    +              (_, Array.empty[String])
    +            }
    +        }
    +
    +        rowsAndPartitionKeys.map { case (deserializedRow, partitionKeys) =>
    --- End diff --
    
    Yes there is a `Product2[A, B].unapply` function call cost. Removing this gains 0.3% speed up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12598783
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -143,20 +185,35 @@ case class HiveTableScan(
       }
     
       def execute() = {
    -    inputRdd.map { row =>
    -      val values = row match {
    -        case Array(deserializedRow: AnyRef, partitionKeys: Array[String]) =>
    -          attributeFunctions.map(_(deserializedRow, partitionKeys))
    -        case deserializedRow: AnyRef =>
    -          attributeFunctions.map(_(deserializedRow, Array.empty))
    +    inputRdd.mapPartitions { iterator =>
    +      if (iterator.isEmpty) {
    +        Iterator.empty
    +      } else {
    +        val mutableRow = new GenericMutableRow(attributes.length)
    +        val buffered = iterator.buffered
    +        val rowsAndPartitionKeys = buffered.head match {
    +          case Array(_, _) =>
    +            buffered.map { case Array(deserializedRow, partitionKeys: Array[String]) =>
    +              (deserializedRow, partitionKeys)
    +            }
    +
    +          case _ =>
    +            buffered.map {
    +              (_, Array.empty[String])
    +            }
    +        }
    +
    +        rowsAndPartitionKeys.map { case (deserializedRow, partitionKeys) =>
    --- End diff --
    
    I'm curious if there is a cost to pattern matching here instead of using `_1` and `_2`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43440123
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15069/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43440122
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43173426
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-42981436
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-42991927
  
    Nice speed up! :)
    
    I looked at the test failure. Looks like [this TODO](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L136) is finally coming back to bite us.  Instead of looking for _any_ Sort we should walk the tree until we find either a Sort or an operation that doesn't preserve ordering (join , aggregate, etc).
    
    Once we fix that I'd propose merging this right away and then addressing the other possible suggestions in a followup PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-44405885
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12598841
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -143,20 +185,35 @@ case class HiveTableScan(
       }
     
       def execute() = {
    -    inputRdd.map { row =>
    -      val values = row match {
    -        case Array(deserializedRow: AnyRef, partitionKeys: Array[String]) =>
    -          attributeFunctions.map(_(deserializedRow, partitionKeys))
    -        case deserializedRow: AnyRef =>
    -          attributeFunctions.map(_(deserializedRow, Array.empty))
    +    inputRdd.mapPartitions { iterator =>
    +      if (iterator.isEmpty) {
    +        Iterator.empty
    +      } else {
    +        val mutableRow = new GenericMutableRow(attributes.length)
    +        val buffered = iterator.buffered
    +        val rowsAndPartitionKeys = buffered.head match {
    +          case Array(_, _) =>
    +            buffered.map { case Array(deserializedRow, partitionKeys: Array[String]) =>
    +              (deserializedRow, partitionKeys)
    --- End diff --
    
    Can we use mutable pair instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-44593628
  
    First merge as a committer :)
    
    Thanks for doing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-42990425
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14941/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-44415308
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15251/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12598965
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -102,16 +105,55 @@ case class HiveTableScan(
               .getOrElse(sys.error(s"Can't find attribute $a"))
             (row: Any, _: Array[String]) => {
               val data = objectInspector.getStructFieldData(row, ref)
    -          unwrapData(data, ref.getFieldObjectInspector)
    +          unwrapHiveData(unwrapData(data, ref.getFieldObjectInspector))
             }
           }
         }
       }
     
    +  private def unwrapHiveData(value: Any) = value match {
    +    case maybeNull: String if maybeNull.toLowerCase == "null" => null
    --- End diff --
    
    Would it be possible to calculate the required unwrapping statically based on the object inspector instead of doing this match for every data item?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/758#discussion_r12598829
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala ---
    @@ -143,20 +185,35 @@ case class HiveTableScan(
       }
     
       def execute() = {
    -    inputRdd.map { row =>
    -      val values = row match {
    -        case Array(deserializedRow: AnyRef, partitionKeys: Array[String]) =>
    -          attributeFunctions.map(_(deserializedRow, partitionKeys))
    -        case deserializedRow: AnyRef =>
    -          attributeFunctions.map(_(deserializedRow, Array.empty))
    +    inputRdd.mapPartitions { iterator =>
    +      if (iterator.isEmpty) {
    +        Iterator.empty
    +      } else {
    +        val mutableRow = new GenericMutableRow(attributes.length)
    +        val buffered = iterator.buffered
    +        val rowsAndPartitionKeys = buffered.head match {
    +          case Array(_, _) =>
    +            buffered.map { case Array(deserializedRow, partitionKeys: Array[String]) =>
    --- End diff --
    
    Same question about the performance of pattern matching vs indexing into the array manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/758


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---