You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/06/13 12:32:36 UTC

[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/21556

    [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType support push down

    ## What changes were proposed in this pull request?
    
    [32BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L208) and [64BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L219) support push down to the data sources.
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-24549

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21556
    
----
commit 9832661e735fbbbfc4da4cc96f4ff7a537c3eca2
Author: Yuming Wang <yu...@...>
Date:   2018-06-13T12:28:33Z

    32BitDecimalType and 64BitDecimalType support push down to the data sources

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92414 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92414/testReport)** for PR 21556 at commit [`0b5d0e7`](https://github.com/apache/spark/commit/0b5d0e70fc4baff48ac5656e01a4ab98020c2166).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92413/testReport)** for PR 21556 at commit [`0b5d0e7`](https://github.com/apache/spark/commit/0b5d0e70fc4baff48ac5656e01a4ab98020c2166).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92675/testReport)** for PR 21556 at commit [`341fe59`](https://github.com/apache/spark/commit/341fe5913b439322a1d3a08df53ed8acd84e0cbb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92843/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/537/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91769/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202447544
  
    --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
    @@ -292,120 +292,120 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
     
     Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     ------------------------------------------------------------------------------------------------
    -Parquet Vectorized                            3785 / 3867          4.2         240.6       1.0X
    -Parquet Vectorized (Pushdown)                 3820 / 3928          4.1         242.9       1.0X
    -Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
    -Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
    +Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
    +Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X
    --- End diff --
    
    Okay, I see. The tenths and hundredths are always 0, which makes the precision-8 numbers actually precision-10. It is still odd that this is causing Parquet to have no stats, but I'm happy with the fix. Thanks for explaining.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21556


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200419939
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    That doesn't validate the value against the decimal scale from the file, which is what I'm suggesting. The decimal scale must match exactly and this is a good place to check because this has the file information. If the scale doesn't match, then the schema used to read this file is incorrect, which would cause data corruption.
    
    In my opinion, it is better to add a check if it is cheap instead of debating whether or not some other part of the code covers the case. If this were happening per record then I would opt for a different strategy, but because this is at the file level it is a good idea to add it here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/179/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92938 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92938/testReport)** for PR 21556 at commit [`f73eab2`](https://github.com/apache/spark/commit/f73eab288f3c1b825521b4e4c8ab841c0840f48a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91881 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91881/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/702/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Another performance test:
    <img width="1008" alt="spark_24549" src="https://user-images.githubusercontent.com/5399861/41448622-437d029a-708e-11e8-9c18-5d9f17cd1edf.png">



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92683/testReport)** for PR 21556 at commit [`341fe59`](https://github.com/apache/spark/commit/341fe5913b439322a1d3a08df53ed8acd84e0cbb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r199442189
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -359,6 +369,70 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         }
       }
     
    +  test("filter pushdown - decimal") {
    +    Seq(true, false).foreach { legacyFormat =>
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) {
    +        Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType
    +          s"_1 decimal(${Decimal.MAX_LONG_DIGITS}, 2)",  // 64BitDecimalType
    +          "_1 decimal(38, 18)"                           // ByteArrayDecimalType
    +        ).foreach { schemaDDL =>
    +          val schema = StructType.fromDDL(schemaDDL)
    +          val rdd =
    +            spark.sparkContext.parallelize((1 to 4).map(i => Row(new java.math.BigDecimal(i))))
    +          val dataFrame = spark.createDataFrame(rdd, schema)
    +          testDecimalPushDown(dataFrame) { implicit df =>
    +            assert(df.schema === schema)
    +            checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row])
    +            checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(Row.apply(_)))
    +
    +            checkFilterPredicate('_1 === 1, classOf[Eq[_]], 1)
    +            checkFilterPredicate('_1 <=> 1, classOf[Eq[_]], 1)
    +            checkFilterPredicate('_1 =!= 1, classOf[NotEq[_]], (2 to 4).map(Row.apply(_)))
    +
    +            checkFilterPredicate('_1 < 2, classOf[Lt[_]], 1)
    +            checkFilterPredicate('_1 > 3, classOf[Gt[_]], 4)
    +            checkFilterPredicate('_1 <= 1, classOf[LtEq[_]], 1)
    +            checkFilterPredicate('_1 >= 4, classOf[GtEq[_]], 4)
    +
    +            checkFilterPredicate(Literal(1) === '_1, classOf[Eq[_]], 1)
    +            checkFilterPredicate(Literal(1) <=> '_1, classOf[Eq[_]], 1)
    +            checkFilterPredicate(Literal(2) > '_1, classOf[Lt[_]], 1)
    +            checkFilterPredicate(Literal(3) < '_1, classOf[Gt[_]], 4)
    +            checkFilterPredicate(Literal(1) >= '_1, classOf[LtEq[_]], 1)
    +            checkFilterPredicate(Literal(4) <= '_1, classOf[GtEq[_]], 4)
    +
    +            checkFilterPredicate(!('_1 < 4), classOf[GtEq[_]], 4)
    +            checkFilterPredicate('_1 < 2 || '_1 > 3, classOf[Operators.Or], Seq(Row(1), Row(4)))
    +          }
    +        }
    +      }
    +    }
    +  }
    +
    +  test("incompatible parquet file format will throw exeception") {
    --- End diff --
    
    Have create a PR: https://github.com/apache/spark/pull/21696
    After this PR. Support decimal should be like this: https://github.com/wangyum/spark/blob/refactor-decimal-pushdown/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L118-L146


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/723/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93017/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202239380
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Decimal type must make sure that filter value's scale matched the file.
    --- End diff --
    
    Shall we leave this comment around the decimal `case`s below or around `isDecimalMatched`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    cc @gatorsmile  @rdblue 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r198907669
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -359,6 +369,70 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         }
       }
     
    +  test("filter pushdown - decimal") {
    +    Seq(true, false).foreach { legacyFormat =>
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) {
    +        Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType
    +          s"_1 decimal(${Decimal.MAX_LONG_DIGITS}, 2)",  // 64BitDecimalType
    +          "_1 decimal(38, 18)"                           // ByteArrayDecimalType
    +        ).foreach { schemaDDL =>
    +          val schema = StructType.fromDDL(schemaDDL)
    +          val rdd =
    +            spark.sparkContext.parallelize((1 to 4).map(i => Row(new java.math.BigDecimal(i))))
    +          val dataFrame = spark.createDataFrame(rdd, schema)
    +          testDecimalPushDown(dataFrame) { implicit df =>
    +            assert(df.schema === schema)
    +            checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row])
    +            checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(Row.apply(_)))
    +
    +            checkFilterPredicate('_1 === 1, classOf[Eq[_]], 1)
    +            checkFilterPredicate('_1 <=> 1, classOf[Eq[_]], 1)
    +            checkFilterPredicate('_1 =!= 1, classOf[NotEq[_]], (2 to 4).map(Row.apply(_)))
    +
    +            checkFilterPredicate('_1 < 2, classOf[Lt[_]], 1)
    +            checkFilterPredicate('_1 > 3, classOf[Gt[_]], 4)
    +            checkFilterPredicate('_1 <= 1, classOf[LtEq[_]], 1)
    +            checkFilterPredicate('_1 >= 4, classOf[GtEq[_]], 4)
    +
    +            checkFilterPredicate(Literal(1) === '_1, classOf[Eq[_]], 1)
    +            checkFilterPredicate(Literal(1) <=> '_1, classOf[Eq[_]], 1)
    +            checkFilterPredicate(Literal(2) > '_1, classOf[Lt[_]], 1)
    +            checkFilterPredicate(Literal(3) < '_1, classOf[Gt[_]], 4)
    +            checkFilterPredicate(Literal(1) >= '_1, classOf[LtEq[_]], 1)
    +            checkFilterPredicate(Literal(4) <= '_1, classOf[GtEq[_]], 4)
    +
    +            checkFilterPredicate(!('_1 < 4), classOf[GtEq[_]], 4)
    +            checkFilterPredicate('_1 < 2 || '_1 > 3, classOf[Operators.Or], Seq(Row(1), Row(4)))
    +          }
    +        }
    +      }
    +    }
    +  }
    +
    +  test("incompatible parquet file format will throw exeception") {
    --- End diff --
    
    If we can detect the case where the data is written with the legacy format, then why do we need a property to read with the legacy format? Why not do the right thing without a property?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Benchmark results:
    ```
    ###############################[ Pushdown benchmark for Decimal ]################################
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    
    Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            4004 / 5309          3.9         254.5       1.0X
    Parquet Vectorized (Pushdown)                 1401 / 1431         11.2          89.1       2.9X
    Native ORC Vectorized                         4499 / 4567          3.5         286.0       0.9X
    Native ORC Vectorized (Pushdown)               899 /  961         17.5          57.2       4.5X
    
    Select 10% decimal(9, 2) rows (value < 1572864): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            5376 / 6437          2.9         341.8       1.0X
    Parquet Vectorized (Pushdown)                 2696 / 2754          5.8         171.4       2.0X
    Native ORC Vectorized                         5458 / 5623          2.9         347.0       1.0X
    Native ORC Vectorized (Pushdown)              2230 / 2255          7.1         141.8       2.4X
    
    Select 50% decimal(9, 2) rows (value < 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            8280 / 8487          1.9         526.4       1.0X
    Parquet Vectorized (Pushdown)                 7716 / 7757          2.0         490.6       1.1X
    Native ORC Vectorized                         9144 / 9495          1.7         581.4       0.9X
    Native ORC Vectorized (Pushdown)              7918 / 8118          2.0         503.4       1.0X
    
    Select 90% decimal(9, 2) rows (value < 14155776): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            9648 / 9676          1.6         613.4       1.0X
    Parquet Vectorized (Pushdown)                 9647 / 9778          1.6         613.3       1.0X
    Native ORC Vectorized                       10782 / 10867          1.5         685.5       0.9X
    Native ORC Vectorized (Pushdown)            10108 / 10269          1.6         642.6       1.0X
    
    Select 1 decimal(18, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            4066 / 4147          3.9         258.5       1.0X
    Parquet Vectorized (Pushdown)                   84 /   89        188.0           5.3      48.6X
    Native ORC Vectorized                         5430 / 5512          2.9         345.3       0.7X
    Native ORC Vectorized (Pushdown)              1054 / 1076         14.9          67.0       3.9X
    
    Select 10% decimal(18, 2) rows (value < 1572864): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            5028 / 5154          3.1         319.7       1.0X
    Parquet Vectorized (Pushdown)                 1360 / 1421         11.6          86.5       3.7X
    Native ORC Vectorized                         6266 / 6360          2.5         398.4       0.8X
    Native ORC Vectorized (Pushdown)              2513 / 2550          6.3         159.8       2.0X
    
    Select 50% decimal(18, 2) rows (value < 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            8571 / 8600          1.8         544.9       1.0X
    Parquet Vectorized (Pushdown)                 6455 / 6713          2.4         410.4       1.3X
    Native ORC Vectorized                       10138 / 10353          1.6         644.5       0.8X
    Native ORC Vectorized (Pushdown)              8166 / 8418          1.9         519.2       1.0X
    
    Select 90% decimal(18, 2) rows (value < 14155776): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                          12184 / 12253          1.3         774.7       1.0X
    Parquet Vectorized (Pushdown)               11720 / 11743          1.3         745.1       1.0X
    Native ORC Vectorized                       14024 / 14172          1.1         891.6       0.9X
    Native ORC Vectorized (Pushdown)            13966 / 13996          1.1         887.9       0.9X
    
    Select 1 decimal(38, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            5650 / 6140          2.8         359.2       1.0X
    Parquet Vectorized (Pushdown)                  112 /  186        140.5           7.1      50.5X
    Native ORC Vectorized                         5890 / 6826          2.7         374.5       1.0X
    Native ORC Vectorized (Pushdown)              1151 / 1396         13.7          73.2       4.9X
    
    Select 10% decimal(38, 2) rows (value < 1572864): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            6668 / 6863          2.4         424.0       1.0X
    Parquet Vectorized (Pushdown)                 1766 / 1781          8.9         112.3       3.8X
    Native ORC Vectorized                         6421 / 6448          2.4         408.2       1.0X
    Native ORC Vectorized (Pushdown)              2648 / 2701          5.9         168.3       2.5X
    
    Select 50% decimal(38, 2) rows (value < 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                          11268 / 11292          1.4         716.4       1.0X
    Parquet Vectorized (Pushdown)                 8544 / 8574          1.8         543.2       1.3X
    Native ORC Vectorized                       10975 / 11455          1.4         697.8       1.0X
    Native ORC Vectorized (Pushdown)              9065 / 9166          1.7         576.3       1.2X
    
    Select 90% decimal(38, 2) rows (value < 14155776): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                          15878 / 15972          1.0        1009.5       1.0X
    Parquet Vectorized (Pushdown)               15141 / 15328          1.0         962.6       1.0X
    Native ORC Vectorized                       15407 / 16003          1.0         979.5       1.0X
    Native ORC Vectorized (Pushdown)            15347 / 15414          1.0         975.7       1.0X
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202075842
  
    --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
    @@ -292,120 +292,120 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
     
     Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     ------------------------------------------------------------------------------------------------
    -Parquet Vectorized                            3785 / 3867          4.2         240.6       1.0X
    -Parquet Vectorized (Pushdown)                 3820 / 3928          4.1         242.9       1.0X
    -Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
    -Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
    +Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
    +Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X
    --- End diff --
    
    Because `1024 * 1024 * 15` is out of `decimal(9, 2)` range. I will update benchmark.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92843/testReport)** for PR 21556 at commit [`16528f3`](https://github.com/apache/spark/commit/16528f3b8ef1f97a4445ecc238fbc758d2454334).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    cc @gatorsmile @cloud-fan @gengliangwang @michal-databricks @mswit-databricks 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92414/testReport)** for PR 21556 at commit [`0b5d0e7`](https://github.com/apache/spark/commit/0b5d0e70fc4baff48ac5656e01a4ab98020c2166).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/729/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93014/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/189/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202240358
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Decimal type must make sure that filter value's scale matched the file.
    +    // If doesn't matched, which would cause data corruption.
    +    // Other types must make sure that filter value's type matched the file.
    --- End diff --
    
    I would say like .. Parquet's type in the given file should be matched to the value's type in the pushed filter in order to push down the filter to Parquet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201754998
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -202,6 +283,16 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
         case ParquetDateType if pushDownDate =>
           (n: String, v: Any) =>
             FilterApi.gtEq(intColumn(n), dateToDays(v.asInstanceOf[Date]).asInstanceOf[Integer])
    +
    +    case ParquetSchemaType(DECIMAL, INT32, 0, _) if pushDownDecimal =>
    --- End diff --
    
    Why match 0 instead of _?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92619/testReport)** for PR 21556 at commit [`f160648`](https://github.com/apache/spark/commit/f1606484867fbf1ce08be4793f8836c5126ec1b2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @dongjoon-hyun benchmark code: https://github.com/apache/spark/blob/bf67f70c48881ee99751f7d51fbcbda1e593d90a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L284-L314


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92837/testReport)** for PR 21556 at commit [`16528f3`](https://github.com/apache/spark/commit/16528f3b8ef1f97a4445ecc238fbc758d2454334).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92938/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92710/testReport)** for PR 21556 at commit [`7128539`](https://github.com/apache/spark/commit/712853999442a611ce7b97db8dad43039268573e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/836/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91909/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Thanks for pinging me. @maropu . I'd like to see the benchmark code.
    > In the benchmarks above, the results of ORC except for the case decimal(9, 2) have worse performance values as compared to the Parquet ones. Is this expected?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200170975
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    DecimalType contains variable: `decimalMetadata`. It seems difficult to make a constants like before. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92837/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4069/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92710/testReport)** for PR 21556 at commit [`7128539`](https://github.com/apache/spark/commit/712853999442a611ce7b97db8dad43039268573e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92938 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92938/testReport)** for PR 21556 at commit [`f73eab2`](https://github.com/apache/spark/commit/f73eab288f3c1b825521b4e4c8ab841c0840f48a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201925142
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    --- End diff --
    
    Originally it is not supported. Do we need to support it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91881/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/898/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @rdblue, ah, I misunderstood then. thanks for clarifying it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202327362
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -37,41 +39,64 @@ import org.apache.spark.unsafe.types.UTF8String
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) {
    +private[parquet] class ParquetFilters(
    +    pushDownDate: Boolean,
    +    pushDownDecimal: Boolean,
    +    pushDownStartWith: Boolean) {
     
       private case class ParquetSchemaType(
           originalType: OriginalType,
           primitiveTypeName: PrimitiveTypeName,
    -      decimalMetadata: DecimalMetadata)
    -
    -  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null)
    -  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, null)
    -  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, null)
    -  private val ParquetIntegerType = ParquetSchemaType(null, INT32, null)
    -  private val ParquetLongType = ParquetSchemaType(null, INT64, null)
    -  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null)
    -  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null)
    -  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null)
    -  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null)
    -  private val ParquetDateType = ParquetSchemaType(DATE, INT32, null)
    +      length: Int,
    +      decimalMeta: DecimalMetadata)
    +
    +  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, 0, null)
    +  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
    +  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
    +  private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
    +  private val ParquetLongType = ParquetSchemaType(null, INT64, 0, null)
    +  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, 0, null)
    +  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, 0, null)
    +  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, 0, null)
    +  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, 0, null)
    +  private val ParquetDateType = ParquetSchemaType(DATE, INT32, 0, null)
     
       private def dateToDays(date: Date): SQLDate = {
         DateTimeUtils.fromJavaDate(date)
       }
     
    +  private def decimalToInt32(decimal: JBigDecimal): Integer = decimal.unscaledValue().intValue()
    +
    +  private def decimalToInt64(decimal: JBigDecimal): JLong = decimal.unscaledValue().longValue()
    +
    +  private def decimalToByteArray(decimal: JBigDecimal, numBytes: Int): Binary = {
    +    val decimalBuffer = new Array[Byte](numBytes)
    +    val bytes = decimal.unscaledValue().toByteArray
    +
    +    val fixedLengthBytes = if (bytes.length == numBytes) {
    +      bytes
    +    } else {
    +      val signByte = if (bytes.head < 0) -1: Byte else 0: Byte
    +      java.util.Arrays.fill(decimalBuffer, 0, numBytes - bytes.length, signByte)
    +      System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
    +      decimalBuffer
    +    }
    +    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)
    +  }
    +
       private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = {
    --- End diff --
    
    `ParquetBooleanType`, `ParquetLongType`, `ParquetFloatType` and `ParquetDoubleType` do not need `Option`. Here is a example:
    ```scala
    scala> import org.apache.parquet.io.api.Binary
    import org.apache.parquet.io.api.Binary
    
    scala> Option(null).map(s => Binary.fromString(s.asInstanceOf[String])).orNull
    res7: org.apache.parquet.io.api.Binary = null
    
    scala> Binary.fromString(null.asInstanceOf[String])
    java.lang.NullPointerException
      at org.apache.parquet.io.api.Binary$FromStringBinary.encodeUTF8(Binary.java:224)
      at org.apache.parquet.io.api.Binary$FromStringBinary.<init>(Binary.java:214)
      at org.apache.parquet.io.api.Binary.fromString(Binary.java:554)
      ... 52 elided
    
    scala> null.asInstanceOf[java.lang.Long]
    res9: Long = null
    
    scala> null.asInstanceOf[java.lang.Boolean]
    res10: Boolean = null
    
    scala> Option(null).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull
    res11: Integer = null
    
    scala> null.asInstanceOf[Number].intValue.asInstanceOf[Integer]
    java.lang.NullPointerException
      ... 52 elided
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92641/testReport)** for PR 21556 at commit [`c7308ab`](https://github.com/apache/spark/commit/c7308ab8758f395930c92c7d1f9e5183699b31f3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91894/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200526464
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    I see. I will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92683/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201754882
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    --- End diff --
    
    Is this issue reference correct? The PR says this is for SPARK-24549.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @rdblue, so basically  you mean it looks both equality comparison and nullsafe equality comparison are identically pushed down and looks it should be distinguished; otherwise, there could be a potential problem? If so, yup. I agree with it.
    
    I think we won't have actually a chance to push down equality comparison or nullsafe equality comparison with actual `null` value by the optimizer. However, sure, I think we shouldn't relay on it. I think actually we should disallow one of both nullsafe equality comparison or equality comparison with `null` in `ParquetFilters`.
    
    Thing is, I remember I checked the inside of Parquet's equality comparison API itself is actually nullsafe a long ago like few years ago - this of course should be double checked.
    
    Since this PR doesn't change the existing behaviour on this and looks needing some more investigation (e.g., checking if it is still (or it has been) true what I remembered and checked about Parquet's equality comparison), probably, it might be okay to leave it as is.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91894/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4080/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91909/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92413/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202093508
  
    --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
    @@ -292,120 +292,120 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
     
     Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     ------------------------------------------------------------------------------------------------
    -Parquet Vectorized                            3785 / 3867          4.2         240.6       1.0X
    -Parquet Vectorized (Pushdown)                 3820 / 3928          4.1         242.9       1.0X
    -Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
    -Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
    +Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
    +Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X
    --- End diff --
    
    I'm not sure I understand. That's less than 2^24, so it should fit in an int. It should also fit in 8 base-ten digits so decimal(9,2) should work. And last, if the values don't fit in an int, I'm not sure how we would be able to store them in the first place, regardless of how stats are handled.
    
    Did you verify that there are no stats for the file produced here? If that's the case, it would make sense with these numbers. I think we just need to look for a different reason why stats are missing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91769/testReport)** for PR 21556 at commit [`9832661`](https://github.com/apache/spark/commit/9832661e735fbbbfc4da4cc96f4ff7a537c3eca2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200246162
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    Seems invalidate value already filtered by: https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L439


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201763489
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -37,41 +39,64 @@ import org.apache.spark.unsafe.types.UTF8String
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) {
    +private[parquet] class ParquetFilters(
    +    pushDownDate: Boolean,
    +    pushDownDecimal: Boolean,
    +    pushDownStartWith: Boolean) {
     
       private case class ParquetSchemaType(
           originalType: OriginalType,
           primitiveTypeName: PrimitiveTypeName,
    -      decimalMetadata: DecimalMetadata)
    -
    -  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null)
    -  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, null)
    -  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, null)
    -  private val ParquetIntegerType = ParquetSchemaType(null, INT32, null)
    -  private val ParquetLongType = ParquetSchemaType(null, INT64, null)
    -  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null)
    -  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null)
    -  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null)
    -  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null)
    -  private val ParquetDateType = ParquetSchemaType(DATE, INT32, null)
    +      length: Int,
    +      decimalMeta: DecimalMetadata)
    +
    +  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, 0, null)
    +  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
    +  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
    +  private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
    +  private val ParquetLongType = ParquetSchemaType(null, INT64, 0, null)
    +  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, 0, null)
    +  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, 0, null)
    +  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, 0, null)
    +  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, 0, null)
    +  private val ParquetDateType = ParquetSchemaType(DATE, INT32, 0, null)
     
       private def dateToDays(date: Date): SQLDate = {
         DateTimeUtils.fromJavaDate(date)
       }
     
    +  private def decimalToInt32(decimal: JBigDecimal): Integer = decimal.unscaledValue().intValue()
    +
    +  private def decimalToInt64(decimal: JBigDecimal): JLong = decimal.unscaledValue().longValue()
    +
    +  private def decimalToByteArray(decimal: JBigDecimal, numBytes: Int): Binary = {
    +    val decimalBuffer = new Array[Byte](numBytes)
    +    val bytes = decimal.unscaledValue().toByteArray
    +
    +    val fixedLengthBytes = if (bytes.length == numBytes) {
    +      bytes
    +    } else {
    +      val signByte = if (bytes.head < 0) -1: Byte else 0: Byte
    +      java.util.Arrays.fill(decimalBuffer, 0, numBytes - bytes.length, signByte)
    +      System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
    +      decimalBuffer
    +    }
    +    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)
    +  }
    +
       private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = {
    --- End diff --
    
    Since `makeEq` is called for `EqualsNullSafe` and `valueCanMakeFilterOn` allows null values through, I think these could be null, like the String case. I think this should use the `Option` pattern from String for all values, unless I'm missing some reason why these will never be null.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @wangyum, can you explain what was happening with the `decimal(9,2)` benchmark more clearly? I asked additional questions, but the thread is on a line that changed so it's collapsed by default.
    
    Also, `valueCanMakeFilterOn` returns true for all null values, so I think we still have a problem there. Conversion from EqualNullSafe needs to support null filter values.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92619/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92936/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92837 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92837/testReport)** for PR 21556 at commit [`16528f3`](https://github.com/apache/spark/commit/16528f3b8ef1f97a4445ecc238fbc758d2454334).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201755353
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    +        case ParquetSchemaType(DECIMAL, INT32, 0, decimalMeta) =>
    --- End diff --
    
    Can the decimal cases be collapsed to a single case on `ParquetSchemaType(DECIMAL, _, _, decimalMetadata)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @wangyum Thanks for the benchmarks!
    @dongjoon-hyun In the benchmarks above, the results of ORC except for the case `decimal(9, 2)` have worse performance values as compared to the Parquet ones. Is this expected?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r195283330
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -62,6 +62,16 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +    case decimal: DecimalType if DecimalType.is32BitDecimalType(decimal) =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        intColumn(n),
    +        Option(v).map(_.asInstanceOf[java.math.BigDecimal].unscaledValue().intValue()
    --- End diff --
    
    REF: https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L219


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201928990
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -248,29 +371,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
         // Probably I missed something and obviously this should be changed.
     
         predicate match {
    -      case sources.IsNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNull(name) if canMakeFilterOn(name, null) =>
             makeEq.lift(nameToType(name)).map(_(name, null))
    -      case sources.IsNotNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNotNull(name) if canMakeFilterOn(name, null) =>
             makeNotEq.lift(nameToType(name)).map(_(name, null))
     
    -      case sources.EqualTo(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualTo(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
     
    -      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
    --- End diff --
    
    I handled null values at `valueCanMakeFilterOn`:
    ```scala
    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
      value == null || (nameToType(name) match {
        case ParquetBooleanType => value.isInstanceOf[JBoolean]
        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
        case ParquetLongType => value.isInstanceOf[JLong]
        case ParquetFloatType => value.isInstanceOf[JFloat]
        case ParquetDoubleType => value.isInstanceOf[JDouble]
        case ParquetStringType => value.isInstanceOf[String]
        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
        case ParquetDateType => value.isInstanceOf[Date]
        case ParquetSchemaType(DECIMAL, INT32, _, decimalMeta) =>
          isDecimalMatched(value, decimalMeta)
        case ParquetSchemaType(DECIMAL, INT64, _, decimalMeta) =>
          isDecimalMatched(value, decimalMeta)
        case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, _, decimalMeta) =>
          isDecimalMatched(value, decimalMeta)
        case _ => false
      })
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201927646
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -202,6 +283,16 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
         case ParquetDateType if pushDownDate =>
           (n: String, v: Any) =>
             FilterApi.gtEq(intColumn(n), dateToDays(v.asInstanceOf[Date]).asInstanceOf[Integer])
    +
    +    case ParquetSchemaType(DECIMAL, INT32, 0, _) if pushDownDecimal =>
    --- End diff --
    
    In fact, the length is always 0, I replaced it to `_`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/900/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/91/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r198904504
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -62,6 +98,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +    case decimal: DecimalType
    +      if pushDownDecimal && (DecimalType.is32BitDecimalType(decimal) && !readLegacyFormat) =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        intColumn(n),
    +        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().intValue()
    +          .asInstanceOf[Integer]).orNull)
    +    case decimal: DecimalType
    +      if pushDownDecimal && (DecimalType.is64BitDecimalType(decimal) && !readLegacyFormat) =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        longColumn(n),
    +        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().longValue()
    +            .asInstanceOf[java.lang.Long]).orNull)
    +    case decimal: DecimalType
    +      if pushDownDecimal && ((DecimalType.is32BitDecimalType(decimal) && readLegacyFormat)
    --- End diff --
    
    Please add comments here to explain what differs when `readLegacyFormat` is true.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201756667
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -37,41 +39,64 @@ import org.apache.spark.unsafe.types.UTF8String
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) {
    +private[parquet] class ParquetFilters(
    +    pushDownDate: Boolean,
    +    pushDownDecimal: Boolean,
    +    pushDownStartWith: Boolean) {
     
       private case class ParquetSchemaType(
           originalType: OriginalType,
           primitiveTypeName: PrimitiveTypeName,
    -      decimalMetadata: DecimalMetadata)
    -
    -  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null)
    -  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, null)
    -  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, null)
    -  private val ParquetIntegerType = ParquetSchemaType(null, INT32, null)
    -  private val ParquetLongType = ParquetSchemaType(null, INT64, null)
    -  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null)
    -  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null)
    -  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null)
    -  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null)
    -  private val ParquetDateType = ParquetSchemaType(DATE, INT32, null)
    +      length: Int,
    +      decimalMeta: DecimalMetadata)
    +
    +  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, 0, null)
    +  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
    +  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
    +  private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
    +  private val ParquetLongType = ParquetSchemaType(null, INT64, 0, null)
    +  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, 0, null)
    +  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, 0, null)
    +  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, 0, null)
    +  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, 0, null)
    +  private val ParquetDateType = ParquetSchemaType(DATE, INT32, 0, null)
     
       private def dateToDays(date: Date): SQLDate = {
         DateTimeUtils.fromJavaDate(date)
       }
     
    +  private def decimalToInt32(decimal: JBigDecimal): Integer = decimal.unscaledValue().intValue()
    +
    +  private def decimalToInt64(decimal: JBigDecimal): JLong = decimal.unscaledValue().longValue()
    +
    +  private def decimalToByteArray(decimal: JBigDecimal, numBytes: Int): Binary = {
    +    val decimalBuffer = new Array[Byte](numBytes)
    +    val bytes = decimal.unscaledValue().toByteArray
    +
    +    val fixedLengthBytes = if (bytes.length == numBytes) {
    +      bytes
    +    } else {
    +      val signByte = if (bytes.head < 0) -1: Byte else 0: Byte
    +      java.util.Arrays.fill(decimalBuffer, 0, numBytes - bytes.length, signByte)
    +      System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
    +      decimalBuffer
    +    }
    +    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)
    --- End diff --
    
    This byte array is not reused, it is allocated each time this function runs. This should use the `fromConstantByteArray` variant.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92641/testReport)** for PR 21556 at commit [`c7308ab`](https://github.com/apache/spark/commit/c7308ab8758f395930c92c7d1f9e5183699b31f3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92710/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91769/testReport)** for PR 21556 at commit [`9832661`](https://github.com/apache/spark/commit/9832661e735fbbbfc4da4cc96f4ff7a537c3eca2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3981/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202093955
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    +        case ParquetSchemaType(DECIMAL, INT32, 0, decimalMeta) =>
    --- End diff --
    
    Have you tried not using `|` and ignoring the physical type with `_`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202214356
  
    --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
    @@ -292,120 +292,120 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
     
     Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     ------------------------------------------------------------------------------------------------
    -Parquet Vectorized                            3785 / 3867          4.2         240.6       1.0X
    -Parquet Vectorized (Pushdown)                 3820 / 3928          4.1         242.9       1.0X
    -Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
    -Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
    +Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
    +Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X
    --- End diff --
    
    Here is a test:
    ```scala
    // decimal(9, 2) max values is 9999999.99
    // 1024 * 1024 * 15 =          15728640
    val path = "/tmp/spark/parquet"
    spark.range(1024 * 1024 * 15).selectExpr("cast((id) as decimal(9, 2)) as id").orderBy("id").write.mode("overwrite").parquet(path)
    ```
    The generated parquet metadata:
    ```shell
    $ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta  /tmp/spark/parquet
    file:        file:/tmp/spark/parquet/part-00000-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet 
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} 
    
    file schema: spark_schema 
    --------------------------------------------------------------------------------
    id:          OPTIONAL INT32 O:DECIMAL R:0 D:1
    
    row group 1: RC:5728640 TS:36 OFFSET:4 
    --------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:38/36/0.95 VC:5728640 ENC:PLAIN,BIT_PACKED,RLE ST:[no stats for this column]
    file:        file:/tmp/spark/parquet/part-00001-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet 
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} 
    
    file schema: spark_schema 
    --------------------------------------------------------------------------------
    id:          OPTIONAL INT32 O:DECIMAL R:0 D:1
    
    row group 1: RC:651016 TS:2604209 OFFSET:4 
    --------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:2604325/2604209/1.00 VC:651016 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 0.00, max: 651015.00, num_nulls: 0]
    file:        file:/tmp/spark/parquet/part-00002-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet 
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} 
    
    file schema: spark_schema 
    --------------------------------------------------------------------------------
    id:          OPTIONAL INT32 O:DECIMAL R:0 D:1
    
    row group 1: RC:3231146 TS:12925219 OFFSET:4 
    --------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:12925864/12925219/1.00 VC:3231146 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 651016.00, max: 3882161.00, num_nulls: 0]
    file:        file:/tmp/spark/parquet/part-00003-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet 
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} 
    
    file schema: spark_schema 
    --------------------------------------------------------------------------------
    id:          OPTIONAL INT32 O:DECIMAL R:0 D:1
    
    row group 1: RC:2887956 TS:11552408 OFFSET:4 
    --------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:11552986/11552408/1.00 VC:2887956 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 3882162.00, max: 6770117.00, num_nulls: 0]
    file:        file:/tmp/spark/parquet/part-00004-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet 
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} 
    
    file schema: spark_schema 
    --------------------------------------------------------------------------------
    id:          OPTIONAL INT32 O:DECIMAL R:0 D:1
    
    row group 1: RC:3229882 TS:12920163 OFFSET:4 
    --------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:12920808/12920163/1.00 VC:3229882 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 6770118.00, max: 9999999.00, num_nulls: 0]
    ```
    As you can see `file:/tmp/spark/parquet/part-00000-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet` have not generated stats for that column.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202090024
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -248,29 +371,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
         // Probably I missed something and obviously this should be changed.
     
         predicate match {
    -      case sources.IsNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNull(name) if canMakeFilterOn(name, null) =>
             makeEq.lift(nameToType(name)).map(_(name, null))
    -      case sources.IsNotNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNotNull(name) if canMakeFilterOn(name, null) =>
             makeNotEq.lift(nameToType(name)).map(_(name, null))
     
    -      case sources.EqualTo(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualTo(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
     
    -      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
    --- End diff --
    
    Maybe I'm missing something, but that returns true for all null values.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/834/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92843/testReport)** for PR 21556 at commit [`16528f3`](https://github.com/apache/spark/commit/16528f3b8ef1f97a4445ecc238fbc758d2454334).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201763831
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -248,29 +371,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
         // Probably I missed something and obviously this should be changed.
     
         predicate match {
    -      case sources.IsNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNull(name) if canMakeFilterOn(name, null) =>
             makeEq.lift(nameToType(name)).map(_(name, null))
    -      case sources.IsNotNull(name) if canMakeFilterOn(name) =>
    +      case sources.IsNotNull(name) if canMakeFilterOn(name, null) =>
             makeNotEq.lift(nameToType(name)).map(_(name, null))
     
    -      case sources.EqualTo(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualTo(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualTo(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
     
    -      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name) =>
    +      case sources.EqualNullSafe(name, value) if canMakeFilterOn(name, value) =>
             makeEq.lift(nameToType(name)).map(_(name, value))
    -      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name) =>
    +      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name, value) =>
             makeNotEq.lift(nameToType(name)).map(_(name, value))
    --- End diff --
    
    Since makeNotEq is also used for EqualNullSafe, I think it should handle null values as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201761849
  
    --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
    @@ -292,120 +292,120 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
     
     Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     ------------------------------------------------------------------------------------------------
    -Parquet Vectorized                            3785 / 3867          4.2         240.6       1.0X
    -Parquet Vectorized (Pushdown)                 3820 / 3928          4.1         242.9       1.0X
    -Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
    -Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
    +Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
    +Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X
    --- End diff --
    
    Maybe it is that the data is more dense, so we need to read more values in the row group that contains the one we're looking for?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92619 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92619/testReport)** for PR 21556 at commit [`f160648`](https://github.com/apache/spark/commit/f1606484867fbf1ce08be4793f8836c5126ec1b2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201924695
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    +        case ParquetSchemaType(DECIMAL, INT32, 0, decimalMeta) =>
    --- End diff --
    
    No.
    ![image](https://user-images.githubusercontent.com/5399861/42616795-0c40b9e8-85e2-11e8-9697-ab5572b66bab.png)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91894/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    +1, I think this looks ready to go.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92936/testReport)** for PR 21556 at commit [`33d1f18`](https://github.com/apache/spark/commit/33d1f1836e10003e29bc439d2823a8b6d1821a82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/538/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r198904779
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -62,6 +98,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +    case decimal: DecimalType
    +      if pushDownDecimal && (DecimalType.is32BitDecimalType(decimal) && !readLegacyFormat) =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        intColumn(n),
    +        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().intValue()
    --- End diff --
    
    Does this need to validate the scale of the decimal, or is scale adjusted in the analyzer?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r201755545
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    --- End diff --
    
    Why is there no support for timestamp?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    @HyukjinKwon, even if the values are null, the makeEq function only casts null to Java Integer so the handling is still safe. It just looks odd that `null.asInstanceOf[JInt]` is safe. Thanks to @wangyum for explaining it. Even if the null-safe equality predicate contains a null value, this should be safe.
    
    And, passing null in an equals predicate is supported by Parquet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200275960
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -37,26 +39,50 @@ import org.apache.spark.unsafe.types.UTF8String
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) {
    +private[parquet] class ParquetFilters(
    +    pushDownDate: Boolean,
    +    pushDownDecimal: Boolean,
    +    pushDownStartWith: Boolean) {
     
       private case class ParquetSchemaType(
           originalType: OriginalType,
           primitiveTypeName: PrimitiveTypeName,
    -      decimalMetadata: DecimalMetadata)
    --- End diff --
    
    Don't need `DecimalMetadata`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92996/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #93014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93014/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4056/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    I misunderstood how it was safe as well. It was Yuming's clarification that helped.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91881/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #93014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93014/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaSummarizerExample `
      * `  class SerializableConfiguration(@transient var value: Configuration)`
      * `  class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)`
      * `  case class SchemaType(dataType: DataType, nullable: Boolean)`
      * `  implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) `
      * `  implicit class AvroDataFrameReader(reader: DataFrameReader) `
      * `class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],`
      * `trait ComplexTypeMergingExpression extends Expression `
      * `case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes `
      * `abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast `
      * `case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r198906232
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -359,6 +369,70 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         }
       }
     
    +  test("filter pushdown - decimal") {
    +    Seq(true, false).foreach { legacyFormat =>
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) {
    +        Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType
    --- End diff --
    
    Since this is providing a column name, it would be better to use something more readable than _1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92414/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92413/testReport)** for PR 21556 at commit [`0b5d0e7`](https://github.com/apache/spark/commit/0b5d0e70fc4baff48ac5656e01a4ab98020c2166).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/681/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r198904089
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -378,6 +378,22 @@ object SQLConf {
         .booleanConf
         .createWithDefault(true)
     
    +  val PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED =
    +    buildConf("spark.sql.parquet.filterPushdown.decimal")
    +      .doc(s"If true, enables Parquet filter push-down optimization for Decimal. " +
    +        "The default value is false to compatible with legacy parquet format. " +
    +        s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
    +        "enabled and Decimal statistics are generated(Since Spark 2.4).")
    +      .internal()
    +      .booleanConf
    +      .createWithDefault(true)
    +
    +  val PARQUET_READ_LEGACY_FORMAT = buildConf("spark.sql.parquet.readLegacyFormat")
    --- End diff --
    
    This property doesn't mention pushdown, but the description says it is only valid for push-down. Can you make the property name more clear?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92641/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202448032
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -37,41 +39,64 @@ import org.apache.spark.unsafe.types.UTF8String
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) {
    +private[parquet] class ParquetFilters(
    +    pushDownDate: Boolean,
    +    pushDownDecimal: Boolean,
    +    pushDownStartWith: Boolean) {
     
       private case class ParquetSchemaType(
           originalType: OriginalType,
           primitiveTypeName: PrimitiveTypeName,
    -      decimalMetadata: DecimalMetadata)
    -
    -  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null)
    -  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, null)
    -  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, null)
    -  private val ParquetIntegerType = ParquetSchemaType(null, INT32, null)
    -  private val ParquetLongType = ParquetSchemaType(null, INT64, null)
    -  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null)
    -  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null)
    -  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null)
    -  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null)
    -  private val ParquetDateType = ParquetSchemaType(DATE, INT32, null)
    +      length: Int,
    +      decimalMeta: DecimalMetadata)
    +
    +  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, 0, null)
    +  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
    +  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
    +  private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
    +  private val ParquetLongType = ParquetSchemaType(null, INT64, 0, null)
    +  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, 0, null)
    +  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, 0, null)
    +  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, 0, null)
    +  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, 0, null)
    +  private val ParquetDateType = ParquetSchemaType(DATE, INT32, 0, null)
     
       private def dateToDays(date: Date): SQLDate = {
         DateTimeUtils.fromJavaDate(date)
       }
     
    +  private def decimalToInt32(decimal: JBigDecimal): Integer = decimal.unscaledValue().intValue()
    +
    +  private def decimalToInt64(decimal: JBigDecimal): JLong = decimal.unscaledValue().longValue()
    +
    +  private def decimalToByteArray(decimal: JBigDecimal, numBytes: Int): Binary = {
    +    val decimalBuffer = new Array[Byte](numBytes)
    +    val bytes = decimal.unscaledValue().toByteArray
    +
    +    val fixedLengthBytes = if (bytes.length == numBytes) {
    +      bytes
    +    } else {
    +      val signByte = if (bytes.head < 0) -1: Byte else 0: Byte
    +      java.util.Arrays.fill(decimalBuffer, 0, numBytes - bytes.length, signByte)
    +      System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
    +      decimalBuffer
    +    }
    +    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)
    +  }
    +
       private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = {
    --- End diff --
    
    Sounds good. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200813391
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    Add check method to `canMakeFilterOn` and add a test case:
    ```scala
        val decimal = new JBigDecimal(10).setScale(scale)
        assert(decimal.scale() === scale)
        assertResult(Some(lt(intColumn("cdecimal1"), 1000: Integer))) {
          parquetFilters.createFilter(parquetSchema, sources.LessThan("cdecimal1", decimal))
        }
    
        val decimal1 = new JBigDecimal(10).setScale(scale + 1)
        assert(decimal1.scale() === scale + 1)
    
        assertResult(None) {
          parquetFilters.createFilter(parquetSchema, sources.LessThan("cdecimal1", decimal1))
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92996/testReport)** for PR 21556 at commit [`e713698`](https://github.com/apache/spark/commit/e7136984b482b3652e7ecfa542aec83aadc58320).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r202093665
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -225,12 +316,44 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
         val nameToType = getFieldMap(schema)
     
    +    def isDecimalMatched(value: Any, decimalMeta: DecimalMetadata): Boolean = value match {
    +      case decimal: JBigDecimal =>
    +        decimal.scale == decimalMeta.getScale
    +      case _ => false
    +    }
    +
    +    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to
    +    // data source Filter. This must make sure that filter value matched the Filter.
    +    // If doesn't matched, then the schema used to read the file is incorrect,
    +    // which would cause data corruption.
    +    def valueCanMakeFilterOn(name: String, value: Any): Boolean = {
    +      value == null || (nameToType(name) match {
    +        case ParquetBooleanType => value.isInstanceOf[JBoolean]
    +        case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
    +        case ParquetLongType => value.isInstanceOf[JLong]
    +        case ParquetFloatType => value.isInstanceOf[JFloat]
    +        case ParquetDoubleType => value.isInstanceOf[JDouble]
    +        case ParquetStringType => value.isInstanceOf[String]
    +        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
    +        case ParquetDateType => value.isInstanceOf[Date]
    --- End diff --
    
    Not in this PR that adds Decimal support. We should consider it in the future, though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/963/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Can you benchmark code and results (on your env) in `FilterPushdownBenchmark` for this type?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/946/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/748/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92675/testReport)** for PR 21556 at commit [`341fe59`](https://github.com/apache/spark/commit/341fe5913b439322a1d3a08df53ed8acd84e0cbb).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92936/testReport)** for PR 21556 at commit [`33d1f18`](https://github.com/apache/spark/commit/33d1f1836e10003e29bc439d2823a8b6d1821a82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #91909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91909/testReport)** for PR 21556 at commit [`51d8540`](https://github.com/apache/spark/commit/51d854000186dcef1385e4b8bcd84c2b9fd763c6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92675/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/165/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92996/testReport)** for PR 21556 at commit [`e713698`](https://github.com/apache/spark/commit/e7136984b482b3652e7ecfa542aec83aadc58320).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200181749
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        intColumn(n),
    +        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().intValue()
    +          .asInstanceOf[Integer]).orNull)
    +    case ParquetSchemaType(DECIMAL, INT64, decimal) if pushDownDecimal =>
    +      (n: String, v: Any) => FilterApi.eq(
    +        longColumn(n),
    +        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().longValue()
    +          .asInstanceOf[java.lang.Long]).orNull)
    +    // Legacy DecimalType
    +    case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, decimal) if pushDownDecimal &&
    --- End diff --
    
    The binary used for the legacy type and for fixed-length storage should be the same, so I don't understand why there are two different conversion methods. Also, because this is using the Parquet schema now, there's no need to base the length of this binary on what older versions of Spark did -- in other words, if the underlying Parquet type is fixed, then just convert the decimal to that size fixed without worrying about legacy types.
    
    I think this should pass in the fixed array's length and convert the BigDecimal value to that length array for all cases. That works no matter what the file contains.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #93017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93017/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200181894
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    Since this uses the file schema, I think it should validate that the file uses the same scale as the value passed in. That's a cheap sanity check to ensure correctness.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/178/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #92683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92683/testReport)** for PR 21556 at commit [`341fe59`](https://github.com/apache/spark/commit/341fe5913b439322a1d3a08df53ed8acd84e0cbb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/960/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21556
  
    **[Test build #93017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93017/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaSummarizerExample `
      * `  class SerializableConfiguration(@transient var value: Configuration)`
      * `  class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)`
      * `  case class SchemaType(dataType: DataType, nullable: Boolean)`
      * `  implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) `
      * `  implicit class AvroDataFrameReader(reader: DataFrameReader) `
      * `class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],`
      * `trait ComplexTypeMergingExpression extends Expression `
      * `case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes `
      * `abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast `
      * `case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org