You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2015/12/29 05:45:16 UTC

[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/10502

    [SPARK-12355][SQL] Implement unhandledFilter interface for Parquet

    https://issues.apache.org/jira/browse/SPARK-12355
    This is similar with https://github.com/apache/spark/pull/10427.
    
    As discussed here https://github.com/apache/spark/pull/10221, this PR implemented `unhandledFilter` to remove duplicated Spark-side filtering.
    
    In case of Parquet, the columns referenced in pushed down filters should be given to `org.apache.spark.sql.parquet.row.requested_schema` whereas general datasources such as JDBC do not require the columns.
    
    However, the current interface of `DataSourceStrategy` removes the columns only referenced in pushed down filters. Therefore, this PR resolved this problem by manually generating the columns referenced in pushed down filters.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-12355

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10502
    
----
commit a55ad54a820f095e5116df05979f169c0fe8e0cf
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-12-29T04:34:53Z

    Implement unhandled filters for Parquet

commit cf331a453c3f99ee40ee5ca6f5029dadee3d07f6
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-12-29T04:36:42Z

    Correct existing tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48592707
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    +    val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap
    +    val referencedColumns = ArrayBuffer.empty[String]
    +    def getDataTypeOf(name: String): DataType = {
    +      referencedColumns += name
    +      dataTypeOf(name)
    +    }
    +    createParquetFilter(getDataTypeOf, predicate)
    --- End diff --
    
    Understood though, I think easy-to-read is better.
    This entry is a little hacky and I was not clearly sure what it made.
    Anyway, we should follow comitters comments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167933518
  
    # Benchmark (Removed Spark-side Filter)
    
    ## Motivation
    
    This PR simplifies the query plans for Parquet files by stripping duplicated Spark-side filtering,
    
    from:
    
    ```
    == Physical Plan ==
    Filter (a#8 = 2)
      +- Scan ParquetRelation[a#8] InputPaths: file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-ef271ec6-95e1-43ae-9b3e-1d4dae6f69c3/part=1, PushedFilters: [EqualTo(a,2)]
    ```
    
    to :
    
    ```
    == Physical Plan ==
    Scan ParquetRelation[a#8] InputPaths: file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-ef271ec6-95e1-43ae-9b3e-1d4dae6f69c3/part=1, PushedFilters: [EqualTo(a,2)]
    ```
    
    However, in terms of performance, it is unkown if there is benefit. So, this benchmark was performed.
    Simply, several quries were executed with duplicated Spark-side filtering and without it.
    
    ## Environment
    
    - Machine: MacBook Pro Retina
    - CPU: 4
    - Memory: 8G
    
    ## Method
    
    In this benchmark, 
     - The filters that Parquet can generate were composed with 9 simple quries. `IS NULL`, `IS NOT NULL`, `=`, `!=`, `<=>`, `<`, `>`, `<=` and `>=` operators were tested here. Each query was executed 5 times and calculated the average times.
     - The [commit right before this PR](https://github.com/apache/spark/commit/73862a1) was used against this PR. Namely, The tests were performed with Spark-side filtering and without Spark-side filtering.
     - In order to test cleanly, the target Parquet file was uncompressed and in order to enable Parquet filtering row by row, `spark.sql.parquet.enableUnsafeRowRecordReader` was disabled.
    
    
    ## Dataset
    #### Raw Data
    - [TPC-H](http://www.tpc.org/tpch/) Lineitem Table created with factor 1 ([generate data](https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)) 
    - Size : 724.66 MB
    
    #### Create Target Parquet File
    - `case` class for Lineitem table
    ```scala
    case class Lineitem(l_orderkey: Int,
                        l_partkey: Int,
                        l_suppkey: Int,
                        l_linenumber: Int,
                        l_quantity: Float,
                        l_extendedprice: Float,
                        l_discount: Float,
                        l_tax: Float,
                        l_returnflag: String,
                        l_linestatus: String,
                        l_shipdate: String,
                        l_commitdate: String,
                        l_receiptdate: String,
                        l_shipinstruct: String,
                        l_shipmode: String,
                        l_comment: String)
    ```
    
    - Create Parquet file
    ```scala
    import sqlContext.implicits._
    var conf = new SparkConf()
    conf.setAppName("Test").setMaster("local")
    conf.set("spark.sql.parquet.compression.codec", "uncompressed")
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    sc.textFile("lineitem.tbl").map(_.split('|')).map { v =>
      Lineitem(
        v(0).trim.toInt,
        v(1).trim.toInt,
        v(2).trim.toInt,
        v(3).trim.toInt,
        v(4).trim.toFloat,
        v(5).trim.toFloat,
        v(6).trim.toFloat,
        v(7).trim.toFloat,
        v(8),
        v(9),
        v(10),
        v(11),
        v(12),
        v(13),
        v(14),
        v(15))
    }.toDF()
    df.save("lineitem", "parquet")
    ```
    
    #### Parquet file
    
    ```
    -rw-r--r--  hyukjinkwon hyukjinkwon 0 B         _SUCCESS
    -rw-r--r--  hyukjinkwon hyukjinkwon 1.54 KB     _common_metadata
    -rw-r--r--  hyukjinkwon hyukjinkwon 13.72 KB    _metadata
    -rw-r--r--  hyukjinkwon hyukjinkwon 52.16 MB    part-r-00000-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    -rw-r--r--  hyukjinkwon hyukjinkwon 51.74 MB    part-r-00001-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    -rw-r--r--  hyukjinkwon hyukjinkwon 51.97 MB    part-r-00002-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    -rw-r--r--  hyukjinkwon hyukjinkwon 51.71 MB    part-r-00003-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    -rw-r--r--  hyukjinkwon hyukjinkwon 51.89 MB    part-r-00004-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    -rw-r--r--  hyukjinkwon hyukjinkwon 34.25 MB    part-r-00005-2aa739b2-6194-47db-9dd5-461905cea976.parquet
    ```
    
    ### Test Codes
    
    - Function to measure time
    ```scala
    def time[A](f: => A) = {
      val s = System.nanoTime
      val ret = f
      println("time: "+(System.nanoTime-s)/1e6+"ms")
      ret
    }
    ```
    
    - Configuration and `SQLContext`
    ```scala
    var conf = new SparkConf()
    conf.setAppName("Test").setMaster("local")
    conf.set("spark.sql.parquet.enableUnsafeRowRecordReader", "false")
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    ```
    
    - `IS NULL`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey IS NULL").select("l_orderkey")
    time(df.collect())
    ```
    
    - `IS NOT NULL`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey IS NOT NULL").select("l_orderkey")
    time(df.collect())
    ```
    
    - `=`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey = 1").select("l_orderkey")
    time(df.collect())
    ```
    
    - `!=`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey != 1").select("l_orderkey")
    time(df.collect())
    ```
    - `<=>`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey <=> 1").select("l_orderkey")
    time(df.collect())
    ```
    - `<`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey < 3000000").select("l_orderkey")
    time(df.collect())
    ```
    
    - `>`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey > 3000000").select("l_orderkey")
    time(df.collect())
    ```
    
    - `<=`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey <= 3000000").select("l_orderkey")
    time(df.collect())
    ```
    
    - `>=`
    ```scala
    val source = sqlContext.read.parquet("lineitem")
    val df = source.filter("l_orderkey >= 3000000").select("l_orderkey")
    time(df.collect())
    ```
    
    ## Results
    
    |Operator|Without Spark Filtering(ms)|With Spark Filtering(ms)|Time Decreased(%)|
    |--------|---------------------------|------------------------|--------------|
    |`IS NULL`|645.015|669.038|3.590|
    |`IS NOT NULL`|8040.593|8394.950|4.221|
    |`=`| 885.764|906.658|2.304|
    |`!=`|7844.248|8082.113|2.943|
    |`<=>`|869.402|912.457|4.718|
    |`<`|4510.599|4625.214|2.478|
    |`>`|4732.729|4940.915|4.213|
    |`<=`|4868.453|4918.441|1.016|
    |`>=`|4751.772|4946.939|3.945|
    
    
    Basically, in a simple view, the difference was below.
    
    The original codes would work as below (**With Spark Filtering**):
    ```scala
    data
      // Parquet-side filtering
      .filter(pushedFilter)
      // Spark-side filtering
      .filter(pushedFilter)
    ```
    
    This PR would change this into below (**Without Spark Filtering**):
    ```scala
    data
      // Parquet-side filtering
      .filter(pushedFilter)
    ```
    
    Although both the same O(n) time complexity, the former was 2n and the latter was n. So, it seems there is performance benefit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189100946
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182688789
  
    @yhuai Would you look through this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189130651
  
    yea will do. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182670583
  
    @yhuai Would you look through this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170542473
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49122/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189100796
  
    **[Test build #52014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52014/consoleFull)** for PR 10502 at commit [`0e149da`](https://github.com/apache/spark/commit/0e149dad44d1787f30122795961a97fd44c6761c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182686261
  
    **[Test build #51075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51075/consoleFull)** for PR 10502 at commit [`ad757be`](https://github.com/apache/spark/commit/ad757beb5922350a7db556203d05a2c84f7f7fe1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167750071
  
    **[Test build #48405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48405/consoleFull)** for PR 10502 at commit [`2ad8182`](https://github.com/apache/spark/commit/2ad81822100229318d1d80796edb92916a10d820).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/10502


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170542471
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48590599
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    --- End diff --
    
    Oh, yes it looks so. I think I might also have to change `createFilter()` in that way because I just followed up in the way of `createFilter()` for this function because both `createFilter()` and `referencedColumns()` are called in the same places. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167737190
  
    **[Test build #48402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48402/consoleFull)** for PR 10502 at commit [`8c376af`](https://github.com/apache/spark/commit/8c376af467b939d3687e24f7d0e1ccf4e0ea5fca).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182686745
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170470378
  
    **[Test build #49122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49122/consoleFull)** for PR 10502 at commit [`dfc7506`](https://github.com/apache/spark/commit/dfc7506fc579bcd4539f2ab26aebaa802ed5dc17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189080574
  
    **[Test build #52014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52014/consoleFull)** for PR 10502 at commit [`0e149da`](https://github.com/apache/spark/commit/0e149dad44d1787f30122795961a97fd44c6761c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167738930
  
    **[Test build #48405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48405/consoleFull)** for PR 10502 at commit [`2ad8182`](https://github.com/apache/spark/commit/2ad81822100229318d1d80796edb92916a10d820).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167729051
  
    I see. `UnsafeRowParquetRecordReader` at Parquet does not support filter record by record but just block. So, even with `=` operator produces the same results below. 
    
    ```
    +---+
    | _1|
    +---+
    |  5|
    |  6|
    |  7|
    |  8|
    |  9|
    +---+
    ```
    
    I think I should disable `unhandledFilters` when 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167737261
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-180324662
  
    I will resolve the conflict on Monday.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170467935
  
    **[Test build #49115 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49115/consoleFull)** for PR 10502 at commit [`dfc7506`](https://github.com/apache/spark/commit/dfc7506fc579bcd4539f2ab26aebaa802ed5dc17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189116088
  
    @HyukjinKwon Thank you for working on it. We have been actively improving the efficiency of code-gen and unsafe parquet reader. For the long term, letting parquet to evaluate filters like `a=10` row by row is not as efficient as using the code gen version of Spark filter. We do plan to let unsafe parquet reader to evaluate very cheap predicates like `IsNotNull` and `IsNull` for every row. However, before it happens, can we hold off the work on this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167750147
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48405/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167750403
  
    cc @yhuai @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48593890
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -288,20 +293,28 @@ private[sql] class ParquetRelation(
         }
       }
     
    +  override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
    +    // The unsafe row RecordReader does not support row by row filtering so for this case
    +    // it should wrap this with Spark-side filtering.
    +    val enableUnsafeRowParquetReader =
    +      sqlContext.getConf(SQLConf.PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED.key).toBoolean
    --- End diff --
    
    Hm.. I think we can implement this for now and then simply change the condition when `UnsafeRowParquetRecordReader` supports filtering row by row.
    
    The option for Parquet filter push down was also false by default but it was implemented. I don't think a default value can deicde if a feature should be implemented or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167722257
  
    The test is failed from wrong results from Parquet.
    
    The test result was below:
    ```
    == Physical Plan ==
    Scan ParquetRelation[_1#4] InputPaths: file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-2fa8f75a-8ac1-4273-afcd-f271529a24a7, PushedFilters: [GreaterThan(_1,5)]
    ```
    ```
    == Results ==
    !== Correct Answer - 4 ==   == Spark Answer - 5 ==
    ![6]                        [5]
    ![7]                        [6]
    ![8]                        [7]
    ![9]                        [8]
    !                           [9]
    ```
    
    As can be seen, `GreaterThan(_1,5)` is pushed down correctly but the results includes `5`.
    I am looking into this deeper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170468892
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182672430
  
    **[Test build #51075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51075/consoleFull)** for PR 10502 at commit [`ad757be`](https://github.com/apache/spark/commit/ad757beb5922350a7db556203d05a2c84f7f7fe1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170542317
  
    **[Test build #49122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49122/consoleFull)** for PR 10502 at commit [`dfc7506`](https://github.com/apache/spark/commit/dfc7506fc579bcd4539f2ab26aebaa802ed5dc17).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167895015
  
    @yhuai Sure. I will try!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-182686749
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51075/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167720165
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167736372
  
    **[Test build #48403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48403/consoleFull)** for PR 10502 at commit [`8c376af`](https://github.com/apache/spark/commit/8c376af467b939d3687e24f7d0e1ccf4e0ea5fca).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170501209
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-213244434
  
    Let me close this for now. Please let me know although this is closed. I will reopen this when I start to work on this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167750144
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167736506
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48403/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167733162
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48590244
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    +    val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap
    +    val referencedColumns = ArrayBuffer.empty[String]
    +    def getDataTypeOf(name: String): DataType = {
    +      referencedColumns += name
    +      dataTypeOf(name)
    +    }
    +    createParquetFilter(getDataTypeOf, predicate)
    --- End diff --
    
    We need this here? ISTM that `ParquetFilters.createFilter' calls `createParquetFilter' again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #10502: [SPARK-12355][SQL] Implement unhandledFilter interface f...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/10502
  
    Hi @yhuai ! Would this be okay if I give a try for this one again maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189119103
  
    @yhuai No problem. Then, please inform me when I am supposed to do something else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-189100947
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52014/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48589974
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    --- End diff --
    
    Better to add `private[parquet]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167736500
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170500927
  
    **[Test build #49115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49115/consoleFull)** for PR 10502 at commit [`dfc7506`](https://github.com/apache/spark/commit/dfc7506fc579bcd4539f2ab26aebaa802ed5dc17).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48603240
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    +    val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap
    +    val referencedColumns = ArrayBuffer.empty[String]
    +    def getDataTypeOf(name: String): DataType = {
    +      referencedColumns += name
    +      dataTypeOf(name)
    +    }
    +    createParquetFilter(getDataTypeOf, predicate)
    --- End diff --
    
    For me I think I would not be sure why `createFilter` retuens filters with referenced columns. I see `OrcFilters.createFilter` returns only filters and I do not think `Parquet.createFilter` returning multiple objects is eazy to read.
    
    If you think we don't have to run the function twice  and the implementation is hacky, please suggest possible implementation.
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48593431
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -288,20 +293,28 @@ private[sql] class ParquetRelation(
         }
       }
     
    +  override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
    +    // The unsafe row RecordReader does not support row by row filtering so for this case
    +    // it should wrap this with Spark-side filtering.
    +    val enableUnsafeRowParquetReader =
    +      sqlContext.getConf(SQLConf.PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED.key).toBoolean
    --- End diff --
    
    Since `SQLConf.PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED` is internally true by default, this `unhandledFilters` always returns `filters` as it is. So, this optimization is little meaningful to end users.
    IMO if `UnsafeRowParquetRecordReader` supports row-by-row filtering, then we implement this.
    Though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167720172
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48396/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167835640
  
    @HyukjinKwon Thank you for the PR? Can you post some benchmarking results (with your testing code)? It will be good to have these numbers to help others understand if it can provide benefit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48590538
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    +    val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap
    +    val referencedColumns = ArrayBuffer.empty[String]
    +    def getDataTypeOf(name: String): DataType = {
    +      referencedColumns += name
    +      dataTypeOf(name)
    +    }
    +    createParquetFilter(getDataTypeOf, predicate)
    --- End diff --
    
    `createParquetFilter ` needs to be called to have the list of referenced columns while creating filters.
    
    If you meant `ParquetFilters.createFilter` should be called once as returning both the columns and filter, I originally wanted to do so but I realised calling twice like this would reduce the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10502#discussion_r48591978
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -208,11 +210,30 @@ private[sql] object ParquetFilters {
       }
     
       /**
    +   *  Return referenced columns in [[sources.Filter]].
    +   */
    +  def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = {
    --- End diff --
    
    Agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167933544
  
    @yhuai Would you look through this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-170501213
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49115/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167733908
  
    **[Test build #48403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48403/consoleFull)** for PR 10502 at commit [`8c376af`](https://github.com/apache/spark/commit/8c376af467b939d3687e24f7d0e1ccf4e0ea5fca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167720147
  
    **[Test build #48396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48396/consoleFull)** for PR 10502 at commit [`cf331a4`](https://github.com/apache/spark/commit/cf331a453c3f99ee40ee5ca6f5029dadee3d07f6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167719205
  
    **[Test build #48396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48396/consoleFull)** for PR 10502 at commit [`cf331a4`](https://github.com/apache/spark/commit/cf331a453c3f99ee40ee5ca6f5029dadee3d07f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167734521
  
    **[Test build #48402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48402/consoleFull)** for PR 10502 at commit [`8c376af`](https://github.com/apache/spark/commit/8c376af467b939d3687e24f7d0e1ccf4e0ea5fca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12355][SQL] Implement unhandledFilter i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10502#issuecomment-167737265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48402/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org