You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/26 10:49:17 UTC

[GitHub] [iceberg] nvitucci opened a new issue #2517: Predicate pushdown not visible in Spark plan

nvitucci opened a new issue #2517:
URL: https://github.com/apache/iceberg/issues/2517


   I have looked into Spark plans to make sure that a partition filter is actually being pushed down to Iceberg, but the plan does not explicitly show any such information. The table is created like this:
   
   ```
   spark.sql("CREATE TABLE local.db.table (\n"
           + "    orderkey string,\n"
           + "    custkey string,\n"
           + "    orderstatus string,\n"
           + "    totalprice double,\n"
           + "    orderdate date,\n"
           + "    orderpriority string,\n"
           + "    clerk string,\n"
           + "    shippriority bigint,\n"
           + "    comment string"
           + ")\n"
           + "USING iceberg\n"
           + "PARTITIONED BY (orderpriority)");
   ```
   
   with `local` being a Hive catalog added to Spark configuration as such:
   
   ```
   ...
       .set("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
       .set("spark.sql.catalog.local.type", "hive")
       .set("spark.sql.catalog.local.uri", hiveConf.get(METASTOREURIS.varname))
   ...
   ```
   
   The table is populated from a 1.6GB, 15M-row `orders` table file generated with the [TPCH kit](https://github.com/gregrahn/tpch-kit). When I run a query with `explain` like so:
   
   ```
   spark.sql("SELECT * FROM local.db.table WHERE orderpriority = '1-URGENT'").explain(false);
   ```
   
   the resulting plan is the following:
   
   ```
   == Physical Plan ==
   *(1) Project [orderkey#46, custkey#47, orderstatus#48, totalprice#49, orderdate#50, orderpriority#51, clerk#52, shippriority#53L, comment#54]
   +- *(1) Filter (isnotnull(orderpriority#51) AND (orderpriority#51 = 1-URGENT))
      +- BatchScan[orderkey#46, custkey#47, orderstatus#48, totalprice#49, orderdate#50, orderpriority#51, clerk#52, shippriority#53L, comment#54] local.db.table [filters=orderpriority IS NOT NULL, orderpriority = '1-URGENT']
   ```
   
   Am I wrong in expecting a `ScanV2 iceberg` (or similar) section of the plan instead of `BatchScan`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-827106131


   The pushdown information is in the BatchScan info ```[filters=orderpriority IS NOT NULL, orderpriority = '1-URGENT']``` which shows the filters as presented to the Iceberg Table.
   
   See
   https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L210-L214
   
   Which get's put into DSV2 ScanExec (Spark)
   ```
     override def simpleString(maxFields: Int): String = {
       val result =
         s"$nodeName${truncatedString(output, "[", ", ", "]", maxFields)} ${scan.description()}"
       redact(result)
     }
   ```
   
   Node name is the physical plan node name which in this case is BatchScanExec (also Spark)
   ```
   Physical plan node for scanning a batch of data from a data source v2.
   ```
   
   So everything is according to plan here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-827169378


   I think we could definitely add some more details there to the description, pull requests are welcome :) We basically have full control over everything that gets printed after the Spark "BatchScan Info"
   
   I think currently the only way to really tell is to check the partitions being created for the read, the (Spark) partition information should have listings of all the files required to load that particular table. So I would probably start tinkering around with that but it would require a bit of digging. 
   
   Probably something like DF.rdd. traverse to the parent Scan RDD, partitions foreach println


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nvitucci commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

nvitucci commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-827167227


   Thanks for your reply. I was expecting some more information in the Iceberg node, a bit like you would see if you were to run the same query on the original table:
   
   ```
   ...
      +- FileScan csv [orderkey#0,custkey#1,orderstatus#2,totalprice#3,orderdate#4,orderpriority#5,clerk#6,shippriority#7,comment#8] Batched: false, DataFilters: [isnotnull(orderpriority#5), (orderpriority#5 = 1-URGENT)], Format: CSV, Location: InMemoryFileIndex[file:/Users/nvitucci/temp/tpch-kit/generated-data/orders.tbl], PartitionFilters: [], PushedFilters: [IsNotNull(orderpriority), EqualTo(orderpriority,1-URGENT)], ReadSchema: struct<orderkey:string,custkey:string,orderstatus:string,totalprice:double,orderdate:date,orderpr...
   ```
   
   or like in #1483, where `iceberg` is clearly visible in the nodes it is related to along with path information (although I now realize that it is probably a Spark 2-only behaviour).
   
   Since a filter on a non-partition column results in basically the same plan, is there another way to see that the partition filter is actually being used for pruning? In other words, how can I programmatically check that only the files in the selected partition are being read?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nautilus28 commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

nautilus28 commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-870692655


   Have you made any progress with this @nvitucci? We are also interested to see if the filter is actually being used for partition pruning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nvitucci commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

nvitucci commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-873086910


   > Have you made any progress with this @nvitucci? We are also interested to see if the filter is actually being used for partition pruning.
   
   I have created [a draft PR](https://github.com/apache/iceberg/pull/2780) to start discussing a potential solution. Please can you take a look too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nvitucci commented on issue #2517: Predicate pushdown not visible in Spark plan

Posted by GitBox <gi...@apache.org>.

nvitucci commented on issue #2517:
URL: https://github.com/apache/iceberg/issues/2517#issuecomment-827177905


   Thanks for the pointers, I'll take a look and report back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org