You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/11 19:11:53 UTC

[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns

dramaticlly opened a new issue, #6567:
URL: https://github.com/apache/iceberg/issues/6567

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I really like the new table scan feature released latest pyiceberg 0.2.1 release, thanks @Fokko. It works great when I provide the partition column as row filter but not working as expected when I provide other columns as part of expression. The `scan.plan_files()` shall return me collection of parquet files satisfy the predicate in row filter but it's returning all instead. 
   
   Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to start with
   
   ### schema
   ```ddl
                                          Create Table
   ------------------------------------------------------------------------------------------
    CREATE TABLE iceberg.hongyue_zhang.mls23 (
       id bigint NOT NULL,
       data varchar,
       ts date
    )
    WITH (
       format = 'PARQUET',
       location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23',
       partitioning = ARRAY['ts']
    )
   (1 row)
   ```
   
   ### Setup
   Table have 2 partitions  and 198 records total, each write have its own parquet files for the sake of simplicity 
   ```
        partition    | record_count | file_count | total_size |                                        data
   -----------------+--------------+------------+------------+-------------------------------------------------------------------------------------
    {ts=2023-01-04} |           99 |         99 |     115300 | {id={min=1, max=1, null_count=0}, data={min=b, max=bbbbbbbbbbbbbbbc, null_count=0}}
    {ts=2023-01-05} |           99 |         99 |     115303 | {id={min=0, max=0, null_count=0}, data={min=a, max=aaaaaaaaaaaaaaab, null_count=0}}
   (2 rows)
   ```
   
   ### Python code
   ```python
   import os
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo
   
   catalog = load_catalog("prod")
   table = catalog.load_table("hongyue_zhang.mls23")
   table.location()
   
   scan1 = table.scan(
       row_filter=EqualTo("ts", "2023-01-04"))
   yesterday_files = [task.file.file_path for task in scan1.plan_files()]
   print(len(yesterday_files))
   # expect 99 and actual is 99 parquet files for single partition
   
   scan2 = table.scan(
       row_filter=EqualTo("data", "a"))
   a_files = [task.file.file_path for task in scan2.plan_files()]
   print(len(a_files))
   # expect 1 but I am seeing 198 instead, which means all parquet files are returned
   
   scan3 = table.scan(
       row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a")))
   yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()]
   print(len(yesterday_and_a_files))
   # expect 1 but I am seeing 99, which means the row filter are taking the 1st expression with partition column ts but not 2nd expression on data 
   ```
   
   For the sake of validation, I also tried to spark to query with similar condition and it's returnning me 1 file as expected
   ```spark
   val result = spark.sql("select id, data, input_file_name(), ts from iceberg.hongyue_zhang.mls23 where data = 'a'")
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |id |data|input_file_name()                                                                                                                                    |ts        |
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |0  |a   |s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/00000-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-00001.parquet               |2023-01-05|
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   ```
   
   
   I cant see to figure out why it failed and happy to contribute if anyone can share insights 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org