You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/11 19:11:53 UTC

[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns

dramaticlly opened a new issue, #6567:
URL: https://github.com/apache/iceberg/issues/6567

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I really like the new table scan feature released latest pyiceberg 0.2.1 release, thanks @Fokko. It works great when I provide the partition column as row filter but not working as expected when I provide other columns as part of expression. The `scan.plan_files()` shall return me collection of parquet files satisfy the predicate in row filter but it's returning all instead. 
   
   Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to start with
   
   ### schema
   ```ddl
                                          Create Table
   ------------------------------------------------------------------------------------------
    CREATE TABLE iceberg.hongyue_zhang.mls23 (
       id bigint NOT NULL,
       data varchar,
       ts date
    )
    WITH (
       format = 'PARQUET',
       location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23',
       partitioning = ARRAY['ts']
    )
   (1 row)
   ```
   
   ### Setup
   Table have 2 partitions  and 198 records total, each write have its own parquet files for the sake of simplicity 
   ```
        partition    | record_count | file_count | total_size |                                        data
   -----------------+--------------+------------+------------+-------------------------------------------------------------------------------------
    {ts=2023-01-04} |           99 |         99 |     115300 | {id={min=1, max=1, null_count=0}, data={min=b, max=bbbbbbbbbbbbbbbc, null_count=0}}
    {ts=2023-01-05} |           99 |         99 |     115303 | {id={min=0, max=0, null_count=0}, data={min=a, max=aaaaaaaaaaaaaaab, null_count=0}}
   (2 rows)
   ```
   
   ### Python code
   ```python
   import os
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo
   
   catalog = load_catalog("prod")
   table = catalog.load_table("hongyue_zhang.mls23")
   table.location()
   
   scan1 = table.scan(
       row_filter=EqualTo("ts", "2023-01-04"))
   yesterday_files = [task.file.file_path for task in scan1.plan_files()]
   print(len(yesterday_files))
   # expect 99 and actual is 99 parquet files for single partition
   
   scan2 = table.scan(
       row_filter=EqualTo("data", "a"))
   a_files = [task.file.file_path for task in scan2.plan_files()]
   print(len(a_files))
   # expect 1 but I am seeing 198 instead, which means all parquet files are returned
   
   scan3 = table.scan(
       row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a")))
   yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()]
   print(len(yesterday_and_a_files))
   # expect 1 but I am seeing 99, which means the row filter are taking the 1st expression with partition column ts but not 2nd expression on data 
   ```
   
   For the sake of validation, I also tried to spark to query with similar condition and it's returnning me 1 file as expected
   ```spark
   val result = spark.sql("select id, data, input_file_name(), ts from iceberg.hongyue_zhang.mls23 where data = 'a'")
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |id |data|input_file_name()                                                                                                                                    |ts        |
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |0  |a   |s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/00000-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-00001.parquet               |2023-01-05|
   +---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   ```
   
   
   I cant see to figure out why it failed and happy to contribute if anyone can share insights 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #6567: Python: Table scan problem with row filter set to non-partition columns

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue closed issue #6567: Python: Table scan problem with row filter set to non-partition columns
URL: https://github.com/apache/iceberg/issues/6567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #6567: Python: Table scan problem with row filter set to non-partition columns

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1381041533

   @rdblue I think that's an excellent IDEA: https://github.com/apache/iceberg/pull/6574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #6567: Python: Table scan problem with row filter set to non-partition columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1380753760

   @Fokko, for the delete files, should we add some detection and fail if there are delete files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #6567: Python: Table scan problem with row filter set to non-partition columns

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1380881675

   @Fokko , thank you fokko for the quick answer. So If I understand correctly, the pyiceberg=0.2.1 will apply row filter and column projection for partition spec ranges but not for data file ranges?
   
   All I want to know is what's covered in the current code, I can definitely wait for 0.3 release, Thank you again for the great work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] erikcw commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns

Posted by GitBox <gi...@apache.org>.
erikcw commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379415205

   I stumbled into the same issue with a slight twist.  I deleted all the rows from my table, however pyiceberg is still returning parquet files with those records.  Shouldn't those files no longer be in the current manifest?
   
   ```sql
   -- Executed in Athena
   DELETE FROM iceberg_test WHERE uid = '200441';
   
   select count(uid) from "iceberg_test"
   where uid = '200441';
   
   -- Returns 0.
   
   
   ```
   
   ```python
       # Glue catalog type.
       catalog = load_catalog("default")
       table = catalog.load_table("testing.iceberg_test")
   
       scan = table.scan(
           row_filter=NotEqualTo("uid", "200441"),  # Doesn't seem to make a difference with out without this line.
           selected_fields=("uid"),
       )
       files = [task.file.file_path for task in scan.plan_files()]
       # files all contain the deleted value.
      
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379472226

   @dramaticlly I just checked, and I can confirm that we don't filter on the datafile ranges, this will be implemented very soon 👍🏻 
   
   @erikcw Thanks for raising the issue, and we're not handling deleted files right now. Could you create a separate issue so we make sure that we keep track of it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org