You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/26 08:10:01 UTC

[GitHub] [iceberg] dotjdk opened a new issue, #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

dotjdk opened a new issue, #5641:
URL: https://github.com/apache/iceberg/issues/5641

   ### Apache Iceberg version
   
   0.14.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   If we create a table with `format-version = 2` and replace the partition spec after table creation, we are unable to read data from the table after executing a `rewrite_manifests`
   
   If we look in the metadata after the manifest rewrite, the min/max stats for the partition has been nulled in both the manifest and the manifest list. This means that all queries on the table using metadata pruning will not return any results.
   
   The issue does not appear if the table has the correct partition spec from creation, or if using `format-version = 1`
   
   See the attached script for steps to reproduce, and similar steps to show that it works correctly with format 1 tables, or when the partition spec is specified on table creation and not modified after.
   
   We are using Spark 3.3.0 with Iceberg 0.14.0, and reproduce the issue easily with a few steps in spark shell on a newly created table as in the attached script.
   
   [script.scala.zip](https://github.com/apache/iceberg/files/9431000/script.scala.zip)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228213426

   ## Broken V2
   ```
   ➜  python git:(master) pyiceberg --uri thrift://localhost:9083 describe data.rewrite_test                                                                                                               
   Table format version  2                                                                                                                                                                                                         
   Metadata location     file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/00004-23f9a545-96f2-4c67-a6cd-e415676bfdf0.metadata.json                                                        
   Table UUID            325fd058-afd1-48fd-846a-26a491ec8d68                                                                                                                                                                      
   Last Updated          1661498735055                                                                                                                                                                                             
   Partition spec        [                                                                                                                                                                                                         
                           1001: ts_day: unknown(2)                                                                                                                                                                                
                         ]                                                                                                                                                                                                         
   Sort order            []                                                                                                                                                                                                        
   Current schema        Schema, id=0                                                                                                                                                                                              
                         ├── 1: id: optional int                                                                                                                                                                                   
                         ├── 2: ts: optional timestamptz                                                                                                                                                                           
                         └── 3: day_of_ts: optional date                                                                                                                                                                           
   Current snapshot      Operation.REPLACE: id=5015440624103305225, parent_id=4893783402811059498, schema_id=0                                                                                                                     
   Snapshots             Snapshots                                                                                                                                                                                                 
                         ├── Snapshot 4893783402811059498, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-4893783402811059498-1-1bd5b04a-7a85-4014-a703-6a56e8a7e741.avro
                         └── Snapshot 5015440624103305225, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-5015440624103305225-1-58d11599-f485-4b55-9e36-198e7ad1be7e.avro
   Properties            owner  root                                                                                                                                                                                               
   ```
   
   ## Working V1
   
   ```json
   ➜  python git:(master) avro-tools tojson /Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-5015440624103305225-1-58d11599-f485-4b55-9e36-198e7ad1be7e.avro | jq
   22/08/26 10:26:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   {
     "manifest_path": "file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/58d11599-f485-4b55-9e36-198e7ad1be7e-m0.avro",
     "manifest_length": 6874,
     "partition_spec_id": 1,
     "content": 0,
     "sequence_number": 2,
     "min_sequence_number": 1,
     "added_snapshot_id": 5015440624103305000,
     "added_data_files_count": 0,
     "existing_data_files_count": 1,
     "deleted_data_files_count": 0,
     "added_rows_count": 0,
     "existing_rows_count": 1,
     "deleted_rows_count": 0,
     "partitions": {
       "array": [
         {
           "contains_null": true,
           "contains_nan": {
             "boolean": false
           },
           "lower_bound": null,
           "upper_bound": null
         }
       ]
     }
   }
   ```
   
   ```bash
   ➜  python git:(master) pyiceberg --uri thrift://localhost:9083 describe data.rewrite_test3                                                                                                             
   Table format version  1                                                                                                                                                                                                         
   Metadata location     file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/00003-d324b9a3-d18c-4e6d-92b3-a6a991fa21b4.metadata.json                                                        
   Table UUID            c099df65-a04a-449f-8ab0-f1e2c7028d1f                                                                                                                                                                       
   Last Updated          1661499222654                                                                                                                                                                                              
   Partition spec        [                                                                                                                                                                                                          
                           1000: ts_day: unknown(2)                                                                                                                                                                                 
                         ]                                                                                                                                                                                                          
   Sort order            []                                                                                                                                                                                                         
   Current schema        Schema, id=0                                                                                                                                                                                               
                         ├── 1: id: optional int                                                                                                                                                                                    
                         ├── 2: ts: optional timestamptz                                                                                                                                                                            
                         └── 3: day_of_ts: optional date                                                                                                                                                                            
   Current snapshot      Operation.REPLACE: id=7085493087562750676, parent_id=2500621515342040057, schema_id=0                                                                                                                      
   Snapshots             Snapshots                                                                                                                                                                                                  
                         ├── Snapshot 2500621515342040057, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-2500621515342040057-1-ab084d85-0f7c-4c75-b647-3dcd1107b07a.avro
                         └── Snapshot 7085493087562750676, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-7085493087562750676-1-2bbe10ad-b311-4397-a645-c64c0ae44b45.avro
   Properties            owner  root                                                                                                                                                                                                
   ```
   
   ```json
   ➜  python git:(master) avro-tools tojson file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-7085493087562750676-1-2bbe10ad-b311-4397-a645-c64c0ae44b45.avro | jq
   22/08/26 10:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   {
     "manifest_path": "file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/2bbe10ad-b311-4397-a645-c64c0ae44b45-m0.avro",
     "manifest_length": 6878,
     "partition_spec_id": 0,
     "content": 0,
     "sequence_number": 2,
     "min_sequence_number": 1,
     "added_snapshot_id": 7085493087562751000,
     "added_data_files_count": 0,
     "existing_data_files_count": 1,
     "deleted_data_files_count": 0,
     "added_rows_count": 0,
     "existing_rows_count": 1,
     "deleted_rows_count": 0,
     "partitions": {
       "array": [
         {
           "contains_null": false,
           "contains_nan": {
             "boolean": false
           },
           "lower_bound": {
             "bytes": "1J\u0000\u0000"
           },
           "upper_bound": {
             "bytes": "1J\u0000\u0000"
           }
         }
       ]
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1230001969

   Found the root issue:
   ![image](https://user-images.githubusercontent.com/1134248/187167760-7342b309-5d79-4c4a-b3a1-73a895e4e902.png)
   
   It turns out that we select all the partition fields (the old and new ones), but we only update the statistics on the current partition keys, and this one is null. PR follows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1235113896

   This has been fixed in https://github.com/apache/iceberg/pull/5691, thanks @rdblue. And thanks @dotjdk for reporting, much appreciated. Otherwise we wouldn't have caught this 🐛 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228197861

   I'm able to reproduce it on my side:
   
   ## Works with V1:
   
   ```sql
   %%sql
   
   DROP TABLE if EXISTS data.rewrite_test3
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   create table data.rewrite_test3 (id int, ts timestamp, day_of_ts date) using iceberg partitioned by (days(ts))
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   alter table data.rewrite_test3 SET TBLPROPERTIES ('format-version' = '2')
   ```
   
       22/08/26 07:33:13 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
   
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   insert into data.rewrite_test3 values (1, CAST('2022-01-01 10:00:00' AS TIMESTAMP), CAST('2022-01-01' AS DATE))
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   select * from data.rewrite_test3 where ts < current_timestamp()
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>id</th>
               <th>ts</th>
               <th>day_of_ts</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>2022-01-01 10:00:00</td>
               <td>2022-01-01</td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql 
   
   call system.rewrite_manifests(table => 'data.rewrite_test3')
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>rewritten_manifests_count</th>
               <th>added_manifests_count</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>1</td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   select * from data.rewrite_test3 where ts < current_timestamp()
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>id</th>
               <th>ts</th>
               <th>day_of_ts</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>2022-01-01 10:00:00</td>
               <td>2022-01-01</td>
           </tr>
       </tbody>
   </table>
   
   ## Seems to be broken with V2
   
   
   ```sql
   %%sql
   
   drop table if exists data.rewrite_test
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   create table data.rewrite_test (id int, ts timestamp, day_of_ts date) using iceberg partitioned by (day_of_ts)
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   describe table extended data.rewrite_test
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>col_name</th>
               <th>data_type</th>
               <th>comment</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>id</td>
               <td>int</td>
               <td></td>
           </tr>
           <tr>
               <td>ts</td>
               <td>timestamp</td>
               <td></td>
           </tr>
           <tr>
               <td>day_of_ts</td>
               <td>date</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Partitioning</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>Part 0</td>
               <td>day_of_ts</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Metadata Columns</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>_spec_id</td>
               <td>int</td>
               <td></td>
           </tr>
           <tr>
               <td>_partition</td>
               <td>struct&lt;day_of_ts:date&gt;</td>
               <td></td>
           </tr>
           <tr>
               <td>_file</td>
               <td>string</td>
               <td></td>
           </tr>
           <tr>
               <td>_pos</td>
               <td>bigint</td>
               <td></td>
           </tr>
           <tr>
               <td>_deleted</td>
               <td>boolean</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Detailed Table Information</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>Name</td>
               <td>demo.data.rewrite_test</td>
               <td></td>
           </tr>
           <tr>
               <td>Location</td>
               <td>file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test</td>
               <td></td>
           </tr>
           <tr>
               <td>Provider</td>
               <td>iceberg</td>
               <td></td>
           </tr>
           <tr>
               <td>Owner</td>
               <td>root</td>
               <td></td>
           </tr>
           <tr>
               <td>Table Properties</td>
               <td>[current-snapshot-id=none,format=iceberg/parquet,format-version=1]</td>
               <td></td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   alter table data.rewrite_test SET TBLPROPERTIES ('format-version' = '2')
   ```
   
       22/08/26 07:24:13 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
   
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   describe table extended data.rewrite_test
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>col_name</th>
               <th>data_type</th>
               <th>comment</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>id</td>
               <td>int</td>
               <td></td>
           </tr>
           <tr>
               <td>ts</td>
               <td>timestamp</td>
               <td></td>
           </tr>
           <tr>
               <td>day_of_ts</td>
               <td>date</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Partitioning</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>Part 0</td>
               <td>day_of_ts</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Metadata Columns</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>_spec_id</td>
               <td>int</td>
               <td></td>
           </tr>
           <tr>
               <td>_partition</td>
               <td>struct&lt;day_of_ts:date&gt;</td>
               <td></td>
           </tr>
           <tr>
               <td>_file</td>
               <td>string</td>
               <td></td>
           </tr>
           <tr>
               <td>_pos</td>
               <td>bigint</td>
               <td></td>
           </tr>
           <tr>
               <td>_deleted</td>
               <td>boolean</td>
               <td></td>
           </tr>
           <tr>
               <td></td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td># Detailed Table Information</td>
               <td></td>
               <td></td>
           </tr>
           <tr>
               <td>Name</td>
               <td>demo.data.rewrite_test</td>
               <td></td>
           </tr>
           <tr>
               <td>Location</td>
               <td>file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test</td>
               <td></td>
           </tr>
           <tr>
               <td>Provider</td>
               <td>iceberg</td>
               <td></td>
           </tr>
           <tr>
               <td>Owner</td>
               <td>root</td>
               <td></td>
           </tr>
           <tr>
               <td>Table Properties</td>
               <td>[current-snapshot-id=none,format=iceberg/parquet,format-version=2]</td>
               <td></td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   ALTER TABLE data.rewrite_test REPLACE PARTITION FIELD day_of_ts WITH days(ts)
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   insert into data.rewrite_test values (1, CAST('2022-01-01 10:00:00' AS TIMESTAMP), CAST('2022-01-01' AS DATE))
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   select * from data.rewrite_test where ts < current_timestamp()
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>id</th>
               <th>ts</th>
               <th>day_of_ts</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>2022-01-01 10:00:00</td>
               <td>2022-01-01</td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   call system.rewrite_manifests(table => 'data.rewrite_test')
   ```
   
       22/08/26 07:25:34 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
   
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>rewritten_manifests_count</th>
               <th>added_manifests_count</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>1</td>
           </tr>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   select * from data.rewrite_test where ts < current_timestamp()
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>id</th>
               <th>ts</th>
               <th>day_of_ts</th>
           </tr>
       </thead>
       <tbody>
       </tbody>
   </table>
   
   
   
   
   ```sql
   %%sql
   
   select * from data.rewrite_test where day_of_ts < current_timestamp()
   ```
   
   
   
   
   <table>
       <thead>
           <tr>
               <th>id</th>
               <th>ts</th>
               <th>day_of_ts</th>
           </tr>
       </thead>
       <tbody>
           <tr>
               <td>1</td>
               <td>2022-01-01 10:00:00</td>
               <td>2022-01-01</td>
           </tr>
       </tbody>
   </table>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko closed issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
Fokko closed issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
URL: https://github.com/apache/iceberg/issues/5641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dotjdk commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests

Posted by GitBox <gi...@apache.org>.
dotjdk commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228200072

   Yes, v1 doesn't have the issue. And v2 doesn't have the issue if the partition spec was specified on table creation and not modified after.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org