You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/22 05:28:00 UTC

[GitHub] [iceberg] arunb2w opened a new issue, #6244: Iceberg metadata not stored properly

arunb2w opened a new issue, #6244:
URL: https://github.com/apache/iceberg/issues/6244

   I tried creating an sample iceberg table with below schema
   
   ```
    CREATE TABLE glue_dev.db.datatype_test (
               id bigint,
               data string,
               category string
               )
           USING iceberg
           TBLPROPERTIES ('read.split.target-size'='134217728', "write.metadata.metrics.default"="full")
   ```
   
   Then, inserted around 100 records and then did rewrite files after that so that all the inserted data will be rewritten in a single file.
   
   ```
   for num in range(1,100)
       INSERT INTO glue_dev.db.datatype_test VALUES ({num}, 'data{num}', 'catagory{num}')
   ```
   
   After that I tried to query the data_files metadata.
   
   ```
    select * from glue_dev.db.datatype_test.data_files limit 10;
   content	file_path	file_format	spec_id	record_count	file_size_in_bytes	column_sizes	value_counts	null_value_counts	nan_value_counts	lower_bounds	upper_bounds	key_metadata	split_offsets	equality_ids	sort_order_id
   0	s3://bucket/folder/db.db/datatype_test/data/00000-0-4bb9ce80-c7a7-4192-98c6-ed6e7289a981-00001.parquet	PARQUET	0	559	4234	{1:991,2:1188,3:1258}	{1:559,2:559,3:559}	{1:0,2:0,3:0}	{}	{1:,2:data1,3:catagory1}	{1:/,2:data99,3:catagory99}	NULL	[4]	NULL	0
   Time taken: 15.614 seconds, Fetched 1 row(s)
   ```
   
   In this metadata, if you see the lower_bounds and upper_bounds data for the id column which is of type bigint they dont represent the correct values.
   
   Does that mean iceberg is not storing the metadata correctly?
   In this case if i join using id column how iceberg will properly scan/skip files since the lower bound and upper bound metadata is not correct?
   
   This is a test data but i have actual data with around 10000 files and they too exhibit similar behaviour for integer columns. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6244: Iceberg metadata not stored properly

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1560279324

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hililiwei commented on issue #6244: Iceberg metadata not stored properly

Posted by GitBox <gi...@apache.org>.
hililiwei commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1325832061

   Agree with @ajantha-bhat.
   
   > As a workaround, I believe https://github.com/hililiwei/iceberg-tools#manifest2json can convert them and show them.
   > 
   > cc: @hililiwei
   
   This little toolbox was a try a long time ago. I uploaded a snapshot version. You can try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323230663

   > Does that mean iceberg is not storing the metadata correctly?
   
   lower bounds and upper bounds are stored as byte arrays in the manifest files. Hence, what you are seeing is the bytearray string representation which is not human readable. But during file pruning these values are converted back to proper data types and compared. 
   
   We can enhance these metadata tables to show actual values. I believe @szehon-ho has an open PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho closed issue #6244: Iceberg metadata not stored properly

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho closed issue #6244: Iceberg metadata not stored properly
URL: https://github.com/apache/iceberg/issues/6244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323232753

   As a workaround, I believe https://github.com/hililiwei/iceberg-tools#manifest2json can convert them and show them. 
   
   cc: @hililiwei 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #6244: Iceberg metadata not stored properly

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1561998472

   Yea this (at least display) should be fixed by #5376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org