You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/12 06:44:53 UTC

[GitHub] [iceberg] arunb2w opened a new issue, #6406: Overlapping data in data files even after sorting

arunb2w opened a new issue, #6406:
URL: https://github.com/apache/iceberg/issues/6406

   ### Apache Iceberg version
   
   0.14.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have performed below steps to analyze table metadata after rewrite based on sort strategy.
   
   1) Run rewrite_data_files with sort by _CONTEXT_ID_
   spark.sql(f"CALL {catalog}.system.rewrite_data_files(table => '{db}.{table_name}', strategy => 'sort', sort_order => '_CONTEXT_ID_', options => map('rewrite-all','true','max-concurrent-file-group-rewrites', '3'))")
   2) Download the metadata json and manifest avro files from metadata folder after the rewrite complete.
   3) Run manifest2json tool - [https://github.com/hililiwei/iceberg-tools#manifest2json](https://urldefense.com/v3/__https:/github.com/hililiwei/iceberg-tools*manifest2json__;Iw!!E3l7wfIP!lOMJv7yh1uW4rfKQ8_MWvn-BtfNyThoi6gnQS4uIb9RsvYD7TJjzZCfq40HTxflW6tQfuayTrnPwhvwIjfa5Dd-A$)  with the metadata json and manifest avro as input.
   4) Parse the json and get lower and upper bound value for the columns that are part of metadata metrics. Below is my table config
   .tableProperty("format-version", "2") \
   .tableProperty("read.parquet.vectorization.enabled", "true") \
   .tableProperty("write.metadata.metrics.default", "none") \
   .tableProperty("write.metadata.metrics.column._CONTEXT_ID_", "full") \
   .tableProperty("write.metadata.metrics.column.ID", "full") \
   .tableProperty("write.target-file-size-bytes", "134217728") \
   5) Write it as csv file for further analysis with human readable metadata for each parquet file
   
   I have attached the resulting csv file here, if we order it based on context we could see it is not respecting the order and it is overlapping in multiple files. Even though overlaps can occur when the data is large for a particular context they should respect the sort order but that’s not happening here.
   
   Note, that it is not happening for all tables so far i have noticed in large tables with more columns in it. Not sure, whether having more number of columns in table is causing this. For the attached manifest metadta, this table contains 115 columns
   [manifest_metadata.csv](https://github.com/apache/iceberg/files/10205213/manifest_metadata.csv)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6406: Overlapping data in data files even after sorting

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6406:
URL: https://github.com/apache/iceberg/issues/6406#issuecomment-1346431638

   Are you sure you are only checking the live manifest files for the table? How do the metrics compare with those in the metadata view of the table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] arunb2w closed issue #6406: Overlapping data in data files even after sorting

Posted by GitBox <gi...@apache.org>.
arunb2w closed issue #6406: Overlapping data in data files even after sorting
URL: https://github.com/apache/iceberg/issues/6406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] arunb2w commented on issue #6406: Overlapping data in data files even after sorting

Posted by GitBox <gi...@apache.org>.
arunb2w commented on issue #6406:
URL: https://github.com/apache/iceberg/issues/6406#issuecomment-1358863100

   Issue resolved based on the inputs from @RussellSpitzer and @ajantha-bhat - https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1670827665482209. Closing this ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6406: Overlapping data in data files even after sorting

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6406:
URL: https://github.com/apache/iceberg/issues/6406#issuecomment-1359000800

   Just so no one has to click the link to slack, there is a property called "max file group size", it defaults to 100GB. Any partition larger than this threshold will be split into independent compaction jobs. For Sorts this would result in overlapping files but we have the default low because larger shuffles than this tend to be problematic without powerful clusters. Increasing this property to the size of the data being compacted will ensure a full single shuffle of all the data in the partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org