You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/06 05:27:36 UTC

[GitHub] [iceberg] RussellSpitzer commented on issue #6364: Optimise POS reads

RussellSpitzer commented on issue #6364:
URL: https://github.com/apache/iceberg/issues/6364#issuecomment-1338783816

What is a POS file? I’m not familiar with the acronym Sent from my iPhoneOn Dec 5, 2022, at 9:00 PM, rbalamohan ***@***.***> wrote:
Apache Iceberg version
0.14.1
Query engine
Spark
Please describe the bug 🐞
Currently combinedFileTask can have more than 1 file. Depending on the nature of workload, it can even have 30-50+ files in single split. When there are 4+ POS files, it takes lot longer time to process "select" queries. This is due to the fact,
that every file needs to process POS file and it leads to read amplification.
Request is to optimise the way POS file reading is done.

Optimise parquet reader with cached filestatus and footer

Optimise within combinedFileTask in a single task in a single executor. This can have more than 1 file in single split. Typically there can 10-50+ files depending on the size of the files.
For simplicity, let us start with 1 POS file. This POS file can have delete information about all the 50+ files in the combined task

Currently, for every file it opens, it needs "delete row positions". So it invokes "DeleteFilter::deletedRowPositions". This opens the POS file, reads the footer and reads the snippet for specific file path.
Above step happens for all the 50+ files in sequential order.
Internally, it opens and reads the footer information 50+ times which is not needed.
Need a lightweight parquet reader, which can accept readerConfs etc and take up footer information as argument. Basically cache footer details, file status details to reduce turn around with object stores.

Otherwise pass the POS reader during data reading, such that it doesn't need to reopen and read the footers again.

Optimise on reading POS

Though path is dictionary encoded, it ends up materializing the path again and again. Need a way to optimise this to reduce CPU burn when reading POS files.
Covered in #5863

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org