You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "zombee0 (via GitHub)" <gi...@apache.org> on 2023/02/17 09:11:00 UTC

[GitHub] [iceberg] zombee0 opened a new issue, #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

zombee0 opened a new issue, #6868:
URL: https://github.com/apache/iceberg/issues/6868

   ### Feature Request / Improvement
   
   I think iceberg decode manifest avro file each file one thread, do we have some method that split the file and decode parallel?
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zombee0 commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "zombee0 (via GitHub)" <gi...@apache.org>.
zombee0 commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1434387258

   > @zombee0 thanks for raising this question.
   > 
   > Avro supports parallelization by decoding the blocks in parallel. For example, we could reduce the block size to increase parallelism. But before doing so, I think it is important to understand why the Manifest file is so big. It can be that there are many columns, or there are many small files that you can compact.
   
   Yes, we have an iceberg table with almost 1000 column, and I don't know why the manifest file were compacted to several files, and it is very big, when I use iceberg tablescan.plantasks it cost long time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1437249484

   @zombee0 Currently manifests are decoded in parallel, but a single manifest isn't decoded by multiple threads. I'm not aware of any plans to make this possible. If you're interested in working on this, you're more than welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zombee0 commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "zombee0 (via GitHub)" <gi...@apache.org>.
zombee0 commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1435481371

   @Fokko Does the iceberg team have any plan to support decode a big manifest file with several threads parallel?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1434374131

   @zombee0 thanks for raising this question.
   
   Avro supports parallelization by decoding the blocks in parallel. For example, we could reduce the block size to increase parallelism. But before doing so, I think it is important to understand why the Manifest file is so big. It can be that there are many columns, or there are many small files that you can compact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Big manifest] If the manifest file is very big, the decode cost time is very long [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1877934801

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1434626492

   @zombee0 Got it, and do you use the statistics over all the columns? You can also configure how to collect all the statistics: https://iceberg.apache.org/docs/latest/configuration/#write-properties
   
   ![image](https://user-images.githubusercontent.com/1134248/219660666-67572dc0-ed12-48eb-8399-b8c72d36a0c0.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Big manifest] If the manifest file is very big, the decode cost time is very long [iceberg]

Posted by "zombee0 (via GitHub)" <gi...@apache.org>.
zombee0 closed issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long
URL: https://github.com/apache/iceberg/issues/6868


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rustyconover commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "rustyconover (via GitHub)" <gi...@apache.org>.
rustyconover commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1550601530

   I'm also seeing this behavior. I have 10 manifest files each 8mb in size.
   
   It seems that there is a lot of contention for Python's GIL across all of the threads.  It may be better to use a ProcessPool rather than a thread pool to do the decoding of the Avro file.  That way there wouldn't be contention around the GIL lock and the result can be easily serialized back to the calling function.  If I have time I will build a comparison between ThreadPool and ProcessPool based loading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zombee0 commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "zombee0 (via GitHub)" <gi...@apache.org>.
zombee0 commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1439911944

   @Fokko thank you very much, I might do this work when I have some time. And another question, is there any team do some work on cache the decoded manifest information, because decoding avro file is really cpu intensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hongli-my commented on issue #6868: [Big manifest] If the manifest file is very big, the decode cost time is very long

Posted by "hongli-my (via GitHub)" <gi...@apache.org>.
hongli-my commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1480530826

   
   > there are many small files that you can compact.
   
    data in iceberg over two years,  if we compact data to big file,  For example 2GB,    I  think  filtering will not work well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org