You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "fengguangyuan (via GitHub)" <gi...@apache.org> on 2023/02/17 13:02:39 UTC

[GitHub] [iceberg] fengguangyuan commented on issue #3340: Should be synchronized on current()/refresh() overrode in HiveTableOperations?

fengguangyuan commented on issue #3340:
URL: https://github.com/apache/iceberg/issues/3340#issuecomment-1434622518

   > > # Why the problem ?
   > > Considering the following cases, each `AppendFiles` instance may hold a stale table metadata instance (referenced by `base` defined as a member variable in `SnapshotProducer`), because of some new snapshots committed by other threads or tasks:
   > > ```java
   > > AppendFiles af1 = table.newAppend().addFile(thread-1.file);
   > > AppendFiles af2 = table.newAppend().addFile(thread-2.file);
   > > AppendFiles af3 = table.newAppend().addFile(thread-3.file);
   > > ...
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > With so many `AppendFiles` existed, the referenced staled `TableMetadata` instances also won't be reclaimed by GC in time, and as we know that the size of TableMetadata instance is increased along with the number of snapshots, in consequence, the GC issues come, commonly seeing `GC overhead limited exceed` error.
   > 
   > Hi @fengguangyuan! Firstly, thank you for your contributions.
   > 
   > I wanted to ask you about your comment here about the GC overhead limit exceeded and your concern with having too many snapshots (which definitely is valid in general, though there are configurations to rewrite snapshots after a certain number and table maintenance operations to keep snapshots at a healthy amount of your needs). I'm hoping you can help me understand more your practices: how you're using the library to achieve this additional parallelism (is this via `.par` to make a Scala parallel collection with Spark, do you have custom code, is it just a configuration property you've raised)? Also, what catalog are you using and what filestore are you writing to? And also (important but I understand if it's not easy to answer right away), at what rate are you accumulating additional snapshots based on the your incoming data (maybe how many files do you have per snapshot in general)? And also, how often are you calling `AppendFile` per commit (as you mentioned `AppendFile` so 
 I'm wondering if you're using the library a bit more directly or what).
   > 
   > I know this isn't necessarily a small ask, but you've given a very thorough description here, and I'd really like to better understand the problems that you're seeing arise as well as your usage of the library. I think it would be really valuable.
   > 
   > Anything you can provide, starting with the basics of:
   > 
   > * system you're using for writing (eg Flink, Spark, Trino, etc),
   > * catalog you're using
   > * iceberg version you're using
   > * any non-default configuration values you're setting, particularly that would affect commit rate and snapshot production
   > 
   > Also, ideally:
   > 
   > * How are you achieving this added parallelism (a config value, high level code in your job, code that you've written using the Iceberg library, etc)
   > * If you've written your own code using the Java API, the relationship between commits and AppendaFile
   > 
   > Anything you can share would greatly help in understanding your use case and the problems you're facing more. And there might be learnings to be had from your usage. 🙂
   > 
   > Thanks, Kyle!
   Best regards, Kyle :)
   
   So sorry too long to no reply. Thanks for you recommendations, it's a lesson has taken to my heart!
   
   After known more abort the code,  it's indeed a bad case to share a table in different threads.
   It's the caller's responsibilities  NOT to do like that, to avoid generating a large amount staled `metadata` objects with Hive operations in a short time, leading to the pressure on JVM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org