You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/21 17:09:10 UTC

[GitHub] [iceberg] davidwilcox opened a new issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

davidwilcox opened a new issue #2130:
URL: https://github.com/apache/iceberg/issues/2130


   I have a process that reads Tables stored in Iceberg and processes them, many at a time. Lately, we've had problems with the scalability of our process due to the number of Hadoop Filesystem objects created inside Iceberg for Tables with many snapshots. These tables could have tens of thousands of snapshots inside, but I only want to read the latest snapshot. Inside the Hadoop Filesystem creation code that's called for every snapshot, there are process-level locks that end up locking up my whole process.
   
   Inside TableMetadataParser, it looks like we read in every snapshot even though the reader likely only wants one snapshot. This loop is what's responsible for locking up my process.
   https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320
   
   I noticed that my process does not care about the whole snapshot list. My process only is interested in a particular snapshot -- just one of them. I'm interested in making a contribution so that the entire snapshot list is lazily calculated inside of TableMetadata where it's actually used. So, we would not create the Snapshot itself in TableMetadataParser, but instead likely would pass a SnapshotCreator in that could know how to create snapshots. We would pass all of the SnapshotCreators into TableMetadata which would create snapshots when needed.
   
   Would you be amenable to such a change? I want to make sure that you think that this sounds like something you would accept before I spend time coding it up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davidwilcox closed issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
davidwilcox closed issue #2130:
URL: https://github.com/apache/iceberg/issues/2130


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davidwilcox commented on issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
davidwilcox commented on issue #2130:
URL: https://github.com/apache/iceberg/issues/2130#issuecomment-765111545


   Thanks @HeartSaVioR! I already sent an email on the mailing list.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davidwilcox closed issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
davidwilcox closed issue #2130:
URL: https://github.com/apache/iceberg/issues/2130


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #2130:
URL: https://github.com/apache/iceberg/issues/2130#issuecomment-765109855


   If your workload doesn't need to have full history of snapshots, why not doing some maintenance work in the background?
   Specifically expiring snapshots: https://iceberg.apache.org/maintenance/#expire-snapshots
   
   IMHO the idea looks feasible, but it adds the complexity and makes it harder to debug so sounds like a trade-off. I'm just a contributor and others may have different opinions. It might be better to post to dev mailing list for possible improvements - it could get more traction from active committers/PMC members.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davidwilcox commented on issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
davidwilcox commented on issue #2130:
URL: https://github.com/apache/iceberg/issues/2130#issuecomment-765111545






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davidwilcox commented on issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
davidwilcox commented on issue #2130:
URL: https://github.com/apache/iceberg/issues/2130#issuecomment-765116951


   Looking at this a bit more, I think that this was actually fixed about six months ago in this commit: a5d105d20a1ad816d9a662901780195da7a450fb
   
   Notice that the call to `io.newInputFile` got moved into a lazy call instead of an eager call. I must still be using an old version of Iceberg.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on issue #2130: Ways To Alleviate Load For Tables With Many Snapshots

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #2130:
URL: https://github.com/apache/iceberg/issues/2130#issuecomment-765109855


   If your workload doesn't need to have full history of snapshots, why not doing some maintenance work in the background?
   Specifically expiring snapshots: https://iceberg.apache.org/maintenance/#expire-snapshots
   
   IMHO the idea looks feasible, but it adds the complexity and makes it harder to debug so sounds like a trade-off. I'm just a contributor and others may have different opinions. It might be better to post to dev mailing list for possible improvements - it could get more traction from active committers/PMC members.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org