You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/12/21 11:14:00 UTC

[jira] [Comment Edited] (HADOOP-18526) Leak of S3AInstrumentation instances via hadoop Metrics references

    [ https://issues.apache.org/jira/browse/HADOOP-18526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649710#comment-17649710 ] 

Steve Loughran edited comment on HADOOP-18526 at 12/21/22 11:13 AM:
--------------------------------------------------------------------



Don't take my unwillingness to backport personally...I have to deal with that internally for too much of my life. The problem is that often you can't cleanly distinguish "fix" from "feature", and what seems a simple fix (fs.s3a.endpoint.region) is actually a significant feature which can't be backported without cherrypicking a whole chain of commits (HADOOP-17705. As any commit can cause a regression, you end up having to keep all branches up to date and at that point you have to ask "why?". If all changes go back there is no more stability in the older branch than the newer one -so if people want the fixes, they may as well upgrade. Or take on the cherrypick, retest and followup fix maintenance themselves. We are open source, volunteers are welcome, but not me. Keeping trunk, 3.3.9 and getting 3.3.5 out is enough for me.

I do have to deal with cherrypicking things internally. There we are mostly up to date with s3a/abfs code though a few things are missing (multipart api), and dependencies haven't been updated when shared with other modules except when coordinated and the other groups are all happy. But we do end up having to do all that retesting, followup fixing etc. One benefit of that work, however, is that the full stack tests often find out regressions (example: HADOOP-18410) before they get into the asf releases. But it does mean a very large fraction of my week is spent in backporting changes to multiple internal branches and dealing with the consequences. Oh, and getting region support into older releases was a serious PITA until we worked out about awssdk_config_default.json. 

To close then: you get to embrace backporting or, as I recommend, pick up the latest release with a full set of current fixes and the changes needed for them to apply.




was (Author: stevel@apache.org):
Downgrade from fail to warn if trying to use dynamic partition output on a committer which doesn't declare support (it works, just slow)
Pass up partitions to partition list of superclass (needs a new addPartition()) method
HadoopMapReduceCommitProtocol scaladocs to describe abspath part of dynamic part output protocol

Don't take my unwillingness to backport personally...I have to deal with that internally for too much of my life. The problem is that often you can't cleanly distinguish "fix" from "feature", and what seems a simple fix (fs.s3a.endpoint.region) is actually a significant feature which can't be backported without cherrypicking a whole chain of commits (HADOOP-17705. As any commit can cause a regression, you end up having to keep all branches up to date and at that point you have to ask "why?". If all changes go back there is no more stability in the older branch than the newer one -so if people want the fixes, they may as well upgrade. Or take on the cherrypick, retest and followup fix maintenance themselves. We are open source, volunteers are welcome, but not me. Keeping trunk, 3.3.9 and getting 3.3.5 out is enough for me.

I do have to deal with cherrypicking things internally. There we are mostly up to date with s3a/abfs code though a few things are missing (multipart api), and dependencies haven't been updated when shared with other modules except when coordinated and the other groups are all happy. But we do end up having to do all that retesting, followup fixing etc. One benefit of that work, however, is that the full stack tests often find out regressions (example: HADOOP-18410) before they get into the asf releases. But it does mean a very large fraction of my week is spent in backporting changes to multiple internal branches and dealing with the consequences. Oh, and getting region support into older releases was a serious PITA until we worked out about awssdk_config_default.json. 

To close then: you get to embrace backporting or, as I recommend, pick up the latest release with a full set of current fixes and the changes needed for them to apply.



> Leak of S3AInstrumentation instances via hadoop Metrics references
> ------------------------------------------------------------------
>
>                 Key: HADOOP-18526
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18526
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.4
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.5
>
>
> A heap dump of a process running OOM shows that if a process creates then destroys lots of S3AFS instances, you seem to run out of heap due to references to S3AInstrumentation and the IOStatisticsStore kept via the hadoop metrics registry
> It doesn't look like S3AInstrumentation.close() is being invoked in S3AFS.close(). it should -with the IOStats being snapshotted to a local reference before this happens. This allows for stats of a closed fs to be examined.
> If you look at org.apache.hadoop.ipc.DecayRpcScheduler.MetricsProxy it uses a WeakReference to refer back to the larger object. we should do the same for abfs/s3a bindings. ideally do some template proxy class in hadoop common they can both use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org