You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/12/28 03:14:03 UTC

[GitHub] [dolphinscheduler] oldbelvey opened a new issue, #13288: HadoopUtils's causing OOM issue

oldbelvey opened a new issue, #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What happened
   
   Worker OOM,  look into the dump file, 
   there are 91879 objects of 'org.apache.hadoop.conf.Configuration'  which consume 94% of the total 8G memeory (retained heap).
   below is the gc root of Configuration object.
   ```
   Class Name                                                                                                                         | Shallow Heap | Retained Heap
   ------------------------------------------------------------------------------------------------------------------------------------------------------------------
   org.apache.hadoop.conf.Configuration @ 0x5f880d4f8                                                                                 |           40 |        85,072
   |- conf org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider @ 0x5f880d450                                  |           96 |           296
   |  |- proxyProvider org.apache.hadoop.io.retry.RetryInvocationHandler @ 0x5f8837850                                                |           40 |         1,592
   |  |  |- h com.sun.proxy.$Proxy138 @ 0x5f883f410                                                                                   |           16 |         1,608
   |  |  |  |- federatedNamenode org.apache.hadoop.hdfs.DFSClient @ 0x7f877c610                                                       |          128 |         4,160
   |  |  |  |  |- dfs org.apache.hadoop.hdfs.DistributedFileSystem @ 0x7f877c2f0                                                      |           56 |         5,336
   |  |  |  |  |  |- fs org.apache.hadoop.fs.viewfs.ChRootedFileSystem @ 0x7f873e400                                                  |           56 |         6,256
   |  |  |  |  |  |  |- fs org.apache.hadoop.fs.viewfs.MergedInodeTree$INodeMerge @ 0x7f873d7d0                                       |           32 |         6,984
   |  |  |  |  |  |  |  |- target org.apache.hadoop.fs.viewfs.InodeTree$MountPoint @ 0x5f8841af0                                      |           24 |           136
   |  |  |  |  |  |  |  |  |- [171] java.lang.Object[244] @ 0x7f7ec7330                                                               |          992 |        27,480
   |  |  |  |  |  |  |  |  |  |- elementData java.util.ArrayList @ 0x7e9599498                                                        |           24 |        27,504
   |  |  |  |  |  |  |  |  |  |  |- mountPoints org.apache.hadoop.fs.viewfs.ViewFileSystem$1 @ 0x7e9599478                            |           32 |     1,616,104
   |  |  |  |  |  |  |  |  |  |  |  |- fsState org.apache.hadoop.fs.viewfs.ViewFileSystem @ 0x6bb0d4318                               |           72 |     1,617,440
   |  |  |  |  |  |  |  |  |  |  |  |  |- viewFs org.apache.hadoop.hdfs.FederatedDFSFileSystem @ 0x6bb405ad8                          |           64 |     1,618,016
   |  |  |  |  |  |  |  |  |  |  |  |  |  |- this$0 org.apache.hadoop.hdfs.FederatedDFSFileSystem$1 @ 0x6bb55eab8                     |           16 |            16
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- renewMpt org.apache.hadoop.hdfs.MountPointRenewer @ 0x6bb55a0a0                       |           64 |        57,720
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- this$0 org.apache.hadoop.hdfs.MountPointRenewer$3 @ 0x5fb360928                    |           40 |        57,776
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- <Java Local> java.util.TimerThread @ 0x5fb360748  Timer-35086 Thread            |          128 |           304
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- [1] java.util.TimerTask[128] @ 0x5fb360538                                      |          528 |           528
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- queue java.util.TaskQueue @ 0x5fb360520 Busy Monitor                         |           24 |           552
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  '- Total: 1 entry                                                               |              |              
   
   ```
   
   
   
   
   
   
   
   ### What you expected to happen
   
   
   It'  Timer in `org.apache.hadoop.hdfs.MountPointRenewer` keep the object away from GC.
   
   The MountPointRenewer is most likely  coming from FileSystem class and there are  192 FileSystem instance. 
   
   I find FileSystem class from the source code, and it only used in `org.apache.dolphinscheduler.common.utils.HadoopUtils` and i see the code below.
   
   ```
       private static final LoadingCache<String, HadoopUtils> cache = CacheBuilder
               .newBuilder()
               .expireAfterWrite(PropertyUtils.getInt(Constants.KERBEROS_EXPIRE_TIME, 2), TimeUnit.HOURS)
               .build(new CacheLoader<String, HadoopUtils>() {
                   @Override
                   public HadoopUtils load(String key) throws Exception {
                       return new HadoopUtils();
                   }
               });
   ```
   
   By default the `HadoopUtils` is generate  every 2 hours, and  the filesystem is never closed.
   
   ### How to reproduce
   
   1. dolphin version 1.3.6 (later version same)
   2. dolphin common.properties: resource.storage.type=HDFS
   3.  hadoop core-site.xml : fs.AbstractFileSystem.hdfs.impl = org.apache.hadoop.fs.FederatedHdfs
   
   using these config the worker OOM every few days or month, depends on the memory we config.
   
   ### Anything else
   
   In my  opinion there is no need to reproduce HadoopUtils   every 2 hours using google cache(LoadingCache).
   
   and I am changing it to single instance , and using `UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();` to refresh the kerberos ticket.
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1366398783

   You can try upgrading the version to 2.0.X to avoid this problem since version 1.3.6 not in maintenance list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] closed issue #13288: HadoopUtils's causing OOM issue

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #13288: HadoopUtils's  causing OOM issue 
URL: https://github.com/apache/dolphinscheduler/issues/13288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1367313109

   > By changing HadoopUtils to single instance , and using `UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();` to refresh the kerberos ticket. It seems that his problem is solved。
   
   Would you like to submit a PR to fix this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] oldbelvey commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
oldbelvey commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1367687490

   I will keep watching my dolphin cluster for a few days, if it really works,I will commit my changes latter.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS closed issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
SbloodyS closed issue #13288: HadoopUtils's  causing OOM issue 
URL: https://github.com/apache/dolphinscheduler/issues/13288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] oldbelvey commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
oldbelvey commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1367252631

   By  changing HadoopUtils to  single instance , and using `UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();` to refresh the kerberos ticket.
   It seems that his problem is solved。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] oldbelvey commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
oldbelvey commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1366499954

   > 
   
   It seems like versions < 3.1.1  has the same problem.
   
   https://github.com/apache/dolphinscheduler/blob/2.0.7/dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/HadoopUtils.java
   
   https://github.com/apache/dolphinscheduler/blob/3.1.0-release/dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/HadoopUtils.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13288: HadoopUtils's causing OOM issue

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1414544548

   This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13288: HadoopUtils's causing OOM issue

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1366336131

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13288: HadoopUtils's causing OOM issue

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288#issuecomment-1425011879

   This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org