You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Bharat Viswanadham (Jira)" <ji...@apache.org> on 2020/11/19 22:44:00 UTC
[jira] [Resolved] (HDDS-4478) Large deletedKeyset slows down OM via listStatus

     [ https://issues.apache.org/jira/browse/HDDS-4478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharat Viswanadham resolved HDDS-4478.
--------------------------------------
    Fix Version/s: 1.1.0
       Resolution: Fixed

> Large deletedKeyset slows down OM via listStatus
> ------------------------------------------------
>
>                 Key: HDDS-4478
>                 URL: https://issues.apache.org/jira/browse/HDDS-4478
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>
> During TPC-DS test runs with Hive backed by Ozone storage I noticed the following:
> As part of data generation we create tables, and with auto partitioning turned on, some of the tables have a couple thousand partitions created.
> This process involves a couple of renames of the different data files while MR containers are working on to create the data files, and move them into their final position.
> After this, when I start to run queries, certain parts of the query run work on to split the work for mappers, and in order to do this, they do listStatus calls, as I tracked down and understood the code, there is a recursive listing is happening, that filters out files on the client side. This means that for a table where we have 3K partitions, Hive issues 3K listsSatus calls on the partition directories.
> During this phase the query is stuck INITIALIZING containers and query phases for a long time, which gave me the possibility to take a couple of jstacks, and all of them has 5-10 threads with this particular stack trace:
> {code}
> "IPC Server handler 90 on 9862" #147 daemon prio=5 os_prio=0 tid=0x00007fec32dd7800 nid=0x19ac2 runnable [0x00007febe13d9000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.TreeMap.put(TreeMap.java:575)
>         at java.util.TreeSet.add(TreeSet.java:255)
>         at org.apache.hadoop.ozone.om.KeyManagerImpl.listStatusFindKeyInTableCache(KeyManagerImpl.java:2041)
>         at org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:2117)
>         at org.apache.hadoop.ozone.om.OzoneManager.listStatus(OzoneManager.java:2938)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.listStatus(OzoneManagerRequestHandler.java:573)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:197)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:218)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$88/1748961453.apply(Unknown Source)
>         at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
>         at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> {code}
> This is happening with a recent (maybe 1 week old tops) master build.
> After I took heapdumps from the OM during query runs, at two different times, I saw the following concerning numbers:
> From CacheKey instances we have 78k in the earlier one, and 80K in the older one
> From CacheValue instance we have 78K in both.
> Almost all of them has the OMMetadataManager as GC root, and belongs to the keytable cache.
> During listStatusFindKeyInTableCache, we iterate through the cache entries via the cache iterator, and if the cacheKey is not equal to the keyArgs, and cacheValue is null, we add the key to the deletedKeySet.
> Where this method is called, is KeyManager's listStatus method, which for every call creates a new deletedKeySet as a TreeSet. Then it goes over all keys in the bucket that are cached, and puts together the deletedKeySet. We seem to need this deletedKeySet in order to leave out keys that are already deleted in the cache but still present in the db from the listing which is a valid reason, but causes a significant slowdown (1-2 order of magnitude) compared to runtimes after OM restart.
> In the heap dump the cache values are either an instance of Present, or Absent, and the majority seems to be Absent. Based on the epoch even present values and absent values are from various epochs, from 1 to 300K+.
> Int the TableCacheImpl, we have an evictCache method that happens in an executor service, and which should run on DoubleBuffer flush.
> It gets a list of epochs to evict, and it iterates through the epochEntries the cache has, and, as we talk about the KeyTable that has a MANUAL eviction policy, it will remove the entry from the cache if the entry's epoch is the current epoch in the iteration, and if the supplied epochs to evict contain the current epoch.
> Epochs come from DoubleBuffer#flushTransactions.
> It starts as a HashMap of Strings and List of Longs, and DoubleBuffer puts together the list of epochs for a table during flush, based on table names registered for Responses with the CleanupTableInfo annotation.
> The only suspicious thing I found, is that OMFileCreateResponse does not define the key table as something to clean up, and it parents neither has it defined for cleanup, while the OMFileCreateRequest adds the parents of the created file to the key table cache [here|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileCreateRequest.java#L299-L301].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org