You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@sentry.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2017/07/01 07:57:00 UTC

[jira] [Commented] (SENTRY-1827) Minimize TPathsDump thrift message used in HDFS sync

    [ https://issues.apache.org/jira/browse/SENTRY-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071091#comment-16071091 ] 

Hadoop QA commented on SENTRY-1827:
-----------------------------------

Here are the results of testing the latest attachment
https://issues.apache.org/jira/secure/attachment/12875342/SENTRY-1827.03.patch against master.

{color:red}Overall:{color} -1 due to 2 errors

{color:red}ERROR:{color} mvn test exited 1
{color:red}ERROR:{color} Failed: org.apache.sentry.tests.e2e.dbprovider.TestDbCrossOperations

Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/2950/console

This message is automatically generated.

> Minimize TPathsDump thrift message used in HDFS sync
> ----------------------------------------------------
>
>                 Key: SENTRY-1827
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1827
>             Project: Sentry
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>             Fix For: 1.8.0
>
>         Attachments: SENTRY-1827.01.patch, SENTRY-1827.02.patch, SENTRY-1827.03.patch
>
>
> We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that  a significant percentage of memory is wasted due to duplicate strings:
> {code}
> 7. DUPLICATE STRINGS
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> Of them, more than 1/3 come from sentry:
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.
> To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)