You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sentry.apache.org by Arjun Mishra <am...@cloudera.com> on 2017/07/13 18:03:30 UTC

Review Request 60843: SENTRY-1827 Minimize TPathsDump thrift message used in HDFS sync - Adding change to sentry-ha-redesign branch

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/
-----------------------------------------------------------

Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.


Repository: sentry


Description
-------

We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that  a significant percentage of memory is wasted due to duplicate strings:

{code}
7. DUPLICATE STRINGS

Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
{code}

Of them, more than 1/3 come from sentry:

{code}
  917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}

The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.

To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".


Diffs
-----

  sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java 722ad76d9 
  sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java 095095710 
  sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java 479188e51 
  sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java e777e4b1a 
  sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java 08a3b3e92 
  sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift b0a1f877b 
  sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java 194ffb755 
  sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java 9a726da27 
  sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java 89a3297d4 
  sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java 2426b4079 


Diff: https://reviews.apache.org/r/60843/diff/1/


Testing
-------


Thanks,

Arjun Mishra


Re: Review Request 60843: SENTRY-1827 Minimize TPathsDump thrift message used in HDFS sync - Adding change to sentry-ha-redesign branch

Posted by Brian Towles <bt...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/#review180537
-----------------------------------------------------------


Ship it!




Looks good to me.

- Brian Towles


On July 13, 2017, 1:03 p.m., Arjun Mishra wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60843/
> -----------------------------------------------------------
> 
> (Updated July 13, 2017, 1:03 p.m.)
> 
> 
> Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.
> 
> 
> Repository: sentry
> 
> 
> Description
> -------
> 
> We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that  a significant percentage of memory is wasted due to duplicate strings:
> 
> {code}
> 7. DUPLICATE STRINGS
> 
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> 
> Of them, more than 1/3 come from sentry:
> 
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> 
> The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.
> 
> To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".
> 
> 
> Diffs
> -----
> 
>   sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java 722ad76d9 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java 095095710 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java 479188e51 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java e777e4b1a 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java 08a3b3e92 
>   sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift b0a1f877b 
>   sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java 194ffb755 
>   sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java 9a726da27 
>   sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java 89a3297d4 
>   sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java 2426b4079 
> 
> 
> Diff: https://reviews.apache.org/r/60843/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Arjun Mishra
> 
>


Re: Review Request 60843: SENTRY-1827 Minimize TPathsDump thrift message used in HDFS sync - Adding change to sentry-ha-redesign branch

Posted by Arjun Mishra <am...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/#review180548
-----------------------------------------------------------


Ship it!




Ship It!

- Arjun Mishra


On July 13, 2017, 6:03 p.m., Arjun Mishra wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60843/
> -----------------------------------------------------------
> 
> (Updated July 13, 2017, 6:03 p.m.)
> 
> 
> Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.
> 
> 
> Repository: sentry
> 
> 
> Description
> -------
> 
> We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that  a significant percentage of memory is wasted due to duplicate strings:
> 
> {code}
> 7. DUPLICATE STRINGS
> 
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> 
> Of them, more than 1/3 come from sentry:
> 
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> 
> The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.
> 
> To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".
> 
> 
> Diffs
> -----
> 
>   sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java 722ad76d9 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java 095095710 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java 479188e51 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java e777e4b1a 
>   sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java 08a3b3e92 
>   sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift b0a1f877b 
>   sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java 194ffb755 
>   sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java 9a726da27 
>   sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java 89a3297d4 
>   sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java 2426b4079 
> 
> 
> Diff: https://reviews.apache.org/r/60843/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Arjun Mishra
> 
>