You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arnaud Nauwynck (Jira)" <ji...@apache.org> on 2022/10/03 20:34:00 UTC

[jira] [Updated] (SPARK-40642) wrong doc on memory tuning regarding String object memory size, changed since version>=9

     [ https://issues.apache.org/jira/browse/SPARK-40642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arnaud Nauwynck updated SPARK-40642:
------------------------------------
    Description: 
The documentation is wrong regarding memory consumption of java.lang.String
https://spark.apache.org/docs/latest/tuning.html#memory-tuning

internally, the source for this doc section is written here:
https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100

{noformat}
* Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an
  array of `Char`s and keep extra data such as the length), and store each character
  as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can
  easily consume 60 bytes.
{noformat}


reason: since java version >= 9 ... Java has optimized the problem described in the doc.
It used to be 16 bytes of header + using internally char coded as UTF-16

Notice  that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM :  -XX:+UseCompressedStrings , but it was not enabled by default ) 

Since OpenJdk >= 9...  with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte   field in class java.lang.String to say if the "coder" is optimized for Latin1.

This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
The computation for the memory size of String was   "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text

the object overhead is 44 because of alignment... not 40+1 for adding one "byte" field



Notice that I am surprised that java.lang.String has put this order for fields
{noformat}
    private final byte[] value;
    private final byte coder;  // <== will take 4 bytes for alignement with next "int" field
    private int hash;
    private boolean hashIsZero; 
{noformat}
instead of 
{noformat}
    private final byte[] value;
    private int hash;
    private final byte coder;  // <== would have take only 1 byte, because not aligment is necessary with next "boolean" field
    private boolean hashIsZero; 
{noformat}
Maybe it would be worth also doing a jira/PR to the OpenJdk for this change?





  was:
The documentation is wrong regarding memory consumption of java.lang.String
https://spark.apache.org/docs/latest/tuning.html#memory-tuning

internally, the source for this doc section is written here:
https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100

{noformat}
* Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an
  array of `Char`s and keep extra data such as the length), and store each character
  as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can
  easily consume 60 bytes.
{noformat}


reason: since java version >= 9 ... Java has optimized the problem described in the doc.
It used to be 16 bytes of header + using internally char coded as UTF-16

Notice  that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM :  -XX:+UseCompressedStrings , but it was not enabled by default ) 

Since OpenJdk >= 9...  with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte   field in class java.lang.String to say if the "coder" is optimized for Latin1.

This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
The computation for the memory size of String was   "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text

the object overhead is 44 because of alignment... not 40+1 because adding one "byte" field? 

Notice that I am surprised that java.lang.String has put this order for fields
{noformat}
    private final byte[] value;
    private final byte coder;  // <== will take 4 bytes for alignement with next "int" field
    private int hash;
    private boolean hashIsZero; 
{noformat}
instead of 
{noformat}
    private final byte[] value;
    private int hash;
    private final byte coder;  // <== would have take only 1 byte, because not aligment is necessary with next "boolean" field
    private boolean hashIsZero; 
{noformat}
Maybe it would be worth also doing a jira/PR to the OpenJdk for this change?






> wrong doc on memory tuning regarding String object memory size, changed since version>=9
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-40642
>                 URL: https://issues.apache.org/jira/browse/SPARK-40642
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2
>            Reporter: Arnaud Nauwynck
>            Priority: Trivial
>
> The documentation is wrong regarding memory consumption of java.lang.String
> https://spark.apache.org/docs/latest/tuning.html#memory-tuning
> internally, the source for this doc section is written here:
> https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100
> {noformat}
> * Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an
>   array of `Char`s and keep extra data such as the length), and store each character
>   as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can
>   easily consume 60 bytes.
> {noformat}
> reason: since java version >= 9 ... Java has optimized the problem described in the doc.
> It used to be 16 bytes of header + using internally char coded as UTF-16
> Notice  that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM :  -XX:+UseCompressedStrings , but it was not enabled by default ) 
> Since OpenJdk >= 9...  with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte   field in class java.lang.String to say if the "coder" is optimized for Latin1.
> This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
> The computation for the memory size of String was   "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text
> the object overhead is 44 because of alignment... not 40+1 for adding one "byte" field
> Notice that I am surprised that java.lang.String has put this order for fields
> {noformat}
>     private final byte[] value;
>     private final byte coder;  // <== will take 4 bytes for alignement with next "int" field
>     private int hash;
>     private boolean hashIsZero; 
> {noformat}
> instead of 
> {noformat}
>     private final byte[] value;
>     private int hash;
>     private final byte coder;  // <== would have take only 1 byte, because not aligment is necessary with next "boolean" field
>     private boolean hashIsZero; 
> {noformat}
> Maybe it would be worth also doing a jira/PR to the OpenJdk for this change?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org