You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Misha Dmitriev (JIRA)" <ji...@apache.org> on 2018/07/16 22:49:00 UTC

[jira] [Created] (SPARK-24827) Some memory waste in History Server by strings in AccumulableInfo objects

Misha Dmitriev created SPARK-24827:
--------------------------------------

             Summary: Some memory waste in History Server by strings in AccumulableInfo objects
                 Key: SPARK-24827
                 URL: https://issues.apache.org/jira/browse/SPARK-24827
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.2
            Reporter: Misha Dmitriev


I've analyzed a heap dump of Spark History Server with jxray ([www.jxray.com)|http://www.jxray.com)/] and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the {{name}} and {{value}} data fields of {{AccumulableInfo}} objects:
{code:java}
7. Duplicate Strings:  overhead 42.1% 

  Total strings   Unique strings   Duplicate values  Overhead 
    13,732,278	   729,234	     354,032	     867,177K (42.1%)

Expensive data fields:


318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing arrays:

 ↖org.apache.spark.scheduler.AccumulableInfo.name

178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing arrays:

 ↖scala.Some.x

168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing arrays:

 ↖org.apache.spark.scheduler.AccumulableInfo.value{code}
That is, 15.4% of the heap is wasted by {{AccumulableInfo.name}} and 8.2% is wasted by {{AccumulableInfo.value}}.

It turns out that the problem has been partially addressed in spark 2.3+, e.g.

[https://github.com/apache/spark/blob/b045315e5d87b7ea3588436053aaa4d5a7bd103f/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L590]

However, this code has two minor problems:
 # Strings for {{AccumulableInfo.value}} are not interned in the above code, only {{AccumulableInfo.name}}.
 # For interning, the code in {{weakIntern(String)}} method uses a Guava interner ({{stringInterner = Interners.newWeakInterner[String]()}}). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM {{String.intern()}} method is much more efficient, both in terms of CPU and memory.

It is therefore suggested to add interning for {{value}} and replace the Guava interner with {{String.intern()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org