You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/04 03:27:00 UTC

[jira] [Commented] (FLINK-7368) MetricStore makes cpu spin at 100%

    [ https://issues.apache.org/jira/browse/FLINK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113865#comment-16113865 ] 

ASF GitHub Bot commented on FLINK-7368:
---------------------------------------

GitHub user nicochen opened a pull request:

    https://github.com/apache/flink/pull/4472

    FLINK-7368: MetricStore makes cpu spin at 100%

    Flink's `MetricStore` is not thread-safe. multi-treads may acess java' hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. 
    
    Recently I met the case that flink jobmanager consumed 100% cpu. A part of stacktrace is shown below. The full jstack is in the attachment.
    {code:java}
    "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 runnable [0x00007fbd7d1c2000]
       java.lang.Thread.State: RUNNABLE
            at java.util.HashMap.put(HashMap.java:494)
            at org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176)
            at org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188)
            at akka.dispatch.OnSuccess.internal(Future.scala:212)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
            at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
            at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
            at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
            at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265)
            at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
            at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604)
            at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
            at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
            at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398)
    {code}
    
    There are 24 threads show same stacktrace as above to indicate they are spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts indicate multi-threads accessing hashmap cause this problem and I reproduce the case as well. Even through `MetricFetcher` has a 10 seconds minimum inteverl between each metrics qurey, it still cannot guarntee query responses do not acess `MtricStore`'s hashmap concurrently. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nicochen/flink FLINK-7368

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4472
    
----
commit abfa571fbf99be4b98d8d690ed10df1440dd21d5
Author: nicochen2012 <16...@cnsuning.com>
Date:   2017-08-04T03:21:49Z

    FLINK-7368: MetricStore makes cpu spin at 100%

----


> MetricStore makes cpu spin at 100%
> ----------------------------------
>
>                 Key: FLINK-7368
>                 URL: https://issues.apache.org/jira/browse/FLINK-7368
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>            Reporter: Nico Chen
>         Attachments: jm-jstack.log
>
>
> Flink's `MetricStore` is not thread-safe. multi-treads may acess java' hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. 
> Recently I met the case that flink jobmanager consumed 100% cpu. A part of stacktrace is shown below. The full jstack is in the attachment.
> {code:java}
> "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 runnable [0x00007fbd7d1c2000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.HashMap.put(HashMap.java:494)
>         at org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176)
>         at org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121)
>         at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198)
>         at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58)
>         at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188)
>         at akka.dispatch.OnSuccess.internal(Future.scala:212)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>         at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
>         at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
>         at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265)
>         at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
>         at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604)
>         at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
>         at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
>         at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398)
> {code}
> There are 24 threads show same stacktrace as above to indicate they are spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts indicate multi-threads accessing hashmap cause this problem and I reproduce the case as well. Even through `MetricFetcher` has a 10 seconds minimum inteverl between each metrics qurey, it still cannot guarntee query responses do not acess `MtricStore`'s hashmap concurrently. 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)