You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by nicochen <gi...@git.apache.org> on 2017/08/04 03:26:16 UTC

[GitHub] flink pull request #4472: FLINK-7368: MetricStore makes cpu spin at 100%

GitHub user nicochen opened a pull request:

    https://github.com/apache/flink/pull/4472

    FLINK-7368: MetricStore makes cpu spin at 100%

    Flink's `MetricStore` is not thread-safe. multi-treads may acess java' hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. 
    
    Recently I met the case that flink jobmanager consumed 100% cpu. A part of stacktrace is shown below. The full jstack is in the attachment.
    {code:java}
    "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 runnable [0x00007fbd7d1c2000]
       java.lang.Thread.State: RUNNABLE
            at java.util.HashMap.put(HashMap.java:494)
            at org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176)
            at org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58)
            at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188)
            at akka.dispatch.OnSuccess.internal(Future.scala:212)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
            at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
            at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
            at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
            at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265)
            at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
            at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604)
            at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
            at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
            at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398)
    {code}
    
    There are 24 threads show same stacktrace as above to indicate they are spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts indicate multi-threads accessing hashmap cause this problem and I reproduce the case as well. Even through `MetricFetcher` has a 10 seconds minimum inteverl between each metrics qurey, it still cannot guarntee query responses do not acess `MtricStore`'s hashmap concurrently. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nicochen/flink FLINK-7368

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4472
    
----
commit abfa571fbf99be4b98d8d690ed10df1440dd21d5
Author: nicochen2012 <16...@cnsuning.com>
Date:   2017-08-04T03:21:49Z

    FLINK-7368: MetricStore makes cpu spin at 100%

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by nicochen <gi...@git.apache.org>.
Github user nicochen commented on the issue:

    https://github.com/apache/flink/pull/4472
  
    @zentol Thanks for replying. Indeed, the problem is caused by MetricFetcher isn't synchronizing on the `MetricStore` object in MetricFetcher#addMetrics(). But in my opinion, synchronizing on the `MetricStore`  is less efficient. `MetricStore` wrapps more than one metric stores and they serves different components(e.g Jobmanager,Taskmangers)  individually. If synchronizing on the `MetricStore` , call of addMetrics() on Jobmanger's metric may wait for addMetrics() on another taskmananger's metric done as they both acquire the same lock, which is unnecessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/4472


---

[GitHub] flink pull request #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4472#discussion_r132429138
  
    --- Diff: flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/metrics/MetricStore.java ---
    @@ -24,8 +24,8 @@
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
    -import java.util.HashMap;
    -import java.util.HashSet;
    +import java.util.concurrent.ConcurrentHashMap;
    +import java.util.concurrent.ConcurrentSkipListSet;
    --- End diff --
    
    To configure the checkstyle plugin: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/ide_setup.html#checkstyle


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/4472
  
    The problem is that the `MetricFetcher` isn't synchronizing on the `MetricStore´ object in `MetricFetcher#addMetrics()` as it should.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/4472
  
    I do see the benefit of a more fine-grained synchronization. What I dislike about using plain ConcurrentHashMaps everywhere is that accesses to the metrics 99% of the time are done in batches, and I don't really want to pay the synchronization cost every time.
    
    That said, the whole "you have to synchronize manually" is a rather big source of bugs so we may just have to bite the bullet.


---

[GitHub] flink pull request #4472: FLINK-7368: MetricStore makes cpu spin at 100%

Posted by asdf2014 <gi...@git.apache.org>.
Github user asdf2014 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4472#discussion_r132358492
  
    --- Diff: flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/metrics/MetricStore.java ---
    @@ -24,8 +24,8 @@
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
    -import java.util.HashMap;
    -import java.util.HashSet;
    +import java.util.concurrent.ConcurrentHashMap;
    +import java.util.concurrent.ConcurrentSkipListSet;
    --- End diff --
    
    Hi, @nicochen . Thank you for the `PR`. There is a import order problem, you should change the order of those import as the following code. Otherwise it will not pass the `checkstyle` system.
    ```java
    import java.util.Map;
    import java.util.Set;
    import java.util.concurrent.ConcurrentHashMap;
    import java.util.concurrent.ConcurrentSkipListSet;
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---