You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Bogdan Kanivets (JIRA)" <ji...@apache.org> on 2018/07/19 06:26:00 UTC
[jira] [Commented] (ZOOKEEPER-3046) testManyChildWatchersAutoReset is flaky

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548844#comment-16548844 ] 

Bogdan Kanivets commented on ZOOKEEPER-3046:
--------------------------------------------

[~hanm]

[~phunt]

I've done some digging into the performance of this test by adding more system and time logging.

I've tested using fake [PR|https://github.com/apache/zookeeper/pull/538] to collect data in apache CI system. I'm using fake 'testCreateManyChildren' that just creates 10k nodes.

Also, I've tested locally by running all tests in the loop and recording all logs until the disk is full.

My hypothesis:

there is some other write-heavy test or build(maybe hadoop tests) that slows down creation of 10K children.

Here fake test took 10s

[https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1912/testReport/]

Here fake took 3m 20s

[https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1907/testReport/]
 # I don't think there is a deadlock, because I'm logging 'stat' after every 1K nodes created and it's pretty consistent.
 ** "Latency min/avg/max: 2/32/596" on slow
 ** "Latency min/avg/max: 0/0/25" on fast
 # Locally, when disk is getting full, I've seen this test slowing down.
 # I've tried logging vmstat, but not very helpful
 ** load average is bit higher on slow "load average: 12.54, 12.04, 10.95"
 ** "load average: 4.16, 4.56, 4.86" on fast
 # Since iostat isn't available, I've logged 'less /proc/diskstats'. Not very meaningful because I'm logging at differently spaced timestamps right now. But one observation is that column 12 "I/Os currently in progress" is usually 1 or 2 on slow and usually 0 on fast.
 # Looking at column 9 of /proc/diskstats (writes completed) for 'sda' device and calculating writes per ms using timestamps LOG.info:
 ** 0.18 writes/ms on slow
 ** 9.1 writes/ms on fast
 ** warning - this calculation shouldn't be trusted - I've done it only on one sample 

As a result I've tried to pin point some slow jenkins apache hosts, but wasn't successful. I don't see the pattern. Also tried looking at jobs that run at the same time, again no luck.

Maybe the next steps can be:
 # Trying to run this test with memory mapped files
 # Somehow pinning the host just for zk builds
 # Getting access to the box and doing better profiling
 # I can try writing a better version of /proc/diskstats stats collector

 

 

 

 

> testManyChildWatchersAutoReset is flaky
> ---------------------------------------
>
>                 Key: ZOOKEEPER-3046
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3046
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: tests
>    Affects Versions: 3.5.3, 3.4.12
>            Reporter: Bogdan Kanivets
>            Assignee: Bogdan Kanivets
>            Priority: Minor
>              Labels: flaky, pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> According to the [dashboard|https://builds.apache.org/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html] testManyChildWatchersAutoReset is flaky in 3.4 and 3.5
> [ZooKeeper_branch34_java10|https://builds.apache.org/job/ZooKeeper_branch34_java10//13]
> [ZooKeeper_branch35_java9|https://builds.apache.org/job/ZooKeeper_branch35_java9/253]
> Test times out and because of that ant doesn't capture any output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)