You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Timothy Farkas (JIRA)" <ji...@apache.org> on 2018/05/02 16:06:00 UTC
[jira] [Commented] (DRILL-6380) Mongo db storage plugin tests can hang on jenkins.

    [ https://issues.apache.org/jira/browse/DRILL-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461232#comment-16461232 ] 

Timothy Farkas commented on DRILL-6380:
---------------------------------------

Doing the following seems to fix the tests on jenkins.

 1. Put replica sets into a tree map instead of a hashmap
 2. Flapadoodle iterates over the entry set of the map. When we use a tree map the config servers are guaranteed to be the first item flapadoodle iterates over.
 3. This guarantees that when flapadoodle starts the replica sets the config servers are started first.

I suspect this works because the config servers have more time to properly initialize and create necessary data like the lockping document if they are initialized first. This is obviously really bad, but flapadoodle doesn't give us a way to pause until a server is completely booted up, and even internally flapadoodle does Thread.sleeps to wait for things to start up. We should probably look into filing some issues with flapadoodle to clean these things up.

> Mongo db storage plugin tests can hang on jenkins.
> --------------------------------------------------
>
>                 Key: DRILL-6380
>                 URL: https://issues.apache.org/jira/browse/DRILL-6380
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Timothy Farkas
>            Assignee: Timothy Farkas
>            Priority: Major
>
> When running on our Jenkins server the mongodb tests hang because the Config servers take up to 5 seconds to process each request (see *Error 1*). This causes the tests to never finish within a reasonable span of time. Searching online people run into this issue when mixing versions of mongo db, but that is not happening in our tests. A possible cause is *Error 2* which seems to indicate that the mongo db config servers are not completely initialized since the config servers should have a lockping document when starting up.
> *Error 1*
> {code}
> [mongod output] 2018-05-01T23:38:47.468-0700 I COMMAND  [replSetDistLockPinger] command config.lockpings command: findAndModify { findAndModify: "lockpings", query: { _id: "ConfigServer" }, update: { $set: { ping: new Date(1525243123413) } }, upsert: true, writeConcern: { w: "majority", wtimeout: 15000 } } planSummary: IDHACK update: { $set: { ping: new Date(1525243123413) } } keysExamined:0 docsExamined:0 nMatched:0 nModified:0 upsert:1 keysInserted:2 numYields:0 reslen:198 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 4055ms
> [mongod output] 2018-05-01T23:38:47.469-0700 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
> [mongod output] 2018-05-01T23:38:47.498-0700 I SHARDING [Balancer] lock 'balancer' successfully forced
> [mongod output] 2018-05-01T23:38:47.498-0700 I SHARDING [Balancer] distributed lock 'balancer' acquired, ts : 5ae95cd5d1023488104e6282
> [mongod output] 2018-05-01T23:38:47.498-0700 I SHARDING [Balancer] CSRS balancer thread is recovering
> [mongod output] 2018-05-01T23:38:47.498-0700 I SHARDING [Balancer] CSRS balancer thread is recovered
> [mongod output] 2018-05-01T23:38:48.056-0700 I NETWORK  [thread2] connection accepted from 127.0.0.1:50244 #10 (7 connections now open)
> {code}
> *Error 2*
> {code}
> [mongod output] 2018-05-01T23:39:37.690-0700 I COMMAND  [conn7] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1525243172000|1, t: 1 } }, limit: 1, maxTimeMS: 30000 } planSummary: EOF keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:0 nreturned:0 reslen:354 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4988ms
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)