You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ilan Ginzburg (Jira)" <ji...@apache.org> on 2020/06/01 09:25:00 UTC

[jira] [Commented] (SOLR-14524) Harden MultiThreadedOCPTest

    [ https://issues.apache.org/jira/browse/SOLR-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120884#comment-17120884 ] 

Ilan Ginzburg commented on SOLR-14524:
--------------------------------------

[https://github.com/apache/lucene-solr/pull/1548] is doing three things:
* Have the test wait for processing to have started in Overseer and check that processing hasn't completed yet,
* Fail with meaningful messages when the test didn't really fail but execution order made it impossible to run the test correctly,
* Significantly increase the runtime of the "long" task (from 1 to 10 seconds), yet remove the wait for that task to complete. Idea is to reduce further timing issues causing test failures without slowing down the test (test is likely faster now than it was, but the specific subtest being changed here contributes only a small fraction of total test runtime).

> Harden MultiThreadedOCPTest
> ---------------------------
>
>                 Key: SOLR-14524
>                 URL: https://issues.apache.org/jira/browse/SOLR-14524
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Ilan Ginzburg
>            Priority: Minor
>              Labels: test
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{MultiThreadedOCPTest.test()}} fails occasionally in Jenkins because of timing of tasks enqueue to the Collection API queue.
> This test in {{testFillWorkQueue()}} enqueues a large number of tasks (115, more than the 100 Collection API parallel executors) to the Collection API queue for a collection COLL_A, then observes a short delay and enqueues a task for another collection COLL_B.
>  It verifies that the COLL_B task (that does not require the same lock as the COLL_A tasks) completes before the third COLL_A task.
> Test failures happen because when enqueues are slowed down enough, the first 3 tasks on COLL_A complete even before the COLL_B task gets enqueued!
> In one sample failed Jenkins test execution, the COLL_B task enqueue happened 1275ms after the enqueue of the first COLL_A, leaving plenty of time for a few (and possibly all) COLL_A tasks to complete.
> Fix will be along the lines of:
>  * Make the “blocking” COLL_A task longer to execute (currently 1 second) to compensate for slow enqueues.
>  * Verify the COLL_B task (a 1ms task) finishes before the long running COLL_A task does. This would be a good indication that even though the collection queue was filled with tasks waiting for a busy lock, a non competing task was picked and executed right away.
>  * Delay the enqueue of the COLL_B task to the end of processing of the first COLL_A task. This would guarantee that COLL_B is enqueued once at least some COLL_A tasks started processing at the Overseer. Possibly also verify that the long running task of COLL_A didn't finish execution yet when the COLL_B task is enqueued...
>  * It might be possible to set a (very) long duration for the slow task of COLL_A (to be less vulnerable to execution delays) without requiring the test to wait for that task to complete, but only wait for the COLL_B task to complete (so the test doesn't run for too long).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org