You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@devlake.apache.org by Zhenmian Huang <zh...@merico.dev.INVALID> on 2022/05/30 03:26:26 UTC
[DISCUSSION] refactor api_collector to prevent deadlock once and for all (if possible)

Hi, Huston, we have a problem


  The source code of `plugins/helper/api_collector.go` has grown around 
500 lines, partly because of the complex nature of the problems it tries 
to resolve:

 1. Most of APIs we are fetching have `ratelimit`, so we have to control
    the frequency of requests that we are sending out.
 2. To fetch as fast as possible like fetching multiple pages of jira
    issues, we have to parallelize requests, without exceeding the
    `ratelimit` of cause. This is easy to solve so far, with an
    asynchronize task queue or pool if you will, we can do both.
 3. We have to retry a couple of times when a request failed, with the
    previous 2 points in mind, things are getting messy because retry is
    a Nested Async Task, which can lead to a deadlock with a SINGLE task
    queue without carefully handling the logic.
 4. If that wasn't hard enough, allow me to lay out another factor: some
    fetching tasks might depend on another table, like fetching jira
    changelogs, we have to iterate jira issues , and fetch changelogs
    for each one of them, AND, we don't want to load them all into
    memory at once, because we might operate on some kinds of BIG DATA
    SYSTEM in near future, it can easily eat up all available memory and
    crash host right away, even if the host can survive from that hit,
    with GC language like golang, we are likely to suffer from a heavy
    performance penalty.
 5. Another small factor: we have to support task cancellation.

  Now, in our current implementation, we have `ants` act as task queue 
or pool if you will, it takes care of parallelization with a certain 
amount of `workers` to run `func` for us. As for the `ratelimit`, we 
have `scheduler` to control request frequency. This is relatively plain 
and simple, lets's assume that we have to fetch multiple pages of jira 
issues (pseudo-code):


func fetchPages() {

   // Submit enqueues a task, it will be executed in future when timing 
is right

   // NOTICE it will block if queue was full

   scheduler.Submit(func () { // lets name this callback Foo

     res := http.Get(apiUrl)

     saveResBodyToDb(res.Body)

     fetchRestPages(res.Body.totalPages)

   })

}

func fetchRestPages(totalPages) {

   for page := 2; page < totalPage; page++ {

     scheduler.Submit(func () { // and this callback is Bar

       res := http.Get(apiUrl)

       saveResBodyToDb(res.Body)

     }

   }

}


This seems ok, but it is NOT! What if there the queue/pool has only ONE 
worker? `Foo` would occupy the worker until `Bar` returns, and `Bar` 
would never return because `Foo` wouldn't let go.

Fixing it is relatively easy:

 1. breaking the chain: `Foo` or `Bar` do NOT call `Submit` directly by
    wrapping it within a `goroutine`
 2. infinite queue: make `Submit` non-blocking, which is NOT practical,
    since memory is not infinite

Now, why do I bother to bring it up since we already know the problem 
and how to solve it? Yup, because everything I laid out above, is just 
the tip of the iceberg. This pattern appeared MANY TIMES in DIFFERENT 
TIMING at MULTIPLE PLACES. This phenomenon reveals its slyness, hard to 
spot nature. It could easily slip through our hands and cause INCIDENTS.

What I wanted, is to prevent such a thing from happening. Not the 
deadlock problem itself. And I don't have a perfect plan at this point, 
but some vague approaches:

 1. Testing Phase: adopt some kind of static or dynamic method to detect
    direct calls with UnitTest or sth else
 2. Compile/Coding Phase: a brand new mechanism that does NOT exhibit
    the same kind of problem

I have been pondering for a while, and I don't think approach 2 is easy, 
even though I crave it. I would suggest that we go with approach 1 for 
now until we figure out a better approach.

You are more than welcome to give it a try, but keep in mind, that all 
factors I listed at the top must be considered.


   It is long and abstract, thank you for reading.   😃

   Best



Klesh Wong

2022-05-28