You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@devlake.apache.org by Zhenmian Huang <zh...@merico.dev.INVALID> on 2022/05/30 03:26:26 UTC
[DISCUSSION] refactor api_collector to prevent deadlock once and for all (if possible)
Hi, Huston, we have a problem
The source code of `plugins/helper/api_collector.go` has grown around
500 lines, partly because of the complex nature of the problems it tries
to resolve:
1. Most of APIs we are fetching have `ratelimit`, so we have to control
the frequency of requests that we are sending out.
2. To fetch as fast as possible like fetching multiple pages of jira
issues, we have to parallelize requests, without exceeding the
`ratelimit` of cause. This is easy to solve so far, with an
asynchronize task queue or pool if you will, we can do both.
3. We have to retry a couple of times when a request failed, with the
previous 2 points in mind, things are getting messy because retry is
a Nested Async Task, which can lead to a deadlock with a SINGLE task
queue without carefully handling the logic.
4. If that wasn't hard enough, allow me to lay out another factor: some
fetching tasks might depend on another table, like fetching jira
changelogs, we have to iterate jira issues , and fetch changelogs
for each one of them, AND, we don't want to load them all into
memory at once, because we might operate on some kinds of BIG DATA
SYSTEM in near future, it can easily eat up all available memory and
crash host right away, even if the host can survive from that hit,
with GC language like golang, we are likely to suffer from a heavy
performance penalty.
5. Another small factor: we have to support task cancellation.
Now, in our current implementation, we have `ants` act as task queue
or pool if you will, it takes care of parallelization with a certain
amount of `workers` to run `func` for us. As for the `ratelimit`, we
have `scheduler` to control request frequency. This is relatively plain
and simple, lets's assume that we have to fetch multiple pages of jira
issues (pseudo-code):
func fetchPages() {
// Submit enqueues a task, it will be executed in future when timing
is right
// NOTICE it will block if queue was full
scheduler.Submit(func () { // lets name this callback Foo
res := http.Get(apiUrl)
saveResBodyToDb(res.Body)
fetchRestPages(res.Body.totalPages)
})
}
func fetchRestPages(totalPages) {
for page := 2; page < totalPage; page++ {
scheduler.Submit(func () { // and this callback is Bar
res := http.Get(apiUrl)
saveResBodyToDb(res.Body)
}
}
}
This seems ok, but it is NOT! What if there the queue/pool has only ONE
worker? `Foo` would occupy the worker until `Bar` returns, and `Bar`
would never return because `Foo` wouldn't let go.
Fixing it is relatively easy:
1. breaking the chain: `Foo` or `Bar` do NOT call `Submit` directly by
wrapping it within a `goroutine`
2. infinite queue: make `Submit` non-blocking, which is NOT practical,
since memory is not infinite
Now, why do I bother to bring it up since we already know the problem
and how to solve it? Yup, because everything I laid out above, is just
the tip of the iceberg. This pattern appeared MANY TIMES in DIFFERENT
TIMING at MULTIPLE PLACES. This phenomenon reveals its slyness, hard to
spot nature. It could easily slip through our hands and cause INCIDENTS.
What I wanted, is to prevent such a thing from happening. Not the
deadlock problem itself. And I don't have a perfect plan at this point,
but some vague approaches:
1. Testing Phase: adopt some kind of static or dynamic method to detect
direct calls with UnitTest or sth else
2. Compile/Coding Phase: a brand new mechanism that does NOT exhibit
the same kind of problem
I have been pondering for a while, and I don't think approach 2 is easy,
even though I crave it. I would suggest that we go with approach 1 for
now until we figure out a better approach.
You are more than welcome to give it a try, but keep in mind, that all
factors I listed at the top must be considered.
It is long and abstract, thank you for reading. 😃
Best
Klesh Wong
2022-05-28