You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Houston Putman (Jira)" <ji...@apache.org> on 2020/08/19 21:27:04 UTC
[jira] [Updated] (SOLR-14713) Single thread on streaming updates

     [ https://issues.apache.org/jira/browse/SOLR-14713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Houston Putman updated SOLR-14713:
----------------------------------
    Security:     (was: Public)

> Single thread on streaming updates
> ----------------------------------
>
>                 Key: SOLR-14713
>                 URL: https://issues.apache.org/jira/browse/SOLR-14713
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Or great simplify SolrCmdDistributor
> h2. Current way for fan out updates of Solr
> Currently on receiving an updateRequest, Solr will create a new UpdateProcessors for handling that request, then it parses one by one document from the request and let’s processor handle it.
> {code:java}
> onReceiving(UpdateRequest update):
>   processors = createNewProcessors();
>   for (Document doc : update) {
>     processors.handle(doc)
> }
> {code}
> Let’s say the number of replicas in the current shard is N, updateProcessor will create N-1 queues and runners for each other replica.
>  Runner is basically a thread that dequeues updates from its corresponding queue and sends it to a corresponding replica endpoint.
> Note 1: all Runners share the same client hence connection pool and same thread pool. 
>  Note 2: A runner will send all documents of its UpdateRequest in a single HTTP POST request (to reduce the number of threads for handling requests on the other side). Therefore its lifetime equals the total time of handling its UpdateRequest. Below is a typical activity that happens in a runner's life cycle.
> h2. Problems of current approach
> The current approach have two problems:
>  - Problem 1: It uses lots of threads for fan out requests.
>  - Problem 2 which is more important: it is very complex. Solr is also using ConcurrentUpdateSolrClient (CUSC for short) for that, CUSC implementation allows using a single queue but multiple runners for same queue (although we only use one runner at max) this raise the complexity of the whole flow up to the top. Single fix for a problem can raise multiple problems later, i.e: in SOLR-13975 on trying to handle the problem when the other endpoint is hanging out for so long, we introduced a bug that lets the runner keep running even when the updateRequest is fully handled in the leader.
> h2. Doing everything in single thread
> Since we are already supporting sending requests in an async manner, why don’t we let the main thread which is handling the update request to send updates to all others without the need of runners or queues. The code will be something like this
> {code:java}
>  Class UpdateProcessor:
>    Map<String, OutputStream> pendingOutStreams
>    
>    func handleAddDoc(doc):
>       for (replica: replicas):
>       pendingOutStreams.get(replica).send(doc)
>    
>    func onEndUpdateRequest():
>       pendingOutStreams.values().forEach(out -> closeAndHandleResponse(out)){code}
>  
> By doing this we will use less threads and the code is much more simpler and cleaner. Of course that there will be some downgrade in the time for handling an updateRequest since we are doing it serially instead of concurrently. In a formal way it will be like this
> {code:java}
>  oldTime = timeForIndexing(update) + timeForSendingUpdates(update)
>  newTime = timeForIndexing(update) + (N-1) * timeForSendingUpdates(update){code}
> But I believe that timeForIndexing is much more than timeForSendingUpdates so we do not really need to be concerned about this. Even that is really a problem users can simply create more threads for indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org