You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2021/01/04 08:06:00 UTC

[jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

    [ https://issues.apache.org/jira/browse/SOLR-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258050#comment-17258050 ] 

David Smiley commented on SOLR-14923:
-------------------------------------

I've been *obsessed* with this issue over the whole holiday break.  My PR is a bit big; I'd prefer to show more isolated and thus easier to review changes.  I've actually gone further with some internal improvements but thankfully managed to shelve them off for another issue.  As it stands, my PR rather insists that the user _somehow_ tell Solr what the root ID is when doing an atomic update to a child doc.  It'll complain if you don't, at least.  But Solr currently doesn't insist; it will figure it out at some cost in performance – a re-open of a realtime searcher and more.  That's the root of the whole matter.  I think Solr ought to insist, and I think it's worth it being a breaking change.  To reduce my PR size, I spent some time today _trying_ to extricate the improvement in update resolution in AtomicUpdateDocumentMerger that I did a week ago, which nicely means there's no longer a use-case to set stored=true on the root field.  But everything is interrelated, and it doesn't quite work / isn't safe on its own.  Despite the PR being bigger than I'd like,  I'm really happy with it.  It needs a bit more to un-document the stored=true requirement for the root field, and also some important upgrade notes, but is otherwise in committable shape IMO.

 

CC [~hossman] and [~ichattopadhyaya] – you both did herculean work years ago on getting the in-place-update functionality in, which overlaps a lot with the code in my PR.  You may want to review this PR.  

> Indexing performance is unacceptable when child documents are involved
> ----------------------------------------------------------------------
>
>                 Key: SOLR-14923
>                 URL: https://issues.apache.org/jira/browse/SOLR-14923
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update, UpdateRequestProcessors
>    Affects Versions: 8.3, 8.4, 8.5, 8.6, 8.7, master (9.0)
>            Reporter: Thomas Wöckinger
>            Priority: Critical
>              Labels: performance, pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Parallel indexing does not make sense at moment when child documents are used.
> The org.apache.solr.update.processor.DistributedUpdateProcessor checks at the end of the method doVersionAdd if Ulog caches should be refreshed.
> This check will return true if any child document is included in the AddUpdateCommand.
> If so ulog.openRealtimeSearcher(); is called, this call is very expensive, and executed in a synchronized block of the UpdateLog instance, therefore all other operations on the UpdateLog are blocked too.
> Because every important UpdateLog method (add, delete, ...) is done using a synchronized block almost each operation is blocked.
> This reduces multi threaded index update to a single thread behavior.
> The described behavior is not depending on any option of the UpdateRequest, so it does not make any difference if 'waitFlush', 'waitSearcher' or 'softCommit'  is true or false.
> The described behavior makes the usage of ChildDocuments useless, because the performance is unacceptable.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org