You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Vigya Sharma (Jira)" <ji...@apache.org> on 2022/05/01 01:10:00 UTC

[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

    [ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530476#comment-17530476 ] 

Vigya Sharma commented on LUCENE-10216:
---------------------------------------

I think the PR is ready for review, with existing tests passing and added tests for new changes.

{{OneMerge}} distribution is now provided by a new {{findMerges(CodecReaders[])}} API in {{{}MergePolicy{}}}, and executed by {{MergeScheduler}} threads. I've also modified the {{MockRandomMergePolicy}} to randomly pick a highly concurrent, (one segment per reader), {{findMerges(...)}} implementation 50% of the time. And confirmed manually that tests pass in both scenarios (this new impl., as well as the default impl. being picked) (thanks Michael McCandless for the suggestion).

> Add concurrency to addIndexes(CodecReader…) API
> -----------------------------------------------
>
>                 Key: LUCENE-10216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10216
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Vigya Sharma
>            Priority: Major
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the e-commerce platform. I’m working on a project that involves applying metadata+ETL transforms and indexing documents on n different _indexing_ boxes, combining them into a single index on a separate _reducer_ box, and making it available for queries on m different _search_ boxes (replicas). Segments are asynchronously copied from indexers to reducers to searchers as they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the reducer boxes. Since we also have taxonomy data, we need to remap facet field ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version of this API. The API leverages {{SegmentMerger.merge()}} to create segments with new ordinal values while also merging all provided segments in the process.
> _This is however a blocking call that runs in a single thread._ Until we have written segments with new ordinal values, we cannot copy them to searcher boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, each with only a single reader, creating a concurrently running 1:1 conversion from old segments to new ones (with new ordinal values). We follow this up with non-blocking background merges. This lets us copy the segments to searchers and replicas as soon as they are available, and later replace them with merged segments as background jobs complete. On the Amazon dataset I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} API with a {{boolean}} flag for concurrency, that internally submits multiple merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting multiple addIndexes() calls and waiting for them to complete, I felt it needs some understanding of what addIndexes does, why you need to wait on the merge and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org