You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2014/05/06 06:07:15 UTC

[jira] [Commented] (LUCENE-5646) stored fields bulk merging doesn't quite work right

    [ https://issues.apache.org/jira/browse/LUCENE-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990259#comment-13990259 ] 

Robert Muir commented on LUCENE-5646:
-------------------------------------

perhaps the reason i fall "out of sync" is because the first segment ended on a non-chunk boundary (i have no deletions). 

So when it moves to the next segment, it falls out of sync and never "recovers". I'm not sure what we can do here: but it seems unless you have very large docs, you aren't gonna get a "pure-bulk copy" even with my fix, because the chances of everything aligning is quite rare.

Maybe there is a way we could (temporarily for that marge) force flush() at the end of segment transitions to avoid this, so that the optimization would continue, if we could then recombine them in the next merge to eventually recover?

> stored fields bulk merging doesn't quite work right
> ---------------------------------------------------
>
>                 Key: LUCENE-5646
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5646
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.9, 5.0
>
>
> from doing some profiling of merging:
> CompressingStoredFieldsWriter has 3 codepaths (as i see it):
> 1. optimized bulk copy (no deletions in chunk). In this case compressed data is copied over.
> 2. semi-optimized copy: in this case its optimized for an existing storedfieldswriter, and it decompresses and recompresses doc-at-a-time around any deleted docs in the chunk.
> 3. ordinary merging
> In my dataset, i only see #2 happening, never #1. The logic for determining if we can do #1 seems to be:
> {code}
> onChunkBoundary && chunkSmallEnough && chunkLargeEnough && noDeletions
> {code}
> I think the logic for "chunkLargeEnough" is out of sync with the MAX_DOCS_PER_CHUNK limit? e.g. instead of:
> {code}
> startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize // chunk is large enough
> {code}
> it should be something like:
> {code}
> (it.chunkDocs >= MAX_DOCUMENTS_PER_CHUNK || startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize) // chunk is large enough
> {code}
> But this only works "at first" then falls out of sync in my tests. Once this happens, it never reverts back to #1 algorithm and sticks with #2. So its still not quite right.
> Maybe [~jpountz] knows off the top of his head...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org