You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2018/10/31 14:59:00 UTC

[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

    [ https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670205#comment-16670205 ] 

Erick Erickson commented on LUCENE-8551:
----------------------------------------

David:

This would be way cool to get to happen on merge. We've had situations where some wild program adds over a million fields and the only remedy was to re-index.

I'm working on SOLR-12259 which will, I hope, allow us to "do things" to the index. If this is too expensive to make happen as part of the regular merging process, that might be an alternative way to go about it on a one-off basis. I'd rather have it happen as part of regular merging of course.

If this is part of regular segment merging, we should still be able to make it happen with SOLR-12259  to cover those cases where there are segments that are never merged because they're full and aren't having records deleted.

> Purge unused FieldInfo on segment merge
> ---------------------------------------
>
>                 Key: LUCENE-8551
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8551
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: David Smiley
>            Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, docValues, stored value, points index), it will nonetheless hang around in FieldInfos indefinitely.  It would be nice to be able to recognize an unused FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing it does.  After that, the different index parts, before it's known what's "used" or not.  After writing, we theoretically know which fields are used or not, though we're not doing any bookkeeping to track it.  Maybe we should track the fields used during writing so we write a filtered merged fieldInfo at the end instead of unfiltered up front?  Or perhaps upon reading a segment, we make it cheap/easy for each index type (e.g. terms index, stored fields, ...) to know which fields have data for the corresponding type.  Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org