You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/08/28 15:50:52 UTC

[jira] [Updated] (LUCENE-5189) Numeric DocValues Updates

     [ https://issues.apache.org/jira/browse/LUCENE-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-5189:
-------------------------------

    Attachment: LUCENE-5189.patch

Patch adds numeric-dv field updates capabilities:

* IndexWriter.updateNumericDocValue(term, field, value) updates the value of 'field' of all documents associated with 'term' to the new 'value'

* When you update the value of field 'f' of few documents, a new pair of .dvd/.dvm files are created, with the values of all documents for that field.
** That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' and the _0.cfs for other fields which were not updated.
** SegmentInfoPerCommit tracks for each field in which 'gen' it's recorded, and SegmentCoreReaders uses that map to read the values of the field from the respective gen.

* TestNumericDocValuesUpdates contains a dozen or so testcases which cover different angles, from simple updates, to unsetting values, merging segments, deletes etc. During development I ran into many interesting scenarios :).

* ReaderAndLiveDocs.writeLiveDocs applies in addition to the deletes, the field updates too. BufferedDeletes tracks the updates, similar to how it tracks deletes.

* SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen').
** I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later.

* Segment merging is supported in that when a segment with updates is merged, the correct values are written to the merged segment and the resulting segment has no 'gen' .dvd.

* I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. The default impl throws UnsupportedOpException, while Lucene45 implements it.
** I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files.
** There is also a nocommit touching that with a suggestion to forbid createOutput call in TrackingDir if the file is already referenced by an IndexCommit.
** It is important that we break something here so that users/apps pay attention to the new feature -- suggestions are welcome!

Few remarks:

* For now, only updating by a single term is supported (simplicity).
* You cannot add a new field through field update, only update existing fields. This is a 'schema' change, and there are other way to do it, e.g. through addIndexes and FilterAtomicReader. Attempting to support it means that we need to created gen'd FieldInfosFormat also, which complicates matters.
* I dropped some nocommits about renaming classes/methods. I didn't want to do it yet, cause it creates an unnecessarily bloated patch. Feel free to comment, we can take care of the renames later.
* I will probably create a branch for that feature cause there are some things that need to be take care of (add some tests, finish Codecs support etc.)
* Also, I haven't yet benchmarked the effect of field updates on indexing/search ... I will get to it at some point, but if someone wants to help, I promise not to say no :).

I may have forgot to describe some changes, feel free to ask for clarification!
                
> Numeric DocValues Updates
> -------------------------
>
>                 Key: LUCENE-5189
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5189
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5189.patch
>
>
> In LUCENE-4258 we started to work on incremental field updates, however the amount of changes are immense and hard to follow/consume. The reason is that we targeted postings, stored fields, DV etc., all from the get go.
> I'd like to start afresh here, with numeric-dv-field updates only. There are a couple of reasons to that:
> * NumericDV fields should be easier to update, if e.g. we write all the values of all the documents in a segment for the updated field (similar to how livedocs work, and previously norms).
> * It's a fairly contained issue, attempting to handle just one data type to update, yet requires many changes to core code which will also be useful for updating other data types.
> * It has value in and on itself, and we don't need to allow updating all the data types in Lucene at once ... we can do that gradually.
> I have some working patch already which I'll upload next, explaining the changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org