You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrzej Bialecki (Created) (JIRA)" <ji...@apache.org> on 2012/03/01 17:39:59 UTC

[jira] [Created] (LUCENE-3837) A modest proposal for updateable fields

A modest proposal for updateable fields
---------------------------------------

                 Key: LUCENE-3837
                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
             Project: Lucene - Java
          Issue Type: New Feature
          Components: core/index
    Affects Versions: 4.0
            Reporter: Andrzej Bialecki 


I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...

This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 

A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.

On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.

Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.

One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.

Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.

Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.

That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220885#comment-13220885 ] 

Shai Erera commented on LUCENE-3837:
------------------------------------

bq. it merges updates on the fly, at the cost of keeping a static map of primary->secondary ids

ah ok, I missed that part.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220387#comment-13220387 ] 

Shai Erera commented on LUCENE-3837:
------------------------------------

Andrzej, this brings back [old memories|http://mail-archives.apache.org/mod_mbox/lucene-dev/201004.mbox/%3Cu2s786fde51004250432gd50bec64m9b2f6ee6dd495987@mail.gmail.com%3E] :-).

The core difference in your proposal is that the updates are processed in a separate index, and that at runtime we use a PQ to match documents and collapse all the updates, right? And these updates will be reflected in the main index on segment merges, right?

I personally prefer a more integrated solution then one that's based on matching PQs, but since I barely did something with my proposal for 2 years, I guess that your progress is better than no progress at all.

One comment -- when the updates are collapsed, the may not just simply 'replace' what exists before them. I could see an update to a document which adds a stored field, and therefore if I'll call IndexReader.document(i), I'd expect to see that stored field with all the ones that existed before it.

At the time I felt that modifying Lucene to add stacked segments is way too complicated, and the indexing internals kept changing by the day. But now Codecs seem to be very stable, and trunk's code changes relax, so perhaps it'll be worthwhile taking a second look at that proposal? (but only if you feel like it)
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220420#comment-13220420 ] 

Andrzej Bialecki  commented on LUCENE-3837:
-------------------------------------------

bq. I guess that your progress is better than no progress at all.
That's my perspective too, and it's reflected in the title of this issue... I remember your description and in fact my proposal is somewhat similar. It does not use PQs, but indeed it merges updates on the fly, at the cost of keeping a static map of primary->secondary ids and random seeking in the secondary index to retrieve matching data. Please check the description above. And then once a segment merge is executed the overlay data will be integrated into the main data, because the merge process will pull in this mix of new and old without being aware of it - it will be hidden by Codec's read formats. Codec abstractions are great for this kind of manipulations.
bq. One comment – when the updates are collapsed, the may not just simply 'replace' what exists before them.
Right, old data will be returned if not overlaid by new data, meaning that e.g. old stored field values will be returned for all other fields except the updated field, and for that field the data from the overlay will be returned.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Alan Woodward (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396847#comment-13396847 ] 

Alan Woodward commented on LUCENE-3837:
---------------------------------------

How much more work needs to be done on the patch?  We have a client who is very interested in getting updateable fields working for document tagging, and could probably be persuaded to support working on this if it's not too far from viability.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: LUCENE-3837.patch
>
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220208#comment-13220208 ] 

Robert Muir commented on LUCENE-3837:
-------------------------------------

{quote}
Ad 1. I don't think it's such a big deal, we already return approximate stats (too high counts) in presence of deletes. I think we should go all the way, at least initially, and ignore stats from an overlay completely, unless the data is present only in the overlay - e.g. for terms not present in the main index.
{quote}

I disagree: it may not be a big deal for DefaultSimilarity, but its important for other scoring implementations. Initially its extremely important
we get this stuff right before committing anything!

Large problems can result when the statistics are inconsistent with what is 'discovered' in the docsenum. This is because many scoring models expect
certain relationships to hold true: such as a single doc's tf value won't exceed totalTermFreq. We had to do significant work already to ensure
consistency, though in some cases the problems could not totally be solved (BasicModelD, BasicModelP, BasicModelBE+NormalizationH3, etc) and we
had to unfortunately resort to only leaving warnings in the javadocs.

I'm fairly certain in all cases we avoid things like NaN or negative scores, but when the function 'inverts relevance' is aweful too.

So I think we need a consistent model for stats: thats why I lean towards maxDoc(field), which is consistent in every way with how we handle
deletes, and it won't yield any surprises.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220163#comment-13220163 ] 

Robert Muir commented on LUCENE-3837:
-------------------------------------

Some concerns about scoring:

# the stats problem: maybe we should allow overlay readers to just return -1 for docfreq? I dont like the
  situation today where preflex codec doesnt implement all the stats (the whole -1 situation and 'optional' stats
  is frustrating), but I think its worse to return out of bounds stuff, e.g. where docfreq > maxdoc. I think 
  totalTermFreq is safe to just sum up though (its wrong, but not out of bounds), and similarity could use
  this safely as to compute expected IDF instead. Still, this part will be messy, unlike the
  newer stats in 4.0, lots of code I think expects that docFreq is always supported. Another possibility that
  I think I like more is to treat this conceptually just like deletes in every way, so all stats are supported 
  but "maxDoc" is wrong (includes masked-away documents), then nothing is out of bounds. So in this case we 
  would add maxDoc(field), which is only used for scoring. For a normal reader this just returns maxDoc() as
  implemented today...
# the norms problem: although norms are implemented as docValues, currently all similarities assume that 
  getArray()/hasArray() is implemented... but here I'm not sure that would be the case? we 
  should probably measure if the method call really even hurts, in general its a burden on the codec
  I think to require that norms actually be representable as an array (maybe other use cases would want
  other data structures for less RAM)...

we could solve both of these issues separately and independently if we decide what what we want to do.

                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220315#comment-13220315 ] 

Andrzej Bialecki  edited comment on LUCENE-3837 at 3/1/12 8:17 PM:
-------------------------------------------------------------------

bq. Could we use the actual docID (ie same docID as the base segment)?
Updates may arrive out of order, so the updates will naturally get different internal IDs (also, if you wanted to use the same ids they would have gaps). I don't know if various parts of Lucene can handle out of order ids coming from iterators? If we wanted to match the ids early then we would have to sort them, a la IndexSorter, on every flush and on every merge, which seems too costly. So, a re-mapping structure seems like a decent compromise. Yes, it could be large - we could put artificial limits on the number of updates before we force a merge.

bq. Also, can't we directly write the stacked segments ourselves? (Ie, within a single IW).
I don't know, it didn't seem likely to me - AFAIK IW operates on a single segment before flushing it? And updates could refer to docs outside the current segment.
                
      was (Author: ab):
    bq. Could we use the actual docID (ie same docID as the base segment)?
Updates may arrive out of order, so the updates will naturally get different internal IDs (also, if you wanted to use the same ids they would have gaps). I don't know if various parts of Lucene can handle out of order ids coming from iterators? If we wanted to match the ids early then we would have to sort them, a la IndexSorter, on every flush and on every merge, which seems too costly. So, a re-mapping structure seems like a decent compromise. Yes, it could be large - we could put artificial limits on the number of updates before we do a merge.

bq. Also, can't we directly write the stacked segments ourselves? (Ie, within a single IW).
I don't know, it didn't seem likely to me - AFAIK IW operates on a single segment before flushing it? And updates could refer to docs outside the current segment.
                  
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220300#comment-13220300 ] 

Andrzej Bialecki  commented on LUCENE-3837:
-------------------------------------------

That was my point, we should be able to come up with estimates that yield "slightly wrong yet consistent" stats. I don't know the details of new similarities, so it's up to you Robert to come up with suggestions :)
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220194#comment-13220194 ] 

Andrzej Bialecki  commented on LUCENE-3837:
-------------------------------------------

Ad 1. I don't think it's such a big deal, we already return approximate stats (too high counts) in presence of deletes. I think we should go all the way, at least initially, and ignore stats from an overlay completely, unless the data is present only in the overlay - e.g. for terms not present in the main index.

Ad 2. I think that if getArray() is supported then on the first call we have to roll-in all updates to the main array created from the primary.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220207#comment-13220207 ] 

Michael McCandless commented on LUCENE-3837:
--------------------------------------------

Could we use the actual docID (ie same docID as the base segment)?  This way we wouldn't need the (possibly large) int[] to remap on each access.  I guess for postings this is OK (we can pass PostingsFormat any docIDs), but for eg stored fields, term vectors, doc values, it's not (they can't handle "sparse" docIDs).

Also, can't we directly write the stacked segments ourselves?  (Ie, within a single IW).

We'd need to extend SegmentInfo(s) to record which segments stack on which, and fix MP to understand stacking (and aggressively target the stacks).
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271484#comment-13271484 ] 

Andrzej Bialecki  edited comment on LUCENE-3837 at 5/10/12 12:54 AM:
---------------------------------------------------------------------

Initial patch. I created also a branch lucene3837 based on the current trunk, so that others may join in.

Edit: note that this is NOT a working implementation yet, it's missing many key pieces. Jump in if you want to help filling in the blanks ;)
                
      was (Author: ab):
    Initial patch. I created also a branch lucene3837 based on the current trunk, so that others may join in.
                  
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: LUCENE-3837.patch
>
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396855#comment-13396855 ] 

Andrzej Bialecki  commented on LUCENE-3837:
-------------------------------------------

It's still quite incomplete, I'm afraid. The patch needs to be updated to the changes in trunk, and then many missing pieces need to be implemented. Still, after discussing this with other developers it looks like the proposed design should work as intended, so it's viable in this sense ...
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: LUCENE-3837.patch
>
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-3837:
--------------------------------------

    Attachment: LUCENE-3837.patch

Initial patch. I created also a branch lucene3837 based on the current trunk, so that others may join in.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: LUCENE-3837.patch
>
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned LUCENE-3837:
-----------------------------------------

    Assignee: Andrzej Bialecki 
    
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220211#comment-13220211 ] 

Michael McCandless commented on LUCENE-3837:
--------------------------------------------

I think for scoring the "wrong yet consistent stats" approach is good?  (Just like deletes).

So, an update would affect scoring (eg on update the field now has 4 occurrences of python vs only 1 occurrence before, so now it gets a better score), but the scoring will not precisely match the scores I'd get from a full re-index instead of an update.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220315#comment-13220315 ] 

Andrzej Bialecki  commented on LUCENE-3837:
-------------------------------------------

bq. Could we use the actual docID (ie same docID as the base segment)?
Updates may arrive out of order, so the updates will naturally get different internal IDs (also, if you wanted to use the same ids they would have gaps). I don't know if various parts of Lucene can handle out of order ids coming from iterators? If we wanted to match the ids early then we would have to sort them, a la IndexSorter, on every flush and on every merge, which seems too costly. So, a re-mapping structure seems like a decent compromise. Yes, it could be large - we could put artificial limits on the number of updates before we do a merge.

bq. Also, can't we directly write the stacked segments ourselves? (Ie, within a single IW).
I don't know, it didn't seem likely to me - AFAIK IW operates on a single segment before flushing it? And updates could refer to docs outside the current segment.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org