You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Babak Farhang <fa...@gmail.com> on 2010/01/14 10:23:39 UTC

incremental document field update

Hi,

I've been thinking about how to update a single field of a document
without touching its other fields. This is an old problem and I was
considering a solution along the lines of Andrzej Bialecki's post to
the dev list back in '07:


<quote  http://markmail.org/message/tbkgmnilhvrt6bii >

I have the following scenario: I want to use ParallelReader to
maintain parts of the index that are changing quickly, and where
changes are limited to specific fields only.

Let's say I have a "main" index (many fields, slowly changing, large
updates), and an "aux" index (fast changing, usually single doc and
single field updates). I'd like to "replace" documents in the "aux"
index - that is, delete one doc and add another - but in a way that
doesn't change the internal document numbers, so that I can keep the
mapping required by ParallelReader intact.

I think this is possible to achieve by using a FilterIndexReader,
which keeps a map of updated documents, and re-maps old doc ids to the
new ones on the fly.

>From time to time I'd like to optimize the "aux" index to get rid of
deleted docs. At this time I need to figure out how to preserve the
old->new mapping during the optimization.

So, here's the question: is this scenario feasible? If so, then in the
trunk/ version of Lucene, is there any way to figure out (predictably)
how internal document numbers are reassigned after calling optimize()
?

</quote>


Reading that trail, I wish the original poster gave up on his idea (
http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
)


<quote>
Thanks for the input - for now I gave up on this, after discovering
that I would have no way to ensure in TermDocs.skipTo() that document
id-s are monotonically increasing (which seems to be a part of the
contract).
</quote>

I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
(ordered) maps, one from internal document-ids to "view" document-ids,
and another mapping from  "view" document-ids to internal
document-ids, then things like skipTo() can be implemented reasonably
efficiently. Only the mapped ids are maintained in these structures.
(Also note that a mapped "view" document-id represents an internally
deleted document with that id.)

And if we can find a way to merge the segments of this "aux" index
along whenever the segments of its associated "main" index are merged
or optimized (such that the [internal] doc-ids in the merged aux index
end up getting sync'ed with those of the trunk), then there shouldn't
be all that many doc-ids to map anyway (if we merge frequently
enough).

So to go back Andrzej's question: is there any way to figure out
(predictably) how internal document numbers [in the main index] are
reassigned after calling optimize() ? How does LUCENE-847, as Doug
Cutting suggests in that trail, help?

Sorry if that was long winded, had to start somewhere ;)

-Babak

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> OK; this approach (modifying an already written & possible in-use (by
> an IndexReader) file) would be problematic for Lucene...

If you have N slots, there would have to be N-1 commits + an Nth commit
in progress while reading the "entry-count block"  for there to be the
possibility of
a bad read. Make N large enough (max 256), and that should close the
window, I think.

Any way, just want to thank you Mike for sharing your thoughts and
ideas. Time to
try some of them..

Cheers,
-Babak

On Wed, Jan 20, 2010 at 3:41 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <fa...@gmail.com> wrote:
>>> I see -- so your file format allows you to append to the same file
>>> without affecting prior readers?  We never do that in Lucene today
>>> (all files are "write once").
>>
>> Yes. For the most part it only appends. The exception is when the
>> log's entry count is updated (when the appends actually "commit").
>> That count is written into one of many (>= 2 and <= 257) slots in round
>> robin fashion and finally a byte-size "keystone" is updated to determine
>> the active slot. The idea is that by flushing the file just before and just
>> after the keystone byte update we ensure persistence (assumes writing
>> one byte is always all-or-nothing). Increasing the number of slots
>> the count is written to narrows the chance of a bad read..
>
> OK; this approach (modifying an already written & possible in-use (by
> an IndexReader) file) would be problematic for Lucene...
>
>>> Good questions (terms dict & stored fields) -- I think what we'd do is
>>> simply write a new segment, so stored fields, term vectors, postings,
>>> terms dict, etc., are all private to that new segment.  But, we'd mark
>>> the segment as being a delta segment, referencing the original
>>> segment, and we'd remap the docIDs when flushing that segment.
>>
>> The .fdx and .tvx files use fixed-width tables, so it seems if there are
>> large gaps in the updated doc-ids, we'd have to fill in "null" rows for
>> those gaps.  Or do you have another data-structure in mind for these
>> .*x files?
>
> Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx.
>
>> One solution may be to combine the approaches we described: maintain
>> doc-id mappings for the .fdx and .tvx files for per-document
>> data at search time, and index-time mapped doc-ids for the posting
>> lists.
>
> True, we could do both...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <fa...@gmail.com> wrote:
>> I see -- so your file format allows you to append to the same file
>> without affecting prior readers?  We never do that in Lucene today
>> (all files are "write once").
>
> Yes. For the most part it only appends. The exception is when the
> log's entry count is updated (when the appends actually "commit").
> That count is written into one of many (>= 2 and <= 257) slots in round
> robin fashion and finally a byte-size "keystone" is updated to determine
> the active slot. The idea is that by flushing the file just before and just
> after the keystone byte update we ensure persistence (assumes writing
> one byte is always all-or-nothing). Increasing the number of slots
> the count is written to narrows the chance of a bad read..

OK; this approach (modifying an already written & possible in-use (by
an IndexReader) file) would be problematic for Lucene...

>> Good questions (terms dict & stored fields) -- I think what we'd do is
>> simply write a new segment, so stored fields, term vectors, postings,
>> terms dict, etc., are all private to that new segment.  But, we'd mark
>> the segment as being a delta segment, referencing the original
>> segment, and we'd remap the docIDs when flushing that segment.
>
> The .fdx and .tvx files use fixed-width tables, so it seems if there are
> large gaps in the updated doc-ids, we'd have to fill in "null" rows for
> those gaps.  Or do you have another data-structure in mind for these
> .*x files?

Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx.

> One solution may be to combine the approaches we described: maintain
> doc-id mappings for the .fdx and .tvx files for per-document
> data at search time, and index-time mapped doc-ids for the posting
> lists.

True, we could do both...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> I see -- so your file format allows you to append to the same file
> without affecting prior readers?  We never do that in Lucene today
> (all files are "write once").

Yes. For the most part it only appends. The exception is when the
log's entry count is updated (when the appends actually "commit").
That count is written into one of many (>= 2 and <= 257) slots in round
robin fashion and finally a byte-size "keystone" is updated to determine
the active slot. The idea is that by flushing the file just before and just
after the keystone byte update we ensure persistence (assumes writing
one byte is always all-or-nothing). Increasing the number of slots
the count is written to narrows the chance of a bad read..

> Good questions (terms dict & stored fields) -- I think what we'd do is
> simply write a new segment, so stored fields, term vectors, postings,
> terms dict, etc., are all private to that new segment.  But, we'd mark
> the segment as being a delta segment, referencing the original
> segment, and we'd remap the docIDs when flushing that segment.

The .fdx and .tvx files use fixed-width tables, so it seems if there are
large gaps in the updated doc-ids, we'd have to fill in "null" rows for
those gaps.  Or do you have another data-structure in mind for these
.*x files?

One solution may be to combine the approaches we described: maintain
doc-id mappings for the .fdx and .tvx files for per-document
data at search time, and index-time mapped doc-ids for the posting
lists.

-Babak

On Tue, Jan 19, 2010 at 3:48 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang <fa...@gmail.com> wrote:
>>> This is about multiple sessions with the writer.  Ie, open writer,
>>> update a few docs, close.  Do the same again, but, that 2nd session
>>> cannot overwrite the same files from the first one, since readers may
>>> have those files open.  The "write once" model gives Lucene its
>>> transactional semantics, too -- we can keep multiple commits in the
>>> index, open an arbitrary commit, rollback, etc.
>>
>> Ahh, I see. We're worried about the read-consistency of the open readers
>> while there is a concurrent writer. I think that's already taken care of with
>> the fact that under the hood we are only appending data into the logical
>> segment.  When a reader first loads one of our slave segments, it first
>> loads into memory the number of entries in the right-ahead log of mapped
>> doc-ids before reading those number of entries from the log. It could note
>> the max unmapped document number represented in the last entry and
>> discard any document numbers above that max in the postings.
>
> I see -- so your file format allows you to append to the same file
> without affecting prior readers?  We never do that in Lucene today
> (all files are "write once").
>
>>> I think both approaches are O(total size of updates) in indexing cost,
>>> assuming each writes only the new changes, to a new generation
>>> filename.
>>>
>>> But I think at search time the "remap during indexing" approach should
>>> simpler, since you "just" have to OR together N posting lists... and
>>> performance should be better (by a constant factor) since there's no
>>> remapping.
>>
>> I think I see how this approach can work for indexed-only fields. I imagine
>> we also need a parallel dictionary for these mapped postings lists in order
>> to deal with new terms encountered during the update. Not sure how this
>> would work. Can you elaborate?
>>
>> And how would we deal with updated stored fields?
>
> Good questions (terms dict & stored fields) -- I think what we'd do is
> simply write a new segment, so stored fields, term vectors, postings,
> terms dict, etc., are all private to that new segment.  But, we'd mark
> the segment as being a delta segment, referencing the original
> segment, and we'd remap the docIDs when flushing that segment.  Then
> at search time, a single SegmentReader would open its primary segment
> plus any series of delta segments, merging them.
>
> So eg StoredFieldsReader would first load the doc from the primary
> segment, then go and load any deltas, overwriting the field values.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang <fa...@gmail.com> wrote:
>> This is about multiple sessions with the writer.  Ie, open writer,
>> update a few docs, close.  Do the same again, but, that 2nd session
>> cannot overwrite the same files from the first one, since readers may
>> have those files open.  The "write once" model gives Lucene its
>> transactional semantics, too -- we can keep multiple commits in the
>> index, open an arbitrary commit, rollback, etc.
>
> Ahh, I see. We're worried about the read-consistency of the open readers
> while there is a concurrent writer. I think that's already taken care of with
> the fact that under the hood we are only appending data into the logical
> segment.  When a reader first loads one of our slave segments, it first
> loads into memory the number of entries in the right-ahead log of mapped
> doc-ids before reading those number of entries from the log. It could note
> the max unmapped document number represented in the last entry and
> discard any document numbers above that max in the postings.

I see -- so your file format allows you to append to the same file
without affecting prior readers?  We never do that in Lucene today
(all files are "write once").

>> I think both approaches are O(total size of updates) in indexing cost,
>> assuming each writes only the new changes, to a new generation
>> filename.
>>
>> But I think at search time the "remap during indexing" approach should
>> simpler, since you "just" have to OR together N posting lists... and
>> performance should be better (by a constant factor) since there's no
>> remapping.
>
> I think I see how this approach can work for indexed-only fields. I imagine
> we also need a parallel dictionary for these mapped postings lists in order
> to deal with new terms encountered during the update. Not sure how this
> would work. Can you elaborate?
>
> And how would we deal with updated stored fields?

Good questions (terms dict & stored fields) -- I think what we'd do is
simply write a new segment, so stored fields, term vectors, postings,
terms dict, etc., are all private to that new segment.  But, we'd mark
the segment as being a delta segment, referencing the original
segment, and we'd remap the docIDs when flushing that segment.  Then
at search time, a single SegmentReader would open its primary segment
plus any series of delta segments, merging them.

So eg StoredFieldsReader would first load the doc from the primary
segment, then go and load any deltas, overwriting the field values.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> This is about multiple sessions with the writer.  Ie, open writer,
> update a few docs, close.  Do the same again, but, that 2nd session
> cannot overwrite the same files from the first one, since readers may
> have those files open.  The "write once" model gives Lucene its
> transactional semantics, too -- we can keep multiple commits in the
> index, open an arbitrary commit, rollback, etc.

Ahh, I see. We're worried about the read-consistency of the open readers
while there is a concurrent writer. I think that's already taken care of with
the fact that under the hood we are only appending data into the logical
segment.  When a reader first loads one of our slave segments, it first
loads into memory the number of entries in the right-ahead log of mapped
doc-ids before reading those number of entries from the log. It could note
the max unmapped document number represented in the last entry and
discard any document numbers above that max in the postings.

> I think both approaches are O(total size of updates) in indexing cost,
> assuming each writes only the new changes, to a new generation
> filename.
>
> But I think at search time the "remap during indexing" approach should
> simpler, since you "just" have to OR together N posting lists... and
> performance should be better (by a constant factor) since there's no
> remapping.

I think I see how this approach can work for indexed-only fields. I imagine
we also need a parallel dictionary for these mapped postings lists in order
to deal with new terms encountered during the update. Not sure how this
would work. Can you elaborate?

And how would we deal with updated stored fields?

-Babak

On Mon, Jan 18, 2010 at 4:42 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Mon, Jan 18, 2010 at 12:35 AM, Babak Farhang <fa...@gmail.com> wrote:
>
>>> Got it.  You'd presumably have to add a generation to this file, so
>>> that multiple sessions of updates + commit write to private files
>>> ("write once")?  And then the reader merges all of them.
>>
>> Actually, I hadn't considered concurrent update/commit semantics;
>> I was thinking more along a single writer / many readers model..
>> Concurrent update is an altogether different beast than concurrent
>> add semantics, and I was hoping to avoid complexity.
>
> This isn't about concurrency (we should just keep single writer +
> multiple readers model).
>
> This is about multiple sessions with the writer.  Ie, open writer,
> update a few docs, close.  Do the same again, but, that 2nd session
> cannot overwrite the same files from the first one, since readers may
> have those files open.  The "write once" model gives Lucene its
> transactional semantics, too -- we can keep multiple commits in the
> index, open an arbitrary commit, rollback, etc.
>
> Each newly written file only has the updates from that one session,
> so write cost is in proportion to how much updating you did in that
> one session.
>
>>> It sounds like this'd require the full posting list for a given term X
>>> to be loaded into RAM, mapped, and sorted, before the caller can
>>> actually pull the first docID, right?  I guess often this is likely an
>>> acceptable cost, since "usually" the fields you update are smallish.
>>
>> I was thinking, for large posting lists, we can avoid loading the entire
>> list into RAM in most cases. Given that we have a sorted set of mapped
>> surrogate doc-ids (the key-set in the TreeMap), we should be able to
>> search any surrogate doc-ids that need to be mapped in the underlying
>> posting list using judicious calls to skipTo() on the underlying TermDocs.
>> (This is just like an efficient implementation of finding the intersection
>> of 2 sorted arrays of integers: the performance is linear in the size of
>> the shortest array, and often, not all the elements of the shorter array
>> need even be examined.)
>>
>> So when the
>> [Filter]TermDocs instance is first positioned to a term, it first gathers
>> the to-be-mapped (surrogate) doc-ids in a sorted list (hopefully a small
>> subset of postings list), then maps these to their "view" ids in a separate
>> list, and finally sorts this list of mapped "view" ids. So one list contains
>> the (sorted) doc-ids that must be removed from the underlying posting
>> list, and another contains the (sorted) doc-ids that must be interleaved
>> into the postings.  Using these two lists, we should be able implement
>> a proper view of the postings.
>
> Ahhh I see.  Nice!
>
>>> Thinking out loud... I wonder if, instead of deferring the mapping
>>> until search time (which is costly since every search must re-do it),
>>> we could do the mapping at indexing time?
>>>
>>> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,
>>
>> This method name/signature/semantics much better than
>> the one I proposed..
>>
>>> which will update only the specified fields.  The Term must uniquely
>>> refer to 1 existing doc in the index.  On flush, we resolve that Term
>>> -> docID (we do this already, to apply deletions), and then we write
>>> to a new set of "delta postings" files using the correct docID.  Then
>>> the reader would have to merge these postings files when reading.
>>
>> [Thinking out loud too] Interesting. So instead of rewriting the entire
>> slave segment on update we rewrite things like the "delta postings"
>> files on committing an update?
>>
>> I'm thinking this approach would probably best work on "batch"
>> update/commits.
>
> It should work well for single doc updates as well, ie, its net
> indexing cost is also in proportion to the number of updated fields.
> It's just that it does the remapping once (during indexing) instead of
> during searching.
>
>> If on the other hand the use case is a sequence of N single-doc
>> update/commits, then the total cost of the N updates would likely
>> approach O(N**2) -- <N rewrites> * <avg delta postings file size = O(N)>
>> So as ever, there are tradeoffs.
>
> I think both approaches are O(total size of updates) in indexing cost,
> assuming each writes only the new changes, to a new generation
> filename.
>
> But I think at search time the "remap during indexing" approach should
> simpler, since you "just" have to OR together N posting lists... and
> performance should be better (by a constant factor) since there's no
> remapping.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Jan 18, 2010 at 12:35 AM, Babak Farhang <fa...@gmail.com> wrote:

>> Got it.  You'd presumably have to add a generation to this file, so
>> that multiple sessions of updates + commit write to private files
>> ("write once")?  And then the reader merges all of them.
>
> Actually, I hadn't considered concurrent update/commit semantics;
> I was thinking more along a single writer / many readers model..
> Concurrent update is an altogether different beast than concurrent
> add semantics, and I was hoping to avoid complexity.

This isn't about concurrency (we should just keep single writer +
multiple readers model).

This is about multiple sessions with the writer.  Ie, open writer,
update a few docs, close.  Do the same again, but, that 2nd session
cannot overwrite the same files from the first one, since readers may
have those files open.  The "write once" model gives Lucene its
transactional semantics, too -- we can keep multiple commits in the
index, open an arbitrary commit, rollback, etc.

Each newly written file only has the updates from that one session,
so write cost is in proportion to how much updating you did in that
one session.

>> It sounds like this'd require the full posting list for a given term X
>> to be loaded into RAM, mapped, and sorted, before the caller can
>> actually pull the first docID, right?  I guess often this is likely an
>> acceptable cost, since "usually" the fields you update are smallish.
>
> I was thinking, for large posting lists, we can avoid loading the entire
> list into RAM in most cases. Given that we have a sorted set of mapped
> surrogate doc-ids (the key-set in the TreeMap), we should be able to
> search any surrogate doc-ids that need to be mapped in the underlying
> posting list using judicious calls to skipTo() on the underlying TermDocs.
> (This is just like an efficient implementation of finding the intersection
> of 2 sorted arrays of integers: the performance is linear in the size of
> the shortest array, and often, not all the elements of the shorter array
> need even be examined.)
>
> So when the
> [Filter]TermDocs instance is first positioned to a term, it first gathers
> the to-be-mapped (surrogate) doc-ids in a sorted list (hopefully a small
> subset of postings list), then maps these to their "view" ids in a separate
> list, and finally sorts this list of mapped "view" ids. So one list contains
> the (sorted) doc-ids that must be removed from the underlying posting
> list, and another contains the (sorted) doc-ids that must be interleaved
> into the postings.  Using these two lists, we should be able implement
> a proper view of the postings.

Ahhh I see.  Nice!

>> Thinking out loud... I wonder if, instead of deferring the mapping
>> until search time (which is costly since every search must re-do it),
>> we could do the mapping at indexing time?
>>
>> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,
>
> This method name/signature/semantics much better than
> the one I proposed..
>
>> which will update only the specified fields.  The Term must uniquely
>> refer to 1 existing doc in the index.  On flush, we resolve that Term
>> -> docID (we do this already, to apply deletions), and then we write
>> to a new set of "delta postings" files using the correct docID.  Then
>> the reader would have to merge these postings files when reading.
>
> [Thinking out loud too] Interesting. So instead of rewriting the entire
> slave segment on update we rewrite things like the "delta postings"
> files on committing an update?
>
> I'm thinking this approach would probably best work on "batch"
> update/commits.

It should work well for single doc updates as well, ie, its net
indexing cost is also in proportion to the number of updated fields.
It's just that it does the remapping once (during indexing) instead of
during searching.

> If on the other hand the use case is a sequence of N single-doc
> update/commits, then the total cost of the N updates would likely
> approach O(N**2) -- <N rewrites> * <avg delta postings file size = O(N)>
> So as ever, there are tradeoffs.

I think both approaches are O(total size of updates) in indexing cost,
assuming each writes only the new changes, to a new generation
filename.

But I think at search time the "remap during indexing" approach should
simpler, since you "just" have to OR together N posting lists... and
performance should be better (by a constant factor) since there's no
remapping.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> Got it.  You'd presumably have to add a generation to this file, so
> that multiple sessions of updates + commit write to private files
> ("write once")?  And then the reader merges all of them.

Actually, I hadn't considered concurrent update/commit semantics;
I was thinking more along a single writer / many readers model..
Concurrent update is an altogether different beast than concurrent
add semantics, and I was hoping to avoid complexity.


> of RAM, especially the Integer overhead.  Though, these will be sparse
> right?  So it only uses RAM for docs that have updates, and when
> segment merging hasn't merged them away.

Yes.

> It sounds like this'd require the full posting list for a given term X
> to be loaded into RAM, mapped, and sorted, before the caller can
> actually pull the first docID, right?  I guess often this is likely an
> acceptable cost, since "usually" the fields you update are smallish.

I was thinking, for large posting lists, we can avoid loading the entire
list into RAM in most cases. Given that we have a sorted set of mapped
surrogate doc-ids (the key-set in the TreeMap), we should be able to
search any surrogate doc-ids that need to be mapped in the underlying
posting list using judicious calls to skipTo() on the underlying TermDocs.
(This is just like an efficient implementation of finding the intersection
of 2 sorted arrays of integers: the performance is linear in the size of
the shortest array, and often, not all the elements of the shorter array
need even be examined.)

So when the
[Filter]TermDocs instance is first positioned to a term, it first gathers
the to-be-mapped (surrogate) doc-ids in a sorted list (hopefully a small
subset of postings list), then maps these to their "view" ids in a separate
list, and finally sorts this list of mapped "view" ids. So one list contains
the (sorted) doc-ids that must be removed from the underlying posting
list, and another contains the (sorted) doc-ids that must be interleaved
into the postings.  Using these two lists, we should be able implement
a proper view of the postings.


> Thinking out loud... I wonder if, instead of deferring the mapping
> until search time (which is costly since every search must re-do it),
> we could do the mapping at indexing time?
>
> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,

This method name/signature/semantics much better than
the one I proposed..

> which will update only the specified fields.  The Term must uniquely
> refer to 1 existing doc in the index.  On flush, we resolve that Term
> -> docID (we do this already, to apply deletions), and then we write
> to a new set of "delta postings" files using the correct docID.  Then
> the reader would have to merge these postings files when reading.

[Thinking out loud too] Interesting. So instead of rewriting the entire
slave segment on update we rewrite things like the "delta postings"
files on committing an update?

I'm thinking this approach would probably best work on "batch"
update/commits.

If on the other hand the use case is a sequence of N single-doc
update/commits, then the total cost of the N updates would likely
approach O(N**2) -- <N rewrites> * <avg delta postings file size = O(N)>
So as ever, there are tradeoffs.

-Babak



On Sun, Jan 17, 2010 at 6:39 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sun, Jan 17, 2010 at 7:45 AM, Babak Farhang <fa...@gmail.com> wrote:
>>> So the idea is, I can change the field for only a few docs in a
>>> massive index, and the amount of "work" done, and bytes written, is in
>>> proportion only to how many docs were changed?
>>
>> Exactly. We append auxiliary data to the parallel segment and
>> delay rewriting the segment to when it'll be merged with another segment.
>
> OK.
>
>>> What would the actual data structure look like on disk?
>>
>> Something like a surrogate docs log file
>>
>> <rowCount><<int><int>>**rowCount
>>
>> a fixed-width table, each row mapping a deleted doc-id to its "surrogate"
>> doc-id. The mapping entries are not sorted; it's a log file. If 2
>> entries reference the
>> same deleted doc-id, the last one wins. The <rowCount> header makes the
>> structure self-delimiting, and ensures clean writes (perhaps using a byte-size
>> keystone).
>
> Got it.  You'd presumably have to add a generation to this file, so
> that multiple sessions of updates + commit write to private files
> ("write once")?  And then the reader merges all of them.
>
>>> And what
>>> would be loaded into RAM such that eg enumerating a postings lists
>>> would work correctly?  You'd have to remap docIDs on the fly, right?
>>
>> Yes, on the fly. I think it boils down to 3 tasks:
>>
>> 1. Map "view" doc-ids to surrogate doc-ids. Supports input, e.g. referring
>>  to a doc by id.
>> 2. Map surrogate doc-ids to "view" doc-ids. Supports output.
>> 3. Re-sort postings quickly on output, and provide correct
>> TermDocs.skipTo() semantics.
>>
>> For (1), I think an unordered HashMap<Integer,Integer> is good enough.
>> For (2) and (3), something like a TreeMap<Integer,Integer> I imagine should be
>> good enough too.  The sorted order of its keys (surrogate doc-ids)
>> should also allow
>> us to intersect the set of surrogate docs quickly with the already
>> sorted outbound
>> postings data, for example.
>
> OK though for an index with many documents, that's gonna use up alot
> of RAM, especially the Integer overhead.  Though, these will be sparse
> right?  So it only uses RAM for docs that have updates, and when
> segment merging hasn't merged them away.
>
>> Both maps would be built by replaying the surrogate docs log file.
>
> It sounds like this'd require the full posting list for a given term X
> to be loaded into RAM, mapped, and sorted, before the caller can
> actually pull the first docID, right?  I guess often this is likely an
> acceptable cost, since "usually" the fields you update are smallish.
>
> Thinking out loud... I wonder if, instead of deferring the mapping
> until search time (which is costly since every search must re-do it),
> we could do the mapping at indexing time?
>
> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,
> which will update only the specified fields.  The Term must uniquely
> refer to 1 existing doc in the index.  On flush, we resolve that Term
> -> docID (we do this already, to apply deletions), and then we write
> to a new set of "delta postings" files using the correct docID.  Then
> the reader would have to merge these postings files when reading.
> We'd also have generations...
>
>> Maybe I should first try to bang out a prototype that works at the
>> index level, not the
>> segment level?
>
> Sounds great!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Jan 17, 2010 at 7:45 AM, Babak Farhang <fa...@gmail.com> wrote:
>> So the idea is, I can change the field for only a few docs in a
>> massive index, and the amount of "work" done, and bytes written, is in
>> proportion only to how many docs were changed?
>
> Exactly. We append auxiliary data to the parallel segment and
> delay rewriting the segment to when it'll be merged with another segment.

OK.

>> What would the actual data structure look like on disk?
>
> Something like a surrogate docs log file
>
> <rowCount><<int><int>>**rowCount
>
> a fixed-width table, each row mapping a deleted doc-id to its "surrogate"
> doc-id. The mapping entries are not sorted; it's a log file. If 2
> entries reference the
> same deleted doc-id, the last one wins. The <rowCount> header makes the
> structure self-delimiting, and ensures clean writes (perhaps using a byte-size
> keystone).

Got it.  You'd presumably have to add a generation to this file, so
that multiple sessions of updates + commit write to private files
("write once")?  And then the reader merges all of them.

>> And what
>> would be loaded into RAM such that eg enumerating a postings lists
>> would work correctly?  You'd have to remap docIDs on the fly, right?
>
> Yes, on the fly. I think it boils down to 3 tasks:
>
> 1. Map "view" doc-ids to surrogate doc-ids. Supports input, e.g. referring
>  to a doc by id.
> 2. Map surrogate doc-ids to "view" doc-ids. Supports output.
> 3. Re-sort postings quickly on output, and provide correct
> TermDocs.skipTo() semantics.
>
> For (1), I think an unordered HashMap<Integer,Integer> is good enough.
> For (2) and (3), something like a TreeMap<Integer,Integer> I imagine should be
> good enough too.  The sorted order of its keys (surrogate doc-ids)
> should also allow
> us to intersect the set of surrogate docs quickly with the already
> sorted outbound
> postings data, for example.

OK though for an index with many documents, that's gonna use up alot
of RAM, especially the Integer overhead.  Though, these will be sparse
right?  So it only uses RAM for docs that have updates, and when
segment merging hasn't merged them away.

> Both maps would be built by replaying the surrogate docs log file.

It sounds like this'd require the full posting list for a given term X
to be loaded into RAM, mapped, and sorted, before the caller can
actually pull the first docID, right?  I guess often this is likely an
acceptable cost, since "usually" the fields you update are smallish.

Thinking out loud... I wonder if, instead of deferring the mapping
until search time (which is costly since every search must re-do it),
we could do the mapping at indexing time?

Say we add an IndexWriter.updateDocumentFields(Term, Document) method,
which will update only the specified fields.  The Term must uniquely
refer to 1 existing doc in the index.  On flush, we resolve that Term
-> docID (we do this already, to apply deletions), and then we write
to a new set of "delta postings" files using the correct docID.  Then
the reader would have to merge these postings files when reading.
We'd also have generations...

> Maybe I should first try to bang out a prototype that works at the
> index level, not the
> segment level?

Sounds great!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> So the idea is, I can change the field for only a few docs in a
> massive index, and the amount of "work" done, and bytes written, is in
> proportion only to how many docs were changed?

Exactly. We append auxiliary data to the parallel segment and
delay rewriting the segment to when it'll be merged with another segment.


> What would the actual data structure look like on disk?

Something like a surrogate docs log file

<rowCount><<int><int>>**rowCount

a fixed-width table, each row mapping a deleted doc-id to its "surrogate"
doc-id. The mapping entries are not sorted; it's a log file. If 2
entries reference the
same deleted doc-id, the last one wins. The <rowCount> header makes the
structure self-delimiting, and ensures clean writes (perhaps using a byte-size
keystone).


> And what
> would be loaded into RAM such that eg enumerating a postings lists
> would work correctly?  You'd have to remap docIDs on the fly, right?

Yes, on the fly. I think it boils down to 3 tasks:

1. Map "view" doc-ids to surrogate doc-ids. Supports input, e.g. referring
  to a doc by id.
2. Map surrogate doc-ids to "view" doc-ids. Supports output.
3. Re-sort postings quickly on output, and provide correct
TermDocs.skipTo() semantics.

For (1), I think an unordered HashMap<Integer,Integer> is good enough.
For (2) and (3), something like a TreeMap<Integer,Integer> I imagine should be
good enough too.  The sorted order of its keys (surrogate doc-ids)
should also allow
us to intersect the set of surrogate docs quickly with the already
sorted outbound
postings data, for example.

Both maps would be built by replaying the surrogate docs log file.


Maybe I should first try to bang out a prototype that works at the
index level, not the
segment level?

-Babak

On Sun, Jan 17, 2010 at 3:06 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sun, Jan 17, 2010 at 4:33 AM, Babak Farhang <fa...@gmail.com> wrote:
>> Thanks Mike!  This is pretty cool..
>>
>> So LUCENE-1879 takes care of aligning (syncing) doc-ids across
>> parallel index / segment merges. Missing is the machinery for
>> updating a field (or fields) in a parallel slave index: to do this the
>> appropriate segment in the slave index must somehow be rewritten.
>> (right?)
>
> I think you're right -- though I haven't looked closely in a while.  I
> think under LUCENE-1879, the entire secondary index must be fully
> rewritten whenever any changes are made.  But the thinking is that the
> secondary index has smallish fields so this might be OK (but I agree,
> does not scale up well).
>
>> In order to support this feature, I'm thinking one could implement
>> an UpdateableSlaveIndexWriter / UpdateableSlaveIndexReader
>> pair. UpdateableSlaveIndexWriter is an IndexWriter with a special
>> updateField[s] method that uses a (I think, per-segment, not sure)
>> write-ahead log to map a to-be-deleted doc-id in to a newly created
>> doc-id.  It also maintains other statistics such as a virtual maxDocId.
>> The UpdateableSlaveIndexReader would play back this log file (if
>> any) and maintain the mappings in-memory.
>>
>> I imagine there would also have to be a UpdateableSlaveIndexReader
>> variant that works on individual slave index segments rather than the
>> entire collection of segments that comprise the slave index. This, say,
>> UpdateableSlaveIndexSegReader would be used in lieu of the
>> SegmentReader instances currently used for merging segments.
>
> This sounds very interesting!
>
> So the idea is, I can change the field for only a few docs in a
> massive index, and the amount of "work" done, and bytes written, is in
> proportion only to how many docs were changed?
>
> What would the actual data structure look like on disk?  And what
> would be loaded into RAM such that eg enumerating a postings lists
> would work correctly?  You'd have to remap docIDs on the fly, right?
>
>> The SegmentMerger class appears to be sufficiently agnostic about the
>> exact type of the IndexReader instances that represent the inputs to the
>> merge (though it does favor SegmentReader subclasses). So far so
>> good.  However, there appear to be a number of other junctures in the
>> merge code that depend on the static SegmentReader.get methods.
>> These would have to be indirected through some factory mechanism
>> so that we could return our UpdateableSlaveIndexSegReader.
>
> Yeah this has been a growing problem... or at least others have bumped
> up against this.  We are wanting to factor out the ReaderPool
> (currently embedded in IndexWriter) to a public standalone class that
> would be the "standard" factory for getting new SegmentReaders.  I
> think SegmentMerger would then use this, and apps could specify their
> own ReaderPool impl to produce whichever impl they wanted.  This is
> being tracked (well, only started at this point) in LUCENE-2026.
>
>> Is the approach I'm describing sensible? Or did I get it all wrong? (This
>> was my first look at this code, so it is entirely possible that I'm making
>> a fool of myself ;)
>
> This sounds great so far!  Individually updateable fields would be an
> awesome addition to Lucene (it's often asked about).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Jan 17, 2010 at 4:33 AM, Babak Farhang <fa...@gmail.com> wrote:
> Thanks Mike!  This is pretty cool..
>
> So LUCENE-1879 takes care of aligning (syncing) doc-ids across
> parallel index / segment merges. Missing is the machinery for
> updating a field (or fields) in a parallel slave index: to do this the
> appropriate segment in the slave index must somehow be rewritten.
> (right?)

I think you're right -- though I haven't looked closely in a while.  I
think under LUCENE-1879, the entire secondary index must be fully
rewritten whenever any changes are made.  But the thinking is that the
secondary index has smallish fields so this might be OK (but I agree,
does not scale up well).

> In order to support this feature, I'm thinking one could implement
> an UpdateableSlaveIndexWriter / UpdateableSlaveIndexReader
> pair. UpdateableSlaveIndexWriter is an IndexWriter with a special
> updateField[s] method that uses a (I think, per-segment, not sure)
> write-ahead log to map a to-be-deleted doc-id in to a newly created
> doc-id.  It also maintains other statistics such as a virtual maxDocId.
> The UpdateableSlaveIndexReader would play back this log file (if
> any) and maintain the mappings in-memory.
>
> I imagine there would also have to be a UpdateableSlaveIndexReader
> variant that works on individual slave index segments rather than the
> entire collection of segments that comprise the slave index. This, say,
> UpdateableSlaveIndexSegReader would be used in lieu of the
> SegmentReader instances currently used for merging segments.

This sounds very interesting!

So the idea is, I can change the field for only a few docs in a
massive index, and the amount of "work" done, and bytes written, is in
proportion only to how many docs were changed?

What would the actual data structure look like on disk?  And what
would be loaded into RAM such that eg enumerating a postings lists
would work correctly?  You'd have to remap docIDs on the fly, right?

> The SegmentMerger class appears to be sufficiently agnostic about the
> exact type of the IndexReader instances that represent the inputs to the
> merge (though it does favor SegmentReader subclasses). So far so
> good.  However, there appear to be a number of other junctures in the
> merge code that depend on the static SegmentReader.get methods.
> These would have to be indirected through some factory mechanism
> so that we could return our UpdateableSlaveIndexSegReader.

Yeah this has been a growing problem... or at least others have bumped
up against this.  We are wanting to factor out the ReaderPool
(currently embedded in IndexWriter) to a public standalone class that
would be the "standard" factory for getting new SegmentReaders.  I
think SegmentMerger would then use this, and apps could specify their
own ReaderPool impl to produce whichever impl they wanted.  This is
being tracked (well, only started at this point) in LUCENE-2026.

> Is the approach I'm describing sensible? Or did I get it all wrong? (This
> was my first look at this code, so it is entirely possible that I'm making
> a fool of myself ;)

This sounds great so far!  Individually updateable fields would be an
awesome addition to Lucene (it's often asked about).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

Thanks Mike!  This is pretty cool..

So LUCENE-1879 takes care of aligning (syncing) doc-ids across
parallel index / segment merges. Missing is the machinery for
updating a field (or fields) in a parallel slave index: to do this the
appropriate segment in the slave index must somehow be rewritten.
(right?)

In order to support this feature, I'm thinking one could implement
an UpdateableSlaveIndexWriter / UpdateableSlaveIndexReader
pair. UpdateableSlaveIndexWriter is an IndexWriter with a special
updateField[s] method that uses a (I think, per-segment, not sure)
write-ahead log to map a to-be-deleted doc-id in to a newly created
doc-id.  It also maintains other statistics such as a virtual maxDocId.
The UpdateableSlaveIndexReader would play back this log file (if
any) and maintain the mappings in-memory.

I imagine there would also have to be a UpdateableSlaveIndexReader
variant that works on individual slave index segments rather than the
entire collection of segments that comprise the slave index. This, say,
UpdateableSlaveIndexSegReader would be used in lieu of the
SegmentReader instances currently used for merging segments.

The SegmentMerger class appears to be sufficiently agnostic about the
exact type of the IndexReader instances that represent the inputs to the
merge (though it does favor SegmentReader subclasses). So far so
good.  However, there appear to be a number of other junctures in the
merge code that depend on the static SegmentReader.get methods.
These would have to be indirected through some factory mechanism
so that we could return our UpdateableSlaveIndexSegReader.

Is the approach I'm describing sensible? Or did I get it all wrong? (This
was my first look at this code, so it is entirely possible that I'm making
a fool of myself ;)

-Babak




On Thu, Jan 14, 2010 at 3:39 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Parallel incremental indexing
> (http://issues.apache.org/jira/browse/LUCENE-1879) is one way to solve
> this.
>
> Mike
>
> On Thu, Jan 14, 2010 at 4:27 AM, Babak Farhang <fa...@gmail.com> wrote:
>>> Reading that trail, I wish the original poster gave up on his idea (
>>
>> Err, that should have read..
>>
>> "Reading that trail, I wish the original poster hadn't given up on his idea"
>>
>>
>> On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang <fa...@gmail.com> wrote:
>>> Hi,
>>>
>>> I've been thinking about how to update a single field of a document
>>> without touching its other fields. This is an old problem and I was
>>> considering a solution along the lines of Andrzej Bialecki's post to
>>> the dev list back in '07:
>>>
>>>
>>> <quote  http://markmail.org/message/tbkgmnilhvrt6bii >
>>>
>>> I have the following scenario: I want to use ParallelReader to
>>> maintain parts of the index that are changing quickly, and where
>>> changes are limited to specific fields only.
>>>
>>> Let's say I have a "main" index (many fields, slowly changing, large
>>> updates), and an "aux" index (fast changing, usually single doc and
>>> single field updates). I'd like to "replace" documents in the "aux"
>>> index - that is, delete one doc and add another - but in a way that
>>> doesn't change the internal document numbers, so that I can keep the
>>> mapping required by ParallelReader intact.
>>>
>>> I think this is possible to achieve by using a FilterIndexReader,
>>> which keeps a map of updated documents, and re-maps old doc ids to the
>>> new ones on the fly.
>>>
>>> From time to time I'd like to optimize the "aux" index to get rid of
>>> deleted docs. At this time I need to figure out how to preserve the
>>> old->new mapping during the optimization.
>>>
>>> So, here's the question: is this scenario feasible? If so, then in the
>>> trunk/ version of Lucene, is there any way to figure out (predictably)
>>> how internal document numbers are reassigned after calling optimize()
>>> ?
>>>
>>> </quote>
>>>
>>>
>>> Reading that trail, I wish the original poster gave up on his idea (
>>> http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
>>> )
>>>
>>>
>>> <quote>
>>> Thanks for the input - for now I gave up on this, after discovering
>>> that I would have no way to ensure in TermDocs.skipTo() that document
>>> id-s are monotonically increasing (which seems to be a part of the
>>> contract).
>>> </quote>
>>>
>>> I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
>>> (ordered) maps, one from internal document-ids to "view" document-ids,
>>> and another mapping from  "view" document-ids to internal
>>> document-ids, then things like skipTo() can be implemented reasonably
>>> efficiently. Only the mapped ids are maintained in these structures.
>>> (Also note that a mapped "view" document-id represents an internally
>>> deleted document with that id.)
>>>
>>> And if we can find a way to merge the segments of this "aux" index
>>> along whenever the segments of its associated "main" index are merged
>>> or optimized (such that the [internal] doc-ids in the merged aux index
>>> end up getting sync'ed with those of the trunk), then there shouldn't
>>> be all that many doc-ids to map anyway (if we merge frequently
>>> enough).
>>>
>>> So to go back Andrzej's question: is there any way to figure out
>>> (predictably) how internal document numbers [in the main index] are
>>> reassigned after calling optimize() ? How does LUCENE-847, as Doug
>>> Cutting suggests in that trail, help?
>>>
>>> Sorry if that was long winded, had to start somewhere ;)
>>>
>>> -Babak
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Michael McCandless <lu...@mikemccandless.com>.

Parallel incremental indexing
(http://issues.apache.org/jira/browse/LUCENE-1879) is one way to solve
this.

Mike

On Thu, Jan 14, 2010 at 4:27 AM, Babak Farhang <fa...@gmail.com> wrote:
>> Reading that trail, I wish the original poster gave up on his idea (
>
> Err, that should have read..
>
> "Reading that trail, I wish the original poster hadn't given up on his idea"
>
>
> On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang <fa...@gmail.com> wrote:
>> Hi,
>>
>> I've been thinking about how to update a single field of a document
>> without touching its other fields. This is an old problem and I was
>> considering a solution along the lines of Andrzej Bialecki's post to
>> the dev list back in '07:
>>
>>
>> <quote  http://markmail.org/message/tbkgmnilhvrt6bii >
>>
>> I have the following scenario: I want to use ParallelReader to
>> maintain parts of the index that are changing quickly, and where
>> changes are limited to specific fields only.
>>
>> Let's say I have a "main" index (many fields, slowly changing, large
>> updates), and an "aux" index (fast changing, usually single doc and
>> single field updates). I'd like to "replace" documents in the "aux"
>> index - that is, delete one doc and add another - but in a way that
>> doesn't change the internal document numbers, so that I can keep the
>> mapping required by ParallelReader intact.
>>
>> I think this is possible to achieve by using a FilterIndexReader,
>> which keeps a map of updated documents, and re-maps old doc ids to the
>> new ones on the fly.
>>
>> From time to time I'd like to optimize the "aux" index to get rid of
>> deleted docs. At this time I need to figure out how to preserve the
>> old->new mapping during the optimization.
>>
>> So, here's the question: is this scenario feasible? If so, then in the
>> trunk/ version of Lucene, is there any way to figure out (predictably)
>> how internal document numbers are reassigned after calling optimize()
>> ?
>>
>> </quote>
>>
>>
>> Reading that trail, I wish the original poster gave up on his idea (
>> http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
>> )
>>
>>
>> <quote>
>> Thanks for the input - for now I gave up on this, after discovering
>> that I would have no way to ensure in TermDocs.skipTo() that document
>> id-s are monotonically increasing (which seems to be a part of the
>> contract).
>> </quote>
>>
>> I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
>> (ordered) maps, one from internal document-ids to "view" document-ids,
>> and another mapping from  "view" document-ids to internal
>> document-ids, then things like skipTo() can be implemented reasonably
>> efficiently. Only the mapped ids are maintained in these structures.
>> (Also note that a mapped "view" document-id represents an internally
>> deleted document with that id.)
>>
>> And if we can find a way to merge the segments of this "aux" index
>> along whenever the segments of its associated "main" index are merged
>> or optimized (such that the [internal] doc-ids in the merged aux index
>> end up getting sync'ed with those of the trunk), then there shouldn't
>> be all that many doc-ids to map anyway (if we merge frequently
>> enough).
>>
>> So to go back Andrzej's question: is there any way to figure out
>> (predictably) how internal document numbers [in the main index] are
>> reassigned after calling optimize() ? How does LUCENE-847, as Doug
>> Cutting suggests in that trail, help?
>>
>> Sorry if that was long winded, had to start somewhere ;)
>>
>> -Babak
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: incremental document field update

Posted by Babak Farhang <fa...@gmail.com>.

> Reading that trail, I wish the original poster gave up on his idea (

Err, that should have read..

"Reading that trail, I wish the original poster hadn't given up on his idea"


On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang <fa...@gmail.com> wrote:
> Hi,
>
> I've been thinking about how to update a single field of a document
> without touching its other fields. This is an old problem and I was
> considering a solution along the lines of Andrzej Bialecki's post to
> the dev list back in '07:
>
>
> <quote  http://markmail.org/message/tbkgmnilhvrt6bii >
>
> I have the following scenario: I want to use ParallelReader to
> maintain parts of the index that are changing quickly, and where
> changes are limited to specific fields only.
>
> Let's say I have a "main" index (many fields, slowly changing, large
> updates), and an "aux" index (fast changing, usually single doc and
> single field updates). I'd like to "replace" documents in the "aux"
> index - that is, delete one doc and add another - but in a way that
> doesn't change the internal document numbers, so that I can keep the
> mapping required by ParallelReader intact.
>
> I think this is possible to achieve by using a FilterIndexReader,
> which keeps a map of updated documents, and re-maps old doc ids to the
> new ones on the fly.
>
> From time to time I'd like to optimize the "aux" index to get rid of
> deleted docs. At this time I need to figure out how to preserve the
> old->new mapping during the optimization.
>
> So, here's the question: is this scenario feasible? If so, then in the
> trunk/ version of Lucene, is there any way to figure out (predictably)
> how internal document numbers are reassigned after calling optimize()
> ?
>
> </quote>
>
>
> Reading that trail, I wish the original poster gave up on his idea (
> http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
> )
>
>
> <quote>
> Thanks for the input - for now I gave up on this, after discovering
> that I would have no way to ensure in TermDocs.skipTo() that document
> id-s are monotonically increasing (which seems to be a part of the
> contract).
> </quote>
>
> I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
> (ordered) maps, one from internal document-ids to "view" document-ids,
> and another mapping from  "view" document-ids to internal
> document-ids, then things like skipTo() can be implemented reasonably
> efficiently. Only the mapped ids are maintained in these structures.
> (Also note that a mapped "view" document-id represents an internally
> deleted document with that id.)
>
> And if we can find a way to merge the segments of this "aux" index
> along whenever the segments of its associated "main" index are merged
> or optimized (such that the [internal] doc-ids in the merged aux index
> end up getting sync'ed with those of the trunk), then there shouldn't
> be all that many doc-ids to map anyway (if we merge frequently
> enough).
>
> So to go back Andrzej's question: is there any way to figure out
> (predictably) how internal document numbers [in the main index] are
> reassigned after calling optimize() ? How does LUCENE-847, as Doug
> Cutting suggests in that trail, help?
>
> Sorry if that was long winded, had to start somewhere ;)
>
> -Babak
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org