You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Babak Farhang <fa...@gmail.com> on 2010/04/02 08:50:13 UTC

Re: Incremental Field Updates

[Late to this party, but thought I'd chime in]

I think this "layer" concept is right on.  But I'm wondering about the
life cycle of these layers.  Do layers live forever? Or do they
collapse at some point? (Like, as I think was already pointed out,
deletes are when segments are merged today.)

-Babak

On Sat, Mar 27, 2010 at 5:25 AM, Grant Ingersoll <gs...@apache.org> wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

Grant,

Reading your post a 3rd time, I see my "suggestion" is in fact the
approach you describe.  Sorry for being redundant.

-Babak

On Fri, Apr 2, 2010 at 11:25 PM, Babak Farhang <fa...@gmail.com> wrote:
>> I think they get merged in by the merger, ideally in the background.
>
> That sounds sensible. (In other words, we wont concern ourselves with
> roll backs--something possible while a "layer" is still around.)
>
> I've been thinking about this problem also. One approach discussed
> earlier in these mailing lists has been to somehow maintain a parallel
> index of the update-able of the fields in such a way that the docIds
> of the parallel index remain in sync with the "master" index. Mike
> McCandless and I were discussing some variants of this approach a few
> months back: http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps
>  That approach involved the concept of mapping (chaining, if you will)
> internal docIds to view ids.  That docid mapping concept sounds
> analogous to this layer concept we are discussing now.
>
> I now think the parallel index approach may not be such a great idea,
> after all: it simply pushes the problem to the edge--the slave index.
> If we can solve update problem in the slave index, I reason, then
> shouldn't we also be able to solve the same update problem in the
> master index (and thereby remove the necessity of maintaining a
> (user-level) parallel index in the first place)?
>
> Which seems to align with the approach being discussed here..
>
> I imagine the "layers" being discussed here are somehow threaded by
> docId. That is, given a docId, you can quickly find it's "layers."  If
> so, then the docId mapping idea may be one way to thread these layers.
> (A logical document would be constructed by a chain of docIds, each
> overriding the previous for each field it defines (or deletes).  Such
> a construction would have to be "merge-aware" (perhaps using machinery
> similar to that used in LUCENE-1879) in order that it may maintain the
> docId chain.
>
> What do you think?
>
>
> On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:
>>
>>> [Late to this party, but thought I'd chime in]
>>>
>>> I think this "layer" concept is right on.  But I'm wondering about the
>>> life cycle of these layers.  Do layers live forever? Or do they
>>> collapse at some point? (Like, as I think was already pointed out,
>>> deletes are when segments are merged today.)
>>
>> I think they get merged in by the merger, ideally in the background.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

One thing that makes IFU especially compelling now is that we have
improved how we track buffered deletes, which records by Term or Query
mapped against certain segments which documents should be deleted.

I think this same "channel" can be generalized to also hold pending
updates to documents from prior segments.  We should be able to reuse
much of the buffered deletes code to also hold buffered doc updates.

IFU would be an *awesome* addition ;)

I think doc blocks (LUCENE-3112) makes IFU even more important.

Mike

http://blog.mikemccandless.com

On Mon, May 23, 2011 at 12:08 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 10/7/10 11:07 AM, Shai Erera wrote:
>>
>> Not yet. I actually plan to start working on it next week, but it will
>> take some time until I post the first patch. Also, I'll probably develop
>> it on top of trunk only, utilizing flexible indexing. At the moment, I
>> have no plans, nor can I estimate how much work is required, to develop
>> it on top of 3x.
>>
>> Unfortunately my regular projects keep me very busy, but it's time I
>> allocate some time to work on this one too :). Stay tuned !
>
> I still am :)
>
> I'm resurrecting this old thread in the hopes that this subject again
> attracts attention of the Lucene mind collective. The design proposal that
> you presented at that time seemed workable ... so I guess it just requires
> someone to roll up the sleeves and start implementing it?
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 5/23/11 8:24 PM, Shai Erera wrote:
> Heh, almost a year has passed since I sent that email. Soon, "next week"
> will be applicable again :).

:)

> Besides my day-to-day job, the changes to trunk kept me away from
> pursuing this. DWPT, BufferedDeletes and all the crazy improvements and
> enhancements Mike and Simon (and others) have been working on lately
> changed much of the code that will be involved in the development.
>
> Things seem to be more stable (and "at peace") at trunk-land now (heck,
> we've even started discussing releasing 4.0 !), so it's time to get back
> on that issue.
>
> I need to reorganize my thoughts and design details into a document or
> something, or plainly in a JIRA issue. I'm afraid of setting any dates
> on this just yet though.
>
> While I very much like to implement it, I realize the community cannot
> wait for me forever. So if there's anyone interested in picking up the
> gloves, I will understand :) (and of course help as much as I can ...)

4.0 internals is still an uncharted area for me ... I'd like to use this 
functionality for a project, so I'm generally willing to help (review, 
design, maybe code if I can catch-up with trunk...).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Heh, almost a year has passed since I sent that email. Soon, "next week"
will be applicable again :).

Besides my day-to-day job, the changes to trunk kept me away from pursuing
this. DWPT, BufferedDeletes and all the crazy improvements and enhancements
Mike and Simon (and others) have been working on lately changed much of the
code that will be involved in the development.

Things seem to be more stable (and "at peace") at trunk-land now (heck,
we've even started discussing releasing 4.0 !), so it's time to get back on
that issue.

I need to reorganize my thoughts and design details into a document or
something, or plainly in a JIRA issue. I'm afraid of setting any dates on
this just yet though.

While I very much like to implement it, I realize the community cannot wait
for me forever. So if there's anyone interested in picking up the gloves, I
will understand :) (and of course help as much as I can ...)

Shai

On Mon, May 23, 2011 at 7:08 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 10/7/10 11:07 AM, Shai Erera wrote:
>
>> Not yet. I actually plan to start working on it next week, but it will
>> take some time until I post the first patch. Also, I'll probably develop
>> it on top of trunk only, utilizing flexible indexing. At the moment, I
>> have no plans, nor can I estimate how much work is required, to develop
>> it on top of 3x.
>>
>> Unfortunately my regular projects keep me very busy, but it's time I
>> allocate some time to work on this one too :). Stay tuned !
>>
>
> I still am :)
>
> I'm resurrecting this old thread in the hopes that this subject again
> attracts attention of the Lucene mind collective. The design proposal that
> you presented at that time seemed workable ... so I guess it just requires
> someone to roll up the sleeves and start implementing it?
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 10/7/10 11:07 AM, Shai Erera wrote:
> Not yet. I actually plan to start working on it next week, but it will
> take some time until I post the first patch. Also, I'll probably develop
> it on top of trunk only, utilizing flexible indexing. At the moment, I
> have no plans, nor can I estimate how much work is required, to develop
> it on top of 3x.
>
> Unfortunately my regular projects keep me very busy, but it's time I
> allocate some time to work on this one too :). Stay tuned !

I still am :)

I'm resurrecting this old thread in the hopes that this subject again 
attracts attention of the Lucene mind collective. The design proposal 
that you presented at that time seemed workable ... so I guess it just 
requires someone to roll up the sleeves and start implementing it?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Not yet. I actually plan to start working on it next week, but it will take
some time until I post the first patch. Also, I'll probably develop it on
top of trunk only, utilizing flexible indexing. At the moment, I have no
plans, nor can I estimate how much work is required, to develop it on top of
3x.

Unfortunately my regular projects keep me very busy, but it's time I
allocate some time to work on this one too :). Stay tuned !

Shai

On Thu, Oct 7, 2010 at 10:59 AM, Jan Høydahl / Cominvent <
jan.asf@cominvent.com> wrote:

> Picking up on this very interesting discussion..
> Great and innovative piece of work, Shai!
>
> I think we come a long way addressing common scenarios through this
> approach. Many customers really just need ACL or other metadata updates. One
> example is a customer of mine who have an index of large docs for which the
> source data is archived on tape. It is way too costly to retrieve the
> original data to compile a new document for a metadata update only.
>
> Also, if I want to have the ability to update a whole field, I would be
> happy to make it stored, rather than having to supply the original value to
> the API. Seems like a reasonable tradeoff for getting incremental update -
> nobody would expect it to be free.
>
> +1 for solving the "simple metadata" update case first, with full-field
> update support for stored fields only.
>
> Does this particular solution currently have an associated JIRA issue?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 10. mai 2010, at 10.40, Michael McCandless wrote:
>
> > On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
> >> That's an interesting scenario Mike.
> >>
> >> Previously, I only handled boolean-like terms, as the scenarios we were
> >> asked to support involved just those types of terms. Obviously, when the
> >> approach allows for more, more scenarios pop to mind :).
> >
> > OK.
> >
> >> I think we may still be able to resolve that case, but it becomes much
> more
> >> complicated. My design approach of adding the +/- affected the entire
> >> posting element, whereas the scenario you describe affects the positions
> of
> >> the posting element. This calls for a more complicated design and
> solution.
> >
> > Right.
> >
> >> My take on it is that if someone wants to update the catch-all field,
> then
> >> reindexing the document may not be such a bad idea anyway. The purpose
> of
> >> those incremental updates is to cope w/ high frequency of updates, which
> >> usually happen on metadata fields, and not title.
> >
> > I agree.
> >
> >> But since one could add the 'tags' to the catch-all field as well, it
> brings
> >> us to the same point - how do I remove the positions of term X that
> relate
> >> to the tag X and not the potentially original term X that existed in the
> >> document?
> >>
> >> This is a very advanced case (and interesting). I don't want to hold up
> the
> >> discussion on it, but want to make sure we do not deviate from getting
> the
> >> more simpler cases in first. Depending on the API, this might be very
> easy
> >> to solve, but might also complicate matters. Maybe, for a
> >> incr-field-updates-v1, we can do without it?
> >
> > Definitely, let's take this (incrementally updating the positions as
> > well) out of scope for the first cut, when we actually start building
> > things.  One simple way to do this might be to only allow incremental
> > update on fields that have omitTFAP=true.
> >
> > When brainstorming/designing a new feature, I like to cast a wide net
> > during the discussion/thinking (what we are doing now), but then when
> > it comes to what to actually build for phase one well pull it way back
> > in and aim for baby steps / progress not perfection.  We are able to
> > do much more imagining than we can actually writing code :)
> >
> > The wide net during brainstorming gives us a better view/context of
> > the road ahead, eg to validate that the baby step is in the right
> > direction, so that it doesn't preclude other things we might imagine
> > later.
> >
> > In this case, it does sound like the approach should work (in theory)
> > fine w/ positions, too.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

Picking up on this very interesting discussion..
Great and innovative piece of work, Shai!

I think we come a long way addressing common scenarios through this approach. Many customers really just need ACL or other metadata updates. One example is a customer of mine who have an index of large docs for which the source data is archived on tape. It is way too costly to retrieve the original data to compile a new document for a metadata update only.

Also, if I want to have the ability to update a whole field, I would be happy to make it stored, rather than having to supply the original value to the API. Seems like a reasonable tradeoff for getting incremental update - nobody would expect it to be free.

+1 for solving the "simple metadata" update case first, with full-field update support for stored fields only.

Does this particular solution currently have an associated JIRA issue?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. mai 2010, at 10.40, Michael McCandless wrote:

> On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
>> That's an interesting scenario Mike.
>> 
>> Previously, I only handled boolean-like terms, as the scenarios we were
>> asked to support involved just those types of terms. Obviously, when the
>> approach allows for more, more scenarios pop to mind :).
> 
> OK.
> 
>> I think we may still be able to resolve that case, but it becomes much more
>> complicated. My design approach of adding the +/- affected the entire
>> posting element, whereas the scenario you describe affects the positions of
>> the posting element. This calls for a more complicated design and solution.
> 
> Right.
> 
>> My take on it is that if someone wants to update the catch-all field, then
>> reindexing the document may not be such a bad idea anyway. The purpose of
>> those incremental updates is to cope w/ high frequency of updates, which
>> usually happen on metadata fields, and not title.
> 
> I agree.
> 
>> But since one could add the 'tags' to the catch-all field as well, it brings
>> us to the same point - how do I remove the positions of term X that relate
>> to the tag X and not the potentially original term X that existed in the
>> document?
>> 
>> This is a very advanced case (and interesting). I don't want to hold up the
>> discussion on it, but want to make sure we do not deviate from getting the
>> more simpler cases in first. Depending on the API, this might be very easy
>> to solve, but might also complicate matters. Maybe, for a
>> incr-field-updates-v1, we can do without it?
> 
> Definitely, let's take this (incrementally updating the positions as
> well) out of scope for the first cut, when we actually start building
> things.  One simple way to do this might be to only allow incremental
> update on fields that have omitTFAP=true.
> 
> When brainstorming/designing a new feature, I like to cast a wide net
> during the discussion/thinking (what we are doing now), but then when
> it comes to what to actually build for phase one well pull it way back
> in and aim for baby steps / progress not perfection.  We are able to
> do much more imagining than we can actually writing code :)
> 
> The wide net during brainstorming gives us a better view/context of
> the road ahead, eg to validate that the baby step is in the right
> direction, so that it doesn't preclude other things we might imagine
> later.
> 
> In this case, it does sound like the approach should work (in theory)
> fine w/ positions, too.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think this would work perfectly fine w/ Shai's approach...

To Lucene a NumericField is just a series of terms w/ no positions indexed.

So when a value is changed, we'd get a new series of terms, do the
delta, and then subtract & add accordingly in the stacked segments.

Mike

On Wed, May 12, 2010 at 5:27 AM, Babak Farhang <fa...@gmail.com> wrote:
>> Of course, it raises an interesting point, what are the implications for numeric fields?
>
> Not sure whether you're referring to the general or the specific, but
> with the approach Shai is proposing, if the numeric fields are indexed
> using the new trie structures, then it would be important to properly
> remove the postings for the old value (I imagine range queries would
> break o.w.). Again, that could be achieved by having the update API
> take the old value as well as the new one.
>
> -Babak
>
> On Tue, May 11, 2010 at 1:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> On May 11, 2010, at 12:26 AM, Shai Erera wrote:
>>
>>> but because of the cost of preparing the inputs (i.e. text
>>> extraction) to Lucene.
>>>
>>> You're right ! That and also the cost of fetching the document, in systems where the content lives on other servers/systems. Reindexing is usually (depends on your analysis chain) the cheapest step.
>>
>> Depends on the type of application, though, I suppose.  Many times the thing being updated is just a number, like a rating/price/inventory as well, in which case there is very little analysis.  Of course, it raises an interesting point, what are the implications for numeric fields?
>>
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

> Of course, it raises an interesting point, what are the implications for numeric fields?

Not sure whether you're referring to the general or the specific, but
with the approach Shai is proposing, if the numeric fields are indexed
using the new trie structures, then it would be important to properly
remove the postings for the old value (I imagine range queries would
break o.w.). Again, that could be achieved by having the update API
take the old value as well as the new one.

-Babak

On Tue, May 11, 2010 at 1:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On May 11, 2010, at 12:26 AM, Shai Erera wrote:
>
>> but because of the cost of preparing the inputs (i.e. text
>> extraction) to Lucene.
>>
>> You're right ! That and also the cost of fetching the document, in systems where the content lives on other servers/systems. Reindexing is usually (depends on your analysis chain) the cheapest step.
>
> Depends on the type of application, though, I suppose.  Many times the thing being updated is just a number, like a rating/price/inventory as well, in which case there is very little analysis.  Of course, it raises an interesting point, what are the implications for numeric fields?
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Grant Ingersoll <gs...@apache.org>.

On May 11, 2010, at 12:26 AM, Shai Erera wrote:

> but because of the cost of preparing the inputs (i.e. text
> extraction) to Lucene.
>  
> You're right ! That and also the cost of fetching the document, in systems where the content lives on other servers/systems. Reindexing is usually (depends on your analysis chain) the cheapest step.

Depends on the type of application, though, I suppose.  Many times the thing being updated is just a number, like a rating/price/inventory as well, in which case there is very little analysis.  Of course, it raises an interesting point, what are the implications for numeric fields?

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

>
> but because of the cost of preparing the inputs (i.e. text
> extraction) to Lucene.
>

You're right ! That and also the cost of fetching the document, in systems
where the content lives on other servers/systems. Reindexing is usually
(depends on your analysis chain) the cheapest step.

Shai

On Tue, May 11, 2010 at 7:22 AM, Babak Farhang <fa...@gmail.com> wrote:

> >> My take on it is that if someone wants to update the catch-all field,
> then
> >> reindexing the document may not be such a bad idea anyway. The purpose
> of
> >> those incremental updates is to cope w/ high frequency of updates, which
> >> usually happen on metadata fields, and not title.
> >
> > I agree.
>
> I too agree with the general gist of this argument.
>
> As an aside, just to add another dimension to this discussion (perhaps
> now the net is cast too wide), Lucene users often want incremental
> updates not because of the cost of reindexing the document inside
> Lucene, but because of the cost of preparing the inputs (i.e. text
> extraction) to Lucene.
>
>
> On Mon, May 10, 2010 at 2:40 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
> >> That's an interesting scenario Mike.
> >>
> >> Previously, I only handled boolean-like terms, as the scenarios we were
> >> asked to support involved just those types of terms. Obviously, when the
> >> approach allows for more, more scenarios pop to mind :).
> >
> > OK.
> >
> >> I think we may still be able to resolve that case, but it becomes much
> more
> >> complicated. My design approach of adding the +/- affected the entire
> >> posting element, whereas the scenario you describe affects the positions
> of
> >> the posting element. This calls for a more complicated design and
> solution.
> >
> > Right.
> >
> >> My take on it is that if someone wants to update the catch-all field,
> then
> >> reindexing the document may not be such a bad idea anyway. The purpose
> of
> >> those incremental updates is to cope w/ high frequency of updates, which
> >> usually happen on metadata fields, and not title.
> >
> > I agree.
> >
> >> But since one could add the 'tags' to the catch-all field as well, it
> brings
> >> us to the same point - how do I remove the positions of term X that
> relate
> >> to the tag X and not the potentially original term X that existed in the
> >> document?
> >>
> >> This is a very advanced case (and interesting). I don't want to hold up
> the
> >> discussion on it, but want to make sure we do not deviate from getting
> the
> >> more simpler cases in first. Depending on the API, this might be very
> easy
> >> to solve, but might also complicate matters. Maybe, for a
> >> incr-field-updates-v1, we can do without it?
> >
> > Definitely, let's take this (incrementally updating the positions as
> > well) out of scope for the first cut, when we actually start building
> > things.  One simple way to do this might be to only allow incremental
> > update on fields that have omitTFAP=true.
> >
> > When brainstorming/designing a new feature, I like to cast a wide net
> > during the discussion/thinking (what we are doing now), but then when
> > it comes to what to actually build for phase one well pull it way back
> > in and aim for baby steps / progress not perfection.  We are able to
> > do much more imagining than we can actually writing code :)
> >
> > The wide net during brainstorming gives us a better view/context of
> > the road ahead, eg to validate that the baby step is in the right
> > direction, so that it doesn't preclude other things we might imagine
> > later.
> >
> > In this case, it does sound like the approach should work (in theory)
> > fine w/ positions, too.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

>> My take on it is that if someone wants to update the catch-all field, then
>> reindexing the document may not be such a bad idea anyway. The purpose of
>> those incremental updates is to cope w/ high frequency of updates, which
>> usually happen on metadata fields, and not title.
>
> I agree.

I too agree with the general gist of this argument.

As an aside, just to add another dimension to this discussion (perhaps
now the net is cast too wide), Lucene users often want incremental
updates not because of the cost of reindexing the document inside
Lucene, but because of the cost of preparing the inputs (i.e. text
extraction) to Lucene.


On Mon, May 10, 2010 at 2:40 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
>> That's an interesting scenario Mike.
>>
>> Previously, I only handled boolean-like terms, as the scenarios we were
>> asked to support involved just those types of terms. Obviously, when the
>> approach allows for more, more scenarios pop to mind :).
>
> OK.
>
>> I think we may still be able to resolve that case, but it becomes much more
>> complicated. My design approach of adding the +/- affected the entire
>> posting element, whereas the scenario you describe affects the positions of
>> the posting element. This calls for a more complicated design and solution.
>
> Right.
>
>> My take on it is that if someone wants to update the catch-all field, then
>> reindexing the document may not be such a bad idea anyway. The purpose of
>> those incremental updates is to cope w/ high frequency of updates, which
>> usually happen on metadata fields, and not title.
>
> I agree.
>
>> But since one could add the 'tags' to the catch-all field as well, it brings
>> us to the same point - how do I remove the positions of term X that relate
>> to the tag X and not the potentially original term X that existed in the
>> document?
>>
>> This is a very advanced case (and interesting). I don't want to hold up the
>> discussion on it, but want to make sure we do not deviate from getting the
>> more simpler cases in first. Depending on the API, this might be very easy
>> to solve, but might also complicate matters. Maybe, for a
>> incr-field-updates-v1, we can do without it?
>
> Definitely, let's take this (incrementally updating the positions as
> well) out of scope for the first cut, when we actually start building
> things.  One simple way to do this might be to only allow incremental
> update on fields that have omitTFAP=true.
>
> When brainstorming/designing a new feature, I like to cast a wide net
> during the discussion/thinking (what we are doing now), but then when
> it comes to what to actually build for phase one well pull it way back
> in and aim for baby steps / progress not perfection.  We are able to
> do much more imagining than we can actually writing code :)
>
> The wide net during brainstorming gives us a better view/context of
> the road ahead, eg to validate that the baby step is in the right
> direction, so that it doesn't preclude other things we might imagine
> later.
>
> In this case, it does sound like the approach should work (in theory)
> fine w/ positions, too.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
> That's an interesting scenario Mike.
>
> Previously, I only handled boolean-like terms, as the scenarios we were
> asked to support involved just those types of terms. Obviously, when the
> approach allows for more, more scenarios pop to mind :).

OK.

> I think we may still be able to resolve that case, but it becomes much more
> complicated. My design approach of adding the +/- affected the entire
> posting element, whereas the scenario you describe affects the positions of
> the posting element. This calls for a more complicated design and solution.

Right.

> My take on it is that if someone wants to update the catch-all field, then
> reindexing the document may not be such a bad idea anyway. The purpose of
> those incremental updates is to cope w/ high frequency of updates, which
> usually happen on metadata fields, and not title.

I agree.

> But since one could add the 'tags' to the catch-all field as well, it brings
> us to the same point - how do I remove the positions of term X that relate
> to the tag X and not the potentially original term X that existed in the
> document?
>
> This is a very advanced case (and interesting). I don't want to hold up the
> discussion on it, but want to make sure we do not deviate from getting the
> more simpler cases in first. Depending on the API, this might be very easy
> to solve, but might also complicate matters. Maybe, for a
> incr-field-updates-v1, we can do without it?

Definitely, let's take this (incrementally updating the positions as
well) out of scope for the first cut, when we actually start building
things.  One simple way to do this might be to only allow incremental
update on fields that have omitTFAP=true.

When brainstorming/designing a new feature, I like to cast a wide net
during the discussion/thinking (what we are doing now), but then when
it comes to what to actually build for phase one well pull it way back
in and aim for baby steps / progress not perfection.  We are able to
do much more imagining than we can actually writing code :)

The wide net during brainstorming gives us a better view/context of
the road ahead, eg to validate that the baby step is in the right
direction, so that it doesn't preclude other things we might imagine
later.

In this case, it does sound like the approach should work (in theory)
fine w/ positions, too.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

That's an interesting scenario Mike.

Previously, I only handled boolean-like terms, as the scenarios we were
asked to support involved just those types of terms. Obviously, when the
approach allows for more, more scenarios pop to mind :).

I think we may still be able to resolve that case, but it becomes much more
complicated. My design approach of adding the +/- affected the entire
posting element, whereas the scenario you describe affects the positions of
the posting element. This calls for a more complicated design and solution.

My take on it is that if someone wants to update the catch-all field, then
reindexing the document may not be such a bad idea anyway. The purpose of
those incremental updates is to cope w/ high frequency of updates, which
usually happen on metadata fields, and not title.

But since one could add the 'tags' to the catch-all field as well, it brings
us to the same point - how do I remove the positions of term X that relate
to the tag X and not the potentially original term X that existed in the
document?

This is a very advanced case (and interesting). I don't want to hold up the
discussion on it, but want to make sure we do not deviate from getting the
more simpler cases in first. Depending on the API, this might be very easy
to solve, but might also complicate matters. Maybe, for a
incr-field-updates-v1, we can do without it?

Shai

On Mon, May 10, 2010 at 10:43 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I think another example would be the catch-all field.
>
> EG say my app concatenates the title, abstract and body of a document
> into the catch-all field.
>
> But now I want to change just the title.
>
> I think in theory (assuming we can work out an intuitive user-level
> API exposure of this...), on changing the title, we could
> incrementally re-index the catch-all field, to remove the old title
> and add the new one.  Right?
>
> But: would this approach work on the positions too?
>
> EG say term X appears 4 times in the doc -- positions 1, 7, 22, 53.
> I've now replaced a part of this field, affecting only this term's
> occurrence at position 53, so that maybe that occurrence is now at 57.
>  Are we able to handle this?
>
> One possible low-level API (I think Doug originally suggested this)
> might be to allow .remove() calls on the enums (like many Java
> collection support from their iterators) -- eg as you are stepping
> through the DocsEnum, you could call .remove() to complete
> disassociate that term from that doc, or in DocsAndPositionsEnum you
> could call .removePosition to remove a specific position.
>
> We would still need a higher level user API... maybe you'd provide a
> "text to remove" and a "text to add" for each field, pre-init'd with
> position/offset for the analyzer (if the field is analyzed)?  I guess
> we'd make this a new attr on Field.  Ie, Field normally contains just
> "text/tokens to add to the index", but we could also include a
> "text/tokens to remove".  Or, each Field instance is marked as "to be
> added" vs "to be removed", and you have to add 2 Field instances to
> subtract then add.
>
> Mike
>
> On Sun, May 9, 2010 at 7:38 AM, Shai Erera <se...@gmail.com> wrote:
> >> When I update a field, I
> >> want to update all of it, not just part of it. No?
> >
> > Well ... might be. But the most common case for us are fields to which we
> > want to add data or remove. For example ACLs - you don't want to replace
> the
> > entire field w/ the document, but simply to add/remove access for certain
> > people/groups. Same goes for "social" fields, like tags, ratings,
> bookmarks
> > etc. - the granularity of the update is to associate/dissociate a
> particular
> > value w/ the field + doc, and not update the entire field.
> >
> > Shai
> >
> > On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <fa...@gmail.com>
> wrote:
> >>
> >> > No, actually, you can update index-only fields also. It all depends on
> >> > the
> >> > operation that you ask to do on the fields.
> >>
> >> I love the level of control this provides, but again, I was talking at
> >> the user level.
> >>
> >> > If you want to e.g. remove an entire field w/ such update operation,
> >> > then it
> >> > becomes more expensive
> >>
> >> That's the typical usage scenario, I imagine. When I update a field, I
> >> want to update all of it, not just part of it. No?
> >>
> >> (The lower level semantics of twiddling with the postings is poorly
> >> understood by the typical user..)
> >>
> >> > We didn't
> >> > face such a scenario though, and I think it's probably a rare one?
> >>
> >> As rare as any time you want to update an indexed-only field.  Not a
> >> serious limitation (but perhaps worth noting?)
> >> Perhaps at the API level, you provide an updateIndexedOnlyField that
> >> takes the old value as well as the new value for the field.
> >>
> >> Anyway, I think your approach would be a great addition to the
> >> toolkit. Would love to see it even in rough cut form :))
> >>
> >> -Babak
> >>
> >> On Sun, May 9, 2010 at 12:49 AM, Shai Erera <se...@gmail.com> wrote:
> >> > No, actually, you can update index-only fields also. It all depends on
> >> > the
> >> > operation that you ask to do on the fields. For example, if the query
> to
> >> > execute is something like "update all documents w/ tags:ibm -> remove
> >> > terms
> >> > t1, t2, t3 and add terms t4, t5", then the result of such request
> would
> >> > dissociate t1-3 from those docs that answer tags:ibm and associate
> them
> >> > w/
> >> > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> >> > following posting updates will be done:
> >> > t1: -1, -4, -10
> >> > t2: -1, -4, -10
> >> > t3: -1, -4, -10
> >> > t4, 1, 4, 10
> >> > t5, 1, 4, 10
> >> > (in addition to whatever other updates that are associated with those
> >> > postings).
> >> >
> >> > At search time, if you search for "t1 OR t2", then the regular t1 and
> t2
> >> > postings will be merged on-the-fly w/ the updated ones to remove docs
> 1,
> >> > 4,
> >> > 10.
> >> >
> >> > If you want to e.g. remove an entire field w/ such update operation,
> >> > then it
> >> > becomes more expensive, but in general you'd need to iterate over the
> >> > field's terms and dissociate the documents from all the terms. We
> didn't
> >> > face such a scenario though, and I think it's probably a rare one?
> >> >
> >> > Shai
> >> >
> >> > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <fa...@gmail.com>
> wrote:
> >> >>
> >> >> Shai,
> >> >>
> >> >> I think this is an interesting approach. I can see how I could
> >> >> [incrementally] update a stored, indexed field this way, but I don't
> >> >> understand how I could update an indexed-only field. Here's why: for
> a
> >> >> stored (and indexed) field, I can always determine what terms to
> >> >> remove ('-') from the postings, but for an indexed-only field I'd
> have
> >> >> no [practical] way to know..
> >> >>
> >> >> So under this approach,  I'm thinking at a user level, a Lucene field
> >> >> would be updateable only if it's at least stored.
> >> >>
> >> >> Is that right?
> >> >>
> >> >> -Babak
> >> >>
> >> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com>
> wrote:
> >> >> > Yes Mike - I don't know yet if two MPs will be used, one for the
> >> >> > stacked
> >> >> > segments and one for the general segments (which will include the
> >> >> > stacked
> >> >> > ones as well) .. feels like one MP should be enough, but this can
> be
> >> >> > decided
> >> >> > on as we progress.
> >> >> >
> >> >> > This approach allows you to update every term in every already
> >> >> > indexed
> >> >> > field, as well as add terms to already indexed fields ... and add
> >> >> > totally
> >> >> > new fields, with lots of text in them. So e.g. there are two neat
> use
> >> >> > cases
> >> >> > that come to mind:
> >> >> > 1) Allow users to rate search results, favor them etc.
> >> >> > 2) Or even comment them,
> >> >> > I think Google offers the 2nd. Both translate into updating the
> >> >> > search
> >> >> > result's already indexed document w/ the new rating, comment etc.
> w/o
> >> >> > needing to reindex the document.
> >> >> >
> >> >> > I will try to find perf results numbers. It's been long time ago,
> >> >> > hope
> >> >> > all
> >> >> > the documents are still where they were :). Indeed, for terms like
> >> >> > ACLs,
> >> >> > it
> >> >> > means that each query had to merge dozens of postings lists. For
> that
> >> >> > I
> >> >> > implemented an alternative solution, which uses a payload-like
> >> >> > structure
> >> >> > that registers for each document the list of ACLs that are
> associated
> >> >> > with
> >> >> > it (not as strings, it was more efficient). Then, if the query
> >> >> > included
> >> >> > dozens of such terms, I created a filter-like object which for
> every
> >> >> > matching document by the query checked if it matches the ACLs list
> of
> >> >> > the
> >> >> > document. This is usually slower, because the ACLs themselves don't
> >> >> > drive
> >> >> > the query, which means more matches will be found. That was a
> >> >> > tradeoff
> >> >> > which
> >> >> > one could configure based on the number of such terms in the query,
> >> >> > the
> >> >> > number of updated terms etc.
> >> >> >
> >> >> > But in essence you're right - if the solution is generic to cover
> any
> >> >> > term,
> >> >> > we cannot use such payload-based feature. We might need to merge
> the
> >> >> > stacked
> >> >> > segments more frequently. This is pending perf testing though.
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> >> >> > <lu...@mikemccandless.com> wrote:
> >> >> >>
> >> >> >> Catching up here :)
> >> >> >>
> >> >> >> This is great stuff Shai -- I like the notion of "negative"
> >> >> >> postings,
> >> >> >> that "subtract" docs from previous generations as you iterate
> them.
> >> >> >>
> >> >> >> And I like the term "stacked segments".  This fits very well with
> >> >> >> Lucene's write-once approach, ie, a writer can at any time stack a
> >> >> >> new
> >> >> >> segment (writes to new files) "over" an old segment, like the
> layers
> >> >> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
> >> >> >>
> >> >> >> And the merge policy now picks from 2 dimensions right?  Ie it may
> >> >> >> want to simply consolidate stacks on an old segment but otherwise
> >> >> >> not
> >> >> >> merge that segment (eg for very large segments that have
> accumulated
> >> >> >> updates), and normal merging would of course consolidate all
> stacks
> >> >> >> for all segments merged.
> >> >> >>
> >> >> >> Wouldn't this approach conceivably allow you to alter single terms
> >> >> >> within a single field (we'd have to figure out how we'd expose the
> >> >> >> API
> >> >> >> for this)?  EG if I appended some text to an already-indexed
> field?
> >> >> >> (In addition to adding a new field to an already indexed doc, or
> >> >> >> replacing an indexed field on a previously indexed doc).
> >> >> >>
> >> >> >> Did you have any hard perf numbers?  Merge sorting N streams is
> >> >> >> surprisingly costly... we may need/want to have a reader pre-merge
> >> >> >> (using up RAM) any "long tail" of stacked segments as long as they
> >> >> >> are
> >> >> >> small enough...
> >> >> >>
> >> >> >> This sounds great!!
> >> >> >>
> >> >> >> Mike
> >> >> >>
> >> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > WARNING: following email is a bit long, but I think is worth the
> >> >> >> > reading
> >> >> >> > :)
> >> >> >> >
> >> >> >> > I would like to describe my implementation of incremental field
> >> >> >> > updates
> >> >> >> > in Juru (the former search library I've worked on in IBM). I
> will
> >> >> >> > later
> >> >> >> > discuss how I think it can be implemented in Lucene.
> >> >> >> >
> >> >> >> > The motivation/requirement came from a doc management system
> which
> >> >> >> > used
> >> >> >> > Juru as its search component. The system included document
> >> >> >> > libraries
> >> >> >> > where users could create files and upload documents. A user
> could
> >> >> >> > belong
> >> >> >> > to any number of libraries and complex ACLs model was used (down
> >> >> >> > to
> >> >> >> > the
> >> >> >> > level of the file). ACLs and Folders were modeled as categories
> in
> >> >> >> > the
> >> >> >> > index (boolean-like terms). Files and folders could be moved
> >> >> >> > around
> >> >> >> > and
> >> >> >> > access to a library/folder/file could be granted/revoked at any
> >> >> >> > given
> >> >> >> > time. Therefore, such updates usually affected hundreds (and
> >> >> >> > thousands)
> >> >> >> > of documents. Overall, the index managed millions of documents,
> >> >> >> > tens
> >> >> >> > of
> >> >> >> > thousands of libraries and hundreds of thousands of ACLs (large
> >> >> >> > organization :)). To get a rough understanding on the number of
> >> >> >> > such
> >> >> >> > updates - every several minutes, tens of thousands of documents
> >> >> >> > were
> >> >> >> > updated due to such changes only (in addition to the regular
> >> >> >> > content
> >> >> >> > updates).
> >> >> >> >
> >> >> >> > We were asked to support requests in the following form: "update
> >> >> >> > all
> >> >> >> > docs
> >> >> >> > that match <criteria> --> do <operation>" where:
> >> >> >> > * <criteria> was "a single doc", "docs belonging to a category"
> >> >> >> > and
> >> >> >> > "docs
> >> >> >> > belonging to a set of categories".
> >> >> >> > * <operation> was "add categories NEW" (NEW might not even exist
> >> >> >> > in
> >> >> >> > the
> >> >> >> > index yet, or already associated w/ the document), "remove
> >> >> >> > categories
> >> >> >> > OLD"
> >> >> >> > (irregardless if the docs were already associated w/ OLD or not)
> >> >> >> > and
> >> >> >> > "remove all OLD and add all NEW".
> >> >> >> >
> >> >> >> > The solution I've implemented to support these requests turned
> out
> >> >> >> > to
> >> >> >> > actually allow you to update every term (!) in the index:
> suppose
> >> >> >> > that
> >> >> >> > you have a table, which recorded tuples like <docid, term, +/->.
> >> >> >> > The
> >> >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the
> term
> >> >> >> > "ibm",
> >> >> >> > and the record <1, "hp", '-'> means that the same document is
> not
> >> >> >> > associated w/ the word "hp". Then, you could very easily ask for
> >> >> >> > all
> >> >> >> > documents that are assoicated w/ "hp", and the result would not
> >> >> >> > include
> >> >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal
> >> >> >> > id
> >> >> >> > the
> >> >> >> > document received.
> >> >> >> >
> >> >> >> > I've kept two types of postings in the index: regular and
> updates.
> >> >> >> > Taking the above examples, "ibm" regular posting looked like
> >> >> >> > <"ibm",
> >> >> >> > 1,
> >> >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like
> >> >> >> > <"ibm",
> >> >> >> > +2,
> >> >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign).
> Similarly
> >> >> >> > for
> >> >> >> > "hp".
> >> >> >> >
> >> >> >> > During search time, when a query with the word "ibm" was
> >> >> >> > submitted, I
> >> >> >> > create a virtual posting which reads from both the regular and
> the
> >> >> >> > updates, and merges them on the fly according to the +/- signs.
> >> >> >> > Since
> >> >> >> > both postings are sorted in ascending order, the merge is very
> >> >> >> > efficient, and query time is hardly affected.
> >> >> >> >
> >> >> >> > Those postings are merged from time to time in a process that is
> >> >> >> > similar
> >> >> >> > to how Lucene works today, which keeps the update postings
> >> >> >> > relatively
> >> >> >> > small and manageable.
> >> >> >> >
> >> >> >> > Now here comes the fun part - how I think it can be implemented
> in
> >> >> >> > Lucene !
> >> >> >> >
> >> >> >> > To be honest, this sat on my TODO list for a very long time and
> >> >> >> > only
> >> >> >> > a
> >> >> >> > couple of months ago I figured out how to implement it in
> Lucene.
> >> >> >> > The
> >> >> >> > main difficulty I had was around the difference between the
> >> >> >> > write-once
> >> >> >> > policy in Juru and Lucene - in Lucene, once a segment is
> written,
> >> >> >> > it
> >> >> >> > cannot be changed. BUT, I've only recently realized that this
> >> >> >> > isn't
> >> >> >> > exactly true, because deleted docs do change existing segments.
> >> >> >> > The
> >> >> >> > deletes are kept in a separate file to the segment (.del) and
> have
> >> >> >> > their
> >> >> >> > own generation. Deletes, as I understood then, and Grant helped
> me
> >> >> >> > term
> >> >> >> > them better, can be defined as "Stacked Segments" - they add
> data
> >> >> >> > to
> >> >> >> > a
> >> >> >> > segment, which from time to time are integrated into the segment
> >> >> >> > (unlike
> >> >> >> > Photoshop Layers, but my understanding of Photoshop is limited).
> >> >> >> > And
> >> >> >> > the
> >> >> >> > Lucene engine knows how to combine the two, giving precedence to
> >> >> >> > the
> >> >> >> > deletes.
> >> >> >> >
> >> >> >> > By introducing an "Updates Stacked Segment", we can encode
> >> >> >> > postings
> >> >> >> > w/
> >> >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we
> >> >> >> > can
> >> >> >> > create a variation which merges the two lists. When segments are
> >> >> >> > merged,
> >> >> >> > the updates will be merged into the regular postings (just like
> >> >> >> > deletes)
> >> >> >> > and thus will be gone. In addition, this plays very nicely with
> >> >> >> > readers
> >> >> >> > that are currently reading the index, as well as we can have
> >> >> >> > generations
> >> >> >> > for the updates - really like deletes !
> >> >> >> >
> >> >> >> > I think that Lucene's architecture allows for such a solution
> very
> >> >> >> > cleanly and nicely (and I believe flex makes it even easier). We
> >> >> >> > can
> >> >> >> > (later, after you've digested the idea) discuss whether this
> >> >> >> > should
> >> >> >> > be
> >> >> >> > built into the current IW, or an extension like UpdateableIW.
> The
> >> >> >> > API
> >> >> >> > I've been thinking about should really be like deletes, allowing
> >> >> >> > to
> >> >> >> > update docs based on Term or Query. I defer the API discussion
> for
> >> >> >> > later
> >> >> >> > for now.
> >> >> >> >
> >> >> >> > As for stored fields, this was a real challenge to support in
> >> >> >> > Juru,
> >> >> >> > but
> >> >> >> > I think that w/ "Stacked Segments" and Lucene's architecture,
> this
> >> >> >> > should
> >> >> >> > be much easier - adding stacked stored fields ...
> >> >> >> >
> >> >> >> > As you've noticed, the update postings are not DGap encoded, and
> >> >> >> > sign
> >> >> >> > needs to be preserved. While I haven't implemented it in Juru, I
> >> >> >> > think
> >> >> >> > that perhaps this can be improved by keeping the '-' and '+'
> lists
> >> >> >> > separated. We will need to register somewhere which came before
> >> >> >> > which
> >> >> >> > because order matters a lot here (and I'm not talking about
> >> >> >> > concurrency
> >> >> >> > - simple update instructions order). I have some idea how this
> can
> >> >> >> > be
> >> >> >> > achieved, but I refrain from describing it now, to not make this
> >> >> >> > email
> >> >> >> > even longer :).
> >> >> >> >
> >> >> >> > I've mentioned that this approach can be applied to any term and
> >> >> >> > not
> >> >> >> > just categories under some circumstances. Basically, as soon as
> >> >> >> > you
> >> >> >> > update a term, its DF is no longer true, unless you are able to
> >> >> >> > take
> >> >> >> > the
> >> >> >> > updates into account. We can defer the discussion on that, but
> >> >> >> > clearly
> >> >> >> > for many fields, incrementally update them should not affect
> >> >> >> > precision,
> >> >> >> > as they're not used for that type of scoring ... Maybe, by
> keeping
> >> >> >> > separate '+' and '-' lists we can compute statistics precisely.
> >> >> >> > And I
> >> >> >> > haven't given much thought yet to how this and Mike's flex
> scoring
> >> >> >> > will
> >> >> >> > be integrated.
> >> >> >> >
> >> >> >> > BTW, a word on Parallel Indexing - the two are completely
> >> >> >> > orthogonal.
> >> >> >> > Once PI is introduced, one can index all the updateable fields
> in
> >> >> >> > a
> >> >> >> > dedicated slice, for perhaps improving search performance for
> >> >> >> > slices
> >> >> >> > that are not updateable (not involving code which attempts to
> read
> >> >> >> > and
> >> >> >> > merge update and regular lists on the fly). Also, incremental
> >> >> >> > field
> >> >> >> > updates support all of PI's scenarios, even though some will be
> >> >> >> > done
> >> >> >> > more efficiently w/ PI. But this too is a matter for a separate
> >> >> >> > discussion :).
> >> >> >> >
> >> >> >> > That's it ! I believe I've given you all the details I have
> about
> >> >> >> > the
> >> >> >> > approach and high level proposed solution for Lucene. Perhaps
> some
> >> >> >> > details slipped my mind, but if you ask the right questions, I'm
> >> >> >> > sure
> >> >> >> > I'll be able to answer them :). I would like to emphasize that
> >> >> >> > since
> >> >> >> > this was already implemented (in Juru) - this is more than just
> a
> >> >> >> > "I
> >> >> >> > think this approach can work" proposal ...
> >> >> >> >
> >> >> >> > I would appreciate your comments on this. I would like to start
> >> >> >> > implementing it soon, and so as a first step, please share your
> >> >> >> > comments
> >> >> >> > on the overall approach. I'll then write a more detailed
> >> >> >> > description
> >> >> >> > on
> >> >> >> > how I think to impl it in Lucene (been spending some time on
> >> >> >> > that),
> >> >> >> > and
> >> >> >> > we can have more detailed (and fun) discussions on the low level
> >> >> >> > details.
> >> >> >> >
> >> >> >> > Shai
> >> >> >> >
> >> >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <
> farhang@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Good point. I meant the model at the document level: i.e. what
> >> >> >> >> milestones does a document go through in its life cycle. Today:
> >> >> >> >>
> >> >> >> >> created --> deleted
> >> >> >> >>
> >> >> >> >> With incremental updates:
> >> >> >> >>
> >> >> >> >> created --> update1 --> update2 --> deleted
> >> >> >> >>
> >> >> >> >> I think what I'm trying to say is that this second threaded
> >> >> >> >> sequence
> >> >> >> >> of state changes seems intuitively more fragile under
> concurrent
> >> >> >> >> scenarios.  So for example, in a lock-free design, the system
> >> >> >> >> would
> >> >> >> >> also have to anticipate the following sequence of events:
> >> >> >> >>
> >> >> >> >> created --> update1 --> deleted --> update2
> >> >> >> >>
> >> >> >> >> and consider update2 a null op.  I'm imagining there are other
> >> >> >> >> cases
> >> >> >> >> that I can't think of..
> >> >> >> >>
> >> >> >> >> -Babak
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >> >> > write once, plus the option to the app to keep multiple
> commit
> >> >> >> >> > points
> >> >> >> >> > around (by customizing the deletion policy).
> >> >> >> >> >
> >> >> >> >> > Actually order of operations / commits very much matters in
> >> >> >> >> > Lucene
> >> >> >> >> > today.
> >> >> >> >> >
> >> >> >> >> > Deletions are not idempotent: if you add a doc w/ term X,
> >> >> >> >> > delete
> >> >> >> >> > by
> >> >> >> >> > term X, add a new doc with term X... that's very different
> than
> >> >> >> >> > if
> >> >> >> >> > you
> >> >> >> >> > moved the delete op to the end.  Ie the deletion only applies
> >> >> >> >> > to
> >> >> >> >> > the
> >> >> >> >> > docs added before it.
> >> >> >> >> >
> >> >> >> >> > Mike
> >> >> >> >> >
> >> >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang
> >> >> >> >> > <fa...@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >> Sure. Because of the write once principle.  But at some cost
> >> >> >> >> >> (duplicated data). I was just agreeing that it would not be
> a
> >> >> >> >> >> good
> >> >> >> >> >> idea to bake in version-ing by keeping the layers around
> >> >> >> >> >> forever
> >> >> >> >> >> in
> >> >> >> >> >> a
> >> >> >> >> >> merged index; I wasn't keying in on transactions per se.
> >> >> >> >> >>
> >> >> >> >> >> Speaking of transactions: I'm not sure if we should worry
> >> >> >> >> >> about
> >> >> >> >> >> this
> >> >> >> >> >> much yet, but with "updates" the order of the transaction
> >> >> >> >> >> commits
> >> >> >> >> >> seems important. I think commit order is less important
> today
> >> >> >> >> >> in
> >> >> >> >> >> Lucene because its model supports only 2 types of events:
> >> >> >> >> >> document
> >> >> >> >> >> creation--which only happens once, and document deletion,
> >> >> >> >> >> which
> >> >> >> >> >> is
> >> >> >> >> >> idempotent.  What do you think? Will commits have to be
> >> >> >> >> >> ordered
> >> >> >> >> >> if
> >> >> >> >> >> we
> >> >> >> >> >> introduce updates?  Or does the onus of maintaining order
> fall
> >> >> >> >> >> on
> >> >> >> >> >> the
> >> >> >> >> >> application?
> >> >> >> >> >>
> >> >> >> >> >> -Babak
> >> >> >> >> >>
> >> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> >> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
> >> >> >> >> >>> <fa...@gmail.com>
> >> >> >> >> >>> wrote:
> >> >> >> >> >>>>> I think they get merged in by the merger, ideally in the
> >> >> >> >> >>>>> background.
> >> >> >> >> >>>>
> >> >> >> >> >>>> That sounds sensible. (In other words, we wont concern
> >> >> >> >> >>>> ourselves
> >> >> >> >> >>>> with
> >> >> >> >> >>>> roll backs--something possible while a "layer" is still
> >> >> >> >> >>>> around.)
> >> >> >> >> >>>
> >> >> >> >> >>> Actually roll backs would still be very possible even if
> >> >> >> >> >>> layers
> >> >> >> >> >>> are
> >> >> >> >> >>> merged.
> >> >> >> >> >>>
> >> >> >> >> >>> Ie, one could keep multiple commits around, and the older
> >> >> >> >> >>> commits
> >> >> >> >> >>> would still be referring to the old postings + layers,
> >> >> >> >> >>> keeping
> >> >> >> >> >>> them
> >> >> >> >> >>> alive.
> >> >> >> >> >>>
> >> >> >> >> >>> Lucene would still be transactional with such an approach.
> >> >> >> >> >>>
> >> >> >> >> >>> Mike
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >> >>> To unsubscribe, e-mail:
> >> >> >> >> >>> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> >>> For additional commands, e-mail:
> >> >> >> >> >>> java-dev-help@lucene.apache.org
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> For additional commands, e-mail:
> >> >> >> >> >> java-dev-help@lucene.apache.org
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> ---------------------------------------------------------------------
> >> >> >> >> > To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> > For additional commands, e-mail:
> >> >> >> >> > java-dev-help@lucene.apache.org
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think another example would be the catch-all field.

EG say my app concatenates the title, abstract and body of a document
into the catch-all field.

But now I want to change just the title.

I think in theory (assuming we can work out an intuitive user-level
API exposure of this...), on changing the title, we could
incrementally re-index the catch-all field, to remove the old title
and add the new one.  Right?

But: would this approach work on the positions too?

EG say term X appears 4 times in the doc -- positions 1, 7, 22, 53.
I've now replaced a part of this field, affecting only this term's
occurrence at position 53, so that maybe that occurrence is now at 57.
 Are we able to handle this?

One possible low-level API (I think Doug originally suggested this)
might be to allow .remove() calls on the enums (like many Java
collection support from their iterators) -- eg as you are stepping
through the DocsEnum, you could call .remove() to complete
disassociate that term from that doc, or in DocsAndPositionsEnum you
could call .removePosition to remove a specific position.

We would still need a higher level user API... maybe you'd provide a
"text to remove" and a "text to add" for each field, pre-init'd with
position/offset for the analyzer (if the field is analyzed)?  I guess
we'd make this a new attr on Field.  Ie, Field normally contains just
"text/tokens to add to the index", but we could also include a
"text/tokens to remove".  Or, each Field instance is marked as "to be
added" vs "to be removed", and you have to add 2 Field instances to
subtract then add.

Mike

On Sun, May 9, 2010 at 7:38 AM, Shai Erera <se...@gmail.com> wrote:
>> When I update a field, I
>> want to update all of it, not just part of it. No?
>
> Well ... might be. But the most common case for us are fields to which we
> want to add data or remove. For example ACLs - you don't want to replace the
> entire field w/ the document, but simply to add/remove access for certain
> people/groups. Same goes for "social" fields, like tags, ratings, bookmarks
> etc. - the granularity of the update is to associate/dissociate a particular
> value w/ the field + doc, and not update the entire field.
>
> Shai
>
> On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <fa...@gmail.com> wrote:
>>
>> > No, actually, you can update index-only fields also. It all depends on
>> > the
>> > operation that you ask to do on the fields.
>>
>> I love the level of control this provides, but again, I was talking at
>> the user level.
>>
>> > If you want to e.g. remove an entire field w/ such update operation,
>> > then it
>> > becomes more expensive
>>
>> That's the typical usage scenario, I imagine. When I update a field, I
>> want to update all of it, not just part of it. No?
>>
>> (The lower level semantics of twiddling with the postings is poorly
>> understood by the typical user..)
>>
>> > We didn't
>> > face such a scenario though, and I think it's probably a rare one?
>>
>> As rare as any time you want to update an indexed-only field.  Not a
>> serious limitation (but perhaps worth noting?)
>> Perhaps at the API level, you provide an updateIndexedOnlyField that
>> takes the old value as well as the new value for the field.
>>
>> Anyway, I think your approach would be a great addition to the
>> toolkit. Would love to see it even in rough cut form :))
>>
>> -Babak
>>
>> On Sun, May 9, 2010 at 12:49 AM, Shai Erera <se...@gmail.com> wrote:
>> > No, actually, you can update index-only fields also. It all depends on
>> > the
>> > operation that you ask to do on the fields. For example, if the query to
>> > execute is something like "update all documents w/ tags:ibm -> remove
>> > terms
>> > t1, t2, t3 and add terms t4, t5", then the result of such request would
>> > dissociate t1-3 from those docs that answer tags:ibm and associate them
>> > w/
>> > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
>> > following posting updates will be done:
>> > t1: -1, -4, -10
>> > t2: -1, -4, -10
>> > t3: -1, -4, -10
>> > t4, 1, 4, 10
>> > t5, 1, 4, 10
>> > (in addition to whatever other updates that are associated with those
>> > postings).
>> >
>> > At search time, if you search for "t1 OR t2", then the regular t1 and t2
>> > postings will be merged on-the-fly w/ the updated ones to remove docs 1,
>> > 4,
>> > 10.
>> >
>> > If you want to e.g. remove an entire field w/ such update operation,
>> > then it
>> > becomes more expensive, but in general you'd need to iterate over the
>> > field's terms and dissociate the documents from all the terms. We didn't
>> > face such a scenario though, and I think it's probably a rare one?
>> >
>> > Shai
>> >
>> > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <fa...@gmail.com> wrote:
>> >>
>> >> Shai,
>> >>
>> >> I think this is an interesting approach. I can see how I could
>> >> [incrementally] update a stored, indexed field this way, but I don't
>> >> understand how I could update an indexed-only field. Here's why: for a
>> >> stored (and indexed) field, I can always determine what terms to
>> >> remove ('-') from the postings, but for an indexed-only field I'd have
>> >> no [practical] way to know..
>> >>
>> >> So under this approach,  I'm thinking at a user level, a Lucene field
>> >> would be updateable only if it's at least stored.
>> >>
>> >> Is that right?
>> >>
>> >> -Babak
>> >>
>> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Yes Mike - I don't know yet if two MPs will be used, one for the
>> >> > stacked
>> >> > segments and one for the general segments (which will include the
>> >> > stacked
>> >> > ones as well) .. feels like one MP should be enough, but this can be
>> >> > decided
>> >> > on as we progress.
>> >> >
>> >> > This approach allows you to update every term in every already
>> >> > indexed
>> >> > field, as well as add terms to already indexed fields ... and add
>> >> > totally
>> >> > new fields, with lots of text in them. So e.g. there are two neat use
>> >> > cases
>> >> > that come to mind:
>> >> > 1) Allow users to rate search results, favor them etc.
>> >> > 2) Or even comment them,
>> >> > I think Google offers the 2nd. Both translate into updating the
>> >> > search
>> >> > result's already indexed document w/ the new rating, comment etc. w/o
>> >> > needing to reindex the document.
>> >> >
>> >> > I will try to find perf results numbers. It's been long time ago,
>> >> > hope
>> >> > all
>> >> > the documents are still where they were :). Indeed, for terms like
>> >> > ACLs,
>> >> > it
>> >> > means that each query had to merge dozens of postings lists. For that
>> >> > I
>> >> > implemented an alternative solution, which uses a payload-like
>> >> > structure
>> >> > that registers for each document the list of ACLs that are associated
>> >> > with
>> >> > it (not as strings, it was more efficient). Then, if the query
>> >> > included
>> >> > dozens of such terms, I created a filter-like object which for every
>> >> > matching document by the query checked if it matches the ACLs list of
>> >> > the
>> >> > document. This is usually slower, because the ACLs themselves don't
>> >> > drive
>> >> > the query, which means more matches will be found. That was a
>> >> > tradeoff
>> >> > which
>> >> > one could configure based on the number of such terms in the query,
>> >> > the
>> >> > number of updated terms etc.
>> >> >
>> >> > But in essence you're right - if the solution is generic to cover any
>> >> > term,
>> >> > we cannot use such payload-based feature. We might need to merge the
>> >> > stacked
>> >> > segments more frequently. This is pending perf testing though.
>> >> >
>> >> > Shai
>> >> >
>> >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
>> >> > <lu...@mikemccandless.com> wrote:
>> >> >>
>> >> >> Catching up here :)
>> >> >>
>> >> >> This is great stuff Shai -- I like the notion of "negative"
>> >> >> postings,
>> >> >> that "subtract" docs from previous generations as you iterate them.
>> >> >>
>> >> >> And I like the term "stacked segments".  This fits very well with
>> >> >> Lucene's write-once approach, ie, a writer can at any time stack a
>> >> >> new
>> >> >> segment (writes to new files) "over" an old segment, like the layers
>> >> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
>> >> >>
>> >> >> And the merge policy now picks from 2 dimensions right?  Ie it may
>> >> >> want to simply consolidate stacks on an old segment but otherwise
>> >> >> not
>> >> >> merge that segment (eg for very large segments that have accumulated
>> >> >> updates), and normal merging would of course consolidate all stacks
>> >> >> for all segments merged.
>> >> >>
>> >> >> Wouldn't this approach conceivably allow you to alter single terms
>> >> >> within a single field (we'd have to figure out how we'd expose the
>> >> >> API
>> >> >> for this)?  EG if I appended some text to an already-indexed field?
>> >> >> (In addition to adding a new field to an already indexed doc, or
>> >> >> replacing an indexed field on a previously indexed doc).
>> >> >>
>> >> >> Did you have any hard perf numbers?  Merge sorting N streams is
>> >> >> surprisingly costly... we may need/want to have a reader pre-merge
>> >> >> (using up RAM) any "long tail" of stacked segments as long as they
>> >> >> are
>> >> >> small enough...
>> >> >>
>> >> >> This sounds great!!
>> >> >>
>> >> >> Mike
>> >> >>
>> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > WARNING: following email is a bit long, but I think is worth the
>> >> >> > reading
>> >> >> > :)
>> >> >> >
>> >> >> > I would like to describe my implementation of incremental field
>> >> >> > updates
>> >> >> > in Juru (the former search library I've worked on in IBM). I will
>> >> >> > later
>> >> >> > discuss how I think it can be implemented in Lucene.
>> >> >> >
>> >> >> > The motivation/requirement came from a doc management system which
>> >> >> > used
>> >> >> > Juru as its search component. The system included document
>> >> >> > libraries
>> >> >> > where users could create files and upload documents. A user could
>> >> >> > belong
>> >> >> > to any number of libraries and complex ACLs model was used (down
>> >> >> > to
>> >> >> > the
>> >> >> > level of the file). ACLs and Folders were modeled as categories in
>> >> >> > the
>> >> >> > index (boolean-like terms). Files and folders could be moved
>> >> >> > around
>> >> >> > and
>> >> >> > access to a library/folder/file could be granted/revoked at any
>> >> >> > given
>> >> >> > time. Therefore, such updates usually affected hundreds (and
>> >> >> > thousands)
>> >> >> > of documents. Overall, the index managed millions of documents,
>> >> >> > tens
>> >> >> > of
>> >> >> > thousands of libraries and hundreds of thousands of ACLs (large
>> >> >> > organization :)). To get a rough understanding on the number of
>> >> >> > such
>> >> >> > updates - every several minutes, tens of thousands of documents
>> >> >> > were
>> >> >> > updated due to such changes only (in addition to the regular
>> >> >> > content
>> >> >> > updates).
>> >> >> >
>> >> >> > We were asked to support requests in the following form: "update
>> >> >> > all
>> >> >> > docs
>> >> >> > that match <criteria> --> do <operation>" where:
>> >> >> > * <criteria> was "a single doc", "docs belonging to a category"
>> >> >> > and
>> >> >> > "docs
>> >> >> > belonging to a set of categories".
>> >> >> > * <operation> was "add categories NEW" (NEW might not even exist
>> >> >> > in
>> >> >> > the
>> >> >> > index yet, or already associated w/ the document), "remove
>> >> >> > categories
>> >> >> > OLD"
>> >> >> > (irregardless if the docs were already associated w/ OLD or not)
>> >> >> > and
>> >> >> > "remove all OLD and add all NEW".
>> >> >> >
>> >> >> > The solution I've implemented to support these requests turned out
>> >> >> > to
>> >> >> > actually allow you to update every term (!) in the index: suppose
>> >> >> > that
>> >> >> > you have a table, which recorded tuples like <docid, term, +/->.
>> >> >> > The
>> >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term
>> >> >> > "ibm",
>> >> >> > and the record <1, "hp", '-'> means that the same document is not
>> >> >> > associated w/ the word "hp". Then, you could very easily ask for
>> >> >> > all
>> >> >> > documents that are assoicated w/ "hp", and the result would not
>> >> >> > include
>> >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal
>> >> >> > id
>> >> >> > the
>> >> >> > document received.
>> >> >> >
>> >> >> > I've kept two types of postings in the index: regular and updates.
>> >> >> > Taking the above examples, "ibm" regular posting looked like
>> >> >> > <"ibm",
>> >> >> > 1,
>> >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like
>> >> >> > <"ibm",
>> >> >> > +2,
>> >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly
>> >> >> > for
>> >> >> > "hp".
>> >> >> >
>> >> >> > During search time, when a query with the word "ibm" was
>> >> >> > submitted, I
>> >> >> > create a virtual posting which reads from both the regular and the
>> >> >> > updates, and merges them on the fly according to the +/- signs.
>> >> >> > Since
>> >> >> > both postings are sorted in ascending order, the merge is very
>> >> >> > efficient, and query time is hardly affected.
>> >> >> >
>> >> >> > Those postings are merged from time to time in a process that is
>> >> >> > similar
>> >> >> > to how Lucene works today, which keeps the update postings
>> >> >> > relatively
>> >> >> > small and manageable.
>> >> >> >
>> >> >> > Now here comes the fun part - how I think it can be implemented in
>> >> >> > Lucene !
>> >> >> >
>> >> >> > To be honest, this sat on my TODO list for a very long time and
>> >> >> > only
>> >> >> > a
>> >> >> > couple of months ago I figured out how to implement it in Lucene.
>> >> >> > The
>> >> >> > main difficulty I had was around the difference between the
>> >> >> > write-once
>> >> >> > policy in Juru and Lucene - in Lucene, once a segment is written,
>> >> >> > it
>> >> >> > cannot be changed. BUT, I've only recently realized that this
>> >> >> > isn't
>> >> >> > exactly true, because deleted docs do change existing segments.
>> >> >> > The
>> >> >> > deletes are kept in a separate file to the segment (.del) and have
>> >> >> > their
>> >> >> > own generation. Deletes, as I understood then, and Grant helped me
>> >> >> > term
>> >> >> > them better, can be defined as "Stacked Segments" - they add data
>> >> >> > to
>> >> >> > a
>> >> >> > segment, which from time to time are integrated into the segment
>> >> >> > (unlike
>> >> >> > Photoshop Layers, but my understanding of Photoshop is limited).
>> >> >> > And
>> >> >> > the
>> >> >> > Lucene engine knows how to combine the two, giving precedence to
>> >> >> > the
>> >> >> > deletes.
>> >> >> >
>> >> >> > By introducing an "Updates Stacked Segment", we can encode
>> >> >> > postings
>> >> >> > w/
>> >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we
>> >> >> > can
>> >> >> > create a variation which merges the two lists. When segments are
>> >> >> > merged,
>> >> >> > the updates will be merged into the regular postings (just like
>> >> >> > deletes)
>> >> >> > and thus will be gone. In addition, this plays very nicely with
>> >> >> > readers
>> >> >> > that are currently reading the index, as well as we can have
>> >> >> > generations
>> >> >> > for the updates - really like deletes !
>> >> >> >
>> >> >> > I think that Lucene's architecture allows for such a solution very
>> >> >> > cleanly and nicely (and I believe flex makes it even easier). We
>> >> >> > can
>> >> >> > (later, after you've digested the idea) discuss whether this
>> >> >> > should
>> >> >> > be
>> >> >> > built into the current IW, or an extension like UpdateableIW. The
>> >> >> > API
>> >> >> > I've been thinking about should really be like deletes, allowing
>> >> >> > to
>> >> >> > update docs based on Term or Query. I defer the API discussion for
>> >> >> > later
>> >> >> > for now.
>> >> >> >
>> >> >> > As for stored fields, this was a real challenge to support in
>> >> >> > Juru,
>> >> >> > but
>> >> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this
>> >> >> > should
>> >> >> > be much easier - adding stacked stored fields ...
>> >> >> >
>> >> >> > As you've noticed, the update postings are not DGap encoded, and
>> >> >> > sign
>> >> >> > needs to be preserved. While I haven't implemented it in Juru, I
>> >> >> > think
>> >> >> > that perhaps this can be improved by keeping the '-' and '+' lists
>> >> >> > separated. We will need to register somewhere which came before
>> >> >> > which
>> >> >> > because order matters a lot here (and I'm not talking about
>> >> >> > concurrency
>> >> >> > - simple update instructions order). I have some idea how this can
>> >> >> > be
>> >> >> > achieved, but I refrain from describing it now, to not make this
>> >> >> > email
>> >> >> > even longer :).
>> >> >> >
>> >> >> > I've mentioned that this approach can be applied to any term and
>> >> >> > not
>> >> >> > just categories under some circumstances. Basically, as soon as
>> >> >> > you
>> >> >> > update a term, its DF is no longer true, unless you are able to
>> >> >> > take
>> >> >> > the
>> >> >> > updates into account. We can defer the discussion on that, but
>> >> >> > clearly
>> >> >> > for many fields, incrementally update them should not affect
>> >> >> > precision,
>> >> >> > as they're not used for that type of scoring ... Maybe, by keeping
>> >> >> > separate '+' and '-' lists we can compute statistics precisely.
>> >> >> > And I
>> >> >> > haven't given much thought yet to how this and Mike's flex scoring
>> >> >> > will
>> >> >> > be integrated.
>> >> >> >
>> >> >> > BTW, a word on Parallel Indexing - the two are completely
>> >> >> > orthogonal.
>> >> >> > Once PI is introduced, one can index all the updateable fields in
>> >> >> > a
>> >> >> > dedicated slice, for perhaps improving search performance for
>> >> >> > slices
>> >> >> > that are not updateable (not involving code which attempts to read
>> >> >> > and
>> >> >> > merge update and regular lists on the fly). Also, incremental
>> >> >> > field
>> >> >> > updates support all of PI's scenarios, even though some will be
>> >> >> > done
>> >> >> > more efficiently w/ PI. But this too is a matter for a separate
>> >> >> > discussion :).
>> >> >> >
>> >> >> > That's it ! I believe I've given you all the details I have about
>> >> >> > the
>> >> >> > approach and high level proposed solution for Lucene. Perhaps some
>> >> >> > details slipped my mind, but if you ask the right questions, I'm
>> >> >> > sure
>> >> >> > I'll be able to answer them :). I would like to emphasize that
>> >> >> > since
>> >> >> > this was already implemented (in Juru) - this is more than just a
>> >> >> > "I
>> >> >> > think this approach can work" proposal ...
>> >> >> >
>> >> >> > I would appreciate your comments on this. I would like to start
>> >> >> > implementing it soon, and so as a first step, please share your
>> >> >> > comments
>> >> >> > on the overall approach. I'll then write a more detailed
>> >> >> > description
>> >> >> > on
>> >> >> > how I think to impl it in Lucene (been spending some time on
>> >> >> > that),
>> >> >> > and
>> >> >> > we can have more detailed (and fun) discussions on the low level
>> >> >> > details.
>> >> >> >
>> >> >> > Shai
>> >> >> >
>> >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Good point. I meant the model at the document level: i.e. what
>> >> >> >> milestones does a document go through in its life cycle. Today:
>> >> >> >>
>> >> >> >> created --> deleted
>> >> >> >>
>> >> >> >> With incremental updates:
>> >> >> >>
>> >> >> >> created --> update1 --> update2 --> deleted
>> >> >> >>
>> >> >> >> I think what I'm trying to say is that this second threaded
>> >> >> >> sequence
>> >> >> >> of state changes seems intuitively more fragile under concurrent
>> >> >> >> scenarios.  So for example, in a lock-free design, the system
>> >> >> >> would
>> >> >> >> also have to anticipate the following sequence of events:
>> >> >> >>
>> >> >> >> created --> update1 --> deleted --> update2
>> >> >> >>
>> >> >> >> and consider update2 a null op.  I'm imagining there are other
>> >> >> >> cases
>> >> >> >> that I can't think of..
>> >> >> >>
>> >> >> >> -Babak
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> >> > write once, plus the option to the app to keep multiple commit
>> >> >> >> > points
>> >> >> >> > around (by customizing the deletion policy).
>> >> >> >> >
>> >> >> >> > Actually order of operations / commits very much matters in
>> >> >> >> > Lucene
>> >> >> >> > today.
>> >> >> >> >
>> >> >> >> > Deletions are not idempotent: if you add a doc w/ term X,
>> >> >> >> > delete
>> >> >> >> > by
>> >> >> >> > term X, add a new doc with term X... that's very different than
>> >> >> >> > if
>> >> >> >> > you
>> >> >> >> > moved the delete op to the end.  Ie the deletion only applies
>> >> >> >> > to
>> >> >> >> > the
>> >> >> >> > docs added before it.
>> >> >> >> >
>> >> >> >> > Mike
>> >> >> >> >
>> >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang
>> >> >> >> > <fa...@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >> Sure. Because of the write once principle.  But at some cost
>> >> >> >> >> (duplicated data). I was just agreeing that it would not be a
>> >> >> >> >> good
>> >> >> >> >> idea to bake in version-ing by keeping the layers around
>> >> >> >> >> forever
>> >> >> >> >> in
>> >> >> >> >> a
>> >> >> >> >> merged index; I wasn't keying in on transactions per se.
>> >> >> >> >>
>> >> >> >> >> Speaking of transactions: I'm not sure if we should worry
>> >> >> >> >> about
>> >> >> >> >> this
>> >> >> >> >> much yet, but with "updates" the order of the transaction
>> >> >> >> >> commits
>> >> >> >> >> seems important. I think commit order is less important today
>> >> >> >> >> in
>> >> >> >> >> Lucene because its model supports only 2 types of events:
>> >> >> >> >> document
>> >> >> >> >> creation--which only happens once, and document deletion,
>> >> >> >> >> which
>> >> >> >> >> is
>> >> >> >> >> idempotent.  What do you think? Will commits have to be
>> >> >> >> >> ordered
>> >> >> >> >> if
>> >> >> >> >> we
>> >> >> >> >> introduce updates?  Or does the onus of maintaining order fall
>> >> >> >> >> on
>> >> >> >> >> the
>> >> >> >> >> application?
>> >> >> >> >>
>> >> >> >> >> -Babak
>> >> >> >> >>
>> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> >> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
>> >> >> >> >>> <fa...@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>>>> I think they get merged in by the merger, ideally in the
>> >> >> >> >>>>> background.
>> >> >> >> >>>>
>> >> >> >> >>>> That sounds sensible. (In other words, we wont concern
>> >> >> >> >>>> ourselves
>> >> >> >> >>>> with
>> >> >> >> >>>> roll backs--something possible while a "layer" is still
>> >> >> >> >>>> around.)
>> >> >> >> >>>
>> >> >> >> >>> Actually roll backs would still be very possible even if
>> >> >> >> >>> layers
>> >> >> >> >>> are
>> >> >> >> >>> merged.
>> >> >> >> >>>
>> >> >> >> >>> Ie, one could keep multiple commits around, and the older
>> >> >> >> >>> commits
>> >> >> >> >>> would still be referring to the old postings + layers,
>> >> >> >> >>> keeping
>> >> >> >> >>> them
>> >> >> >> >>> alive.
>> >> >> >> >>>
>> >> >> >> >>> Lucene would still be transactional with such an approach.
>> >> >> >> >>>
>> >> >> >> >>> Mike
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> ---------------------------------------------------------------------
>> >> >> >> >>> To unsubscribe, e-mail:
>> >> >> >> >>> java-dev-unsubscribe@lucene.apache.org
>> >> >> >> >>> For additional commands, e-mail:
>> >> >> >> >>> java-dev-help@lucene.apache.org
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> >> For additional commands, e-mail:
>> >> >> >> >> java-dev-help@lucene.apache.org
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ---------------------------------------------------------------------
>> >> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> > For additional commands, e-mail:
>> >> >> >> > java-dev-help@lucene.apache.org
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

>
> When I update a field, I
> want to update all of it, not just part of it. No?
>

Well ... might be. But the most common case for us are fields to which we
want to add data or remove. For example ACLs - you don't want to replace the
entire field w/ the document, but simply to add/remove access for certain
people/groups. Same goes for "social" fields, like tags, ratings, bookmarks
etc. - the granularity of the update is to associate/dissociate a particular
value w/ the field + doc, and not update the entire field.

Shai

On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <fa...@gmail.com> wrote:

> > No, actually, you can update index-only fields also. It all depends on
> the
> > operation that you ask to do on the fields.
>
> I love the level of control this provides, but again, I was talking at
> the user level.
>
> > If you want to e.g. remove an entire field w/ such update operation, then
> it
> > becomes more expensive
>
> That's the typical usage scenario, I imagine. When I update a field, I
> want to update all of it, not just part of it. No?
>
> (The lower level semantics of twiddling with the postings is poorly
> understood by the typical user..)
>
> > We didn't
> > face such a scenario though, and I think it's probably a rare one?
>
> As rare as any time you want to update an indexed-only field.  Not a
> serious limitation (but perhaps worth noting?)
> Perhaps at the API level, you provide an updateIndexedOnlyField that
> takes the old value as well as the new value for the field.
>
> Anyway, I think your approach would be a great addition to the
> toolkit. Would love to see it even in rough cut form :))
>
> -Babak
>
> On Sun, May 9, 2010 at 12:49 AM, Shai Erera <se...@gmail.com> wrote:
> > No, actually, you can update index-only fields also. It all depends on
> the
> > operation that you ask to do on the fields. For example, if the query to
> > execute is something like "update all documents w/ tags:ibm -> remove
> terms
> > t1, t2, t3 and add terms t4, t5", then the result of such request would
> > dissociate t1-3 from those docs that answer tags:ibm and associate them
> w/
> > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> > following posting updates will be done:
> > t1: -1, -4, -10
> > t2: -1, -4, -10
> > t3: -1, -4, -10
> > t4, 1, 4, 10
> > t5, 1, 4, 10
> > (in addition to whatever other updates that are associated with those
> > postings).
> >
> > At search time, if you search for "t1 OR t2", then the regular t1 and t2
> > postings will be merged on-the-fly w/ the updated ones to remove docs 1,
> 4,
> > 10.
> >
> > If you want to e.g. remove an entire field w/ such update operation, then
> it
> > becomes more expensive, but in general you'd need to iterate over the
> > field's terms and dissociate the documents from all the terms. We didn't
> > face such a scenario though, and I think it's probably a rare one?
> >
> > Shai
> >
> > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <fa...@gmail.com> wrote:
> >>
> >> Shai,
> >>
> >> I think this is an interesting approach. I can see how I could
> >> [incrementally] update a stored, indexed field this way, but I don't
> >> understand how I could update an indexed-only field. Here's why: for a
> >> stored (and indexed) field, I can always determine what terms to
> >> remove ('-') from the postings, but for an indexed-only field I'd have
> >> no [practical] way to know..
> >>
> >> So under this approach,  I'm thinking at a user level, a Lucene field
> >> would be updateable only if it's at least stored.
> >>
> >> Is that right?
> >>
> >> -Babak
> >>
> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Yes Mike - I don't know yet if two MPs will be used, one for the
> stacked
> >> > segments and one for the general segments (which will include the
> >> > stacked
> >> > ones as well) .. feels like one MP should be enough, but this can be
> >> > decided
> >> > on as we progress.
> >> >
> >> > This approach allows you to update every term in every already indexed
> >> > field, as well as add terms to already indexed fields ... and add
> >> > totally
> >> > new fields, with lots of text in them. So e.g. there are two neat use
> >> > cases
> >> > that come to mind:
> >> > 1) Allow users to rate search results, favor them etc.
> >> > 2) Or even comment them,
> >> > I think Google offers the 2nd. Both translate into updating the search
> >> > result's already indexed document w/ the new rating, comment etc. w/o
> >> > needing to reindex the document.
> >> >
> >> > I will try to find perf results numbers. It's been long time ago, hope
> >> > all
> >> > the documents are still where they were :). Indeed, for terms like
> ACLs,
> >> > it
> >> > means that each query had to merge dozens of postings lists. For that
> I
> >> > implemented an alternative solution, which uses a payload-like
> structure
> >> > that registers for each document the list of ACLs that are associated
> >> > with
> >> > it (not as strings, it was more efficient). Then, if the query
> included
> >> > dozens of such terms, I created a filter-like object which for every
> >> > matching document by the query checked if it matches the ACLs list of
> >> > the
> >> > document. This is usually slower, because the ACLs themselves don't
> >> > drive
> >> > the query, which means more matches will be found. That was a tradeoff
> >> > which
> >> > one could configure based on the number of such terms in the query,
> the
> >> > number of updated terms etc.
> >> >
> >> > But in essence you're right - if the solution is generic to cover any
> >> > term,
> >> > we cannot use such payload-based feature. We might need to merge the
> >> > stacked
> >> > segments more frequently. This is pending perf testing though.
> >> >
> >> > Shai
> >> >
> >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> >> > <lu...@mikemccandless.com> wrote:
> >> >>
> >> >> Catching up here :)
> >> >>
> >> >> This is great stuff Shai -- I like the notion of "negative" postings,
> >> >> that "subtract" docs from previous generations as you iterate them.
> >> >>
> >> >> And I like the term "stacked segments".  This fits very well with
> >> >> Lucene's write-once approach, ie, a writer can at any time stack a
> new
> >> >> segment (writes to new files) "over" an old segment, like the layers
> >> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
> >> >>
> >> >> And the merge policy now picks from 2 dimensions right?  Ie it may
> >> >> want to simply consolidate stacks on an old segment but otherwise not
> >> >> merge that segment (eg for very large segments that have accumulated
> >> >> updates), and normal merging would of course consolidate all stacks
> >> >> for all segments merged.
> >> >>
> >> >> Wouldn't this approach conceivably allow you to alter single terms
> >> >> within a single field (we'd have to figure out how we'd expose the
> API
> >> >> for this)?  EG if I appended some text to an already-indexed field?
> >> >> (In addition to adding a new field to an already indexed doc, or
> >> >> replacing an indexed field on a previously indexed doc).
> >> >>
> >> >> Did you have any hard perf numbers?  Merge sorting N streams is
> >> >> surprisingly costly... we may need/want to have a reader pre-merge
> >> >> (using up RAM) any "long tail" of stacked segments as long as they
> are
> >> >> small enough...
> >> >>
> >> >> This sounds great!!
> >> >>
> >> >> Mike
> >> >>
> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com>
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > WARNING: following email is a bit long, but I think is worth the
> >> >> > reading
> >> >> > :)
> >> >> >
> >> >> > I would like to describe my implementation of incremental field
> >> >> > updates
> >> >> > in Juru (the former search library I've worked on in IBM). I will
> >> >> > later
> >> >> > discuss how I think it can be implemented in Lucene.
> >> >> >
> >> >> > The motivation/requirement came from a doc management system which
> >> >> > used
> >> >> > Juru as its search component. The system included document
> libraries
> >> >> > where users could create files and upload documents. A user could
> >> >> > belong
> >> >> > to any number of libraries and complex ACLs model was used (down to
> >> >> > the
> >> >> > level of the file). ACLs and Folders were modeled as categories in
> >> >> > the
> >> >> > index (boolean-like terms). Files and folders could be moved around
> >> >> > and
> >> >> > access to a library/folder/file could be granted/revoked at any
> given
> >> >> > time. Therefore, such updates usually affected hundreds (and
> >> >> > thousands)
> >> >> > of documents. Overall, the index managed millions of documents,
> tens
> >> >> > of
> >> >> > thousands of libraries and hundreds of thousands of ACLs (large
> >> >> > organization :)). To get a rough understanding on the number of
> such
> >> >> > updates - every several minutes, tens of thousands of documents
> were
> >> >> > updated due to such changes only (in addition to the regular
> content
> >> >> > updates).
> >> >> >
> >> >> > We were asked to support requests in the following form: "update
> all
> >> >> > docs
> >> >> > that match <criteria> --> do <operation>" where:
> >> >> > * <criteria> was "a single doc", "docs belonging to a category" and
> >> >> > "docs
> >> >> > belonging to a set of categories".
> >> >> > * <operation> was "add categories NEW" (NEW might not even exist in
> >> >> > the
> >> >> > index yet, or already associated w/ the document), "remove
> categories
> >> >> > OLD"
> >> >> > (irregardless if the docs were already associated w/ OLD or not)
> and
> >> >> > "remove all OLD and add all NEW".
> >> >> >
> >> >> > The solution I've implemented to support these requests turned out
> to
> >> >> > actually allow you to update every term (!) in the index: suppose
> >> >> > that
> >> >> > you have a table, which recorded tuples like <docid, term, +/->.
> The
> >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term
> >> >> > "ibm",
> >> >> > and the record <1, "hp", '-'> means that the same document is not
> >> >> > associated w/ the word "hp". Then, you could very easily ask for
> all
> >> >> > documents that are assoicated w/ "hp", and the result would not
> >> >> > include
> >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id
> >> >> > the
> >> >> > document received.
> >> >> >
> >> >> > I've kept two types of postings in the index: regular and updates.
> >> >> > Taking the above examples, "ibm" regular posting looked like
> <"ibm",
> >> >> > 1,
> >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm",
> >> >> > +2,
> >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly
> for
> >> >> > "hp".
> >> >> >
> >> >> > During search time, when a query with the word "ibm" was submitted,
> I
> >> >> > create a virtual posting which reads from both the regular and the
> >> >> > updates, and merges them on the fly according to the +/- signs.
> Since
> >> >> > both postings are sorted in ascending order, the merge is very
> >> >> > efficient, and query time is hardly affected.
> >> >> >
> >> >> > Those postings are merged from time to time in a process that is
> >> >> > similar
> >> >> > to how Lucene works today, which keeps the update postings
> relatively
> >> >> > small and manageable.
> >> >> >
> >> >> > Now here comes the fun part - how I think it can be implemented in
> >> >> > Lucene !
> >> >> >
> >> >> > To be honest, this sat on my TODO list for a very long time and
> only
> >> >> > a
> >> >> > couple of months ago I figured out how to implement it in Lucene.
> The
> >> >> > main difficulty I had was around the difference between the
> >> >> > write-once
> >> >> > policy in Juru and Lucene - in Lucene, once a segment is written,
> it
> >> >> > cannot be changed. BUT, I've only recently realized that this isn't
> >> >> > exactly true, because deleted docs do change existing segments. The
> >> >> > deletes are kept in a separate file to the segment (.del) and have
> >> >> > their
> >> >> > own generation. Deletes, as I understood then, and Grant helped me
> >> >> > term
> >> >> > them better, can be defined as "Stacked Segments" - they add data
> to
> >> >> > a
> >> >> > segment, which from time to time are integrated into the segment
> >> >> > (unlike
> >> >> > Photoshop Layers, but my understanding of Photoshop is limited).
> And
> >> >> > the
> >> >> > Lucene engine knows how to combine the two, giving precedence to
> the
> >> >> > deletes.
> >> >> >
> >> >> > By introducing an "Updates Stacked Segment", we can encode postings
> >> >> > w/
> >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
> >> >> > create a variation which merges the two lists. When segments are
> >> >> > merged,
> >> >> > the updates will be merged into the regular postings (just like
> >> >> > deletes)
> >> >> > and thus will be gone. In addition, this plays very nicely with
> >> >> > readers
> >> >> > that are currently reading the index, as well as we can have
> >> >> > generations
> >> >> > for the updates - really like deletes !
> >> >> >
> >> >> > I think that Lucene's architecture allows for such a solution very
> >> >> > cleanly and nicely (and I believe flex makes it even easier). We
> can
> >> >> > (later, after you've digested the idea) discuss whether this should
> >> >> > be
> >> >> > built into the current IW, or an extension like UpdateableIW. The
> API
> >> >> > I've been thinking about should really be like deletes, allowing to
> >> >> > update docs based on Term or Query. I defer the API discussion for
> >> >> > later
> >> >> > for now.
> >> >> >
> >> >> > As for stored fields, this was a real challenge to support in Juru,
> >> >> > but
> >> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this
> >> >> > should
> >> >> > be much easier - adding stacked stored fields ...
> >> >> >
> >> >> > As you've noticed, the update postings are not DGap encoded, and
> sign
> >> >> > needs to be preserved. While I haven't implemented it in Juru, I
> >> >> > think
> >> >> > that perhaps this can be improved by keeping the '-' and '+' lists
> >> >> > separated. We will need to register somewhere which came before
> which
> >> >> > because order matters a lot here (and I'm not talking about
> >> >> > concurrency
> >> >> > - simple update instructions order). I have some idea how this can
> be
> >> >> > achieved, but I refrain from describing it now, to not make this
> >> >> > email
> >> >> > even longer :).
> >> >> >
> >> >> > I've mentioned that this approach can be applied to any term and
> not
> >> >> > just categories under some circumstances. Basically, as soon as you
> >> >> > update a term, its DF is no longer true, unless you are able to
> take
> >> >> > the
> >> >> > updates into account. We can defer the discussion on that, but
> >> >> > clearly
> >> >> > for many fields, incrementally update them should not affect
> >> >> > precision,
> >> >> > as they're not used for that type of scoring ... Maybe, by keeping
> >> >> > separate '+' and '-' lists we can compute statistics precisely. And
> I
> >> >> > haven't given much thought yet to how this and Mike's flex scoring
> >> >> > will
> >> >> > be integrated.
> >> >> >
> >> >> > BTW, a word on Parallel Indexing - the two are completely
> orthogonal.
> >> >> > Once PI is introduced, one can index all the updateable fields in a
> >> >> > dedicated slice, for perhaps improving search performance for
> slices
> >> >> > that are not updateable (not involving code which attempts to read
> >> >> > and
> >> >> > merge update and regular lists on the fly). Also, incremental field
> >> >> > updates support all of PI's scenarios, even though some will be
> done
> >> >> > more efficiently w/ PI. But this too is a matter for a separate
> >> >> > discussion :).
> >> >> >
> >> >> > That's it ! I believe I've given you all the details I have about
> the
> >> >> > approach and high level proposed solution for Lucene. Perhaps some
> >> >> > details slipped my mind, but if you ask the right questions, I'm
> sure
> >> >> > I'll be able to answer them :). I would like to emphasize that
> since
> >> >> > this was already implemented (in Juru) - this is more than just a
> "I
> >> >> > think this approach can work" proposal ...
> >> >> >
> >> >> > I would appreciate your comments on this. I would like to start
> >> >> > implementing it soon, and so as a first step, please share your
> >> >> > comments
> >> >> > on the overall approach. I'll then write a more detailed
> description
> >> >> > on
> >> >> > how I think to impl it in Lucene (been spending some time on that),
> >> >> > and
> >> >> > we can have more detailed (and fun) discussions on the low level
> >> >> > details.
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Good point. I meant the model at the document level: i.e. what
> >> >> >> milestones does a document go through in its life cycle. Today:
> >> >> >>
> >> >> >> created --> deleted
> >> >> >>
> >> >> >> With incremental updates:
> >> >> >>
> >> >> >> created --> update1 --> update2 --> deleted
> >> >> >>
> >> >> >> I think what I'm trying to say is that this second threaded
> sequence
> >> >> >> of state changes seems intuitively more fragile under concurrent
> >> >> >> scenarios.  So for example, in a lock-free design, the system
> would
> >> >> >> also have to anticipate the following sequence of events:
> >> >> >>
> >> >> >> created --> update1 --> deleted --> update2
> >> >> >>
> >> >> >> and consider update2 a null op.  I'm imagining there are other
> cases
> >> >> >> that I can't think of..
> >> >> >>
> >> >> >> -Babak
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >> > write once, plus the option to the app to keep multiple commit
> >> >> >> > points
> >> >> >> > around (by customizing the deletion policy).
> >> >> >> >
> >> >> >> > Actually order of operations / commits very much matters in
> Lucene
> >> >> >> > today.
> >> >> >> >
> >> >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete
> >> >> >> > by
> >> >> >> > term X, add a new doc with term X... that's very different than
> if
> >> >> >> > you
> >> >> >> > moved the delete op to the end.  Ie the deletion only applies to
> >> >> >> > the
> >> >> >> > docs added before it.
> >> >> >> >
> >> >> >> > Mike
> >> >> >> >
> >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <
> farhang@gmail.com>
> >> >> >> > wrote:
> >> >> >> >> Sure. Because of the write once principle.  But at some cost
> >> >> >> >> (duplicated data). I was just agreeing that it would not be a
> >> >> >> >> good
> >> >> >> >> idea to bake in version-ing by keeping the layers around
> forever
> >> >> >> >> in
> >> >> >> >> a
> >> >> >> >> merged index; I wasn't keying in on transactions per se.
> >> >> >> >>
> >> >> >> >> Speaking of transactions: I'm not sure if we should worry about
> >> >> >> >> this
> >> >> >> >> much yet, but with "updates" the order of the transaction
> commits
> >> >> >> >> seems important. I think commit order is less important today
> in
> >> >> >> >> Lucene because its model supports only 2 types of events:
> >> >> >> >> document
> >> >> >> >> creation--which only happens once, and document deletion, which
> >> >> >> >> is
> >> >> >> >> idempotent.  What do you think? Will commits have to be ordered
> >> >> >> >> if
> >> >> >> >> we
> >> >> >> >> introduce updates?  Or does the onus of maintaining order fall
> on
> >> >> >> >> the
> >> >> >> >> application?
> >> >> >> >>
> >> >> >> >> -Babak
> >> >> >> >>
> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
> >> >> >> >>> <fa...@gmail.com>
> >> >> >> >>> wrote:
> >> >> >> >>>>> I think they get merged in by the merger, ideally in the
> >> >> >> >>>>> background.
> >> >> >> >>>>
> >> >> >> >>>> That sounds sensible. (In other words, we wont concern
> >> >> >> >>>> ourselves
> >> >> >> >>>> with
> >> >> >> >>>> roll backs--something possible while a "layer" is still
> >> >> >> >>>> around.)
> >> >> >> >>>
> >> >> >> >>> Actually roll backs would still be very possible even if
> layers
> >> >> >> >>> are
> >> >> >> >>> merged.
> >> >> >> >>>
> >> >> >> >>> Ie, one could keep multiple commits around, and the older
> >> >> >> >>> commits
> >> >> >> >>> would still be referring to the old postings + layers, keeping
> >> >> >> >>> them
> >> >> >> >>> alive.
> >> >> >> >>>
> >> >> >> >>> Lucene would still be transactional with such an approach.
> >> >> >> >>>
> >> >> >> >>> Mike
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >>> To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >>> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> ---------------------------------------------------------------------
> >> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> > For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

> No, actually, you can update index-only fields also. It all depends on the
> operation that you ask to do on the fields.

I love the level of control this provides, but again, I was talking at
the user level.

> If you want to e.g. remove an entire field w/ such update operation, then it
> becomes more expensive

That's the typical usage scenario, I imagine. When I update a field, I
want to update all of it, not just part of it. No?

(The lower level semantics of twiddling with the postings is poorly
understood by the typical user..)

> We didn't
> face such a scenario though, and I think it's probably a rare one?

As rare as any time you want to update an indexed-only field.  Not a
serious limitation (but perhaps worth noting?)
Perhaps at the API level, you provide an updateIndexedOnlyField that
takes the old value as well as the new value for the field.

Anyway, I think your approach would be a great addition to the
toolkit. Would love to see it even in rough cut form :))

-Babak

On Sun, May 9, 2010 at 12:49 AM, Shai Erera <se...@gmail.com> wrote:
> No, actually, you can update index-only fields also. It all depends on the
> operation that you ask to do on the fields. For example, if the query to
> execute is something like "update all documents w/ tags:ibm -> remove terms
> t1, t2, t3 and add terms t4, t5", then the result of such request would
> dissociate t1-3 from those docs that answer tags:ibm and associate them w/
> t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> following posting updates will be done:
> t1: -1, -4, -10
> t2: -1, -4, -10
> t3: -1, -4, -10
> t4, 1, 4, 10
> t5, 1, 4, 10
> (in addition to whatever other updates that are associated with those
> postings).
>
> At search time, if you search for "t1 OR t2", then the regular t1 and t2
> postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4,
> 10.
>
> If you want to e.g. remove an entire field w/ such update operation, then it
> becomes more expensive, but in general you'd need to iterate over the
> field's terms and dissociate the documents from all the terms. We didn't
> face such a scenario though, and I think it's probably a rare one?
>
> Shai
>
> On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <fa...@gmail.com> wrote:
>>
>> Shai,
>>
>> I think this is an interesting approach. I can see how I could
>> [incrementally] update a stored, indexed field this way, but I don't
>> understand how I could update an indexed-only field. Here's why: for a
>> stored (and indexed) field, I can always determine what terms to
>> remove ('-') from the postings, but for an indexed-only field I'd have
>> no [practical] way to know..
>>
>> So under this approach,  I'm thinking at a user level, a Lucene field
>> would be updateable only if it's at least stored.
>>
>> Is that right?
>>
>> -Babak
>>
>> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com> wrote:
>> > Yes Mike - I don't know yet if two MPs will be used, one for the stacked
>> > segments and one for the general segments (which will include the
>> > stacked
>> > ones as well) .. feels like one MP should be enough, but this can be
>> > decided
>> > on as we progress.
>> >
>> > This approach allows you to update every term in every already indexed
>> > field, as well as add terms to already indexed fields ... and add
>> > totally
>> > new fields, with lots of text in them. So e.g. there are two neat use
>> > cases
>> > that come to mind:
>> > 1) Allow users to rate search results, favor them etc.
>> > 2) Or even comment them,
>> > I think Google offers the 2nd. Both translate into updating the search
>> > result's already indexed document w/ the new rating, comment etc. w/o
>> > needing to reindex the document.
>> >
>> > I will try to find perf results numbers. It's been long time ago, hope
>> > all
>> > the documents are still where they were :). Indeed, for terms like ACLs,
>> > it
>> > means that each query had to merge dozens of postings lists. For that I
>> > implemented an alternative solution, which uses a payload-like structure
>> > that registers for each document the list of ACLs that are associated
>> > with
>> > it (not as strings, it was more efficient). Then, if the query included
>> > dozens of such terms, I created a filter-like object which for every
>> > matching document by the query checked if it matches the ACLs list of
>> > the
>> > document. This is usually slower, because the ACLs themselves don't
>> > drive
>> > the query, which means more matches will be found. That was a tradeoff
>> > which
>> > one could configure based on the number of such terms in the query, the
>> > number of updated terms etc.
>> >
>> > But in essence you're right - if the solution is generic to cover any
>> > term,
>> > we cannot use such payload-based feature. We might need to merge the
>> > stacked
>> > segments more frequently. This is pending perf testing though.
>> >
>> > Shai
>> >
>> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Catching up here :)
>> >>
>> >> This is great stuff Shai -- I like the notion of "negative" postings,
>> >> that "subtract" docs from previous generations as you iterate them.
>> >>
>> >> And I like the term "stacked segments".  This fits very well with
>> >> Lucene's write-once approach, ie, a writer can at any time stack a new
>> >> segment (writes to new files) "over" an old segment, like the layers
>> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
>> >>
>> >> And the merge policy now picks from 2 dimensions right?  Ie it may
>> >> want to simply consolidate stacks on an old segment but otherwise not
>> >> merge that segment (eg for very large segments that have accumulated
>> >> updates), and normal merging would of course consolidate all stacks
>> >> for all segments merged.
>> >>
>> >> Wouldn't this approach conceivably allow you to alter single terms
>> >> within a single field (we'd have to figure out how we'd expose the API
>> >> for this)?  EG if I appended some text to an already-indexed field?
>> >> (In addition to adding a new field to an already indexed doc, or
>> >> replacing an indexed field on a previously indexed doc).
>> >>
>> >> Did you have any hard perf numbers?  Merge sorting N streams is
>> >> surprisingly costly... we may need/want to have a reader pre-merge
>> >> (using up RAM) any "long tail" of stacked segments as long as they are
>> >> small enough...
>> >>
>> >> This sounds great!!
>> >>
>> >> Mike
>> >>
>> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > WARNING: following email is a bit long, but I think is worth the
>> >> > reading
>> >> > :)
>> >> >
>> >> > I would like to describe my implementation of incremental field
>> >> > updates
>> >> > in Juru (the former search library I've worked on in IBM). I will
>> >> > later
>> >> > discuss how I think it can be implemented in Lucene.
>> >> >
>> >> > The motivation/requirement came from a doc management system which
>> >> > used
>> >> > Juru as its search component. The system included document libraries
>> >> > where users could create files and upload documents. A user could
>> >> > belong
>> >> > to any number of libraries and complex ACLs model was used (down to
>> >> > the
>> >> > level of the file). ACLs and Folders were modeled as categories in
>> >> > the
>> >> > index (boolean-like terms). Files and folders could be moved around
>> >> > and
>> >> > access to a library/folder/file could be granted/revoked at any given
>> >> > time. Therefore, such updates usually affected hundreds (and
>> >> > thousands)
>> >> > of documents. Overall, the index managed millions of documents, tens
>> >> > of
>> >> > thousands of libraries and hundreds of thousands of ACLs (large
>> >> > organization :)). To get a rough understanding on the number of such
>> >> > updates - every several minutes, tens of thousands of documents were
>> >> > updated due to such changes only (in addition to the regular content
>> >> > updates).
>> >> >
>> >> > We were asked to support requests in the following form: "update all
>> >> > docs
>> >> > that match <criteria> --> do <operation>" where:
>> >> > * <criteria> was "a single doc", "docs belonging to a category" and
>> >> > "docs
>> >> > belonging to a set of categories".
>> >> > * <operation> was "add categories NEW" (NEW might not even exist in
>> >> > the
>> >> > index yet, or already associated w/ the document), "remove categories
>> >> > OLD"
>> >> > (irregardless if the docs were already associated w/ OLD or not) and
>> >> > "remove all OLD and add all NEW".
>> >> >
>> >> > The solution I've implemented to support these requests turned out to
>> >> > actually allow you to update every term (!) in the index: suppose
>> >> > that
>> >> > you have a table, which recorded tuples like <docid, term, +/->. The
>> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term
>> >> > "ibm",
>> >> > and the record <1, "hp", '-'> means that the same document is not
>> >> > associated w/ the word "hp". Then, you could very easily ask for all
>> >> > documents that are assoicated w/ "hp", and the result would not
>> >> > include
>> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id
>> >> > the
>> >> > document received.
>> >> >
>> >> > I've kept two types of postings in the index: regular and updates.
>> >> > Taking the above examples, "ibm" regular posting looked like <"ibm",
>> >> > 1,
>> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm",
>> >> > +2,
>> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
>> >> > "hp".
>> >> >
>> >> > During search time, when a query with the word "ibm" was submitted, I
>> >> > create a virtual posting which reads from both the regular and the
>> >> > updates, and merges them on the fly according to the +/- signs. Since
>> >> > both postings are sorted in ascending order, the merge is very
>> >> > efficient, and query time is hardly affected.
>> >> >
>> >> > Those postings are merged from time to time in a process that is
>> >> > similar
>> >> > to how Lucene works today, which keeps the update postings relatively
>> >> > small and manageable.
>> >> >
>> >> > Now here comes the fun part - how I think it can be implemented in
>> >> > Lucene !
>> >> >
>> >> > To be honest, this sat on my TODO list for a very long time and only
>> >> > a
>> >> > couple of months ago I figured out how to implement it in Lucene. The
>> >> > main difficulty I had was around the difference between the
>> >> > write-once
>> >> > policy in Juru and Lucene - in Lucene, once a segment is written, it
>> >> > cannot be changed. BUT, I've only recently realized that this isn't
>> >> > exactly true, because deleted docs do change existing segments. The
>> >> > deletes are kept in a separate file to the segment (.del) and have
>> >> > their
>> >> > own generation. Deletes, as I understood then, and Grant helped me
>> >> > term
>> >> > them better, can be defined as "Stacked Segments" - they add data to
>> >> > a
>> >> > segment, which from time to time are integrated into the segment
>> >> > (unlike
>> >> > Photoshop Layers, but my understanding of Photoshop is limited). And
>> >> > the
>> >> > Lucene engine knows how to combine the two, giving precedence to the
>> >> > deletes.
>> >> >
>> >> > By introducing an "Updates Stacked Segment", we can encode postings
>> >> > w/
>> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
>> >> > create a variation which merges the two lists. When segments are
>> >> > merged,
>> >> > the updates will be merged into the regular postings (just like
>> >> > deletes)
>> >> > and thus will be gone. In addition, this plays very nicely with
>> >> > readers
>> >> > that are currently reading the index, as well as we can have
>> >> > generations
>> >> > for the updates - really like deletes !
>> >> >
>> >> > I think that Lucene's architecture allows for such a solution very
>> >> > cleanly and nicely (and I believe flex makes it even easier). We can
>> >> > (later, after you've digested the idea) discuss whether this should
>> >> > be
>> >> > built into the current IW, or an extension like UpdateableIW. The API
>> >> > I've been thinking about should really be like deletes, allowing to
>> >> > update docs based on Term or Query. I defer the API discussion for
>> >> > later
>> >> > for now.
>> >> >
>> >> > As for stored fields, this was a real challenge to support in Juru,
>> >> > but
>> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this
>> >> > should
>> >> > be much easier - adding stacked stored fields ...
>> >> >
>> >> > As you've noticed, the update postings are not DGap encoded, and sign
>> >> > needs to be preserved. While I haven't implemented it in Juru, I
>> >> > think
>> >> > that perhaps this can be improved by keeping the '-' and '+' lists
>> >> > separated. We will need to register somewhere which came before which
>> >> > because order matters a lot here (and I'm not talking about
>> >> > concurrency
>> >> > - simple update instructions order). I have some idea how this can be
>> >> > achieved, but I refrain from describing it now, to not make this
>> >> > email
>> >> > even longer :).
>> >> >
>> >> > I've mentioned that this approach can be applied to any term and not
>> >> > just categories under some circumstances. Basically, as soon as you
>> >> > update a term, its DF is no longer true, unless you are able to take
>> >> > the
>> >> > updates into account. We can defer the discussion on that, but
>> >> > clearly
>> >> > for many fields, incrementally update them should not affect
>> >> > precision,
>> >> > as they're not used for that type of scoring ... Maybe, by keeping
>> >> > separate '+' and '-' lists we can compute statistics precisely. And I
>> >> > haven't given much thought yet to how this and Mike's flex scoring
>> >> > will
>> >> > be integrated.
>> >> >
>> >> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
>> >> > Once PI is introduced, one can index all the updateable fields in a
>> >> > dedicated slice, for perhaps improving search performance for slices
>> >> > that are not updateable (not involving code which attempts to read
>> >> > and
>> >> > merge update and regular lists on the fly). Also, incremental field
>> >> > updates support all of PI's scenarios, even though some will be done
>> >> > more efficiently w/ PI. But this too is a matter for a separate
>> >> > discussion :).
>> >> >
>> >> > That's it ! I believe I've given you all the details I have about the
>> >> > approach and high level proposed solution for Lucene. Perhaps some
>> >> > details slipped my mind, but if you ask the right questions, I'm sure
>> >> > I'll be able to answer them :). I would like to emphasize that since
>> >> > this was already implemented (in Juru) - this is more than just a "I
>> >> > think this approach can work" proposal ...
>> >> >
>> >> > I would appreciate your comments on this. I would like to start
>> >> > implementing it soon, and so as a first step, please share your
>> >> > comments
>> >> > on the overall approach. I'll then write a more detailed description
>> >> > on
>> >> > how I think to impl it in Lucene (been spending some time on that),
>> >> > and
>> >> > we can have more detailed (and fun) discussions on the low level
>> >> > details.
>> >> >
>> >> > Shai
>> >> >
>> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Good point. I meant the model at the document level: i.e. what
>> >> >> milestones does a document go through in its life cycle. Today:
>> >> >>
>> >> >> created --> deleted
>> >> >>
>> >> >> With incremental updates:
>> >> >>
>> >> >> created --> update1 --> update2 --> deleted
>> >> >>
>> >> >> I think what I'm trying to say is that this second threaded sequence
>> >> >> of state changes seems intuitively more fragile under concurrent
>> >> >> scenarios.  So for example, in a lock-free design, the system would
>> >> >> also have to anticipate the following sequence of events:
>> >> >>
>> >> >> created --> update1 --> deleted --> update2
>> >> >>
>> >> >> and consider update2 a null op.  I'm imagining there are other cases
>> >> >> that I can't think of..
>> >> >>
>> >> >> -Babak
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> > write once, plus the option to the app to keep multiple commit
>> >> >> > points
>> >> >> > around (by customizing the deletion policy).
>> >> >> >
>> >> >> > Actually order of operations / commits very much matters in Lucene
>> >> >> > today.
>> >> >> >
>> >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete
>> >> >> > by
>> >> >> > term X, add a new doc with term X... that's very different than if
>> >> >> > you
>> >> >> > moved the delete op to the end.  Ie the deletion only applies to
>> >> >> > the
>> >> >> > docs added before it.
>> >> >> >
>> >> >> > Mike
>> >> >> >
>> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
>> >> >> > wrote:
>> >> >> >> Sure. Because of the write once principle.  But at some cost
>> >> >> >> (duplicated data). I was just agreeing that it would not be a
>> >> >> >> good
>> >> >> >> idea to bake in version-ing by keeping the layers around forever
>> >> >> >> in
>> >> >> >> a
>> >> >> >> merged index; I wasn't keying in on transactions per se.
>> >> >> >>
>> >> >> >> Speaking of transactions: I'm not sure if we should worry about
>> >> >> >> this
>> >> >> >> much yet, but with "updates" the order of the transaction commits
>> >> >> >> seems important. I think commit order is less important today in
>> >> >> >> Lucene because its model supports only 2 types of events:
>> >> >> >> document
>> >> >> >> creation--which only happens once, and document deletion, which
>> >> >> >> is
>> >> >> >> idempotent.  What do you think? Will commits have to be ordered
>> >> >> >> if
>> >> >> >> we
>> >> >> >> introduce updates?  Or does the onus of maintaining order fall on
>> >> >> >> the
>> >> >> >> application?
>> >> >> >>
>> >> >> >> -Babak
>> >> >> >>
>> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
>> >> >> >>> <fa...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>>>> I think they get merged in by the merger, ideally in the
>> >> >> >>>>> background.
>> >> >> >>>>
>> >> >> >>>> That sounds sensible. (In other words, we wont concern
>> >> >> >>>> ourselves
>> >> >> >>>> with
>> >> >> >>>> roll backs--something possible while a "layer" is still
>> >> >> >>>> around.)
>> >> >> >>>
>> >> >> >>> Actually roll backs would still be very possible even if layers
>> >> >> >>> are
>> >> >> >>> merged.
>> >> >> >>>
>> >> >> >>> Ie, one could keep multiple commits around, and the older
>> >> >> >>> commits
>> >> >> >>> would still be referring to the old postings + layers, keeping
>> >> >> >>> them
>> >> >> >>> alive.
>> >> >> >>>
>> >> >> >>> Lucene would still be transactional with such an approach.
>> >> >> >>>
>> >> >> >>> Mike
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> ---------------------------------------------------------------------
>> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

No, actually, you can update index-only fields also. It all depends on the
operation that you ask to do on the fields. For example, if the query to
execute is something like "update all documents w/ tags:ibm -> remove terms
t1, t2, t3 and add terms t4, t5", then the result of such request would
dissociate t1-3 from those docs that answer tags:ibm and associate them w/
t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
following posting updates will be done:
t1: -1, -4, -10
t2: -1, -4, -10
t3: -1, -4, -10
t4, 1, 4, 10
t5, 1, 4, 10
(in addition to whatever other updates that are associated with those
postings).

At search time, if you search for "t1 OR t2", then the regular t1 and t2
postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4,
10.

If you want to e.g. remove an entire field w/ such update operation, then it
becomes more expensive, but in general you'd need to iterate over the
field's terms and dissociate the documents from all the terms. We didn't
face such a scenario though, and I think it's probably a rare one?

Shai

On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <fa...@gmail.com> wrote:

> Shai,
>
> I think this is an interesting approach. I can see how I could
> [incrementally] update a stored, indexed field this way, but I don't
> understand how I could update an indexed-only field. Here's why: for a
> stored (and indexed) field, I can always determine what terms to
> remove ('-') from the postings, but for an indexed-only field I'd have
> no [practical] way to know..
>
> So under this approach,  I'm thinking at a user level, a Lucene field
> would be updateable only if it's at least stored.
>
> Is that right?
>
> -Babak
>
> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com> wrote:
> > Yes Mike - I don't know yet if two MPs will be used, one for the stacked
> > segments and one for the general segments (which will include the stacked
> > ones as well) .. feels like one MP should be enough, but this can be
> decided
> > on as we progress.
> >
> > This approach allows you to update every term in every already indexed
> > field, as well as add terms to already indexed fields ... and add totally
> > new fields, with lots of text in them. So e.g. there are two neat use
> cases
> > that come to mind:
> > 1) Allow users to rate search results, favor them etc.
> > 2) Or even comment them,
> > I think Google offers the 2nd. Both translate into updating the search
> > result's already indexed document w/ the new rating, comment etc. w/o
> > needing to reindex the document.
> >
> > I will try to find perf results numbers. It's been long time ago, hope
> all
> > the documents are still where they were :). Indeed, for terms like ACLs,
> it
> > means that each query had to merge dozens of postings lists. For that I
> > implemented an alternative solution, which uses a payload-like structure
> > that registers for each document the list of ACLs that are associated
> with
> > it (not as strings, it was more efficient). Then, if the query included
> > dozens of such terms, I created a filter-like object which for every
> > matching document by the query checked if it matches the ACLs list of the
> > document. This is usually slower, because the ACLs themselves don't drive
> > the query, which means more matches will be found. That was a tradeoff
> which
> > one could configure based on the number of such terms in the query, the
> > number of updated terms etc.
> >
> > But in essence you're right - if the solution is generic to cover any
> term,
> > we cannot use such payload-based feature. We might need to merge the
> stacked
> > segments more frequently. This is pending perf testing though.
> >
> > Shai
> >
> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Catching up here :)
> >>
> >> This is great stuff Shai -- I like the notion of "negative" postings,
> >> that "subtract" docs from previous generations as you iterate them.
> >>
> >> And I like the term "stacked segments".  This fits very well with
> >> Lucene's write-once approach, ie, a writer can at any time stack a new
> >> segment (writes to new files) "over" an old segment, like the layers
> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
> >>
> >> And the merge policy now picks from 2 dimensions right?  Ie it may
> >> want to simply consolidate stacks on an old segment but otherwise not
> >> merge that segment (eg for very large segments that have accumulated
> >> updates), and normal merging would of course consolidate all stacks
> >> for all segments merged.
> >>
> >> Wouldn't this approach conceivably allow you to alter single terms
> >> within a single field (we'd have to figure out how we'd expose the API
> >> for this)?  EG if I appended some text to an already-indexed field?
> >> (In addition to adding a new field to an already indexed doc, or
> >> replacing an indexed field on a previously indexed doc).
> >>
> >> Did you have any hard perf numbers?  Merge sorting N streams is
> >> surprisingly costly... we may need/want to have a reader pre-merge
> >> (using up RAM) any "long tail" of stacked segments as long as they are
> >> small enough...
> >>
> >> This sounds great!!
> >>
> >> Mike
> >>
> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > WARNING: following email is a bit long, but I think is worth the
> reading
> >> > :)
> >> >
> >> > I would like to describe my implementation of incremental field
> updates
> >> > in Juru (the former search library I've worked on in IBM). I will
> later
> >> > discuss how I think it can be implemented in Lucene.
> >> >
> >> > The motivation/requirement came from a doc management system which
> used
> >> > Juru as its search component. The system included document libraries
> >> > where users could create files and upload documents. A user could
> belong
> >> > to any number of libraries and complex ACLs model was used (down to
> the
> >> > level of the file). ACLs and Folders were modeled as categories in the
> >> > index (boolean-like terms). Files and folders could be moved around
> and
> >> > access to a library/folder/file could be granted/revoked at any given
> >> > time. Therefore, such updates usually affected hundreds (and
> thousands)
> >> > of documents. Overall, the index managed millions of documents, tens
> of
> >> > thousands of libraries and hundreds of thousands of ACLs (large
> >> > organization :)). To get a rough understanding on the number of such
> >> > updates - every several minutes, tens of thousands of documents were
> >> > updated due to such changes only (in addition to the regular content
> >> > updates).
> >> >
> >> > We were asked to support requests in the following form: "update all
> >> > docs
> >> > that match <criteria> --> do <operation>" where:
> >> > * <criteria> was "a single doc", "docs belonging to a category" and
> >> > "docs
> >> > belonging to a set of categories".
> >> > * <operation> was "add categories NEW" (NEW might not even exist in
> the
> >> > index yet, or already associated w/ the document), "remove categories
> >> > OLD"
> >> > (irregardless if the docs were already associated w/ OLD or not) and
> >> > "remove all OLD and add all NEW".
> >> >
> >> > The solution I've implemented to support these requests turned out to
> >> > actually allow you to update every term (!) in the index: suppose that
> >> > you have a table, which recorded tuples like <docid, term, +/->. The
> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term
> "ibm",
> >> > and the record <1, "hp", '-'> means that the same document is not
> >> > associated w/ the word "hp". Then, you could very easily ask for all
> >> > documents that are assoicated w/ "hp", and the result would not
> include
> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id
> the
> >> > document received.
> >> >
> >> > I've kept two types of postings in the index: regular and updates.
> >> > Taking the above examples, "ibm" regular posting looked like <"ibm",
> 1,
> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm",
> +2,
> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
> >> > "hp".
> >> >
> >> > During search time, when a query with the word "ibm" was submitted, I
> >> > create a virtual posting which reads from both the regular and the
> >> > updates, and merges them on the fly according to the +/- signs. Since
> >> > both postings are sorted in ascending order, the merge is very
> >> > efficient, and query time is hardly affected.
> >> >
> >> > Those postings are merged from time to time in a process that is
> similar
> >> > to how Lucene works today, which keeps the update postings relatively
> >> > small and manageable.
> >> >
> >> > Now here comes the fun part - how I think it can be implemented in
> >> > Lucene !
> >> >
> >> > To be honest, this sat on my TODO list for a very long time and only a
> >> > couple of months ago I figured out how to implement it in Lucene. The
> >> > main difficulty I had was around the difference between the write-once
> >> > policy in Juru and Lucene - in Lucene, once a segment is written, it
> >> > cannot be changed. BUT, I've only recently realized that this isn't
> >> > exactly true, because deleted docs do change existing segments. The
> >> > deletes are kept in a separate file to the segment (.del) and have
> their
> >> > own generation. Deletes, as I understood then, and Grant helped me
> term
> >> > them better, can be defined as "Stacked Segments" - they add data to a
> >> > segment, which from time to time are integrated into the segment
> (unlike
> >> > Photoshop Layers, but my understanding of Photoshop is limited). And
> the
> >> > Lucene engine knows how to combine the two, giving precedence to the
> >> > deletes.
> >> >
> >> > By introducing an "Updates Stacked Segment", we can encode postings w/
> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
> >> > create a variation which merges the two lists. When segments are
> merged,
> >> > the updates will be merged into the regular postings (just like
> deletes)
> >> > and thus will be gone. In addition, this plays very nicely with
> readers
> >> > that are currently reading the index, as well as we can have
> generations
> >> > for the updates - really like deletes !
> >> >
> >> > I think that Lucene's architecture allows for such a solution very
> >> > cleanly and nicely (and I believe flex makes it even easier). We can
> >> > (later, after you've digested the idea) discuss whether this should be
> >> > built into the current IW, or an extension like UpdateableIW. The API
> >> > I've been thinking about should really be like deletes, allowing to
> >> > update docs based on Term or Query. I defer the API discussion for
> later
> >> > for now.
> >> >
> >> > As for stored fields, this was a real challenge to support in Juru,
> but
> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this
> >> > should
> >> > be much easier - adding stacked stored fields ...
> >> >
> >> > As you've noticed, the update postings are not DGap encoded, and sign
> >> > needs to be preserved. While I haven't implemented it in Juru, I think
> >> > that perhaps this can be improved by keeping the '-' and '+' lists
> >> > separated. We will need to register somewhere which came before which
> >> > because order matters a lot here (and I'm not talking about
> concurrency
> >> > - simple update instructions order). I have some idea how this can be
> >> > achieved, but I refrain from describing it now, to not make this email
> >> > even longer :).
> >> >
> >> > I've mentioned that this approach can be applied to any term and not
> >> > just categories under some circumstances. Basically, as soon as you
> >> > update a term, its DF is no longer true, unless you are able to take
> the
> >> > updates into account. We can defer the discussion on that, but clearly
> >> > for many fields, incrementally update them should not affect
> precision,
> >> > as they're not used for that type of scoring ... Maybe, by keeping
> >> > separate '+' and '-' lists we can compute statistics precisely. And I
> >> > haven't given much thought yet to how this and Mike's flex scoring
> will
> >> > be integrated.
> >> >
> >> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
> >> > Once PI is introduced, one can index all the updateable fields in a
> >> > dedicated slice, for perhaps improving search performance for slices
> >> > that are not updateable (not involving code which attempts to read and
> >> > merge update and regular lists on the fly). Also, incremental field
> >> > updates support all of PI's scenarios, even though some will be done
> >> > more efficiently w/ PI. But this too is a matter for a separate
> >> > discussion :).
> >> >
> >> > That's it ! I believe I've given you all the details I have about the
> >> > approach and high level proposed solution for Lucene. Perhaps some
> >> > details slipped my mind, but if you ask the right questions, I'm sure
> >> > I'll be able to answer them :). I would like to emphasize that since
> >> > this was already implemented (in Juru) - this is more than just a "I
> >> > think this approach can work" proposal ...
> >> >
> >> > I would appreciate your comments on this. I would like to start
> >> > implementing it soon, and so as a first step, please share your
> comments
> >> > on the overall approach. I'll then write a more detailed description
> on
> >> > how I think to impl it in Lucene (been spending some time on that),
> and
> >> > we can have more detailed (and fun) discussions on the low level
> >> > details.
> >> >
> >> > Shai
> >> >
> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com>
> wrote:
> >> >>
> >> >> Good point. I meant the model at the document level: i.e. what
> >> >> milestones does a document go through in its life cycle. Today:
> >> >>
> >> >> created --> deleted
> >> >>
> >> >> With incremental updates:
> >> >>
> >> >> created --> update1 --> update2 --> deleted
> >> >>
> >> >> I think what I'm trying to say is that this second threaded sequence
> >> >> of state changes seems intuitively more fragile under concurrent
> >> >> scenarios.  So for example, in a lock-free design, the system would
> >> >> also have to anticipate the following sequence of events:
> >> >>
> >> >> created --> update1 --> deleted --> update2
> >> >>
> >> >> and consider update2 a null op.  I'm imagining there are other cases
> >> >> that I can't think of..
> >> >>
> >> >> -Babak
> >> >>
> >> >>
> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >> > write once, plus the option to the app to keep multiple commit
> points
> >> >> > around (by customizing the deletion policy).
> >> >> >
> >> >> > Actually order of operations / commits very much matters in Lucene
> >> >> > today.
> >> >> >
> >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete by
> >> >> > term X, add a new doc with term X... that's very different than if
> >> >> > you
> >> >> > moved the delete op to the end.  Ie the deletion only applies to
> the
> >> >> > docs added before it.
> >> >> >
> >> >> > Mike
> >> >> >
> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
> >> >> > wrote:
> >> >> >> Sure. Because of the write once principle.  But at some cost
> >> >> >> (duplicated data). I was just agreeing that it would not be a good
> >> >> >> idea to bake in version-ing by keeping the layers around forever
> in
> >> >> >> a
> >> >> >> merged index; I wasn't keying in on transactions per se.
> >> >> >>
> >> >> >> Speaking of transactions: I'm not sure if we should worry about
> this
> >> >> >> much yet, but with "updates" the order of the transaction commits
> >> >> >> seems important. I think commit order is less important today in
> >> >> >> Lucene because its model supports only 2 types of events: document
> >> >> >> creation--which only happens once, and document deletion, which is
> >> >> >> idempotent.  What do you think? Will commits have to be ordered if
> >> >> >> we
> >> >> >> introduce updates?  Or does the onus of maintaining order fall on
> >> >> >> the
> >> >> >> application?
> >> >> >>
> >> >> >> -Babak
> >> >> >>
> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <farhang@gmail.com
> >
> >> >> >>> wrote:
> >> >> >>>>> I think they get merged in by the merger, ideally in the
> >> >> >>>>> background.
> >> >> >>>>
> >> >> >>>> That sounds sensible. (In other words, we wont concern ourselves
> >> >> >>>> with
> >> >> >>>> roll backs--something possible while a "layer" is still around.)
> >> >> >>>
> >> >> >>> Actually roll backs would still be very possible even if layers
> are
> >> >> >>> merged.
> >> >> >>>
> >> >> >>> Ie, one could keep multiple commits around, and the older commits
> >> >> >>> would still be referring to the old postings + layers, keeping
> them
> >> >> >>> alive.
> >> >> >>>
> >> >> >>> Lucene would still be transactional with such an approach.
> >> >> >>>
> >> >> >>> Mike
> >> >> >>>
> >> >> >>>
> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

Shai,

I think this is an interesting approach. I can see how I could
[incrementally] update a stored, indexed field this way, but I don't
understand how I could update an indexed-only field. Here's why: for a
stored (and indexed) field, I can always determine what terms to
remove ('-') from the postings, but for an indexed-only field I'd have
no [practical] way to know..

So under this approach,  I'm thinking at a user level, a Lucene field
would be updateable only if it's at least stored.

Is that right?

-Babak

On Wed, May 5, 2010 at 11:17 AM, Shai Erera <se...@gmail.com> wrote:
> Yes Mike - I don't know yet if two MPs will be used, one for the stacked
> segments and one for the general segments (which will include the stacked
> ones as well) .. feels like one MP should be enough, but this can be decided
> on as we progress.
>
> This approach allows you to update every term in every already indexed
> field, as well as add terms to already indexed fields ... and add totally
> new fields, with lots of text in them. So e.g. there are two neat use cases
> that come to mind:
> 1) Allow users to rate search results, favor them etc.
> 2) Or even comment them,
> I think Google offers the 2nd. Both translate into updating the search
> result's already indexed document w/ the new rating, comment etc. w/o
> needing to reindex the document.
>
> I will try to find perf results numbers. It's been long time ago, hope all
> the documents are still where they were :). Indeed, for terms like ACLs, it
> means that each query had to merge dozens of postings lists. For that I
> implemented an alternative solution, which uses a payload-like structure
> that registers for each document the list of ACLs that are associated with
> it (not as strings, it was more efficient). Then, if the query included
> dozens of such terms, I created a filter-like object which for every
> matching document by the query checked if it matches the ACLs list of the
> document. This is usually slower, because the ACLs themselves don't drive
> the query, which means more matches will be found. That was a tradeoff which
> one could configure based on the number of such terms in the query, the
> number of updated terms etc.
>
> But in essence you're right - if the solution is generic to cover any term,
> we cannot use such payload-based feature. We might need to merge the stacked
> segments more frequently. This is pending perf testing though.
>
> Shai
>
> On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Catching up here :)
>>
>> This is great stuff Shai -- I like the notion of "negative" postings,
>> that "subtract" docs from previous generations as you iterate them.
>>
>> And I like the term "stacked segments".  This fits very well with
>> Lucene's write-once approach, ie, a writer can at any time stack a new
>> segment (writes to new files) "over" an old segment, like the layers
>> in photoshop.  A reader merges all stacks on-the-fly when reading.
>>
>> And the merge policy now picks from 2 dimensions right?  Ie it may
>> want to simply consolidate stacks on an old segment but otherwise not
>> merge that segment (eg for very large segments that have accumulated
>> updates), and normal merging would of course consolidate all stacks
>> for all segments merged.
>>
>> Wouldn't this approach conceivably allow you to alter single terms
>> within a single field (we'd have to figure out how we'd expose the API
>> for this)?  EG if I appended some text to an already-indexed field?
>> (In addition to adding a new field to an already indexed doc, or
>> replacing an indexed field on a previously indexed doc).
>>
>> Did you have any hard perf numbers?  Merge sorting N streams is
>> surprisingly costly... we may need/want to have a reader pre-merge
>> (using up RAM) any "long tail" of stacked segments as long as they are
>> small enough...
>>
>> This sounds great!!
>>
>> Mike
>>
>> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com> wrote:
>> > Hi,
>> >
>> > WARNING: following email is a bit long, but I think is worth the reading
>> > :)
>> >
>> > I would like to describe my implementation of incremental field updates
>> > in Juru (the former search library I've worked on in IBM). I will later
>> > discuss how I think it can be implemented in Lucene.
>> >
>> > The motivation/requirement came from a doc management system which used
>> > Juru as its search component. The system included document libraries
>> > where users could create files and upload documents. A user could belong
>> > to any number of libraries and complex ACLs model was used (down to the
>> > level of the file). ACLs and Folders were modeled as categories in the
>> > index (boolean-like terms). Files and folders could be moved around and
>> > access to a library/folder/file could be granted/revoked at any given
>> > time. Therefore, such updates usually affected hundreds (and thousands)
>> > of documents. Overall, the index managed millions of documents, tens of
>> > thousands of libraries and hundreds of thousands of ACLs (large
>> > organization :)). To get a rough understanding on the number of such
>> > updates - every several minutes, tens of thousands of documents were
>> > updated due to such changes only (in addition to the regular content
>> > updates).
>> >
>> > We were asked to support requests in the following form: "update all
>> > docs
>> > that match <criteria> --> do <operation>" where:
>> > * <criteria> was "a single doc", "docs belonging to a category" and
>> > "docs
>> > belonging to a set of categories".
>> > * <operation> was "add categories NEW" (NEW might not even exist in the
>> > index yet, or already associated w/ the document), "remove categories
>> > OLD"
>> > (irregardless if the docs were already associated w/ OLD or not) and
>> > "remove all OLD and add all NEW".
>> >
>> > The solution I've implemented to support these requests turned out to
>> > actually allow you to update every term (!) in the index: suppose that
>> > you have a table, which recorded tuples like <docid, term, +/->. The
>> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
>> > and the record <1, "hp", '-'> means that the same document is not
>> > associated w/ the word "hp". Then, you could very easily ask for all
>> > documents that are assoicated w/ "hp", and the result would not include
>> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
>> > document received.
>> >
>> > I've kept two types of postings in the index: regular and updates.
>> > Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
>> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
>> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
>> > "hp".
>> >
>> > During search time, when a query with the word "ibm" was submitted, I
>> > create a virtual posting which reads from both the regular and the
>> > updates, and merges them on the fly according to the +/- signs. Since
>> > both postings are sorted in ascending order, the merge is very
>> > efficient, and query time is hardly affected.
>> >
>> > Those postings are merged from time to time in a process that is similar
>> > to how Lucene works today, which keeps the update postings relatively
>> > small and manageable.
>> >
>> > Now here comes the fun part - how I think it can be implemented in
>> > Lucene !
>> >
>> > To be honest, this sat on my TODO list for a very long time and only a
>> > couple of months ago I figured out how to implement it in Lucene. The
>> > main difficulty I had was around the difference between the write-once
>> > policy in Juru and Lucene - in Lucene, once a segment is written, it
>> > cannot be changed. BUT, I've only recently realized that this isn't
>> > exactly true, because deleted docs do change existing segments. The
>> > deletes are kept in a separate file to the segment (.del) and have their
>> > own generation. Deletes, as I understood then, and Grant helped me term
>> > them better, can be defined as "Stacked Segments" - they add data to a
>> > segment, which from time to time are integrated into the segment (unlike
>> > Photoshop Layers, but my understanding of Photoshop is limited). And the
>> > Lucene engine knows how to combine the two, giving precedence to the
>> > deletes.
>> >
>> > By introducing an "Updates Stacked Segment", we can encode postings w/
>> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
>> > create a variation which merges the two lists. When segments are merged,
>> > the updates will be merged into the regular postings (just like deletes)
>> > and thus will be gone. In addition, this plays very nicely with readers
>> > that are currently reading the index, as well as we can have generations
>> > for the updates - really like deletes !
>> >
>> > I think that Lucene's architecture allows for such a solution very
>> > cleanly and nicely (and I believe flex makes it even easier). We can
>> > (later, after you've digested the idea) discuss whether this should be
>> > built into the current IW, or an extension like UpdateableIW. The API
>> > I've been thinking about should really be like deletes, allowing to
>> > update docs based on Term or Query. I defer the API discussion for later
>> > for now.
>> >
>> > As for stored fields, this was a real challenge to support in Juru, but
>> > I think that w/ "Stacked Segments" and Lucene's architecture, this
>> > should
>> > be much easier - adding stacked stored fields ...
>> >
>> > As you've noticed, the update postings are not DGap encoded, and sign
>> > needs to be preserved. While I haven't implemented it in Juru, I think
>> > that perhaps this can be improved by keeping the '-' and '+' lists
>> > separated. We will need to register somewhere which came before which
>> > because order matters a lot here (and I'm not talking about concurrency
>> > - simple update instructions order). I have some idea how this can be
>> > achieved, but I refrain from describing it now, to not make this email
>> > even longer :).
>> >
>> > I've mentioned that this approach can be applied to any term and not
>> > just categories under some circumstances. Basically, as soon as you
>> > update a term, its DF is no longer true, unless you are able to take the
>> > updates into account. We can defer the discussion on that, but clearly
>> > for many fields, incrementally update them should not affect precision,
>> > as they're not used for that type of scoring ... Maybe, by keeping
>> > separate '+' and '-' lists we can compute statistics precisely. And I
>> > haven't given much thought yet to how this and Mike's flex scoring will
>> > be integrated.
>> >
>> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
>> > Once PI is introduced, one can index all the updateable fields in a
>> > dedicated slice, for perhaps improving search performance for slices
>> > that are not updateable (not involving code which attempts to read and
>> > merge update and regular lists on the fly). Also, incremental field
>> > updates support all of PI's scenarios, even though some will be done
>> > more efficiently w/ PI. But this too is a matter for a separate
>> > discussion :).
>> >
>> > That's it ! I believe I've given you all the details I have about the
>> > approach and high level proposed solution for Lucene. Perhaps some
>> > details slipped my mind, but if you ask the right questions, I'm sure
>> > I'll be able to answer them :). I would like to emphasize that since
>> > this was already implemented (in Juru) - this is more than just a "I
>> > think this approach can work" proposal ...
>> >
>> > I would appreciate your comments on this. I would like to start
>> > implementing it soon, and so as a first step, please share your comments
>> > on the overall approach. I'll then write a more detailed description on
>> > how I think to impl it in Lucene (been spending some time on that), and
>> > we can have more detailed (and fun) discussions on the low level
>> > details.
>> >
>> > Shai
>> >
>> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com> wrote:
>> >>
>> >> Good point. I meant the model at the document level: i.e. what
>> >> milestones does a document go through in its life cycle. Today:
>> >>
>> >> created --> deleted
>> >>
>> >> With incremental updates:
>> >>
>> >> created --> update1 --> update2 --> deleted
>> >>
>> >> I think what I'm trying to say is that this second threaded sequence
>> >> of state changes seems intuitively more fragile under concurrent
>> >> scenarios.  So for example, in a lock-free design, the system would
>> >> also have to anticipate the following sequence of events:
>> >>
>> >> created --> update1 --> deleted --> update2
>> >>
>> >> and consider update2 a null op.  I'm imagining there are other cases
>> >> that I can't think of..
>> >>
>> >> -Babak
>> >>
>> >>
>> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >> > write once, plus the option to the app to keep multiple commit points
>> >> > around (by customizing the deletion policy).
>> >> >
>> >> > Actually order of operations / commits very much matters in Lucene
>> >> > today.
>> >> >
>> >> > Deletions are not idempotent: if you add a doc w/ term X, delete by
>> >> > term X, add a new doc with term X... that's very different than if
>> >> > you
>> >> > moved the delete op to the end.  Ie the deletion only applies to the
>> >> > docs added before it.
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
>> >> > wrote:
>> >> >> Sure. Because of the write once principle.  But at some cost
>> >> >> (duplicated data). I was just agreeing that it would not be a good
>> >> >> idea to bake in version-ing by keeping the layers around forever in
>> >> >> a
>> >> >> merged index; I wasn't keying in on transactions per se.
>> >> >>
>> >> >> Speaking of transactions: I'm not sure if we should worry about this
>> >> >> much yet, but with "updates" the order of the transaction commits
>> >> >> seems important. I think commit order is less important today in
>> >> >> Lucene because its model supports only 2 types of events: document
>> >> >> creation--which only happens once, and document deletion, which is
>> >> >> idempotent.  What do you think? Will commits have to be ordered if
>> >> >> we
>> >> >> introduce updates?  Or does the onus of maintaining order fall on
>> >> >> the
>> >> >> application?
>> >> >>
>> >> >> -Babak
>> >> >>
>> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> >> <lu...@mikemccandless.com> wrote:
>> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com>
>> >> >>> wrote:
>> >> >>>>> I think they get merged in by the merger, ideally in the
>> >> >>>>> background.
>> >> >>>>
>> >> >>>> That sounds sensible. (In other words, we wont concern ourselves
>> >> >>>> with
>> >> >>>> roll backs--something possible while a "layer" is still around.)
>> >> >>>
>> >> >>> Actually roll backs would still be very possible even if layers are
>> >> >>> merged.
>> >> >>>
>> >> >>> Ie, one could keep multiple commits around, and the older commits
>> >> >>> would still be referring to the old postings + layers, keeping them
>> >> >>> alive.
>> >> >>>
>> >> >>> Lucene would still be transactional with such an approach.
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Yes Mike - I don't know yet if two MPs will be used, one for the stacked
segments and one for the general segments (which will include the stacked
ones as well) .. feels like one MP should be enough, but this can be decided
on as we progress.

This approach allows you to update every term in every already indexed
field, as well as add terms to already indexed fields ... and add totally
new fields, with lots of text in them. So e.g. there are two neat use cases
that come to mind:
1) Allow users to rate search results, favor them etc.
2) Or even comment them,
I think Google offers the 2nd. Both translate into updating the search
result's already indexed document w/ the new rating, comment etc. w/o
needing to reindex the document.

I will try to find perf results numbers. It's been long time ago, hope all
the documents are still where they were :). Indeed, for terms like ACLs, it
means that each query had to merge dozens of postings lists. For that I
implemented an alternative solution, which uses a payload-like structure
that registers for each document the list of ACLs that are associated with
it (not as strings, it was more efficient). Then, if the query included
dozens of such terms, I created a filter-like object which for every
matching document by the query checked if it matches the ACLs list of the
document. This is usually slower, because the ACLs themselves don't drive
the query, which means more matches will be found. That was a tradeoff which
one could configure based on the number of such terms in the query, the
number of updated terms etc.

But in essence you're right - if the solution is generic to cover any term,
we cannot use such payload-based feature. We might need to merge the stacked
segments more frequently. This is pending perf testing though.

Shai

On Wed, May 5, 2010 at 6:54 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Catching up here :)
>
> This is great stuff Shai -- I like the notion of "negative" postings,
> that "subtract" docs from previous generations as you iterate them.
>
> And I like the term "stacked segments".  This fits very well with
> Lucene's write-once approach, ie, a writer can at any time stack a new
> segment (writes to new files) "over" an old segment, like the layers
> in photoshop.  A reader merges all stacks on-the-fly when reading.
>
> And the merge policy now picks from 2 dimensions right?  Ie it may
> want to simply consolidate stacks on an old segment but otherwise not
> merge that segment (eg for very large segments that have accumulated
> updates), and normal merging would of course consolidate all stacks
> for all segments merged.
>
> Wouldn't this approach conceivably allow you to alter single terms
> within a single field (we'd have to figure out how we'd expose the API
> for this)?  EG if I appended some text to an already-indexed field?
> (In addition to adding a new field to an already indexed doc, or
> replacing an indexed field on a previously indexed doc).
>
> Did you have any hard perf numbers?  Merge sorting N streams is
> surprisingly costly... we may need/want to have a reader pre-merge
> (using up RAM) any "long tail" of stacked segments as long as they are
> small enough...
>
> This sounds great!!
>
> Mike
>
> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi,
> >
> > WARNING: following email is a bit long, but I think is worth the reading
> :)
> >
> > I would like to describe my implementation of incremental field updates
> > in Juru (the former search library I've worked on in IBM). I will later
> > discuss how I think it can be implemented in Lucene.
> >
> > The motivation/requirement came from a doc management system which used
> > Juru as its search component. The system included document libraries
> > where users could create files and upload documents. A user could belong
> > to any number of libraries and complex ACLs model was used (down to the
> > level of the file). ACLs and Folders were modeled as categories in the
> > index (boolean-like terms). Files and folders could be moved around and
> > access to a library/folder/file could be granted/revoked at any given
> > time. Therefore, such updates usually affected hundreds (and thousands)
> > of documents. Overall, the index managed millions of documents, tens of
> > thousands of libraries and hundreds of thousands of ACLs (large
> > organization :)). To get a rough understanding on the number of such
> > updates - every several minutes, tens of thousands of documents were
> > updated due to such changes only (in addition to the regular content
> > updates).
> >
> > We were asked to support requests in the following form: "update all docs
> > that match <criteria> --> do <operation>" where:
> > * <criteria> was "a single doc", "docs belonging to a category" and "docs
> > belonging to a set of categories".
> > * <operation> was "add categories NEW" (NEW might not even exist in the
> > index yet, or already associated w/ the document), "remove categories
> OLD"
> > (irregardless if the docs were already associated w/ OLD or not) and
> > "remove all OLD and add all NEW".
> >
> > The solution I've implemented to support these requests turned out to
> > actually allow you to update every term (!) in the index: suppose that
> > you have a table, which recorded tuples like <docid, term, +/->. The
> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
> > and the record <1, "hp", '-'> means that the same document is not
> > associated w/ the word "hp". Then, you could very easily ask for all
> > documents that are assoicated w/ "hp", and the result would not include
> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
> > document received.
> >
> > I've kept two types of postings in the index: regular and updates.
> > Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
> > "hp".
> >
> > During search time, when a query with the word "ibm" was submitted, I
> > create a virtual posting which reads from both the regular and the
> > updates, and merges them on the fly according to the +/- signs. Since
> > both postings are sorted in ascending order, the merge is very
> > efficient, and query time is hardly affected.
> >
> > Those postings are merged from time to time in a process that is similar
> > to how Lucene works today, which keeps the update postings relatively
> > small and manageable.
> >
> > Now here comes the fun part - how I think it can be implemented in Lucene
> !
> >
> > To be honest, this sat on my TODO list for a very long time and only a
> > couple of months ago I figured out how to implement it in Lucene. The
> > main difficulty I had was around the difference between the write-once
> > policy in Juru and Lucene - in Lucene, once a segment is written, it
> > cannot be changed. BUT, I've only recently realized that this isn't
> > exactly true, because deleted docs do change existing segments. The
> > deletes are kept in a separate file to the segment (.del) and have their
> > own generation. Deletes, as I understood then, and Grant helped me term
> > them better, can be defined as "Stacked Segments" - they add data to a
> > segment, which from time to time are integrated into the segment (unlike
> > Photoshop Layers, but my understanding of Photoshop is limited). And the
> > Lucene engine knows how to combine the two, giving precedence to the
> > deletes.
> >
> > By introducing an "Updates Stacked Segment", we can encode postings w/
> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
> > create a variation which merges the two lists. When segments are merged,
> > the updates will be merged into the regular postings (just like deletes)
> > and thus will be gone. In addition, this plays very nicely with readers
> > that are currently reading the index, as well as we can have generations
> > for the updates - really like deletes !
> >
> > I think that Lucene's architecture allows for such a solution very
> > cleanly and nicely (and I believe flex makes it even easier). We can
> > (later, after you've digested the idea) discuss whether this should be
> > built into the current IW, or an extension like UpdateableIW. The API
> > I've been thinking about should really be like deletes, allowing to
> > update docs based on Term or Query. I defer the API discussion for later
> > for now.
> >
> > As for stored fields, this was a real challenge to support in Juru, but
> > I think that w/ "Stacked Segments" and Lucene's architecture, this should
> > be much easier - adding stacked stored fields ...
> >
> > As you've noticed, the update postings are not DGap encoded, and sign
> > needs to be preserved. While I haven't implemented it in Juru, I think
> > that perhaps this can be improved by keeping the '-' and '+' lists
> > separated. We will need to register somewhere which came before which
> > because order matters a lot here (and I'm not talking about concurrency
> > - simple update instructions order). I have some idea how this can be
> > achieved, but I refrain from describing it now, to not make this email
> > even longer :).
> >
> > I've mentioned that this approach can be applied to any term and not
> > just categories under some circumstances. Basically, as soon as you
> > update a term, its DF is no longer true, unless you are able to take the
> > updates into account. We can defer the discussion on that, but clearly
> > for many fields, incrementally update them should not affect precision,
> > as they're not used for that type of scoring ... Maybe, by keeping
> > separate '+' and '-' lists we can compute statistics precisely. And I
> > haven't given much thought yet to how this and Mike's flex scoring will
> > be integrated.
> >
> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
> > Once PI is introduced, one can index all the updateable fields in a
> > dedicated slice, for perhaps improving search performance for slices
> > that are not updateable (not involving code which attempts to read and
> > merge update and regular lists on the fly). Also, incremental field
> > updates support all of PI's scenarios, even though some will be done
> > more efficiently w/ PI. But this too is a matter for a separate
> > discussion :).
> >
> > That's it ! I believe I've given you all the details I have about the
> > approach and high level proposed solution for Lucene. Perhaps some
> > details slipped my mind, but if you ask the right questions, I'm sure
> > I'll be able to answer them :). I would like to emphasize that since
> > this was already implemented (in Juru) - this is more than just a "I
> > think this approach can work" proposal ...
> >
> > I would appreciate your comments on this. I would like to start
> > implementing it soon, and so as a first step, please share your comments
> > on the overall approach. I'll then write a more detailed description on
> > how I think to impl it in Lucene (been spending some time on that), and
> > we can have more detailed (and fun) discussions on the low level
> > details.
> >
> > Shai
> >
> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com> wrote:
> >>
> >> Good point. I meant the model at the document level: i.e. what
> >> milestones does a document go through in its life cycle. Today:
> >>
> >> created --> deleted
> >>
> >> With incremental updates:
> >>
> >> created --> update1 --> update2 --> deleted
> >>
> >> I think what I'm trying to say is that this second threaded sequence
> >> of state changes seems intuitively more fragile under concurrent
> >> scenarios.  So for example, in a lock-free design, the system would
> >> also have to anticipate the following sequence of events:
> >>
> >> created --> update1 --> deleted --> update2
> >>
> >> and consider update2 a null op.  I'm imagining there are other cases
> >> that I can't think of..
> >>
> >> -Babak
> >>
> >>
> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >> > write once, plus the option to the app to keep multiple commit points
> >> > around (by customizing the deletion policy).
> >> >
> >> > Actually order of operations / commits very much matters in Lucene
> >> > today.
> >> >
> >> > Deletions are not idempotent: if you add a doc w/ term X, delete by
> >> > term X, add a new doc with term X... that's very different than if you
> >> > moved the delete op to the end.  Ie the deletion only applies to the
> >> > docs added before it.
> >> >
> >> > Mike
> >> >
> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
> >> > wrote:
> >> >> Sure. Because of the write once principle.  But at some cost
> >> >> (duplicated data). I was just agreeing that it would not be a good
> >> >> idea to bake in version-ing by keeping the layers around forever in a
> >> >> merged index; I wasn't keying in on transactions per se.
> >> >>
> >> >> Speaking of transactions: I'm not sure if we should worry about this
> >> >> much yet, but with "updates" the order of the transaction commits
> >> >> seems important. I think commit order is less important today in
> >> >> Lucene because its model supports only 2 types of events: document
> >> >> creation--which only happens once, and document deletion, which is
> >> >> idempotent.  What do you think? Will commits have to be ordered if we
> >> >> introduce updates?  Or does the onus of maintaining order fall on the
> >> >> application?
> >> >>
> >> >> -Babak
> >> >>
> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com>
> >> >>> wrote:
> >> >>>>> I think they get merged in by the merger, ideally in the
> background.
> >> >>>>
> >> >>>> That sounds sensible. (In other words, we wont concern ourselves
> with
> >> >>>> roll backs--something possible while a "layer" is still around.)
> >> >>>
> >> >>> Actually roll backs would still be very possible even if layers are
> >> >>> merged.
> >> >>>
> >> >>> Ie, one could keep multiple commits around, and the older commits
> >> >>> would still be referring to the old postings + layers, keeping them
> >> >>> alive.
> >> >>>
> >> >>> Lucene would still be transactional with such an approach.
> >> >>>
> >> >>> Mike
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>>
> >> >>>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

Catching up here :)

This is great stuff Shai -- I like the notion of "negative" postings,
that "subtract" docs from previous generations as you iterate them.

And I like the term "stacked segments".  This fits very well with
Lucene's write-once approach, ie, a writer can at any time stack a new
segment (writes to new files) "over" an old segment, like the layers
in photoshop.  A reader merges all stacks on-the-fly when reading.

And the merge policy now picks from 2 dimensions right?  Ie it may
want to simply consolidate stacks on an old segment but otherwise not
merge that segment (eg for very large segments that have accumulated
updates), and normal merging would of course consolidate all stacks
for all segments merged.

Wouldn't this approach conceivably allow you to alter single terms
within a single field (we'd have to figure out how we'd expose the API
for this)?  EG if I appended some text to an already-indexed field?
(In addition to adding a new field to an already indexed doc, or
replacing an indexed field on a previously indexed doc).

Did you have any hard perf numbers?  Merge sorting N streams is
surprisingly costly... we may need/want to have a reader pre-merge
(using up RAM) any "long tail" of stacked segments as long as they are
small enough...

This sounds great!!

Mike

On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <se...@gmail.com> wrote:
> Hi,
>
> WARNING: following email is a bit long, but I think is worth the reading :)
>
> I would like to describe my implementation of incremental field updates
> in Juru (the former search library I've worked on in IBM). I will later
> discuss how I think it can be implemented in Lucene.
>
> The motivation/requirement came from a doc management system which used
> Juru as its search component. The system included document libraries
> where users could create files and upload documents. A user could belong
> to any number of libraries and complex ACLs model was used (down to the
> level of the file). ACLs and Folders were modeled as categories in the
> index (boolean-like terms). Files and folders could be moved around and
> access to a library/folder/file could be granted/revoked at any given
> time. Therefore, such updates usually affected hundreds (and thousands)
> of documents. Overall, the index managed millions of documents, tens of
> thousands of libraries and hundreds of thousands of ACLs (large
> organization :)). To get a rough understanding on the number of such
> updates - every several minutes, tens of thousands of documents were
> updated due to such changes only (in addition to the regular content
> updates).
>
> We were asked to support requests in the following form: "update all docs
> that match <criteria> --> do <operation>" where:
> * <criteria> was "a single doc", "docs belonging to a category" and "docs
> belonging to a set of categories".
> * <operation> was "add categories NEW" (NEW might not even exist in the
> index yet, or already associated w/ the document), "remove categories OLD"
> (irregardless if the docs were already associated w/ OLD or not) and
> "remove all OLD and add all NEW".
>
> The solution I've implemented to support these requests turned out to
> actually allow you to update every term (!) in the index: suppose that
> you have a table, which recorded tuples like <docid, term, +/->. The
> record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
> and the record <1, "hp", '-'> means that the same document is not
> associated w/ the word "hp". Then, you could very easily ask for all
> documents that are assoicated w/ "hp", and the result would not include
> doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
> document received.
>
> I've kept two types of postings in the index: regular and updates.
> Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
> 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
> -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
> "hp".
>
> During search time, when a query with the word "ibm" was submitted, I
> create a virtual posting which reads from both the regular and the
> updates, and merges them on the fly according to the +/- signs. Since
> both postings are sorted in ascending order, the merge is very
> efficient, and query time is hardly affected.
>
> Those postings are merged from time to time in a process that is similar
> to how Lucene works today, which keeps the update postings relatively
> small and manageable.
>
> Now here comes the fun part - how I think it can be implemented in Lucene !
>
> To be honest, this sat on my TODO list for a very long time and only a
> couple of months ago I figured out how to implement it in Lucene. The
> main difficulty I had was around the difference between the write-once
> policy in Juru and Lucene - in Lucene, once a segment is written, it
> cannot be changed. BUT, I've only recently realized that this isn't
> exactly true, because deleted docs do change existing segments. The
> deletes are kept in a separate file to the segment (.del) and have their
> own generation. Deletes, as I understood then, and Grant helped me term
> them better, can be defined as "Stacked Segments" - they add data to a
> segment, which from time to time are integrated into the segment (unlike
> Photoshop Layers, but my understanding of Photoshop is limited). And the
> Lucene engine knows how to combine the two, giving precedence to the
> deletes.
>
> By introducing an "Updates Stacked Segment", we can encode postings w/
> the '+'/'-' signs, and when TermDocs/Positions is requested, we can
> create a variation which merges the two lists. When segments are merged,
> the updates will be merged into the regular postings (just like deletes)
> and thus will be gone. In addition, this plays very nicely with readers
> that are currently reading the index, as well as we can have generations
> for the updates - really like deletes !
>
> I think that Lucene's architecture allows for such a solution very
> cleanly and nicely (and I believe flex makes it even easier). We can
> (later, after you've digested the idea) discuss whether this should be
> built into the current IW, or an extension like UpdateableIW. The API
> I've been thinking about should really be like deletes, allowing to
> update docs based on Term or Query. I defer the API discussion for later
> for now.
>
> As for stored fields, this was a real challenge to support in Juru, but
> I think that w/ "Stacked Segments" and Lucene's architecture, this should
> be much easier - adding stacked stored fields ...
>
> As you've noticed, the update postings are not DGap encoded, and sign
> needs to be preserved. While I haven't implemented it in Juru, I think
> that perhaps this can be improved by keeping the '-' and '+' lists
> separated. We will need to register somewhere which came before which
> because order matters a lot here (and I'm not talking about concurrency
> - simple update instructions order). I have some idea how this can be
> achieved, but I refrain from describing it now, to not make this email
> even longer :).
>
> I've mentioned that this approach can be applied to any term and not
> just categories under some circumstances. Basically, as soon as you
> update a term, its DF is no longer true, unless you are able to take the
> updates into account. We can defer the discussion on that, but clearly
> for many fields, incrementally update them should not affect precision,
> as they're not used for that type of scoring ... Maybe, by keeping
> separate '+' and '-' lists we can compute statistics precisely. And I
> haven't given much thought yet to how this and Mike's flex scoring will
> be integrated.
>
> BTW, a word on Parallel Indexing - the two are completely orthogonal.
> Once PI is introduced, one can index all the updateable fields in a
> dedicated slice, for perhaps improving search performance for slices
> that are not updateable (not involving code which attempts to read and
> merge update and regular lists on the fly). Also, incremental field
> updates support all of PI's scenarios, even though some will be done
> more efficiently w/ PI. But this too is a matter for a separate
> discussion :).
>
> That's it ! I believe I've given you all the details I have about the
> approach and high level proposed solution for Lucene. Perhaps some
> details slipped my mind, but if you ask the right questions, I'm sure
> I'll be able to answer them :). I would like to emphasize that since
> this was already implemented (in Juru) - this is more than just a "I
> think this approach can work" proposal ...
>
> I would appreciate your comments on this. I would like to start
> implementing it soon, and so as a first step, please share your comments
> on the overall approach. I'll then write a more detailed description on
> how I think to impl it in Lucene (been spending some time on that), and
> we can have more detailed (and fun) discussions on the low level
> details.
>
> Shai
>
> On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com> wrote:
>>
>> Good point. I meant the model at the document level: i.e. what
>> milestones does a document go through in its life cycle. Today:
>>
>> created --> deleted
>>
>> With incremental updates:
>>
>> created --> update1 --> update2 --> deleted
>>
>> I think what I'm trying to say is that this second threaded sequence
>> of state changes seems intuitively more fragile under concurrent
>> scenarios.  So for example, in a lock-free design, the system would
>> also have to anticipate the following sequence of events:
>>
>> created --> update1 --> deleted --> update2
>>
>> and consider update2 a null op.  I'm imagining there are other cases
>> that I can't think of..
>>
>> -Babak
>>
>>
>> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> > write once, plus the option to the app to keep multiple commit points
>> > around (by customizing the deletion policy).
>> >
>> > Actually order of operations / commits very much matters in Lucene
>> > today.
>> >
>> > Deletions are not idempotent: if you add a doc w/ term X, delete by
>> > term X, add a new doc with term X... that's very different than if you
>> > moved the delete op to the end.  Ie the deletion only applies to the
>> > docs added before it.
>> >
>> > Mike
>> >
>> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
>> > wrote:
>> >> Sure. Because of the write once principle.  But at some cost
>> >> (duplicated data). I was just agreeing that it would not be a good
>> >> idea to bake in version-ing by keeping the layers around forever in a
>> >> merged index; I wasn't keying in on transactions per se.
>> >>
>> >> Speaking of transactions: I'm not sure if we should worry about this
>> >> much yet, but with "updates" the order of the transaction commits
>> >> seems important. I think commit order is less important today in
>> >> Lucene because its model supports only 2 types of events: document
>> >> creation--which only happens once, and document deletion, which is
>> >> idempotent.  What do you think? Will commits have to be ordered if we
>> >> introduce updates?  Or does the onus of maintaining order fall on the
>> >> application?
>> >>
>> >> -Babak
>> >>
>> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com>
>> >>> wrote:
>> >>>>> I think they get merged in by the merger, ideally in the background.
>> >>>>
>> >>>> That sounds sensible. (In other words, we wont concern ourselves with
>> >>>> roll backs--something possible while a "layer" is still around.)
>> >>>
>> >>> Actually roll backs would still be very possible even if layers are
>> >>> merged.
>> >>>
>> >>> Ie, one could keep multiple commits around, and the older commits
>> >>> would still be referring to the old postings + layers, keeping them
>> >>> alive.
>> >>>
>> >>> Lucene would still be transactional with such an approach.
>> >>>
>> >>> Mike
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Hi,

WARNING: following email is a bit long, but I think is worth the reading :)

I would like to describe my implementation of incremental field updates
in Juru (the former search library I've worked on in IBM). I will later
discuss how I think it can be implemented in Lucene.

The motivation/requirement came from a doc management system which used
Juru as its search component. The system included document libraries
where users could create files and upload documents. A user could belong
to any number of libraries and complex ACLs model was used (down to the
level of the file). ACLs and Folders were modeled as categories in the
index (boolean-like terms). Files and folders could be moved around and
access to a library/folder/file could be granted/revoked at any given
time. Therefore, such updates usually affected hundreds (and thousands)
of documents. Overall, the index managed millions of documents, tens of
thousands of libraries and hundreds of thousands of ACLs (large
organization :)). To get a rough understanding on the number of such
updates - every several minutes, tens of thousands of documents were
updated due to such changes only (in addition to the regular content
updates).

We were asked to support requests in the following form: "update all docs
that match <criteria> --> do <operation>" where:
* <criteria> was "a single doc", "docs belonging to a category" and "docs
belonging to a set of categories".
* <operation> was "add categories NEW" (NEW might not even exist in the
index yet, or already associated w/ the document), "remove categories OLD"
(irregardless if the docs were already associated w/ OLD or not) and
"remove all OLD and add all NEW".

The solution I've implemented to support these requests turned out to
actually allow you to update every term (!) in the index: suppose that
you have a table, which recorded tuples like <docid, term, +/->. The
record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
and the record <1, "hp", '-'> means that the same document is not
associated w/ the word "hp". Then, you could very easily ask for all
documents that are assoicated w/ "hp", and the result would not include
doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
document received.

I've kept two types of postings in the index: regular and updates.
Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
-3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
"hp".

During search time, when a query with the word "ibm" was submitted, I
create a virtual posting which reads from both the regular and the
updates, and merges them on the fly according to the +/- signs. Since
both postings are sorted in ascending order, the merge is very
efficient, and query time is hardly affected.

Those postings are merged from time to time in a process that is similar
to how Lucene works today, which keeps the update postings relatively
small and manageable.

Now here comes the fun part - how I think it can be implemented in Lucene !

To be honest, this sat on my TODO list for a very long time and only a
couple of months ago I figured out how to implement it in Lucene. The
main difficulty I had was around the difference between the write-once
policy in Juru and Lucene - in Lucene, once a segment is written, it
cannot be changed. BUT, I've only recently realized that this isn't
exactly true, because deleted docs do change existing segments. The
deletes are kept in a separate file to the segment (.del) and have their
own generation. Deletes, as I understood then, and Grant helped me term
them better, can be defined as "Stacked Segments" - they add data to a
segment, which from time to time are integrated into the segment (unlike
Photoshop Layers, but my understanding of Photoshop is limited). And the
Lucene engine knows how to combine the two, giving precedence to the
deletes.

By introducing an "Updates Stacked Segment", we can encode postings w/
the '+'/'-' signs, and when TermDocs/Positions is requested, we can
create a variation which merges the two lists. When segments are merged,
the updates will be merged into the regular postings (just like deletes)
and thus will be gone. In addition, this plays very nicely with readers
that are currently reading the index, as well as we can have generations
for the updates - really like deletes !

I think that Lucene's architecture allows for such a solution very
cleanly and nicely (and I believe flex makes it even easier). We can
(later, after you've digested the idea) discuss whether this should be
built into the current IW, or an extension like UpdateableIW. The API
I've been thinking about should really be like deletes, allowing to
update docs based on Term or Query. I defer the API discussion for later
for now.

As for stored fields, this was a real challenge to support in Juru, but
I think that w/ "Stacked Segments" and Lucene's architecture, this should
be much easier - adding stacked stored fields ...

As you've noticed, the update postings are not DGap encoded, and sign
needs to be preserved. While I haven't implemented it in Juru, I think
that perhaps this can be improved by keeping the '-' and '+' lists
separated. We will need to register somewhere which came before which
because order matters a lot here (and I'm not talking about concurrency
- simple update instructions order). I have some idea how this can be
achieved, but I refrain from describing it now, to not make this email
even longer :).

I've mentioned that this approach can be applied to any term and not
just categories under some circumstances. Basically, as soon as you
update a term, its DF is no longer true, unless you are able to take the
updates into account. We can defer the discussion on that, but clearly
for many fields, incrementally update them should not affect precision,
as they're not used for that type of scoring ... Maybe, by keeping
separate '+' and '-' lists we can compute statistics precisely. And I
haven't given much thought yet to how this and Mike's flex scoring will
be integrated.

BTW, a word on Parallel Indexing - the two are completely orthogonal.
Once PI is introduced, one can index all the updateable fields in a
dedicated slice, for perhaps improving search performance for slices
that are not updateable (not involving code which attempts to read and
merge update and regular lists on the fly). Also, incremental field
updates support all of PI's scenarios, even though some will be done
more efficiently w/ PI. But this too is a matter for a separate
discussion :).

That's it ! I believe I've given you all the details I have about the
approach and high level proposed solution for Lucene. Perhaps some
details slipped my mind, but if you ask the right questions, I'm sure
I'll be able to answer them :). I would like to emphasize that since
this was already implemented (in Juru) - this is more than just a "I
think this approach can work" proposal ...

I would appreciate your comments on this. I would like to start
implementing it soon, and so as a first step, please share your comments
on the overall approach. I'll then write a more detailed description on
how I think to impl it in Lucene (been spending some time on that), and
we can have more detailed (and fun) discussions on the low level
details.

Shai

On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <fa...@gmail.com> wrote:

> Good point. I meant the model at the document level: i.e. what
> milestones does a document go through in its life cycle. Today:
>
> created --> deleted
>
> With incremental updates:
>
> created --> update1 --> update2 --> deleted
>
> I think what I'm trying to say is that this second threaded sequence
> of state changes seems intuitively more fragile under concurrent
> scenarios.  So for example, in a lock-free design, the system would
> also have to anticipate the following sequence of events:
>
> created --> update1 --> deleted --> update2
>
> and consider update2 a null op.  I'm imagining there are other cases
> that I can't think of..
>
> -Babak
>
>
> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > write once, plus the option to the app to keep multiple commit points
> > around (by customizing the deletion policy).
> >
> > Actually order of operations / commits very much matters in Lucene today.
> >
> > Deletions are not idempotent: if you add a doc w/ term X, delete by
> > term X, add a new doc with term X... that's very different than if you
> > moved the delete op to the end.  Ie the deletion only applies to the
> > docs added before it.
> >
> > Mike
> >
> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com>
> wrote:
> >> Sure. Because of the write once principle.  But at some cost
> >> (duplicated data). I was just agreeing that it would not be a good
> >> idea to bake in version-ing by keeping the layers around forever in a
> >> merged index; I wasn't keying in on transactions per se.
> >>
> >> Speaking of transactions: I'm not sure if we should worry about this
> >> much yet, but with "updates" the order of the transaction commits
> >> seems important. I think commit order is less important today in
> >> Lucene because its model supports only 2 types of events: document
> >> creation--which only happens once, and document deletion, which is
> >> idempotent.  What do you think? Will commits have to be ordered if we
> >> introduce updates?  Or does the onus of maintaining order fall on the
> >> application?
> >>
> >> -Babak
> >>
> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com>
> wrote:
> >>>>> I think they get merged in by the merger, ideally in the background.
> >>>>
> >>>> That sounds sensible. (In other words, we wont concern ourselves with
> >>>> roll backs--something possible while a "layer" is still around.)
> >>>
> >>> Actually roll backs would still be very possible even if layers are
> merged.
> >>>
> >>> Ie, one could keep multiple commits around, and the older commits
> >>> would still be referring to the old postings + layers, keeping them
> >>> alive.
> >>>
> >>> Lucene would still be transactional with such an approach.
> >>>
> >>> Mike
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

Good point. I meant the model at the document level: i.e. what
milestones does a document go through in its life cycle. Today:

created --> deleted

With incremental updates:

created --> update1 --> update2 --> deleted

I think what I'm trying to say is that this second threaded sequence
of state changes seems intuitively more fragile under concurrent
scenarios.  So for example, in a lock-free design, the system would
also have to anticipate the following sequence of events:

created --> update1 --> deleted --> update2

and consider update2 a null op.  I'm imagining there are other cases
that I can't think of..

-Babak


On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> write once, plus the option to the app to keep multiple commit points
> around (by customizing the deletion policy).
>
> Actually order of operations / commits very much matters in Lucene today.
>
> Deletions are not idempotent: if you add a doc w/ term X, delete by
> term X, add a new doc with term X... that's very different than if you
> moved the delete op to the end.  Ie the deletion only applies to the
> docs added before it.
>
> Mike
>
> On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com> wrote:
>> Sure. Because of the write once principle.  But at some cost
>> (duplicated data). I was just agreeing that it would not be a good
>> idea to bake in version-ing by keeping the layers around forever in a
>> merged index; I wasn't keying in on transactions per se.
>>
>> Speaking of transactions: I'm not sure if we should worry about this
>> much yet, but with "updates" the order of the transaction commits
>> seems important. I think commit order is less important today in
>> Lucene because its model supports only 2 types of events: document
>> creation--which only happens once, and document deletion, which is
>> idempotent.  What do you think? Will commits have to be ordered if we
>> introduce updates?  Or does the onus of maintaining order fall on the
>> application?
>>
>> -Babak
>>
>> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com> wrote:
>>>>> I think they get merged in by the merger, ideally in the background.
>>>>
>>>> That sounds sensible. (In other words, we wont concern ourselves with
>>>> roll backs--something possible while a "layer" is still around.)
>>>
>>> Actually roll backs would still be very possible even if layers are merged.
>>>
>>> Ie, one could keep multiple commits around, and the older commits
>>> would still be referring to the old postings + layers, keeping them
>>> alive.
>>>
>>> Lucene would still be transactional with such an approach.
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

write once, plus the option to the app to keep multiple commit points
around (by customizing the deletion policy).

Actually order of operations / commits very much matters in Lucene today.

Deletions are not idempotent: if you add a doc w/ term X, delete by
term X, add a new doc with term X... that's very different than if you
moved the delete op to the end.  Ie the deletion only applies to the
docs added before it.

Mike

On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <fa...@gmail.com> wrote:
> Sure. Because of the write once principle.  But at some cost
> (duplicated data). I was just agreeing that it would not be a good
> idea to bake in version-ing by keeping the layers around forever in a
> merged index; I wasn't keying in on transactions per se.
>
> Speaking of transactions: I'm not sure if we should worry about this
> much yet, but with "updates" the order of the transaction commits
> seems important. I think commit order is less important today in
> Lucene because its model supports only 2 types of events: document
> creation--which only happens once, and document deletion, which is
> idempotent.  What do you think? Will commits have to be ordered if we
> introduce updates?  Or does the onus of maintaining order fall on the
> application?
>
> -Babak
>
> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com> wrote:
>>>> I think they get merged in by the merger, ideally in the background.
>>>
>>> That sounds sensible. (In other words, we wont concern ourselves with
>>> roll backs--something possible while a "layer" is still around.)
>>
>> Actually roll backs would still be very possible even if layers are merged.
>>
>> Ie, one could keep multiple commits around, and the older commits
>> would still be referring to the old postings + layers, keeping them
>> alive.
>>
>> Lucene would still be transactional with such an approach.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

Sure. Because of the write once principle.  But at some cost
(duplicated data). I was just agreeing that it would not be a good
idea to bake in version-ing by keeping the layers around forever in a
merged index; I wasn't keying in on transactions per se.

Speaking of transactions: I'm not sure if we should worry about this
much yet, but with "updates" the order of the transaction commits
seems important. I think commit order is less important today in
Lucene because its model supports only 2 types of events: document
creation--which only happens once, and document deletion, which is
idempotent.  What do you think? Will commits have to be ordered if we
introduce updates?  Or does the onus of maintaining order fall on the
application?

-Babak

On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com> wrote:
>>> I think they get merged in by the merger, ideally in the background.
>>
>> That sounds sensible. (In other words, we wont concern ourselves with
>> roll backs--something possible while a "layer" is still around.)
>
> Actually roll backs would still be very possible even if layers are merged.
>
> Ie, one could keep multiple commits around, and the older commits
> would still be referring to the old postings + layers, keeping them
> alive.
>
> Lucene would still be transactional with such an approach.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <fa...@gmail.com> wrote:
>> I think they get merged in by the merger, ideally in the background.
>
> That sounds sensible. (In other words, we wont concern ourselves with
> roll backs--something possible while a "layer" is still around.)

Actually roll backs would still be very possible even if layers are merged.

Ie, one could keep multiple commits around, and the older commits
would still be referring to the old postings + layers, keeping them
alive.

Lucene would still be transactional with such an approach.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Babak Farhang <fa...@gmail.com>.

> I think they get merged in by the merger, ideally in the background.

That sounds sensible. (In other words, we wont concern ourselves with
roll backs--something possible while a "layer" is still around.)

I've been thinking about this problem also. One approach discussed
earlier in these mailing lists has been to somehow maintain a parallel
index of the update-able of the fields in such a way that the docIds
of the parallel index remain in sync with the "master" index. Mike
McCandless and I were discussing some variants of this approach a few
months back: http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps
 That approach involved the concept of mapping (chaining, if you will)
internal docIds to view ids.  That docid mapping concept sounds
analogous to this layer concept we are discussing now.

I now think the parallel index approach may not be such a great idea,
after all: it simply pushes the problem to the edge--the slave index.
If we can solve update problem in the slave index, I reason, then
shouldn't we also be able to solve the same update problem in the
master index (and thereby remove the necessity of maintaining a
(user-level) parallel index in the first place)?

Which seems to align with the approach being discussed here..

I imagine the "layers" being discussed here are somehow threaded by
docId. That is, given a docId, you can quickly find it's "layers."  If
so, then the docId mapping idea may be one way to thread these layers.
(A logical document would be constructed by a chain of docIds, each
overriding the previous for each field it defines (or deletes).  Such
a construction would have to be "merge-aware" (perhaps using machinery
similar to that used in LUCENE-1879) in order that it may maintain the
docId chain.

What do you think?

On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:
>
>> [Late to this party, but thought I'd chime in]
>>
>> I think this "layer" concept is right on.  But I'm wondering about the
>> life cycle of these layers.  Do layers live forever? Or do they
>> collapse at some point? (Like, as I think was already pointed out,
>> deletes are when segments are merged today.)
>
> I think they get merged in by the merger, ideally in the background.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:

> [Late to this party, but thought I'd chime in]
> 
> I think this "layer" concept is right on.  But I'm wondering about the
> life cycle of these layers.  Do layers live forever? Or do they
> collapse at some point? (Like, as I think was already pointed out,
> deletes are when segments are merged today.)

I think they get merged in by the merger, ideally in the background.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org