You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/10/07 10:59:31 UTC

Re: Incremental Field Updates

Picking up on this very interesting discussion..
Great and innovative piece of work, Shai!

I think we come a long way addressing common scenarios through this approach. Many customers really just need ACL or other metadata updates. One example is a customer of mine who have an index of large docs for which the source data is archived on tape. It is way too costly to retrieve the original data to compile a new document for a metadata update only.

Also, if I want to have the ability to update a whole field, I would be happy to make it stored, rather than having to supply the original value to the API. Seems like a reasonable tradeoff for getting incremental update - nobody would expect it to be free.

+1 for solving the "simple metadata" update case first, with full-field update support for stored fields only.

Does this particular solution currently have an associated JIRA issue?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. mai 2010, at 10.40, Michael McCandless wrote:

> On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
>> That's an interesting scenario Mike.
>> 
>> Previously, I only handled boolean-like terms, as the scenarios we were
>> asked to support involved just those types of terms. Obviously, when the
>> approach allows for more, more scenarios pop to mind :).
> 
> OK.
> 
>> I think we may still be able to resolve that case, but it becomes much more
>> complicated. My design approach of adding the +/- affected the entire
>> posting element, whereas the scenario you describe affects the positions of
>> the posting element. This calls for a more complicated design and solution.
> 
> Right.
> 
>> My take on it is that if someone wants to update the catch-all field, then
>> reindexing the document may not be such a bad idea anyway. The purpose of
>> those incremental updates is to cope w/ high frequency of updates, which
>> usually happen on metadata fields, and not title.
> 
> I agree.
> 
>> But since one could add the 'tags' to the catch-all field as well, it brings
>> us to the same point - how do I remove the positions of term X that relate
>> to the tag X and not the potentially original term X that existed in the
>> document?
>> 
>> This is a very advanced case (and interesting). I don't want to hold up the
>> discussion on it, but want to make sure we do not deviate from getting the
>> more simpler cases in first. Depending on the API, this might be very easy
>> to solve, but might also complicate matters. Maybe, for a
>> incr-field-updates-v1, we can do without it?
> 
> Definitely, let's take this (incrementally updating the positions as
> well) out of scope for the first cut, when we actually start building
> things.  One simple way to do this might be to only allow incremental
> update on fields that have omitTFAP=true.
> 
> When brainstorming/designing a new feature, I like to cast a wide net
> during the discussion/thinking (what we are doing now), but then when
> it comes to what to actually build for phase one well pull it way back
> in and aim for baby steps / progress not perfection.  We are able to
> do much more imagining than we can actually writing code :)
> 
> The wide net during brainstorming gives us a better view/context of
> the road ahead, eg to validate that the baby step is in the right
> direction, so that it doesn't preclude other things we might imagine
> later.
> 
> In this case, it does sound like the approach should work (in theory)
> fine w/ positions, too.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Michael McCandless <lu...@mikemccandless.com>.

One thing that makes IFU especially compelling now is that we have
improved how we track buffered deletes, which records by Term or Query
mapped against certain segments which documents should be deleted.

I think this same "channel" can be generalized to also hold pending
updates to documents from prior segments.  We should be able to reuse
much of the buffered deletes code to also hold buffered doc updates.

IFU would be an *awesome* addition ;)

I think doc blocks (LUCENE-3112) makes IFU even more important.

Mike

http://blog.mikemccandless.com

On Mon, May 23, 2011 at 12:08 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 10/7/10 11:07 AM, Shai Erera wrote:
>>
>> Not yet. I actually plan to start working on it next week, but it will
>> take some time until I post the first patch. Also, I'll probably develop
>> it on top of trunk only, utilizing flexible indexing. At the moment, I
>> have no plans, nor can I estimate how much work is required, to develop
>> it on top of 3x.
>>
>> Unfortunately my regular projects keep me very busy, but it's time I
>> allocate some time to work on this one too :). Stay tuned !
>
> I still am :)
>
> I'm resurrecting this old thread in the hopes that this subject again
> attracts attention of the Lucene mind collective. The design proposal that
> you presented at that time seemed workable ... so I guess it just requires
> someone to roll up the sleeves and start implementing it?
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 5/23/11 8:24 PM, Shai Erera wrote:
> Heh, almost a year has passed since I sent that email. Soon, "next week"
> will be applicable again :).

:)

> Besides my day-to-day job, the changes to trunk kept me away from
> pursuing this. DWPT, BufferedDeletes and all the crazy improvements and
> enhancements Mike and Simon (and others) have been working on lately
> changed much of the code that will be involved in the development.
>
> Things seem to be more stable (and "at peace") at trunk-land now (heck,
> we've even started discussing releasing 4.0 !), so it's time to get back
> on that issue.
>
> I need to reorganize my thoughts and design details into a document or
> something, or plainly in a JIRA issue. I'm afraid of setting any dates
> on this just yet though.
>
> While I very much like to implement it, I realize the community cannot
> wait for me forever. So if there's anyone interested in picking up the
> gloves, I will understand :) (and of course help as much as I can ...)

4.0 internals is still an uncharted area for me ... I'd like to use this 
functionality for a project, so I'm generally willing to help (review, 
design, maybe code if I can catch-up with trunk...).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Heh, almost a year has passed since I sent that email. Soon, "next week"
will be applicable again :).

Besides my day-to-day job, the changes to trunk kept me away from pursuing
this. DWPT, BufferedDeletes and all the crazy improvements and enhancements
Mike and Simon (and others) have been working on lately changed much of the
code that will be involved in the development.

Things seem to be more stable (and "at peace") at trunk-land now (heck,
we've even started discussing releasing 4.0 !), so it's time to get back on
that issue.

I need to reorganize my thoughts and design details into a document or
something, or plainly in a JIRA issue. I'm afraid of setting any dates on
this just yet though.

While I very much like to implement it, I realize the community cannot wait
for me forever. So if there's anyone interested in picking up the gloves, I
will understand :) (and of course help as much as I can ...)

Shai

On Mon, May 23, 2011 at 7:08 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 10/7/10 11:07 AM, Shai Erera wrote:
>
>> Not yet. I actually plan to start working on it next week, but it will
>> take some time until I post the first patch. Also, I'll probably develop
>> it on top of trunk only, utilizing flexible indexing. At the moment, I
>> have no plans, nor can I estimate how much work is required, to develop
>> it on top of 3x.
>>
>> Unfortunately my regular projects keep me very busy, but it's time I
>> allocate some time to work on this one too :). Stay tuned !
>>
>
> I still am :)
>
> I'm resurrecting this old thread in the hopes that this subject again
> attracts attention of the Lucene mind collective. The design proposal that
> you presented at that time seemed workable ... so I guess it just requires
> someone to roll up the sleeves and start implementing it?
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Incremental Field Updates

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 10/7/10 11:07 AM, Shai Erera wrote:
> Not yet. I actually plan to start working on it next week, but it will
> take some time until I post the first patch. Also, I'll probably develop
> it on top of trunk only, utilizing flexible indexing. At the moment, I
> have no plans, nor can I estimate how much work is required, to develop
> it on top of 3x.
>
> Unfortunately my regular projects keep me very busy, but it's time I
> allocate some time to work on this one too :). Stay tuned !

I still am :)

I'm resurrecting this old thread in the hopes that this subject again 
attracts attention of the Lucene mind collective. The design proposal 
that you presented at that time seemed workable ... so I guess it just 
requires someone to roll up the sleeves and start implementing it?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Incremental Field Updates

Posted by Shai Erera <se...@gmail.com>.

Not yet. I actually plan to start working on it next week, but it will take
some time until I post the first patch. Also, I'll probably develop it on
top of trunk only, utilizing flexible indexing. At the moment, I have no
plans, nor can I estimate how much work is required, to develop it on top of
3x.

Unfortunately my regular projects keep me very busy, but it's time I
allocate some time to work on this one too :). Stay tuned !

Shai

On Thu, Oct 7, 2010 at 10:59 AM, Jan Høydahl / Cominvent <
jan.asf@cominvent.com> wrote:

> Picking up on this very interesting discussion..
> Great and innovative piece of work, Shai!
>
> I think we come a long way addressing common scenarios through this
> approach. Many customers really just need ACL or other metadata updates. One
> example is a customer of mine who have an index of large docs for which the
> source data is archived on tape. It is way too costly to retrieve the
> original data to compile a new document for a metadata update only.
>
> Also, if I want to have the ability to update a whole field, I would be
> happy to make it stored, rather than having to supply the original value to
> the API. Seems like a reasonable tradeoff for getting incremental update -
> nobody would expect it to be free.
>
> +1 for solving the "simple metadata" update case first, with full-field
> update support for stored fields only.
>
> Does this particular solution currently have an associated JIRA issue?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 10. mai 2010, at 10.40, Michael McCandless wrote:
>
> > On Mon, May 10, 2010 at 4:05 AM, Shai Erera <se...@gmail.com> wrote:
> >> That's an interesting scenario Mike.
> >>
> >> Previously, I only handled boolean-like terms, as the scenarios we were
> >> asked to support involved just those types of terms. Obviously, when the
> >> approach allows for more, more scenarios pop to mind :).
> >
> > OK.
> >
> >> I think we may still be able to resolve that case, but it becomes much
> more
> >> complicated. My design approach of adding the +/- affected the entire
> >> posting element, whereas the scenario you describe affects the positions
> of
> >> the posting element. This calls for a more complicated design and
> solution.
> >
> > Right.
> >
> >> My take on it is that if someone wants to update the catch-all field,
> then
> >> reindexing the document may not be such a bad idea anyway. The purpose
> of
> >> those incremental updates is to cope w/ high frequency of updates, which
> >> usually happen on metadata fields, and not title.
> >
> > I agree.
> >
> >> But since one could add the 'tags' to the catch-all field as well, it
> brings
> >> us to the same point - how do I remove the positions of term X that
> relate
> >> to the tag X and not the potentially original term X that existed in the
> >> document?
> >>
> >> This is a very advanced case (and interesting). I don't want to hold up
> the
> >> discussion on it, but want to make sure we do not deviate from getting
> the
> >> more simpler cases in first. Depending on the API, this might be very
> easy
> >> to solve, but might also complicate matters. Maybe, for a
> >> incr-field-updates-v1, we can do without it?
> >
> > Definitely, let's take this (incrementally updating the positions as
> > well) out of scope for the first cut, when we actually start building
> > things.  One simple way to do this might be to only allow incremental
> > update on fields that have omitTFAP=true.
> >
> > When brainstorming/designing a new feature, I like to cast a wide net
> > during the discussion/thinking (what we are doing now), but then when
> > it comes to what to actually build for phase one well pull it way back
> > in and aim for baby steps / progress not perfection.  We are able to
> > do much more imagining than we can actually writing code :)
> >
> > The wide net during brainstorming gives us a better view/context of
> > the road ahead, eg to validate that the baby step is in the right
> > direction, so that it doesn't preclude other things we might imagine
> > later.
> >
> > In this case, it does sound like the approach should work (in theory)
> > fine w/ positions, too.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>