You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2008/02/05 00:07:24 UTC

Re: Performance guarantees and index format

: What this issue doesn't discuss is what to do with partial results obtained
: when a timeout occurred. As the original poster points out, document lists are
: traversed in the order they were added and not the order of their importance,
: which introduces a bias to partial results in that they reflect results from a
: random sample (which is likely not the most relevant, i.e. there could have
: been more relevant results later in the traversal order).
: 
: The answer to this issue is org.apache.nutch.indexer.IndexSorter, which

skimming this it doesn't seem like a refactored version that was less 
nutch specific cold make a handy contrib ... but it also seems like there 
may be a simpler approach for the (i assume) common case of prefering docs 
that were indexed later....

if we eliminate the requirement for *strict* preference of recent 
documents and make that a more loose desire, then we coulnd't we do a 
pretty good job if we just changed Segment merging to reorder reverse the 
order of the segments before each merge?  it wouldn't be very useful to 
start doing this on an index that's already a decent size, but if this was 
happening on every merge right from the very begining, then the most 
recent documents would percollate to the front of the index right?

The only downside i can think of would be that docids would frequently 
(not not very predictably) change even if there were no deletions .. but 
you'd pay that same penalty with something like the nutch's IndexSorter.

I'm not much of an expert on segment merging.. but with the exception of 
docid order i can'tthink of many reasons why there couldn't be a merger 
that revesed the order of hte segments.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance guarantees and index format

Posted by Doron Cohen <cd...@gmail.com>.

I was once involved in modified a search index
implementation (not Lucene) to write posting lists so that
they can be traversed (only) in reverse order. Docids
were preserved but you got higher IDs first. This was
a non-trivial code change.

Now the suggestion to (optionally) order merged
segments from new to old should be much simpler
to implement (I think) and would be an interesting add-on.

If in addition DocumentsWriter is modified to optionally
reverse the order of written docs, you get the docs
completely reversed.

Being optional, applications caring about docids
stability would not use this option.

On Fri, Feb 8, 2008 at 12:22 AM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> : I think this would be too messy - currently we can be sure of the simple
> rule
> : that documents added to the index get incrementally higher docids, i.e.
> the
> : higher the docid the more recent is the document. I think it would be
> much
> : simpler to write a FilterIndexReader that simply reverses the order of
> docids.
>
> First off: you only have that garuntee while indexing ... if you
> frequently reorder docs using something like the IndexSorter then that
> rule no longer applies (and you must not care or you wouldn't have
> reordered everything)
>
> Second: using IndexSorter after an index is completley built is definitely
> a simpler, clearner, way of accomplishing something like this -- but it
> only seems adequate for situations in which "index building" is seperate
> and distinct from "index searching" ... I can't see how it would work very
> easily in situations where you are continuously performing incremental
> updates while searches are taking place.
>
> : The issue with Nutch's IndexSorter is that it allows you to reorder
> docids in
> : an arbitrary manner, using a user-supplied mapping between old and new
> docids,
> : which can be based on values retrieved from the current index or from
> any
> : other source. So I think this would be the main value of this class
> applicable
> : to various scenarios.
>
> No Argument what-so-ever.  IndexSorter seems like a sweet tool to have in
> the Lucene toolbox for letting people reordering the docs in an index by
> arbitrary criteria ... but for people with the specific case of
> *prefering* that recently added docs be in front of older docs, automatic
> segment reordering seems like it would also be a handy tool to have in the
> toolbox so that documents could "bubble up" gradually.  (maybe as a new
> MergePolicy? ... probably need some API changes to allow order to be
> specified)
>
> There would definitley be trade offs people would need to consdier before
> using it -- but those tradeoffs would probably also apply if they wanted
> to use IndexSorter.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Performance guarantees and index format

Posted by Chris Hostetter <ho...@fucit.org>.

: I think this would be too messy - currently we can be sure of the simple rule
: that documents added to the index get incrementally higher docids, i.e. the
: higher the docid the more recent is the document. I think it would be much
: simpler to write a FilterIndexReader that simply reverses the order of docids.

First off: you only have that garuntee while indexing ... if you 
frequently reorder docs using something like the IndexSorter then that 
rule no longer applies (and you must not care or you wouldn't have 
reordered everything)

Second: using IndexSorter after an index is completley built is definitely 
a simpler, clearner, way of accomplishing something like this -- but it 
only seems adequate for situations in which "index building" is seperate 
and distinct from "index searching" ... I can't see how it would work very 
easily in situations where you are continuously performing incremental 
updates while searches are taking place.

: The issue with Nutch's IndexSorter is that it allows you to reorder docids in
: an arbitrary manner, using a user-supplied mapping between old and new docids,
: which can be based on values retrieved from the current index or from any
: other source. So I think this would be the main value of this class applicable
: to various scenarios.

No Argument what-so-ever.  IndexSorter seems like a sweet tool to have in 
the Lucene toolbox for letting people reordering the docs in an index by 
arbitrary criteria ... but for people with the specific case of 
*prefering* that recently added docs be in front of older docs, automatic 
segment reordering seems like it would also be a handy tool to have in the 
toolbox so that documents could "bubble up" gradually.  (maybe as a new 
MergePolicy? ... probably need some API changes to allow order to be 
specified)

There would definitley be trade offs people would need to consdier before 
using it -- but those tradeoffs would probably also apply if they wanted 
to use IndexSorter.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance guarantees and index format

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Hostetter wrote:
> : What this issue doesn't discuss is what to do with partial results obtained
> : when a timeout occurred. As the original poster points out, document lists are
> : traversed in the order they were added and not the order of their importance,
> : which introduces a bias to partial results in that they reflect results from a
> : random sample (which is likely not the most relevant, i.e. there could have
> : been more relevant results later in the traversal order).
> : 
> : The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
> 
> skimming this it doesn't seem like a refactored version that was less 
> nutch specific cold make a handy contrib ... but it also seems like there 
> may be a simpler approach for the (i assume) common case of prefering docs 
> that were indexed later....
> 
> if we eliminate the requirement for *strict* preference of recent 
> documents and make that a more loose desire, then we coulnd't we do a 
> pretty good job if we just changed Segment merging to reorder reverse the 
> order of the segments before each merge?  it wouldn't be very useful to 
> start doing this on an index that's already a decent size, but if this was 
> happening on every merge right from the very begining, then the most 
> recent documents would percollate to the front of the index right?
> 
> The only downside i can think of would be that docids would frequently 
> (not not very predictably) change even if there were no deletions .. but 
> you'd pay that same penalty with something like the nutch's IndexSorter.
> 
> I'm not much of an expert on segment merging.. but with the exception of 
> docid order i can'tthink of many reasons why there couldn't be a merger 
> that revesed the order of hte segments.

I think this would be too messy - currently we can be sure of the simple 
rule that documents added to the index get incrementally higher docids, 
i.e. the higher the docid the more recent is the document. I think it 
would be much simpler to write a FilterIndexReader that simply reverses 
the order of docids.

The issue with Nutch's IndexSorter is that it allows you to reorder 
docids in an arbitrary manner, using a user-supplied mapping between old 
and new docids, which can be based on values retrieved from the current 
index or from any other source. So I think this would be the main value 
of this class applicable to various scenarios.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org