You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2008/02/05 00:07:24 UTC
Re: Performance guarantees and index format
: What this issue doesn't discuss is what to do with partial results obtained
: when a timeout occurred. As the original poster points out, document lists are
: traversed in the order they were added and not the order of their importance,
: which introduces a bias to partial results in that they reflect results from a
: random sample (which is likely not the most relevant, i.e. there could have
: been more relevant results later in the traversal order).
:
: The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
skimming this it doesn't seem like a refactored version that was less
nutch specific cold make a handy contrib ... but it also seems like there
may be a simpler approach for the (i assume) common case of prefering docs
that were indexed later....
if we eliminate the requirement for *strict* preference of recent
documents and make that a more loose desire, then we coulnd't we do a
pretty good job if we just changed Segment merging to reorder reverse the
order of the segments before each merge? it wouldn't be very useful to
start doing this on an index that's already a decent size, but if this was
happening on every merge right from the very begining, then the most
recent documents would percollate to the front of the index right?
The only downside i can think of would be that docids would frequently
(not not very predictably) change even if there were no deletions .. but
you'd pay that same penalty with something like the nutch's IndexSorter.
I'm not much of an expert on segment merging.. but with the exception of
docid order i can'tthink of many reasons why there couldn't be a merger
that revesed the order of hte segments.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Performance guarantees and index format
Posted by Doron Cohen <cd...@gmail.com>.
I was once involved in modified a search index
implementation (not Lucene) to write posting lists so that
they can be traversed (only) in reverse order. Docids
were preserved but you got higher IDs first. This was
a non-trivial code change.
Now the suggestion to (optionally) order merged
segments from new to old should be much simpler
to implement (I think) and would be an interesting add-on.
If in addition DocumentsWriter is modified to optionally
reverse the order of written docs, you get the docs
completely reversed.
Being optional, applications caring about docids
stability would not use this option.
On Fri, Feb 8, 2008 at 12:22 AM, Chris Hostetter <ho...@fucit.org>
wrote:
>
> : I think this would be too messy - currently we can be sure of the simple
> rule
> : that documents added to the index get incrementally higher docids, i.e.
> the
> : higher the docid the more recent is the document. I think it would be
> much
> : simpler to write a FilterIndexReader that simply reverses the order of
> docids.
>
> First off: you only have that garuntee while indexing ... if you
> frequently reorder docs using something like the IndexSorter then that
> rule no longer applies (and you must not care or you wouldn't have
> reordered everything)
>
> Second: using IndexSorter after an index is completley built is definitely
> a simpler, clearner, way of accomplishing something like this -- but it
> only seems adequate for situations in which "index building" is seperate
> and distinct from "index searching" ... I can't see how it would work very
> easily in situations where you are continuously performing incremental
> updates while searches are taking place.
>
> : The issue with Nutch's IndexSorter is that it allows you to reorder
> docids in
> : an arbitrary manner, using a user-supplied mapping between old and new
> docids,
> : which can be based on values retrieved from the current index or from
> any
> : other source. So I think this would be the main value of this class
> applicable
> : to various scenarios.
>
> No Argument what-so-ever. IndexSorter seems like a sweet tool to have in
> the Lucene toolbox for letting people reordering the docs in an index by
> arbitrary criteria ... but for people with the specific case of
> *prefering* that recently added docs be in front of older docs, automatic
> segment reordering seems like it would also be a handy tool to have in the
> toolbox so that documents could "bubble up" gradually. (maybe as a new
> MergePolicy? ... probably need some API changes to allow order to be
> specified)
>
> There would definitley be trade offs people would need to consdier before
> using it -- but those tradeoffs would probably also apply if they wanted
> to use IndexSorter.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Performance guarantees and index format
Posted by Chris Hostetter <ho...@fucit.org>.
: I think this would be too messy - currently we can be sure of the simple rule
: that documents added to the index get incrementally higher docids, i.e. the
: higher the docid the more recent is the document. I think it would be much
: simpler to write a FilterIndexReader that simply reverses the order of docids.
First off: you only have that garuntee while indexing ... if you
frequently reorder docs using something like the IndexSorter then that
rule no longer applies (and you must not care or you wouldn't have
reordered everything)
Second: using IndexSorter after an index is completley built is definitely
a simpler, clearner, way of accomplishing something like this -- but it
only seems adequate for situations in which "index building" is seperate
and distinct from "index searching" ... I can't see how it would work very
easily in situations where you are continuously performing incremental
updates while searches are taking place.
: The issue with Nutch's IndexSorter is that it allows you to reorder docids in
: an arbitrary manner, using a user-supplied mapping between old and new docids,
: which can be based on values retrieved from the current index or from any
: other source. So I think this would be the main value of this class applicable
: to various scenarios.
No Argument what-so-ever. IndexSorter seems like a sweet tool to have in
the Lucene toolbox for letting people reordering the docs in an index by
arbitrary criteria ... but for people with the specific case of
*prefering* that recently added docs be in front of older docs, automatic
segment reordering seems like it would also be a handy tool to have in the
toolbox so that documents could "bubble up" gradually. (maybe as a new
MergePolicy? ... probably need some API changes to allow order to be
specified)
There would definitley be trade offs people would need to consdier before
using it -- but those tradeoffs would probably also apply if they wanted
to use IndexSorter.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Performance guarantees and index format
Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Hostetter wrote:
> : What this issue doesn't discuss is what to do with partial results obtained
> : when a timeout occurred. As the original poster points out, document lists are
> : traversed in the order they were added and not the order of their importance,
> : which introduces a bias to partial results in that they reflect results from a
> : random sample (which is likely not the most relevant, i.e. there could have
> : been more relevant results later in the traversal order).
> :
> : The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
>
> skimming this it doesn't seem like a refactored version that was less
> nutch specific cold make a handy contrib ... but it also seems like there
> may be a simpler approach for the (i assume) common case of prefering docs
> that were indexed later....
>
> if we eliminate the requirement for *strict* preference of recent
> documents and make that a more loose desire, then we coulnd't we do a
> pretty good job if we just changed Segment merging to reorder reverse the
> order of the segments before each merge? it wouldn't be very useful to
> start doing this on an index that's already a decent size, but if this was
> happening on every merge right from the very begining, then the most
> recent documents would percollate to the front of the index right?
>
> The only downside i can think of would be that docids would frequently
> (not not very predictably) change even if there were no deletions .. but
> you'd pay that same penalty with something like the nutch's IndexSorter.
>
> I'm not much of an expert on segment merging.. but with the exception of
> docid order i can'tthink of many reasons why there couldn't be a merger
> that revesed the order of hte segments.
I think this would be too messy - currently we can be sure of the simple
rule that documents added to the index get incrementally higher docids,
i.e. the higher the docid the more recent is the document. I think it
would be much simpler to write a FilterIndexReader that simply reverses
the order of docids.
The issue with Nutch's IndexSorter is that it allows you to reorder
docids in an arbitrary manner, using a user-supplied mapping between old
and new docids, which can be based on values retrieved from the current
index or from any other source. So I think this would be the main value
of this class applicable to various scenarios.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org