You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by neils <ne...@gmx.net> on 2006/07/29 10:43:04 UTC

Sorting

Hi,

Lucene sort hits by relevance as default. Cause i would like to sort them by
a special string field and not by relevance i was thinking about dropping
the sorting by relevance as default and implement sorting by alphabetic
order.

Reason that sorting by alpabetic order takes a lot of time. Makes this sense
and how can this be done? Or is there another "fast" way to sort by
alphabetic order ?

Currently I'm using lucene 1.9.1 (dotnet). The indexsize is currently about
2 GB and index is split in two parts and are accessed by a parallelreader.

Hopefully you can give me a tip if there is a way :-))

Thanks a lot :-)
-- 
View this message in context: http://www.nabble.com/Sorting-tf2019404.html#a5552408
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-31 at 11:54 +0200, Andrzej Bialecki wrote:
> Chris Hostetter wrote:
> > 1) I didn't know there were any JVMs that limited the heap size to 1GB ...
> > a 32bit address space would impose a hard limit of 4GB, and I've heard
> > that Windows limits process to 2GB, but I don't know of any JVMs that have
> > 1GB limits.
> >   
> 
> I believe all Win32 JVM-s have a limit of ~1.3GB (~1.9GB if using 
> rebase.exe), which quite often can't be reached anyway due to memory 
> fragmentation. Read here for a somewhat funny analysis:
> 
> *http://www.oreillynet.com/digitalmedia/blog/2005/01/what_is_the_largest_text_file.html*
> 
> *nix OS-es on 32-bit platforms indeed have 4GB addressing space, but at 
> least 1GB of this space is reserved for kernel use ... If I'm not 
> mistaken most 2.6.x Linux distros run now with 1GB/3GB split between 
> kernel/user space, and 2.4.x kernels ran with 2GB/2GB split.

I love my 64bit Solaris and -XX:+AggressiveHeap.

:D


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Hostetter wrote:
> 1) I didn't know there were any JVMs that limited the heap size to 1GB ...
> a 32bit address space would impose a hard limit of 4GB, and I've heard
> that Windows limits process to 2GB, but I don't know of any JVMs that have
> 1GB limits.
>   

I believe all Win32 JVM-s have a limit of ~1.3GB (~1.9GB if using 
rebase.exe), which quite often can't be reached anyway due to memory 
fragmentation. Read here for a somewhat funny analysis:

*http://www.oreillynet.com/digitalmedia/blog/2005/01/what_is_the_largest_text_file.html*

*nix OS-es on 32-bit platforms indeed have 4GB addressing space, but at 
least 1GB of this space is reserved for kernel use ... If I'm not 
mistaken most 2.6.x Linux distros run now with 1GB/3GB split between 
kernel/user space, and 2.4.x kernels ran with 2GB/2GB split.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Advice on Custom Sorting

Posted by Paul Lynch <pa...@yahoo.com>.
Thanks again Erick for taking the time.

I agree that the CachingWrapperFilter as described
under "using a custom filter" in LIA is probably my
best bet. I wanted to check if anything had been added
in Lucene releases since the book was written I wasn't
aware of.

Cheers again.

--- Erick Erickson <er...@gmail.com> wrote:

> You were probably right. See below....
> 
> On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> >
> > Thanks for the quick response Erick.
> >
> > "index the documents in your preferred list with a
> > field and index your non-preferred docs with a
> field
> > subid?"
> >
> > I considered this approach and dismissed it due to
> the
> > actual list of preferred ids changing so
> frequently
> > (every 10 mins...ish) but maybe I was a little
> hasty
> > in doing so. I will investigate the overhead in
> > updating all docs in the index each time my list
> > refreshes. I had assumed it was too prohibitive
> but I
> > know what they say about assumptions :)
> 
> 
> Lots of overhead. There's really no capability of
> updating a doc in place.
> This has been on several people's wish-list. You'd
> have to delete every doc
> that you wanted to change and re-add it. I don't
> know how many documents
> this would be, if just a few it'd be OK, but if
> many.... I was assuming (and
> I *do* know what they say about assumptions <G>)
> that you were just adding
> to your preferred doc list every few minutes, not
> changing existing
> documents....
> 
> It really does sound like you want a filter. I was
> pleasantly surprised by
> how very quickly a filters are built. You could use
> a CachingWrapperFilter
> to have the filter kept around automatically (I
> guess you'd only have one
> per index update) to minimize your overhead for
> building filters, and
> perhaps warm up your cache by firing a canned query
> at your searcher when
> you re-open your IndexReader after index update. I
> think you'd have to do
> the two-query thing in this case. If you wanted to
> really get exotic, you
> could build your filter when you created your index
> and store it in a *very
> special document* and just read it in the first time
> you needed it. Although
> I've never used it, I guess you can store binary
> data. From the Javadoc
> 
>
*Field<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29>
> *(String
>
<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html>
> name,
> byte[] value,
>
Field.Store<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html>
>  store)
>           Create a stored field with binary value.
> 
> The only thing here is that the filters (probably
> wrapped in a
> ConstantScoreQuery) lose relevance, but since you're
> sorting "one of several
> ways", that probably doesn't matter.
> 
> Best
> Erick
> 
> 
> 
> Should I be able to make this workable, the beauty
> of
> > this solution would be that I would actually only
> need
> > to query once. If I had a field which indicates
> > whether it is a preferred doc or not, "all" I will
> > have to do is sort across the two fields.
> >
> > Thanks again Erick. Any other suggestions are most
> > welcome.
> >
> > Regards,
> > Paul
> >
> > --- Erick Erickson <er...@gmail.com>
> wrote:
> >
> > > OK, a really "off the top of my head" response,
> but
> > > what the heck....
> > >
> > > I'm not sure you need to worry about filters.
> Would
> > > it work for you to index
> > > the documents in your preferred list with a 
> field
> > > (called, at the limit of
> > > my creativity, preferredsubid <G>) and index
> your
> > > non-preferred docs with a
> > > field subid? You'd still have to fire two
> queries,
> > > one on subid (to pick up
> > > the ones in your non-preferred list) and one on
> > > preferredsubid.
> > >
> > > Since there's no requirement that all docs have
> the
> > > same fields, your
> > > preferred docs could have ONLY the
> preferredsubid
> > > field and your
> > > non-preferred docs ONLY the subid field. That
> way
> > > you wouldn't have to worry
> > > about picking the docs up twice.
> > >
> > > Merging should be simple then, just iterate over
> > > however many hits you want
> > > in your preferredHits object, then tack on
> however
> > > many you want from your
> > > nonPreferredHits object. All the code for the
> two
> > > queries would be
> > > identical, the only difference being whether you
> > > specify "subid" or
> > > "preferredsubid"......
> > >
> > > I can imagine several variations on this
> scenario,
> > > but they depend on your
> > > problem space.
> > >
> > > Whether this is the "best" or not, I leave as an
> > > exercise for the reader.
> > >
> > > Best
> > > Erick
> > >
> > > On 9/25/06, Paul Lynch <pa...@yahoo.com>
> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I have an index containing documents which all
> > > have a
> > > > field called SubId which holds the ID of the
> > > > Subscriber that submitted the data. This field
> is
> > > > STORED and UN_TOKENIZED
> > > >
> > > > When I am querying the index, the user can
> cloose
> > > a
> > > > number of different ways to sort the Hits. The
> > > problem
> > > > is that I have a list of SubIds that should
> appear
> > > at
> > > > the top of the results list regardless of how
> the
> > > > index is sorted. In other words, lets suppose
> the
> > > Hits
> > > > should be sorted by DateAdded, I require the
> Hits
> > > to
> > > > be sorted by DateAdded for the SubIds in my
> list
> > > and
> > > > then by DateAdded for the SubIds not in my
> list.
> > > >
> > > > From reading previous discussions on the
> mailing
> > > list,
> > > > I believe I could achieve what I need by
> writing
> > > > custom filters i.e. Run the query first with a
> > > custom
> > > > filter for the SubIds in my list and then a
> second
> > > > time with a custom filter for the SubIds not
> in my
> > > > list and then "merge" the results.
> > > >
> > > > I suppose my question is simple: Is there a
> better
> > > way
> > > > to achieve this?
> > > >
> 
=== message truncated ===


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Advice on Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.
You were probably right. See below....

On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
>
> Thanks for the quick response Erick.
>
> "index the documents in your preferred list with a
> field and index your non-preferred docs with a field
> subid?"
>
> I considered this approach and dismissed it due to the
> actual list of preferred ids changing so frequently
> (every 10 mins...ish) but maybe I was a little hasty
> in doing so. I will investigate the overhead in
> updating all docs in the index each time my list
> refreshes. I had assumed it was too prohibitive but I
> know what they say about assumptions :)


Lots of overhead. There's really no capability of updating a doc in place.
This has been on several people's wish-list. You'd have to delete every doc
that you wanted to change and re-add it. I don't know how many documents
this would be, if just a few it'd be OK, but if many.... I was assuming (and
I *do* know what they say about assumptions <G>) that you were just adding
to your preferred doc list every few minutes, not changing existing
documents....

It really does sound like you want a filter. I was pleasantly surprised by
how very quickly a filters are built. You could use a CachingWrapperFilter
to have the filter kept around automatically (I guess you'd only have one
per index update) to minimize your overhead for building filters, and
perhaps warm up your cache by firing a canned query at your searcher when
you re-open your IndexReader after index update. I think you'd have to do
the two-query thing in this case. If you wanted to really get exotic, you
could build your filter when you created your index and store it in a *very
special document* and just read it in the first time you needed it. Although
I've never used it, I guess you can store binary data. From the Javadoc

*Field<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29>
*(String <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> name,
byte[] value, Field.Store<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html>
 store)
          Create a stored field with binary value.

The only thing here is that the filters (probably wrapped in a
ConstantScoreQuery) lose relevance, but since you're sorting "one of several
ways", that probably doesn't matter.

Best
Erick



Should I be able to make this workable, the beauty of
> this solution would be that I would actually only need
> to query once. If I had a field which indicates
> whether it is a preferred doc or not, "all" I will
> have to do is sort across the two fields.
>
> Thanks again Erick. Any other suggestions are most
> welcome.
>
> Regards,
> Paul
>
> --- Erick Erickson <er...@gmail.com> wrote:
>
> > OK, a really "off the top of my head" response, but
> > what the heck....
> >
> > I'm not sure you need to worry about filters. Would
> > it work for you to index
> > the documents in your preferred list with a  field
> > (called, at the limit of
> > my creativity, preferredsubid <G>) and index your
> > non-preferred docs with a
> > field subid? You'd still have to fire two queries,
> > one on subid (to pick up
> > the ones in your non-preferred list) and one on
> > preferredsubid.
> >
> > Since there's no requirement that all docs have the
> > same fields, your
> > preferred docs could have ONLY the preferredsubid
> > field and your
> > non-preferred docs ONLY the subid field. That way
> > you wouldn't have to worry
> > about picking the docs up twice.
> >
> > Merging should be simple then, just iterate over
> > however many hits you want
> > in your preferredHits object, then tack on however
> > many you want from your
> > nonPreferredHits object. All the code for the two
> > queries would be
> > identical, the only difference being whether you
> > specify "subid" or
> > "preferredsubid"......
> >
> > I can imagine several variations on this scenario,
> > but they depend on your
> > problem space.
> >
> > Whether this is the "best" or not, I leave as an
> > exercise for the reader.
> >
> > Best
> > Erick
> >
> > On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> > >
> > > Hi All,
> > >
> > > I have an index containing documents which all
> > have a
> > > field called SubId which holds the ID of the
> > > Subscriber that submitted the data. This field is
> > > STORED and UN_TOKENIZED
> > >
> > > When I am querying the index, the user can cloose
> > a
> > > number of different ways to sort the Hits. The
> > problem
> > > is that I have a list of SubIds that should appear
> > at
> > > the top of the results list regardless of how the
> > > index is sorted. In other words, lets suppose the
> > Hits
> > > should be sorted by DateAdded, I require the Hits
> > to
> > > be sorted by DateAdded for the SubIds in my list
> > and
> > > then by DateAdded for the SubIds not in my list.
> > >
> > > From reading previous discussions on the mailing
> > list,
> > > I believe I could achieve what I need by writing
> > > custom filters i.e. Run the query first with a
> > custom
> > > filter for the SubIds in my list and then a second
> > > time with a custom filter for the SubIds not in my
> > > list and then "merge" the results.
> > >
> > > I suppose my question is simple: Is there a better
> > way
> > > to achieve this?
> > >
> > > Couple of bits of info which I would influence
> > best
> > > design:
> > >
> > > - Index contains roughly 5M documents
> > > - There can be up to 10K different unique SubIds
> > > - My "Preferred SubId List" could contain any
> > > combination of the 10K SubIds including all or
> > none of
> > > them
> > > - My "Preferred SubId List" gets updated about 10
> > > times and hour so I could cache the custom filters
> > >
> > > Thanks in advance,
> > > Paul
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Advice on Custom Sorting

Posted by Paul Lynch <pa...@yahoo.com>.
Thanks for the quick response Erick.

"index the documents in your preferred list with a 
field and index your non-preferred docs with a field
subid?"

I considered this approach and dismissed it due to the
actual list of preferred ids changing so frequently
(every 10 mins...ish) but maybe I was a little hasty
in doing so. I will investigate the overhead in
updating all docs in the index each time my list
refreshes. I had assumed it was too prohibitive but I
know what they say about assumptions :)

Should I be able to make this workable, the beauty of
this solution would be that I would actually only need
to query once. If I had a field which indicates
whether it is a preferred doc or not, "all" I will
have to do is sort across the two fields.

Thanks again Erick. Any other suggestions are most
welcome.

Regards,
Paul

--- Erick Erickson <er...@gmail.com> wrote:

> OK, a really "off the top of my head" response, but
> what the heck....
> 
> I'm not sure you need to worry about filters. Would
> it work for you to index
> the documents in your preferred list with a  field
> (called, at the limit of
> my creativity, preferredsubid <G>) and index your
> non-preferred docs with a
> field subid? You'd still have to fire two queries,
> one on subid (to pick up
> the ones in your non-preferred list) and one on
> preferredsubid.
> 
> Since there's no requirement that all docs have the
> same fields, your
> preferred docs could have ONLY the preferredsubid
> field and your
> non-preferred docs ONLY the subid field. That way
> you wouldn't have to worry
> about picking the docs up twice.
> 
> Merging should be simple then, just iterate over
> however many hits you want
> in your preferredHits object, then tack on however
> many you want from your
> nonPreferredHits object. All the code for the two
> queries would be
> identical, the only difference being whether you
> specify "subid" or
> "preferredsubid"......
> 
> I can imagine several variations on this scenario,
> but they depend on your
> problem space.
> 
> Whether this is the "best" or not, I leave as an
> exercise for the reader.
> 
> Best
> Erick
> 
> On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> >
> > Hi All,
> >
> > I have an index containing documents which all
> have a
> > field called SubId which holds the ID of the
> > Subscriber that submitted the data. This field is
> > STORED and UN_TOKENIZED
> >
> > When I am querying the index, the user can cloose
> a
> > number of different ways to sort the Hits. The
> problem
> > is that I have a list of SubIds that should appear
> at
> > the top of the results list regardless of how the
> > index is sorted. In other words, lets suppose the
> Hits
> > should be sorted by DateAdded, I require the Hits
> to
> > be sorted by DateAdded for the SubIds in my list
> and
> > then by DateAdded for the SubIds not in my list.
> >
> > From reading previous discussions on the mailing
> list,
> > I believe I could achieve what I need by writing
> > custom filters i.e. Run the query first with a
> custom
> > filter for the SubIds in my list and then a second
> > time with a custom filter for the SubIds not in my
> > list and then "merge" the results.
> >
> > I suppose my question is simple: Is there a better
> way
> > to achieve this?
> >
> > Couple of bits of info which I would influence
> best
> > design:
> >
> > - Index contains roughly 5M documents
> > - There can be up to 10K different unique SubIds
> > - My "Preferred SubId List" could contain any
> > combination of the 10K SubIds including all or
> none of
> > them
> > - My "Preferred SubId List" gets updated about 10
> > times and hour so I could cache the custom filters
> >
> > Thanks in advance,
> > Paul
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Advice on Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.
OK, a really "off the top of my head" response, but what the heck....

I'm not sure you need to worry about filters. Would it work for you to index
the documents in your preferred list with a  field (called, at the limit of
my creativity, preferredsubid <G>) and index your non-preferred docs with a
field subid? You'd still have to fire two queries, one on subid (to pick up
the ones in your non-preferred list) and one on preferredsubid.

Since there's no requirement that all docs have the same fields, your
preferred docs could have ONLY the preferredsubid field and your
non-preferred docs ONLY the subid field. That way you wouldn't have to worry
about picking the docs up twice.

Merging should be simple then, just iterate over however many hits you want
in your preferredHits object, then tack on however many you want from your
nonPreferredHits object. All the code for the two queries would be
identical, the only difference being whether you specify "subid" or
"preferredsubid"......

I can imagine several variations on this scenario, but they depend on your
problem space.

Whether this is the "best" or not, I leave as an exercise for the reader.

Best
Erick

On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
>
> Hi All,
>
> I have an index containing documents which all have a
> field called SubId which holds the ID of the
> Subscriber that submitted the data. This field is
> STORED and UN_TOKENIZED
>
> When I am querying the index, the user can cloose a
> number of different ways to sort the Hits. The problem
> is that I have a list of SubIds that should appear at
> the top of the results list regardless of how the
> index is sorted. In other words, lets suppose the Hits
> should be sorted by DateAdded, I require the Hits to
> be sorted by DateAdded for the SubIds in my list and
> then by DateAdded for the SubIds not in my list.
>
> From reading previous discussions on the mailing list,
> I believe I could achieve what I need by writing
> custom filters i.e. Run the query first with a custom
> filter for the SubIds in my list and then a second
> time with a custom filter for the SubIds not in my
> list and then "merge" the results.
>
> I suppose my question is simple: Is there a better way
> to achieve this?
>
> Couple of bits of info which I would influence best
> design:
>
> - Index contains roughly 5M documents
> - There can be up to 10K different unique SubIds
> - My "Preferred SubId List" could contain any
> combination of the 10K SubIds including all or none of
> them
> - My "Preferred SubId List" gets updated about 10
> times and hour so I could cache the custom filters
>
> Thanks in advance,
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Sorting

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
> Scorers are by contract expected to score docs in docId order

This was my missing link. Now it makes sense to me to use a buffered
RandomAccessFile and not bother with the presort.

Many thanks, Chris, that was very well explained. 

I'll have a crack at a lean-memory SortComparatorSource implementation,
which uses a buffered RandomAccessFile, as described.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 02 August 2006 04:32
To: java-user@lucene.apache.org
Subject: RE: Sorting

: I'm with you now. So you do seeks in your comparator. For a large index
you
: might as well use java.io.RandomAccessFile for the "array", because there
: would be little value in buffering when the comparator is liable to jump
all

yep .. that's what i was getting at ... but i'm not so sure that buffering
won't be usefull.  I've i'm not mistaken, all Scorers are by contract
expected to score docs in docId order so when your hits are being collected
for sorting, you should allways be moving forward in the file
-- but you may skip ahead alot when the result set isn't a high percentage
of the total number of docs.
(i may be wrong about all Scorers going in docId order ... if you explicilty
use the 1.4 BooleanScorer you may not get that behavior, but i think
everything else works that way ... perhaps someone else can verify
that)

: around the file. This sounds very expensive, though. If you don't open a
: Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
: reduce the number of seeks. That was the half-baked idea of the third
file,
: which essentially orders document IDs.

presort on what exactly, the field you want to sort on?  -- That's
esentially what the TermEnum is.  I'm not sure how having that helps you ...
let's assume you've got some data structure (let's not worry about the
file/ram or TermEnum distinction just yet) containing every document in your
index of 100,000,000 products sorted on the price field, and you've done a
search for "apple" and there are 1,000,000 docIds for matching products
ready to be collected by your new custom Scoring code ... how does the full
list of all docIds sorted by price help you as you are given docIds and have
to decide if that doc is better or worse then the docs you've already
collected?

: > Bear in mind, there have been some improvements recently to the ability
to
: grab individual stored fields per document....
:
: I can't see anything like that in 2.0. Is that something in the Lucene
HEAD
: build?

I guess so ... search the java-dev archives for "lazy field loading" or
"Fieldable" .. that should find some of the discussion about it and the jira
issue with the changes.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Advice on Custom Sorting

Posted by Paul Lynch <pa...@yahoo.com>.
Hi All,

I have an index containing documents which all have a
field called SubId which holds the ID of the
Subscriber that submitted the data. This field is
STORED and UN_TOKENIZED

When I am querying the index, the user can cloose a
number of different ways to sort the Hits. The problem
is that I have a list of SubIds that should appear at
the top of the results list regardless of how the
index is sorted. In other words, lets suppose the Hits
should be sorted by DateAdded, I require the Hits to
be sorted by DateAdded for the SubIds in my list and
then by DateAdded for the SubIds not in my list.

>From reading previous discussions on the mailing list,
I believe I could achieve what I need by writing
custom filters i.e. Run the query first with a custom
filter for the SubIds in my list and then a second
time with a custom filter for the SubIds not in my
list and then "merge" the results.

I suppose my question is simple: Is there a better way
to achieve this?

Couple of bits of info which I would influence best
design:

- Index contains roughly 5M documents
- There can be up to 10K different unique SubIds
- My "Preferred SubId List" could contain any
combination of the 10K SubIds including all or none of
them
- My "Preferred SubId List" gets updated about 10
times and hour so I could cache the custom filters

Thanks in advance,
Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Sorting

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm with you now. So you do seeks in your comparator. For a large index you
: might as well use java.io.RandomAccessFile for the "array", because there
: would be little value in buffering when the comparator is liable to jump all

yep .. that's what i was getting at ... but i'm not so sure that buffering
won't be usefull.  I've i'm not mistaken, all Scorers are by contract
expected to score docs in docId order so when your hits are being
collected for sorting, you should allways be moving forward in the file
-- but you may skip ahead alot when the result set isn't a high percentage
of the total number of docs.
(i may be wrong about all Scorers going in docId order ... if you
explicilty use the 1.4 BooleanScorer you may not get that behavior, but i
think everything else works that way ... perhaps someone else can verify
that)

: around the file. This sounds very expensive, though. If you don't open a
: Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
: reduce the number of seeks. That was the half-baked idea of the third file,
: which essentially orders document IDs.

presort on what exactly, the field you want to sort on?  -- That's
esentially what the TermEnum is.  I'm not sure how having that helps you
... let's assume you've got some data structure (let's not worry about the
file/ram or TermEnum distinction just yet) containing every document in
your index of 100,000,000 products sorted on the price field, and you've
done a search for "apple" and there are 1,000,000 docIds for matching
products ready to be collected by your new custom Scoring code ... how
does the full list of all docIds sorted by price help you as you are given
docIds and have to decide if that doc is better or worse then the docs
you've already collected?

: > Bear in mind, there have been some improvements recently to the ability to
: grab individual stored fields per document....
:
: I can't see anything like that in 2.0. Is that something in the Lucene HEAD
: build?

I guess so ... search the java-dev archives for "lazy field loading" or
"Fieldable" .. that should find some of the discussion about it and the
jira issue with the changes.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Sorting

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
>  file seeks instead of array lookups

I'm with you now. So you do seeks in your comparator. For a large index you
might as well use java.io.RandomAccessFile for the "array", because there
would be little value in buffering when the comparator is liable to jump all
around the file. This sounds very expensive, though. If you don't open a
Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
reduce the number of seeks. That was the half-baked idea of the third file,
which essentially orders document IDs.

> Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document....

I can't see anything like that in 2.0. Is that something in the Lucene HEAD
build?

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 01 August 2006 09:37
To: java-user@lucene.apache.org
Subject: RE: Sorting


: I take your point that Berkley DB would be much less clumsy, but an
: application that's already using a relational database for other purposes
: might as well use that relational database, no?

if you already have some need to access data about each matching doc from a
relational DB, then sure you might as well let it sort for you -- but just
bcause your APP has some DB connections open doesn't mean that's a
worthwhile reason to ask it to do the sort ... your app might have some
netowrk connections open to an IMAP server as well .. that doesn't mean you
should convert the docs to email messages and ask the IMAP server to sort
them :)

: I'm not really with you on the random access file, Chris. Here's where I
am
: up to with my [mis-]understanding...
:
: I want to sort on 2 terms. Happily these can be ints (the first is an INT
: corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a
: hash of a string, used to group similar documents together within those 10
: minute timestamps). When I initially warm up the FieldCache (first search
: after opening the Searcher), I start by generating two random access files
: with int values at offsets corresponding to document IDs for each of
these;
: the first file would have ints corresponding to the timestamp and the
second
: would have integers corresponding to the hash. I'd then need to generate a
: third file which is equivalent to an array dimensioned by document ID,
with
: document IDs in compound sort order??

i'm not sure why you think you need the third file ... you should be able to
use the two files you created exactly the way the existing code would use
the two arrays if you were using an in memory FieldCache (with file seeks
instead of array lookups) .. i think the class you want to look at is
FieldSortedHitQueue

: In a big index, it will take a while to walk through all of the documents
to
: generate the first two random access files and the sort process required
to
: generate the sorted file is going to be hard work.

well .. yes.  but that's the trade off, the reason for the RAM based
FieldCache is speed .. if you don't have that RAM to use, then doing the
same things on disk gets slower.


Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document (FieldSelector is the name of the
class i think) ... i haven't tried those out yet, but they could make
Sorting on a stored field (which wouldn't require building up any cache -
RAM or Disk based) feasible regardless of the size of your result sets ...
but i haven't tried that yet.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Sorting

Posted by Chris Hostetter <ho...@fucit.org>.
: I take your point that Berkley DB would be much less clumsy, but an
: application that's already using a relational database for other purposes
: might as well use that relational database, no?

if you already have some need to access data about each matching doc from
a relational DB, then sure you might as well let it sort for you -- but
just bcause your APP has some DB connections open doesn't mean that's a
worthwhile reason to ask it to do the sort ... your app might have some
netowrk connections open to an IMAP server as well .. that doesn't mean
you should convert the docs to email messages and ask the IMAP server to
sort them :)

: I'm not really with you on the random access file, Chris. Here's where I am
: up to with my [mis-]understanding...
:
: I want to sort on 2 terms. Happily these can be ints (the first is an INT
: corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a
: hash of a string, used to group similar documents together within those 10
: minute timestamps). When I initially warm up the FieldCache (first search
: after opening the Searcher), I start by generating two random access files
: with int values at offsets corresponding to document IDs for each of these;
: the first file would have ints corresponding to the timestamp and the second
: would have integers corresponding to the hash. I'd then need to generate a
: third file which is equivalent to an array dimensioned by document ID, with
: document IDs in compound sort order??

i'm not sure why you think you need the third file ... you should be
able to use the two files you created exactly the way the existing code
would use the two arrays if you were using an in memory FieldCache (with
file seeks instead of array lookups) .. i think the class you want to look
at is FieldSortedHitQueue

: In a big index, it will take a while to walk through all of the documents to
: generate the first two random access files and the sort process required to
: generate the sorted file is going to be hard work.

well .. yes.  but that's the trade off, the reason for the RAM based
FieldCache is speed .. if you don't have that RAM to use, then doing
the same things on disk gets slower.


Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document (FieldSelector is the name of
the class i think) ... i haven't tried those out yet, but they could make
Sorting on a stored field (which wouldn't require building up any cache -
RAM or Disk based) feasible regardless of the size of your result sets ...
but i haven't tried that yet.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Sorting

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
Ref 1: I was just about to show you a link at Sun but I realise that it was
my misread! OK, so the maximum heap is 2G on a 32-bit Linux platform, which
doubles the numbers, and yes indeed 64 bits seems like a good idea, if
having sort indexes in RAM is a good use of resources. But there must be a
better alternative to using 4 bytes of RAM per document per sort field.

Ref 2: "holding a laptop in both hands, and using the corner of it to type
letters on the keyboard of another computer."... I like that analogy... I
may even find a use for my laptop now :-) 

I take your point that Berkley DB would be much less clumsy, but an
application that's already using a relational database for other purposes
might as well use that relational database, no?

I'm not really with you on the random access file, Chris. Here's where I am
up to with my [mis-]understanding...

I want to sort on 2 terms. Happily these can be ints (the first is an INT
corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a
hash of a string, used to group similar documents together within those 10
minute timestamps). When I initially warm up the FieldCache (first search
after opening the Searcher), I start by generating two random access files
with int values at offsets corresponding to document IDs for each of these;
the first file would have ints corresponding to the timestamp and the second
would have integers corresponding to the hash. I'd then need to generate a
third file which is equivalent to an array dimensioned by document ID, with
document IDs in compound sort order??

In a big index, it will take a while to walk through all of the documents to
generate the first two random access files and the sort process required to
generate the sorted file is going to be hard work. 

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 31 July 2006 09:34
To: java-user@lucene.apache.org
Subject: Re: Sorting


1) I didn't know there were any JVMs that limited the heap size to 1GB ...
a 32bit address space would impose a hard limit of 4GB, and I've heard that
Windows limits process to 2GB, but I don't know of any JVMs that have 1GB
limits.

If you really need to deal with indexes big enough for that to make a
differnce, you probably want to look into 64bit hardware.

2) ...

: Were going to need to maintain a set sort indexes for documents in a
: large index too, and I'm interested in suggestions for the best/easiest
: way to maintain non-RAM-based (or not entirely RAM-based) sort index
: which is external to Lucene. Would using MySQL for sort indexing be "a
: sledgehammer to crack a nut", I wonder? I've not really thought through
: the RAMifications (sorry!) of this approach. I wonder if anyone else
: here has attempted to integrate an external sort using a database?

The analogy that comes to mind for me is not "a sledgehammer to crack a nut"
... more along the lines of "holding a laptop in both hands, and using the
corner of it to type letters on the keyboard of another computer."  Using a
relational DB in conjuntion with Lucene just to do some sorting on disk
seems like a really gratuitious and unneccessary use of a relational DB.

The only reason Field sorting in Lucene uses a lot of RAM is because of hte
FieldCache, which provides an easy way to lookup the sort value for a given
doc durring hit collection in order to rank them in a priority queue
-- namely an array indexed by docId.  You could just as easily store that
data on disk, you just need an API that lets you lookup things by numeric
id.  A Berkeley DB "map" comes to mind ... or even random acess files where
the value is stored in offsets based on the docId (would have some trickines
if you wanted String sorting but would work great for numerics).
This would eliminate the high RAM usage, but would be a lot slower because
of the disk access (especially on the first search when the "FieldCache"
was being built)


Alternately, if you assume your results sets are ging to be "small", you
could collect all of hte docIds into a set and then iteratre over a complete
pass of a TermEnum/TermDocs for your field looking up the sort values for
each match -- in esence doing the same work as when building the FieldCache
on each search, but only for hte docs that match that search.  Really low
memory usage, no additional disk usage -- just much slower.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Sorting

Posted by Chris Hostetter <ho...@fucit.org>.
1) I didn't know there were any JVMs that limited the heap size to 1GB ...
a 32bit address space would impose a hard limit of 4GB, and I've heard
that Windows limits process to 2GB, but I don't know of any JVMs that have
1GB limits.

If you really need to deal with indexes big enough for that to make a
differnce, you probably want to look into 64bit hardware.

2) ...

: Were going to need to maintain a set sort indexes for documents in a
: large index too, and I'm interested in suggestions for the best/easiest
: way to maintain non-RAM-based (or not entirely RAM-based) sort index
: which is external to Lucene. Would using MySQL for sort indexing be "a
: sledgehammer to crack a nut", I wonder? I've not really thought through
: the RAMifications (sorry!) of this approach. I wonder if anyone else
: here has attempted to integrate an external sort using a database?

The analogy that comes to mind for me is not "a sledgehammer to crack a
nut" ... more along the lines of "holding a laptop in both hands, and
using the corner of it to type letters on the keyboard of another
computer."  Using a relational DB in conjuntion with Lucene just to do
some sorting on disk seems like a really gratuitious and unneccessary use
of a relational DB.

The only reason Field sorting in Lucene uses a lot of RAM is because of
hte FieldCache, which provides an easy way to lookup the sort value for a
given doc durring hit collection in order to rank them in a priority queue
-- namely an array indexed by docId.  You could just as easily store that
data on disk, you just need an API that lets you lookup things by numeric
id.  A Berkeley DB "map" comes to mind ... or even random acess files
where the value is stored in offsets based on the docId (would have some
trickines if you wanted String sorting but would work great for numerics).
This would eliminate the high RAM usage, but would be a lot slower because
of the disk access (especially on the first search when the "FieldCache"
was being built)


Alternately, if you assume your results sets are ging to be "small",
you could collect all of hte docIds into a set and then iteratre over a
complete pass of a TermEnum/TermDocs for your field looking up the sort
values for each match -- in esence doing the same work as when building
the FieldCache on each search, but only for hte docs that match that
search.  Really low memory usage, no additional disk usage -- just much
slower.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
The limit is much less than Integer.MAX_VALUE (2,147,483,647), unless
you have a VM which can run in more than 1G of heap. 1G limits you to a
theoretical number of 256M (268,435,456) documents with 4 bytes per
array element. In practise it will be something a less, because there
are other things which need heap too.

Were going to need to maintain a set sort indexes for documents in a
large index too, and I'm interested in suggestions for the best/easiest
way to maintain non-RAM-based (or not entirely RAM-based) sort index
which is external to Lucene. Would using MySQL for sort indexing be "a
sledgehammer to crack a nut", I wonder? I've not really thought through
the RAMifications (sorry!) of this approach. I wonder if anyone else
here has attempted to integrate an external sort using a database?

On Sat, 2006-07-29 at 22:42 +0200, karl wettin wrote:
> On Sat, 2006-07-29 at 12:39 -0700, Jason Calabrese wrote:
> > One fast way to make an alphabetic sort very fast is to presort your
> > docs before adding them to the index.  If you do this you can then
> > just sort by index order.  We are using this for a large index (1
> > million+ docs) and it works very good, and seems even slightly faster
> > than relevance sorting.
> > 
> > Using this approach may create some maintainance issues since you
> > can't add a new doc to the index at a specified position.  Instead you
> > will need to re-index everything. 
> 
> Instead of above I would probably choose an int[index size] where each
> position in the array represents the global order of that document. It's
> much easier to re-order that than re-indexing the whole corpus every
> time you want to insert something.
> 
> It limits your corpus to 2 billion items (Integer.MAX_VALUE). And it
> will consume 32 bits of RAM per document.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Chris Hostetter <ho...@fucit.org>.
: thanks a lot for your reply. Currently I'm using a parallelreader, cause one
: part of index is in the memory and one part is on disk. It seems like
: parallelreader has a problem with sorting. So i have three questions:
:
: 1. Is there a know bug in the parallelreader?

not that i know of ... can you elaborate on why you think there is a
problem?

: 2. Is it true, that only fields with one single term can be sorted?

there can at most one indexed term per document ... which means you either
needed to index the Field UN_TOKENIZED you need to use KeywordAnalyzer, or
you need to know that you are only ever putting in values that get
analyzed to single Tokens.

: 3. The fields i would like to sort are indexed but not stored...is this a
: problem for sorting `?

they must be indexed, wether they are stored or not is irrelevant.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by neils <ne...@gmx.net>.
Hi Chris,

thanks a lot for your reply. Currently I'm using a parallelreader, cause one
part of index is in the memory and one part is on disk. It seems like
parallelreader has a problem with sorting. So i have three questions:

1. Is there a know bug in the parallelreader?
2. Is it true, that only fields with one single term can be sorted?
3. The fields i would like to sort are indexed but not stored...is this a
problem for sorting `?

Thank you very much ;-))
-- 
View this message in context: http://www.nabble.com/Sorting-tf2019404.html#a5561163
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Chris Hostetter <ho...@fucit.org>.
: thanks a lot for you helpfull answers :-)) I think I will try i like karl
: suggest, cause i have to update the index every day :-))

All of the suggestions so far assume that you have some way of mapping
each document to a number that indicates it's relative position in the
total space of ordered documents -- which means that if you are updating
the index on a daily basis, you are going to need some serious juggling to
make sure that when a new document comes along, you figure out the "right"
integer to put it into the correct order -- not to mention you have no
solution for the day when you discover that "banana" is in your index with
a sort value of 12345 and "bandana" is in your index with sort value 12346
-- and now you want to add "banco" and you don't have any room for it.

A bigger problem is the fact that sorting by an int field takes
just as much time as sorting by a String field -- because Lucene's sorting
code is already doing the String->int mapping for you and putting it into
a FieldCache.  The only real difference between sorting on an int field
and a String field is how much RAM that FieldCache uses (typically more
for Strings)
(NOTE: there are some caveats to this when dealing with MultiSearcher, but
you didn't mention using a MultiSearcher, so i'm glossing over that)

What *does* take more time when dealing with String fields is building the
FieldCache (because sorting a bunch of strings tends to take longer then
sorting a bunch of ints) ... but this Cache will only be built up the
first time you sort on that field for a given IndexReader/IndexSearcher
... as long as you keep reusing the same IndexSearcher, things should be
"fast".

Without confirmation to the contrary, i'm guessing you aren't reusing the
same IndexSearcher, and that's why it seems like sorting on Strings is
slow.





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by neils <ne...@gmx.net>.
Hi,

thanks a lot for you helpfull answers :-)) I think I will try i like karl
suggest, cause i have to update the index every day :-))

Thanks a lot :-))
-- 
View this message in context: http://www.nabble.com/Sorting-tf2019404.html#a5558212
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by karl wettin <ka...@gmail.com>.
On Sat, 2006-07-29 at 12:39 -0700, Jason Calabrese wrote:
> One fast way to make an alphabetic sort very fast is to presort your
> docs before adding them to the index.  If you do this you can then
> just sort by index order.  We are using this for a large index (1
> million+ docs) and it works very good, and seems even slightly faster
> than relevance sorting.
> 
> Using this approach may create some maintainance issues since you
> can't add a new doc to the index at a specified position.  Instead you
> will need to re-index everything. 

Instead of above I would probably choose an int[index size] where each
position in the array represents the global order of that document. It's
much easier to re-order that than re-indexing the whole corpus every
time you want to insert something.

It limits your corpus to 2 billion items (Integer.MAX_VALUE). And it
will consume 32 bits of RAM per document.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Jason Calabrese <ma...@jasoncalabrese.com>.
One fast way to make an alphabetic sort very fast is to presort your docs 
before adding them to the index.  If you do this you can then just sort by 
index order.  We are using this for a large index (1 million+ docs) and it 
works very good, and seems even slightly faster than relevance sorting.

Using this approach may create some maintainance issues since you can't add a 
new doc to the index at a specified position.  Instead you will need to 
re-index everything.


On Saturday 29 July 2006 01:43, neils wrote:
> Hi,
>
> Lucene sort hits by relevance as default. Cause i would like to sort them
> by a special string field and not by relevance i was thinking about
> dropping the sorting by relevance as default and implement sorting by
> alphabetic order.
>
> Reason that sorting by alpabetic order takes a lot of time. Makes this
> sense and how can this be done? Or is there another "fast" way to sort by
> alphabetic order ?
>
> Currently I'm using lucene 1.9.1 (dotnet). The indexsize is currently about
> 2 GB and index is split in two parts and are accessed by a parallelreader.
>
> Hopefully you can give me a tip if there is a way :-))
>
> Thanks a lot :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by karl wettin <ka...@gmail.com>.
On Sat, 2006-07-29 at 01:43 -0700, neils wrote:
> is there another "fast" way to sort by alphabetic order ?
>
> The indexsize is currently about 2 GB and index is split in two parts
> and are accessed by a parallelreader.

I don't know how much faster it is, but you could try to store the sort
order as a float value in a secondary field and sort by that value.

Find or calculate the value by using TermEnum and TermDocs at insert
time. This requires that the string field you are sorting by is one term
only. If there are more than one term you (might) have to figure this
out in your primary data storage.

(I've used this strategy in ORM CRUDs to minimize writing to data
storage when setting the order of an instance in an ordered list.)

This is what insert iterations would look like:

new insert: aaaa

[aaaa : 1.0]

new insert: aaab

[aaaa : 1.0]
[aaab : 2.0]

new insert: aaac

[aaaa : 1.0]
[aaab : 2.0]
[aaac : 3.0]

new insert: aaaba

[aaaa  : 1.0]
[aaab  : 2.0]
[aaaba : 2.5]
[aaac  : 3.0]

new insert: aaabb

[aaaa  : 1.0]
[aaab  : 2.0]
[aaaba : 2.5]
[aaabb : 2.75]
[aaac  : 3.0]

new insert: aaabc

[aaaa  : 1.0]
[aaab  : 2.0]
[aaaba : 2.5]
[aaabb : 2.75]
[aaabc : 2.875]
[aaac  : 3.0]

and so on.


> Currently I'm using lucene 1.9.1 (dotnet).

This is the Lucene java-user list. You might want to try the dotnet.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org