You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by neils <ne...@gmx.net> on 2006/07/29 10:43:04 UTC

Sorting

Hi,

Lucene sort hits by relevance as default. Cause i would like to sort them by
a special string field and not by relevance i was thinking about dropping
the sorting by relevance as default and implement sorting by alphabetic
order.

Reason that sorting by alpabetic order takes a lot of time. Makes this sense
and how can this be done? Or is there another "fast" way to sort by
alphabetic order ?

Currently I'm using lucene 1.9.1 (dotnet). The indexsize is currently about
2 GB and index is split in two parts and are accessed by a parallelreader.

Hopefully you can give me a tip if there is a way :-))

Thanks a lot :-)
-- 
View this message in context: http://www.nabble.com/Sorting-tf2019404.html#a5552408
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Sorting

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-31 at 11:54 +0200, Andrzej Bialecki wrote:
> Chris Hostetter wrote:
> > 1) I didn't know there were any JVMs that limited the heap size to 1GB ...
> > a 32bit address space would impose a hard limit of 4GB, and I've heard
> > that Windows limits process to 2GB, but I don't know of any JVMs that have
> > 1GB limits.
> >   
> 
> I believe all Win32 JVM-s have a limit of ~1.3GB (~1.9GB if using 
> rebase.exe), which quite often can't be reached anyway due to memory 
> fragmentation. Read here for a somewhat funny analysis:
> 
> *http://www.oreillynet.com/digitalmedia/blog/2005/01/what_is_the_largest_text_file.html*
> 
> *nix OS-es on 32-bit platforms indeed have 4GB addressing space, but at 
> least 1GB of this space is reserved for kernel use ... If I'm not 
> mistaken most 2.6.x Linux distros run now with 1GB/3GB split between 
> kernel/user space, and 2.4.x kernels ran with 2GB/2GB split.

I love my 64bit Solaris and -XX:+AggressiveHeap.

:D


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Sorting

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Hostetter wrote:
> 1) I didn't know there were any JVMs that limited the heap size to 1GB ...
> a 32bit address space would impose a hard limit of 4GB, and I've heard
> that Windows limits process to 2GB, but I don't know of any JVMs that have
> 1GB limits.
>   

I believe all Win32 JVM-s have a limit of ~1.3GB (~1.9GB if using 
rebase.exe), which quite often can't be reached anyway due to memory 
fragmentation. Read here for a somewhat funny analysis:

*http://www.oreillynet.com/digitalmedia/blog/2005/01/what_is_the_largest_text_file.html*

*nix OS-es on 32-bit platforms indeed have 4GB addressing space, but at 
least 1GB of this space is reserved for kernel use ... If I'm not 
mistaken most 2.6.x Linux distros run now with 1GB/3GB split between 
kernel/user space, and 2.4.x kernels ran with 2GB/2GB split.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Advice on Custom Sorting

Posted by Paul Lynch <pa...@yahoo.com>.

Thanks again Erick for taking the time.

I agree that the CachingWrapperFilter as described
under "using a custom filter" in LIA is probably my
best bet. I wanted to check if anything had been added
in Lucene releases since the book was written I wasn't
aware of.

Cheers again.

--- Erick Erickson <er...@gmail.com> wrote:

> You were probably right. See below....
> 
> On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> >
> > Thanks for the quick response Erick.
> >
> > "index the documents in your preferred list with a
> > field and index your non-preferred docs with a
> field
> > subid?"
> >
> > I considered this approach and dismissed it due to
> the
> > actual list of preferred ids changing so
> frequently
> > (every 10 mins...ish) but maybe I was a little
> hasty
> > in doing so. I will investigate the overhead in
> > updating all docs in the index each time my list
> > refreshes. I had assumed it was too prohibitive
> but I
> > know what they say about assumptions :)
> 
> 
> Lots of overhead. There's really no capability of
> updating a doc in place.
> This has been on several people's wish-list. You'd
> have to delete every doc
> that you wanted to change and re-add it. I don't
> know how many documents
> this would be, if just a few it'd be OK, but if
> many.... I was assuming (and
> I *do* know what they say about assumptions <G>)
> that you were just adding
> to your preferred doc list every few minutes, not
> changing existing
> documents....
> 
> It really does sound like you want a filter. I was
> pleasantly surprised by
> how very quickly a filters are built. You could use
> a CachingWrapperFilter
> to have the filter kept around automatically (I
> guess you'd only have one
> per index update) to minimize your overhead for
> building filters, and
> perhaps warm up your cache by firing a canned query
> at your searcher when
> you re-open your IndexReader after index update. I
> think you'd have to do
> the two-query thing in this case. If you wanted to
> really get exotic, you
> could build your filter when you created your index
> and store it in a *very
> special document* and just read it in the first time
> you needed it. Although
> I've never used it, I guess you can store binary
> data. From the Javadoc
> 
>
*Field<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29>
> *(String
>
<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html>
> name,
> byte[] value,
>
Field.Store<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html>
>  store)
>           Create a stored field with binary value.
> 
> The only thing here is that the filters (probably
> wrapped in a
> ConstantScoreQuery) lose relevance, but since you're
> sorting "one of several
> ways", that probably doesn't matter.
> 
> Best
> Erick
> 
> 
> 
> Should I be able to make this workable, the beauty
> of
> > this solution would be that I would actually only
> need
> > to query once. If I had a field which indicates
> > whether it is a preferred doc or not, "all" I will
> > have to do is sort across the two fields.
> >
> > Thanks again Erick. Any other suggestions are most
> > welcome.
> >
> > Regards,
> > Paul
> >
> > --- Erick Erickson <er...@gmail.com>
> wrote:
> >
> > > OK, a really "off the top of my head" response,
> but
> > > what the heck....
> > >
> > > I'm not sure you need to worry about filters.
> Would
> > > it work for you to index
> > > the documents in your preferred list with a 
> field
> > > (called, at the limit of
> > > my creativity, preferredsubid <G>) and index
> your
> > > non-preferred docs with a
> > > field subid? You'd still have to fire two
> queries,
> > > one on subid (to pick up
> > > the ones in your non-preferred list) and one on
> > > preferredsubid.
> > >
> > > Since there's no requirement that all docs have
> the
> > > same fields, your
> > > preferred docs could have ONLY the
> preferredsubid
> > > field and your
> > > non-preferred docs ONLY the subid field. That
> way
> > > you wouldn't have to worry
> > > about picking the docs up twice.
> > >
> > > Merging should be simple then, just iterate over
> > > however many hits you want
> > > in your preferredHits object, then tack on
> however
> > > many you want from your
> > > nonPreferredHits object. All the code for the
> two
> > > queries would be
> > > identical, the only difference being whether you
> > > specify "subid" or
> > > "preferredsubid"......
> > >
> > > I can imagine several variations on this
> scenario,
> > > but they depend on your
> > > problem space.
> > >
> > > Whether this is the "best" or not, I leave as an
> > > exercise for the reader.
> > >
> > > Best
> > > Erick
> > >
> > > On 9/25/06, Paul Lynch <pa...@yahoo.com>
> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I have an index containing documents which all
> > > have a
> > > > field called SubId which holds the ID of the
> > > > Subscriber that submitted the data. This field
> is
> > > > STORED and UN_TOKENIZED
> > > >
> > > > When I am querying the index, the user can
> cloose
> > > a
> > > > number of different ways to sort the Hits. The
> > > problem
> > > > is that I have a list of SubIds that should
> appear
> > > at
> > > > the top of the results list regardless of how
> the
> > > > index is sorted. In other words, lets suppose
> the
> > > Hits
> > > > should be sorted by DateAdded, I require the
> Hits
> > > to
> > > > be sorted by DateAdded for the SubIds in my
> list
> > > and
> > > > then by DateAdded for the SubIds not in my
> list.
> > > >
> > > > From reading previous discussions on the
> mailing
> > > list,
> > > > I believe I could achieve what I need by
> writing
> > > > custom filters i.e. Run the query first with a
> > > custom
> > > > filter for the SubIds in my list and then a
> second
> > > > time with a custom filter for the SubIds not
> in my
> > > > list and then "merge" the results.
> > > >
> > > > I suppose my question is simple: Is there a
> better
> > > way
> > > > to achieve this?
> > > >
> 
=== message truncated ===


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Advice on Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.

You were probably right. See below....

On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
>
> Thanks for the quick response Erick.
>
> "index the documents in your preferred list with a
> field and index your non-preferred docs with a field
> subid?"
>
> I considered this approach and dismissed it due to the
> actual list of preferred ids changing so frequently
> (every 10 mins...ish) but maybe I was a little hasty
> in doing so. I will investigate the overhead in
> updating all docs in the index each time my list
> refreshes. I had assumed it was too prohibitive but I
> know what they say about assumptions :)


Lots of overhead. There's really no capability of updating a doc in place.
This has been on several people's wish-list. You'd have to delete every doc
that you wanted to change and re-add it. I don't know how many documents
this would be, if just a few it'd be OK, but if many.... I was assuming (and
I *do* know what they say about assumptions <G>) that you were just adding
to your preferred doc list every few minutes, not changing existing
documents....

It really does sound like you want a filter. I was pleasantly surprised by
how very quickly a filters are built. You could use a CachingWrapperFilter
to have the filter kept around automatically (I guess you'd only have one
per index update) to minimize your overhead for building filters, and
perhaps warm up your cache by firing a canned query at your searcher when
you re-open your IndexReader after index update. I think you'd have to do
the two-query thing in this case. If you wanted to really get exotic, you
could build your filter when you created your index and store it in a *very
special document* and just read it in the first time you needed it. Although
I've never used it, I guess you can store binary data. From the Javadoc

*Field<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29>
*(String <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> name,
byte[] value, Field.Store<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html>
 store)
          Create a stored field with binary value.

The only thing here is that the filters (probably wrapped in a
ConstantScoreQuery) lose relevance, but since you're sorting "one of several
ways", that probably doesn't matter.

Best
Erick



Should I be able to make this workable, the beauty of
> this solution would be that I would actually only need
> to query once. If I had a field which indicates
> whether it is a preferred doc or not, "all" I will
> have to do is sort across the two fields.
>
> Thanks again Erick. Any other suggestions are most
> welcome.
>
> Regards,
> Paul
>
> --- Erick Erickson <er...@gmail.com> wrote:
>
> > OK, a really "off the top of my head" response, but
> > what the heck....
> >
> > I'm not sure you need to worry about filters. Would
> > it work for you to index
> > the documents in your preferred list with a  field
> > (called, at the limit of
> > my creativity, preferredsubid <G>) and index your
> > non-preferred docs with a
> > field subid? You'd still have to fire two queries,
> > one on subid (to pick up
> > the ones in your non-preferred list) and one on
> > preferredsubid.
> >
> > Since there's no requirement that all docs have the
> > same fields, your
> > preferred docs could have ONLY the preferredsubid
> > field and your
> > non-preferred docs ONLY the subid field. That way
> > you wouldn't have to worry
> > about picking the docs up twice.
> >
> > Merging should be simple then, just iterate over
> > however many hits you want
> > in your preferredHits object, then tack on however
> > many you want from your
> > nonPreferredHits object. All the code for the two
> > queries would be
> > identical, the only difference being whether you
> > specify "subid" or
> > "preferredsubid"......
> >
> > I can imagine several variations on this scenario,
> > but they depend on your
> > problem space.
> >
> > Whether this is the "best" or not, I leave as an
> > exercise for the reader.
> >
> > Best
> > Erick
> >
> > On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> > >
> > > Hi All,
> > >
> > > I have an index containing documents which all
> > have a
> > > field called SubId which holds the ID of the
> > > Subscriber that submitted the data. This field is
> > > STORED and UN_TOKENIZED
> > >
> > > When I am querying the index, the user can cloose
> > a
> > > number of different ways to sort the Hits. The
> > problem
> > > is that I have a list of SubIds that should appear
> > at
> > > the top of the results list regardless of how the
> > > index is sorted. In other words, lets suppose the
> > Hits
> > > should be sorted by DateAdded, I require the Hits
> > to
> > > be sorted by DateAdded for the SubIds in my list
> > and
> > > then by DateAdded for the SubIds not in my list.
> > >
> > > From reading previous discussions on the mailing
> > list,
> > > I believe I could achieve what I need by writing
> > > custom filters i.e. Run the query first with a
> > custom
> > > filter for the SubIds in my list and then a second
> > > time with a custom filter for the SubIds not in my
> > > list and then "merge" the results.
> > >
> > > I suppose my question is simple: Is there a better
> > way
> > > to achieve this?
> > >
> > > Couple of bits of info which I would influence
> > best
> > > design:
> > >
> > > - Index contains roughly 5M documents
> > > - There can be up to 10K different unique SubIds
> > > - My "Preferred SubId List" could contain any
> > > combination of the 10K SubIds including all or
> > none of
> > > them
> > > - My "Preferred SubId List" gets updated about 10
> > > times and hour so I could cache the custom filters
> > >
> > > Thanks in advance,
> > > Paul
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Advice on Custom Sorting

Posted by Paul Lynch <pa...@yahoo.com>.

Thanks for the quick response Erick.

"index the documents in your preferred list with a 
field and index your non-preferred docs with a field
subid?"

I considered this approach and dismissed it due to the
actual list of preferred ids changing so frequently
(every 10 mins...ish) but maybe I was a little hasty
in doing so. I will investigate the overhead in
updating all docs in the index each time my list
refreshes. I had assumed it was too prohibitive but I
know what they say about assumptions :)

Should I be able to make this workable, the beauty of
this solution would be that I would actually only need
to query once. If I had a field which indicates
whether it is a preferred doc or not, "all" I will
have to do is sort across the two fields.

Thanks again Erick. Any other suggestions are most
welcome.

Regards,
Paul

--- Erick Erickson <er...@gmail.com> wrote:

> OK, a really "off the top of my head" response, but
> what the heck....
> 
> I'm not sure you need to worry about filters. Would
> it work for you to index
> the documents in your preferred list with a  field
> (called, at the limit of
> my creativity, preferredsubid <G>) and index your
> non-preferred docs with a
> field subid? You'd still have to fire two queries,
> one on subid (to pick up
> the ones in your non-preferred list) and one on
> preferredsubid.
> 
> Since there's no requirement that all docs have the
> same fields, your
> preferred docs could have ONLY the preferredsubid
> field and your
> non-preferred docs ONLY the subid field. That way
> you wouldn't have to worry
> about picking the docs up twice.
> 
> Merging should be simple then, just iterate over
> however many hits you want
> in your preferredHits object, then tack on however
> many you want from your
> nonPreferredHits object. All the code for the two
> queries would be
> identical, the only difference being whether you
> specify "subid" or
> "preferredsubid"......
> 
> I can imagine several variations on this scenario,
> but they depend on your
> problem space.
> 
> Whether this is the "best" or not, I leave as an
> exercise for the reader.
> 
> Best
> Erick
> 
> On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
> >
> > Hi All,
> >
> > I have an index containing documents which all
> have a
> > field called SubId which holds the ID of the
> > Subscriber that submitted the data. This field is
> > STORED and UN_TOKENIZED
> >
> > When I am querying the index, the user can cloose
> a
> > number of different ways to sort the Hits. The
> problem
> > is that I have a list of SubIds that should appear
> at
> > the top of the results list regardless of how the
> > index is sorted. In other words, lets suppose the
> Hits
> > should be sorted by DateAdded, I require the Hits
> to
> > be sorted by DateAdded for the SubIds in my list
> and
> > then by DateAdded for the SubIds not in my list.
> >
> > From reading previous discussions on the mailing
> list,
> > I believe I could achieve what I need by writing
> > custom filters i.e. Run the query first with a
> custom
> > filter for the SubIds in my list and then a second
> > time with a custom filter for the SubIds not in my
> > list and then "merge" the results.
> >
> > I suppose my question is simple: Is there a better
> way
> > to achieve this?
> >
> > Couple of bits of info which I would influence
> best
> > design:
> >
> > - Index contains roughly 5M documents
> > - There can be up to 10K different unique SubIds
> > - My "Preferred SubId List" could contain any
> > combination of the 10K SubIds including all or
> none of
> > them
> > - My "Preferred SubId List" gets updated about 10
> > times and hour so I could cache the custom filters
> >
> > Thanks in advance,
> > Paul
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Advice on Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.

OK, a really "off the top of my head" response, but what the heck....

I'm not sure you need to worry about filters. Would it work for you to index
the documents in your preferred list with a  field (called, at the limit of
my creativity, preferredsubid <G>) and index your non-preferred docs with a
field subid? You'd still have to fire two queries, one on subid (to pick up
the ones in your non-preferred list) and one on preferredsubid.

Since there's no requirement that all docs have the same fields, your
preferred docs could have ONLY the preferredsubid field and your
non-preferred docs ONLY the subid field. That way you wouldn't have to worry
about picking the docs up twice.

Merging should be simple then, just iterate over however many hits you want
in your preferredHits object, then tack on however many you want from your
nonPreferredHits object. All the code for the two queries would be
identical, the only difference being whether you specify "subid" or
"preferredsubid"......

I can imagine several variations on this scenario, but they depend on your
problem space.

Whether this is the "best" or not, I leave as an exercise for the reader.

Best
Erick

On 9/25/06, Paul Lynch <pa...@yahoo.com> wrote:
>
> Hi All,
>
> I have an index containing documents which all have a
> field called SubId which holds the ID of the
> Subscriber that submitted the data. This field is
> STORED and UN_TOKENIZED
>
> When I am querying the index, the user can cloose a
> number of different ways to sort the Hits. The problem
> is that I have a list of SubIds that should appear at
> the top of the results list regardless of how the
> index is sorted. In other words, lets suppose the Hits
> should be sorted by DateAdded, I require the Hits to
> be sorted by DateAdded for the SubIds in my list and
> then by DateAdded for the SubIds not in my list.
>
> From reading previous discussions on the mailing list,
> I believe I could achieve what I need by writing
> custom filters i.e. Run the query first with a custom
> filter for the SubIds in my list and then a second
> time with a custom filter for the SubIds not in my
> list and then "merge" the results.
>
> I suppose my question is simple: Is there a better way
> to achieve this?
>
> Couple of bits of info which I would influence best
> design:
>
> - Index contains roughly 5M documents
> - There can be up to 10K different unique SubIds
> - My "Preferred SubId List" could contain any
> combination of the 10K SubIds including all or none of
> them
> - My "Preferred SubId List" gets updated about 10
> times and hour so I could cache the custom filters
>
> Thanks in advance,
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Sorting

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

> Scorers are by contract expected to score docs in docId order

This was my missing link. Now it makes sense to me to use a buffered
RandomAccessFile and not bother with the presort.

Many thanks, Chris, that was very well explained. 

I'll have a crack at a lean-memory SortComparatorSource implementation,
which uses a buffered RandomAccessFile, as described.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 02 August 2006 04:32
To: java-user@lucene.apache.org
Subject: RE: Sorting

: I'm with you now. So you do seeks in your comparator. For a large index
you
: might as well use java.io.RandomAccessFile for the "array", because there
: would be little value in buffering when the comparator is liable to jump
all

yep .. that's what i was getting at ... but i'm not so sure that buffering
won't be usefull.  I've i'm not mistaken, all Scorers are by contract
expected to score docs in docId order so when your hits are being collected
for sorting, you should allways be moving forward in the file
-- but you may skip ahead alot when the result set isn't a high percentage
of the total number of docs.
(i may be wrong about all Scorers going in docId order ... if you explicilty
use the 1.4 BooleanScorer you may not get that behavior, but i think
everything else works that way ... perhaps someone else can verify
that)

: around the file. This sounds very expensive, though. If you don't open a
: Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
: reduce the number of seeks. That was the half-baked idea of the third
file,
: which essentially orders document IDs.

presort on what exactly, the field you want to sort on?  -- That's
esentially what the TermEnum is.  I'm not sure how having that helps you ...
let's assume you've got some data structure (let's not worry about the
file/ram or TermEnum distinction just yet) containing every document in your
index of 100,000,000 products sorted on the price field, and you've done a
search for "apple" and there are 1,000,000 docIds for matching products
ready to be collected by your new custom Scoring code ... how does the full
list of all docIds sorted by price help you as you are given docIds and have
to decide if that doc is better or worse then the docs you've already
collected?

: > Bear in mind, there have been some improvements recently to the ability
to
: grab individual stored fields per document....
:
: I can't see anything like that in 2.0. Is that something in the Lucene
HEAD
: build?

I guess so ... search the java-dev archives for "lazy field loading" or
"Fieldable" .. that should find some of the discussion about it and the jira
issue with the changes.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org