You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by SOME ONE <su...@yahoo.com> on 2006/02/17 19:27:52 UTC

Custom Sorting

Hi,

I am using MultiFieldQueryParser (Lucene 1.9) to
search title and body fields in the documents. The
requirement is that documents with title match should
be returned before the documents with body match.
Using the default scoring, title matches do come
before the body matches. But, I also need the
documents with title matches sorted by date, and
documents with body matches sorted by date. i.e there
will be two groups of docs in the results, one with
title match and the other with just body match, the
first group of docs should come before the second
group, and each group should be sorted by date.

I am not concerned with the Lucene scoring as the
documents are very short. So, I think, just two scores
as described above for the two groups of docs are
sufficient. I am new to Lucene and have been looking
into its code trying to figure out how to achieve the
desired behaviour but can't seem to.

Any help in this regard will be greatly appreciated.

Thanks and Regards
Wiseman

Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by "Michael D. Curtin" <mi...@curtin.com>.
SOME ONE wrote:

> Hi,
> 
> Yes, my queries are like the first case. And as there
> have been no other suggestions to do it in a single
> search operation, will have to do it the way you
> suggested. This technique will do the job particularly
> because title's text is always in the body as well. So
> finally I will have to run two search operations like
> 
> (body:a AND body:b AND body:c) AND
> (title:a OR title:b OR title:c)
> 
> to get the first group of results for title match, and
> 
> (body:a AND body:b AND body:c) NOT
> (title:a OR title:b OR title:c)
> 
> to get the second group of results with just body
> match, and sort each group by date.

Here's another way.  Search just the first part of the first query, i.e. 
(body:a AND body:b AND body:c).  Then use an IndexReader, TermDocs and related 
objects to get a list of document numbers for each of title:a, title:b, and 
title:c.  Then merge / compare the lists to separate out the title matches and 
the body-only matches.  Doing it this way will save whatever resources are 
consumed by score computations.  Resources needed by the ORing will still 
basically be consumed, because of the merge / compare step you must perform.

Of course, this approach wouldn't work if the query terms aren't simple terms, 
i.e. you use ranges, or wildcards, or prefixes, etc.  If you do use constructs 
like those, then I don't see an alternative to the two-search algorithm. 
Unless you have hundreds of millions of documents, each of the two searches 
could be simpler than what you have above, though (see my earlier email for 
details).  The way they're set up above, 2 sets of reads through the indexes 
are necessary, and 2 sets of boolean operations.  The way I suggested earlier 
would have only 1 read of the index and 1 merge / compare computation.  if you 
have a few tens of millions of documents, or less, the lists of document 
numbers should easily fit in RAM.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by SOME ONE <su...@yahoo.com>.
Hi,

Yes, my queries are like the first case. And as there
have been no other suggestions to do it in a single
search operation, will have to do it the way you
suggested. This technique will do the job particularly
because title's text is always in the body as well. So
finally I will have to run two search operations like

(body:a AND body:b AND body:c) AND
(title:a OR title:b OR title:c)

to get the first group of results for title match, and

(body:a AND body:b AND body:c) NOT
(title:a OR title:b OR title:c)

to get the second group of results with just body
match, and sort each group by date.

As the queries in two operations are very similar,
even if there is overhead involved by doing it in two
search operations, I think it can be improved by using
filters that cache results and reuse them in the
second search operation. But I'm not sure how much
overhead would there be for doing it in two search
operations, and if this optimization is really needed.

Thanks once again for your help.

Regards
Wiseman


--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> I'm not sure you can do what you want in a single
> search.  But, I'm not sure I 
> actually understand what your queries look like,
> either.  I *think* you want 
> to search like
> 
> (title:a OR body:a) AND (title:b OR body:b) AND
> (title:c OR body:c)
> 
> not something like
> 
> (title:a OR title:b OR title:c) AND (body:a OR
> body:b OR body:c)
> 
> or maybe something else altogether.  If it's the
> former, and your data really 
> has the title's text duplicated in the body, then I
> think you should run 2 
> searches, like this:
> 
> #1	body:a AND body:b AND body:c
> #2	title:a OR title:b OR title:c
> 
> #1 tells you whether you get a hit at all, and #2
> tells you whether the title 
> field was involved.  Putting the same criterion on
> title as on body in a given 
> query is redundant, because there's nothing in title
> that isn't also in body. 
>   You might even be able to do something like
> running #1, then using its 
> results as a Filter for #2.
> 
> Good luck!
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by "Michael D. Curtin" <mi...@curtin.com>.
I'm not sure you can do what you want in a single search.  But, I'm not sure I 
actually understand what your queries look like, either.  I *think* you want 
to search like

(title:a OR body:a) AND (title:b OR body:b) AND (title:c OR body:c)

not something like

(title:a OR title:b OR title:c) AND (body:a OR body:b OR body:c)

or maybe something else altogether.  If it's the former, and your data really 
has the title's text duplicated in the body, then I think you should run 2 
searches, like this:

#1	body:a AND body:b AND body:c
#2	title:a OR title:b OR title:c

#1 tells you whether you get a hit at all, and #2 tells you whether the title 
field was involved.  Putting the same criterion on title as on body in a given 
query is redundant, because there's nothing in title that isn't also in body. 
  You might even be able to do something like running #1, then using its 
results as a Filter for #2.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by SOME ONE <su...@yahoo.com>.
Hi,

Well, I gave more thought to your suggestion and came
to the conclusion that I can not even run 2 searches.
The reason being, as I mentioned in my first message,
I am using MultiFieldQueryParser to search title and
body fields. Search terms can be found anywhere,
either in title or body or both, so I can not just run
2 searches, one with only the title criteria and then
another with only the body criteria.
By a "title match" I actually meant "at least one
search term finds a match in title", the rest of the
terms must also be found somewhere (I use AND operator
by default), and by a "body match" I meant "no term
got a match in title", so all the terms are found in
body only. I think I better re-explain my question:

I am using MultiFieldQueryParser (Lucene 1.9) to
search title and body fields in the documents. The
requirement is that documents with title match (at
least one search term finds a match in title) should
be returned before the documents with just body match
(no term got a match in title, all found in body
only). I use AND operator by default, so all the terms
must be found somewhere.

Using the default scoring, title matches do come
before the just body matches. But, I also need the
documents with title matches sorted by date, and
documents with just body matches sorted by date. i.e
there will be two groups of docs in the results, one
with title match and the other with just body match,
the first group of docs should come before the second
group, and each group should be sorted by date.

I am not concerned with the Lucene scoring as the
documents are very short. So, I think, just two scores
as described above for the two groups of docs would be
sufficient?

BTW, all the terms present in title are always present
in body as well.

Really appreciate your interest and help, any more
suggestions how to achieve the desired behaviour
please?

Thanks and Regards
Wiseman


--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> SOME ONE wrote:
> 
> > Yes, I could run two searches, but that means
> running
> > two searches for each request from user and that I
> > think doubles the job taking double time. Any
> > suggestions to do it more efficiently please ?
> 
> I think it would only take double time if the sets
> of hit documents have 
> substantial overlap, i.e. most of the documents meet
> both sets of criteria. 
> If the sets of documents are mainly disjoint, the
> time wouldn't matter much 
> (since the same index data structures have to be
> traversed either way).
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by "Michael D. Curtin" <mi...@curtin.com>.
SOME ONE wrote:

> Yes, I could run two searches, but that means running
> two searches for each request from user and that I
> think doubles the job taking double time. Any
> suggestions to do it more efficiently please ?

I think it would only take double time if the sets of hit documents have 
substantial overlap, i.e. most of the documents meet both sets of criteria. 
If the sets of documents are mainly disjoint, the time wouldn't matter much 
(since the same index data structures have to be traversed either way).

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by SOME ONE <su...@yahoo.com>.
Hi,

Yes, I could run two searches, but that means running
two searches for each request from user and that I
think doubles the job taking double time. Any
suggestions to do it more efficiently please ?

Thanks and Regards
Wiseman


--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> SOME ONE wrote:
> 
> > Hi,
> > 
> > I am using MultiFieldQueryParser (Lucene 1.9) to
> > search title and body fields in the documents. The
> > requirement is that documents with title match
> should
> > be returned before the documents with body match.
> > Using the default scoring, title matches do come
> > before the body matches. But, I also need the
> > documents with title matches sorted by date, and
> > documents with body matches sorted by date. i.e
> there
> > will be two groups of docs in the results, one
> with
> > title match and the other with just body match,
> the
> > first group of docs should come before the second
> > group, and each group should be sorted by date.
> 
> > I am not concerned with the Lucene scoring as the
> > documents are very short. So, I think, just two
> scores
> > as described above for the two groups of docs are
> > sufficient. I am new to Lucene and have been
looking
> > into its code trying to figure out how to achieve
> the
> > desired behaviour but can't seem to.
> 
> > Any help in this regard will be greatly
appreciated.


>
> Would it work to run 2 searches, one with only the
> title criteria and then 
> another with only the body criteria?
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by "Michael D. Curtin" <mi...@curtin.com>.
SOME ONE wrote:

> Hi,
> 
> I am using MultiFieldQueryParser (Lucene 1.9) to
> search title and body fields in the documents. The
> requirement is that documents with title match should
> be returned before the documents with body match.
> Using the default scoring, title matches do come
> before the body matches. But, I also need the
> documents with title matches sorted by date, and
> documents with body matches sorted by date. i.e there
> will be two groups of docs in the results, one with
> title match and the other with just body match, the
> first group of docs should come before the second
> group, and each group should be sorted by date.

Would it work to run 2 searches, one with only the title criteria and then 
another with only the body criteria?

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org