You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/07/23 20:29:42 UTC

Combining hits

Hi,
I am doing a search on my index for a query like this:

query = "\"Term 1\" \"Term 2\" \"Term 3\""

Where I want to find Term 1, Term 2 and Term 3 in the index.  However, I
only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to
avoid doing processing on hits that only contain "Term 3".  To do this, I
was thinking of doing a search for "\"Term 1\" \"Term 2\"" and then if there
are hits for these terms, I would do another search for "Term 3" on these
resulting documents.  I am running a background search so I am not too
worried performance issues caused by searching twice.

Is there a way to search on subset of documents and then combining the hits
for the document?  For example, if Term 1 and Term 2 are found in Document1,
and Term3 is also later found in Document1, I want to be able to process the
hits on my highlighter as containing all three terms.

Sorry if it's confusing.

Thanks,
Max

Re: Combining hits

Posted by Max Lynch <ih...@gmail.com>.
> What do you mean by "first"? Would you want to process a doc thatdid NOT
> have a "Term 3"?
>
> Let's say you have the following:
> doc1: "Term 1"
> doc2: "Term 2"
> doc3: "Term 1" "Term 2"
> doc4: "Term 3"
> doc5: "Term 1" "Term 2" "Term 3"
> doc6: "Term 2" "Term 3"
>
> Which docs do you want to get from your search? And does order really
> matter?


I would want all of those.  What I wouldn't want would be

doc7: "Term 3"

I rank documents more highly based on how many of these terms they contain.
For example, my system ranks doc5 the highest in your example, then doc3,
then doc6. I don't need Term 3, but I need to have Term 1 or Term 2 before I
go looking for Term 3 to further organize my results.  Order doesn't matter,
they are put through a separate scoring system.  I am really just trying to
improve the performance a little bit since "Term 3" will, in my index, hit
on far more documents than Term 1 or Term 2, but I only care about the
documents where Term 1 or Term 2 were found first.

Thanks,
Max

Re: Combining hits

Posted by Erick Erickson <er...@gmail.com>.
What do you mean by "first"? Would you want to process a doc thatdid NOT
have a "Term 3"?

Let's say you have the following:
doc1: "Term 1"
doc2: "Term 2"
doc3: "Term 1" "Term 2"
doc4: "Term 3"
doc5: "Term 1" "Term 2" "Term 3"
doc6: "Term 2" "Term 3"

Which docs do you want to get from your search? And does order really
matter?

Best
Erick



On Thu, Jul 23, 2009 at 2:29 PM, Max Lynch <ih...@gmail.com> wrote:

> Hi,
> I am doing a search on my index for a query like this:
>
> query = "\"Term 1\" \"Term 2\" \"Term 3\""
>
> Where I want to find Term 1, Term 2 and Term 3 in the index.  However, I
> only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to
> avoid doing processing on hits that only contain "Term 3".  To do this, I
> was thinking of doing a search for "\"Term 1\" \"Term 2\"" and then if
> there
> are hits for these terms, I would do another search for "Term 3" on these
> resulting documents.  I am running a background search so I am not too
> worried performance issues caused by searching twice.
>
> Is there a way to search on subset of documents and then combining the hits
> for the document?  For example, if Term 1 and Term 2 are found in
> Document1,
> and Term3 is also later found in Document1, I want to be able to process
> the
> hits on my highlighter as containing all three terms.
>
> Sorry if it's confusing.
>
> Thanks,
> Max
>

Re: Combining hits

Posted by Max Lynch <ih...@gmail.com>.
> Couldn't you maybe get the same effect using some clever term boosting?
>
> I.. think something like
>
> "Term 1" OR "Term 2" OR "Term 3" ^ .25
>
> would return in almost the exact order that you are asking for here, with
> the only real difference being that you would have some matches for only
> Term 3 way way at the bottom of your list score wise.
>
> It might be worth investigating something like this, where you cut off
> displaying documents that don't match a certain score thresh hold.  Thus
> cutting out the matches that you don't want (The term3 only ones)
>

Thanks Matthew, I'll take a look at this.  It seems like a much better
option than what I'm doing currently.

-max

Re: Combining hits

Posted by Matthew Hall <mh...@informatics.jax.org>.
Looking at what you wrote:

I am doing a weighting system where I rank documents that have Term 1 AND
Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term
2, and more highly than documents that just have Term 1 OR Term 2 but not
both.

Couldn't you maybe get the same effect using some clever term boosting?

I.. think something like

"Term 1" OR "Term 2" OR "Term 3" ^ .25

would return in almost the exact order that you are asking for here, 
with the only real difference being that you would have some matches for 
only Term 3 way way at the bottom of your list score wise.

It might be worth investigating something like this, where you cut off 
displaying documents that don't match a certain score thresh hold.  Thus 
cutting out the matches that you don't want (The term3 only ones)

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Combining hits

Posted by Max Lynch <ih...@gmail.com>.
> do a search on "Term 1" AND "Term 2"
> do a search on "Term 2" AND "Term2" AND "Term 3"
>
> This would ensure that you have two objects back, one of which is
> guaranteed to be a subset of the other.


I did start doing this after sending the email.  My only concern is search
speed.  Right now I first search for "Term 1" OR "Term 2" and then if there
are hits, I search for all three terms again on the whole index.

I guess it is working just fine for me, but I wonder if I could speed it up
at all by only searching on a subset of documents from the first search and
the combining the hits to process later.


> Then, when you are iterating on your documents to do your highlighting over
> the results from the first search (At least I think that's what you are
> doing here) check to see if the current document exists in the hits or
> topDocs object that came from the second search.  If it does, use the three
> term highlighter, if it doesn't use the two term highlighter.
>
> But, what sort of reordering are you trying to do here anyhow?


I am doing a weighting system where I rank documents that have Term 1 AND
Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term
2, and more highly than documents that just have Term 1 OR Term 2 but not
both.

Thanks,
Max

Re: Combining hits

Posted by Matthew Hall <mh...@informatics.jax.org>.
Erm.. I have to be missing something here, wouldn't you be able just do 
the following:

do a search on "Term 1" AND "Term 2"
do a search on "Term 2" AND "Term2" AND "Term 3"

This would ensure that you have two objects back, one of which is 
guaranteed to be a subset of the other.

Then, when you are iterating on your documents to do your highlighting 
over the results from the first search (At least I think that's what you 
are doing here) check to see if the current document exists in the hits 
or topDocs object that came from the second search.  If it does, use the 
three term highlighter, if it doesn't use the two term highlighter.

But, what sort of reordering are you trying to do here anyhow?

Doing just a normal search against "Term 1" OR "Term 2" OR "Term 3" with 
a standard highlighter would most likely get you ... well exactly the 
same results as what you are describing.  The only real difference I 
could see is the order that the documents are returned to you.

Matt

Max Lynch wrote:
> Hi,
> I am doing a search on my index for a query like this:
>
> query = "\"Term 1\" \"Term 2\" \"Term 3\""
>
> Where I want to find Term 1, Term 2 and Term 3 in the index.  However, I
> only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to
> avoid doing processing on hits that only contain "Term 3".  To do this, I
> was thinking of doing a search for "\"Term 1\" \"Term 2\"" and then if there
> are hits for these terms, I would do another search for "Term 3" on these
> resulting documents.  I am running a background search so I am not too
> worried performance issues caused by searching twice.
>
> Is there a way to search on subset of documents and then combining the hits
> for the document?  For example, if Term 1 and Term 2 are found in Document1,
> and Term3 is also later found in Document1, I want to be able to process the
> hits on my highlighter as containing all three terms.
>
> Sorry if it's confusing.
>
> Thanks,
> Max
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org