You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Baldwin, David" <Da...@bmc.com> on 2014/09/06 05:05:12 UTC

How to properly correlate relevance in a search across multiple collections

I have a project where there are multiple collections - could be dozens at times that a single results set needs to be generated by applying the same search criteria to each collection directory and then correlating all the sub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some tid-bits or info I may not have run across yet?

-David


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by "Baldwin, David" <Da...@bmc.com>.

After my last question, I am now intrigued by the alternative suggested.  Defining a 'Super-Corpus' (Collection).  We are using Stock Lucene (not Solr or anything else).   Is there a known method already to integrate the DF for multiple collections allowing such a cross-collection  DF?  

I think I like the simplicity of the first method, but it remains to be seen if it would satisfy the relevancy needs of the application.

Thoughts?

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Saturday, September 06, 2014 12:10 PM
To: java-user@lucene.apache.org
Subject: Re: How to properly correlate relevance in a search across multiple collections

An observation: df and IDF (document frequency) is a key driver of the whole relevancy framework on which stock Lucene is based. There is no question about its significant value. But... that means that we can't blindly "correlate" relevancy between "collections", in large part because the document scores are so heavily driven by df, which is distinctly based on the specific corpus of each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy to use "switch" to drop back to a pure tf-based relevancy score (primarily tf, but it can include other factors, but simply limited to the contents of the document itself) to sidestep these corpus-dependent scores. In other words, the score of the document could depend on only the contents of the document itself, not the corpus. Yes, that's a major loss of relevance, but the benefits for operations in a multi-corpus, distributed world can be substantial.

Yes, you can do this yurself by just plugging in your own custom "similarity" class, but it should be offered as a much easier to use "switch" for Lucene itself (and Solr too!)

The alternative is to have some mechanism to define and work with a "super-corpus" or "super-collection" that integrates the df for multiple corpuses, but... df is calculated or updated for the overall corpus, so a cross-corpus df would require recalculating df for all terms in the index whenever the multi-corpus structure changes, which can work in some cases, but not for things like distributed searches for Solr. That might be a superior solution, but might now be so easy or as performant as a simple non-df similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it provides faster search results, and then the user could click on a "refine results" 
button to re-do the search with the more expensive cross-corpus df-based scoring.

Thoughts?

-- Jack Krupansky

-----Original Message-----
From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org
Subject: How to properly correlate relevance in a search across multiple collections

I have a project where there are multiple collections - could be dozens at times that a single results set needs to be generated by applying the same search criteria to each collection directory and then correlating all the sub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some tid-bits or info I may not have run across yet?

-David

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by "Baldwin, David" <Da...@bmc.com>.

I did notice the MultiSearch and MultiReader, given the advertisement on the lucene feature page that " multiple-index searching with merged results" (See https://lucene.apache.org/core/) I am wondering if my original question about searching multiple indexes with merged results also includes proper ranking during the merge.  I would have normally assumed that, but given the discussions we are having here, I am doubting that the merged results are actually merged in any reasonable way as to provided relevance merging and relationships as well.

I hope I am wrong.

Anyone?

-----Original Message-----
From: atawfik [mailto:contact.txlabs@gmail.com] 
Sent: Tuesday, September 09, 2014 12:42 AM
To: java-user@lucene.apache.org
Subject: RE: How to properly correlate relevance in a search across multiple collections

Hi David,

It seems that MultiSearcher is deprecated in favor of MultiReader. Have a look  here <https://issues.apache.org/jira/browse/LUCENE-2756>  . 

Regarding the meta search approach, you can normalize raw scores of documents. There are many ways to do that. Just search for "normalization scores in meta search". The key here is the nature of your collections. If they contain the same type of documents, then you can fuse them with different aggregation methods. If raw score is the issue, you can normalize or use sum of reciprocal ranks, Borda Count or even a simple count. If the documents are not the same type, then you try round robin. 

My concern is not combining the search results, but rather maintaining good relevant documents at the top of the merged result.

I have a master degree in Information retrieval, where I studied meta search and distributed search for almost three years. However, probably the simple workarounds suggested above might do the job.

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance-in-a-search-across-multiple-collections-tp4157240p4157555.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by Vincent Sevel <v....@lombardodier.com>.

Hi,

Does someone know if the source of the jira issues search example is available:
http://jirasearch.mikemccandless.com/

thanks,
vince

************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not constitute
a formal commitment by Bank Lombard Odier & Co Ltd or any
of its branches or affiliates. If you are not the intended recipient
of this message, kindly notify the sender immediately and
destroy this message. Thank You.
*****************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by atawfik <co...@gmail.com>.

Hi David,

It seems that MultiSearcher is deprecated in favor of MultiReader. Have a
look  here <https://issues.apache.org/jira/browse/LUCENE-2756>  . 

Regarding the meta search approach, you can normalize raw scores of
documents. There are many ways to do that. Just search for "normalization
scores in meta search". The key here is the nature of your collections. If
they contain the same type of documents, then you can fuse them with
different aggregation methods. If raw score is the issue, you can normalize
or use sum of reciprocal ranks, Borda Count or even a simple count. If the
documents are not the same type, then you try round robin. 

My concern is not combining the search results, but rather maintaining good
relevant documents at the top of the merged result.

I have a master degree in Information retrieval, where I studied meta search
and distributed search for almost three years. However, probably the simple
workarounds suggested above might do the job.



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance-in-a-search-across-multiple-collections-tp4157240p4157555.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by "Baldwin, David" <Da...@bmc.com>.

I am looking at the MultiSearcher, which seems to have been around for a while (at least since 3.0.3) and I am wondering if that will do what I want.  I just looked at Lucene again and it states that it searches multiple indexes with merged results.  I also see a lot of similar comments about scores not being comparable from one index to another.  I am confused.  Does anyone have any additional thoughts on MultiSearcher?  Reading Lucene in Action, it looks like it does what I want it to do

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Monday, September 08, 2014 10:31 AM
To: java-user
Subject: Re: How to properly correlate relevance in a search across multiple collections

I think the point got lost in the discussion. Raw scores are simply _not_ comparable from different collections. They aren't even comparable for different queries in the _same_ collection. They are _only_ relevant for ranking in the same collection within a single query.

And even then raw scores don't tell you much. A score of 2 isn't "twice as good" as a score of 1, it's just "somewhat better".

So the bottom line is that you start resorting to some kind of clever presentation of the different groups to the user; tabs for each collection, round-robin inclusion or meta-analysis where you query the _same_ docs that exist in different indexes and try to create some satisfactory heuristic etc.  as atawfik suggested.

Best,
Erick

On Mon, Sep 8, 2014 at 8:59 AM, Baldwin, David <Da...@bmc.com> wrote:
> Would it be possible, or does anyone have any experience, in using the raw score from each separate collection to order and then after a merge come up with relevancy?
>
> -----Original Message-----
> From: atawfik [mailto:contact.txlabs@gmail.com]
> Sent: Sunday, September 07, 2014 9:50 AM
> To: java-user@lucene.apache.org
> Subject: Re: How to properly correlate relevance in a search across 
> multiple collections
>
> Hi,
>
> if you have documents that might exist in multiple collections, then 
> you can use techniques from meta search. That is combining multiple 
> search results from different collections. In this case, you can 
> retrieve the top 100 or
> 1000 documents from each collection and merge them. You then rank documents by using some aggregation methods. It is known that using the sum of relevance scores produces good results.
>
> If there are no shared documents between collections, you still can use the same approach but using different aggregation methods. One method is round robin. You start by selecting the first ranked document from each collection. Then, taking the second ranked document and so on.
>
> If that does not fit your needs, probably you should search for "federated or aggregated search techniques". These techniques are used by giant search engines to combine results from their search engine parts (images,video and web). You can find a lot of academic resources in these aspects.
>
> Regards
> Ameer
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance
> -in-a-search-across-multiple-collections-tp4157240p4157321.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to properly correlate relevance in a search across multiple collections

Posted by Erick Erickson <er...@gmail.com>.

I think the point got lost in the discussion. Raw scores are simply
_not_ comparable from different collections. They aren't even
comparable for different queries in the _same_ collection. They are
_only_ relevant for ranking in the same collection within a single
query.

And even then raw scores don't tell you much. A score of 2 isn't
"twice as good" as a score of 1, it's just "somewhat better".

So the bottom line is that you start resorting to some kind of clever
presentation of the different groups to the user; tabs for each
collection, round-robin inclusion or meta-analysis where you query the
_same_ docs that exist in different indexes and try to create some
satisfactory heuristic etc.  as atawfik suggested.

Best,
Erick

On Mon, Sep 8, 2014 at 8:59 AM, Baldwin, David <Da...@bmc.com> wrote:
> Would it be possible, or does anyone have any experience, in using the raw score from each separate collection to order and then after a merge come up with relevancy?
>
> -----Original Message-----
> From: atawfik [mailto:contact.txlabs@gmail.com]
> Sent: Sunday, September 07, 2014 9:50 AM
> To: java-user@lucene.apache.org
> Subject: Re: How to properly correlate relevance in a search across multiple collections
>
> Hi,
>
> if you have documents that might exist in multiple collections, then you can use techniques from meta search. That is combining multiple search results from different collections. In this case, you can retrieve the top 100 or
> 1000 documents from each collection and merge them. You then rank documents by using some aggregation methods. It is known that using the sum of relevance scores produces good results.
>
> If there are no shared documents between collections, you still can use the same approach but using different aggregation methods. One method is round robin. You start by selecting the first ranked document from each collection. Then, taking the second ranked document and so on.
>
> If that does not fit your needs, probably you should search for "federated or aggregated search techniques". These techniques are used by giant search engines to combine results from their search engine parts (images,video and web). You can find a lot of academic resources in these aspects.
>
> Regards
> Ameer
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance-in-a-search-across-multiple-collections-tp4157240p4157321.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to properly correlate relevance in a search across multiple collections

Posted by "Baldwin, David" <Da...@bmc.com>.

Would it be possible, or does anyone have any experience, in using the raw score from each separate collection to order and then after a merge come up with relevancy?

-----Original Message-----
From: atawfik [mailto:contact.txlabs@gmail.com] 
Sent: Sunday, September 07, 2014 9:50 AM
To: java-user@lucene.apache.org
Subject: Re: How to properly correlate relevance in a search across multiple collections

Hi,

if you have documents that might exist in multiple collections, then you can use techniques from meta search. That is combining multiple search results from different collections. In this case, you can retrieve the top 100 or
1000 documents from each collection and merge them. You then rank documents by using some aggregation methods. It is known that using the sum of relevance scores produces good results. 

If there are no shared documents between collections, you still can use the same approach but using different aggregation methods. One method is round robin. You start by selecting the first ranked document from each collection. Then, taking the second ranked document and so on. 

If that does not fit your needs, probably you should search for "federated or aggregated search techniques". These techniques are used by giant search engines to combine results from their search engine parts (images,video and web). You can find a lot of academic resources in these aspects.

Regards
Ameer 

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance-in-a-search-across-multiple-collections-tp4157240p4157321.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to properly correlate relevance in a search across multiple collections

Posted by atawfik <co...@gmail.com>.

Hi,

if you have documents that might exist in multiple collections, then you can
use techniques from meta search. That is combining multiple search results
from different collections. In this case, you can retrieve the top 100 or
1000 documents from each collection and merge them. You then rank documents
by using some aggregation methods. It is known that using the sum of
relevance scores produces good results. 

If there are no shared documents between collections, you still can use the
same approach but using different aggregation methods. One method is round
robin. You start by selecting the first ranked document from each
collection. Then, taking the second ranked document and so on. 

If that does not fit your needs, probably you should search for "federated
or aggregated search techniques". These techniques are used by giant search
engines to combine results from their search engine parts (images,video and
web). You can find a lot of academic resources in these aspects.

Regards
Ameer 



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-properly-correlate-relevance-in-a-search-across-multiple-collections-tp4157240p4157321.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to properly correlate relevance in a search across multiple collections

Posted by Jack Krupansky <ja...@basetechnology.com>.

An observation: df and IDF (document frequency) is a key driver of the whole 
relevancy framework on which stock Lucene is based. There is no question 
about its significant value. But... that means that we can't blindly 
"correlate" relevancy between "collections", in large part because the 
document scores are so heavily driven by df, which is distinctly based on 
the specific corpus of each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy to 
use "switch" to drop back to a pure tf-based relevancy score (primarily tf, 
but it can include other factors, but simply limited to the contents of the 
document itself) to sidestep these corpus-dependent scores. In other words, 
the score of the document could depend on only the contents of the document 
itself, not the corpus. Yes, that's a major loss of relevance, but the 
benefits for operations in a multi-corpus, distributed world can be 
substantial.

Yes, you can do this yurself by just plugging in your own custom 
"similarity" class, but it should be offered as a much easier to use 
"switch" for Lucene itself (and Solr too!)

The alternative is to have some mechanism to define and work with a 
"super-corpus" or "super-collection" that integrates the df for multiple 
corpuses, but... df is calculated or updated for the overall corpus, so a 
cross-corpus df would require recalculating df for all terms in the index 
whenever the multi-corpus structure changes, which can work in some cases, 
but not for things like distributed searches for Solr. That might be a 
superior solution, but might now be so easy or as performant as a simple 
non-df similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it provides 
faster search results, and then the user could click on a "refine results" 
button to re-do the search with the more expensive cross-corpus df-based 
scoring.

Thoughts?

-- Jack Krupansky

-----Original Message----- 
From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org
Subject: How to properly correlate relevance in a search across multiple 
collections

I have a project where there are multiple collections - could be dozens at 
times that a single results set needs to be generated by applying the same 
search criteria to each collection directory and then correlating all the 
sub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some 
tid-bits or info I may not have run across yet?

-David


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org