You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Manuel Amoabeng <ma...@vjoon.com> on 2013/11/07 11:59:13 UTC

What is the best way to aggregate scores for sets of documents?

Hello everybody,


I am currently working on an index where the documents only represent parts of the entities that should be searchable: 
We have text objects indexed as independent documents but actually want to find articles the text objects are placed on. We also need to provide an indication of the relevance of the matched articles.
In this scenario the way the content of an article is distributed in text objects will determine how many hits representing the article are present in TopDocs.scoreDocs and what score they carry.

Is there are a way to aggregate the scores for logically connected ScoreDocs so that the result would be similar to the score a single document containing all matched content would have gotten? 


Thanks and best regards,

Manuel

Re: What is the best way to aggregate scores for sets of documents?

Posted by Alan Burlison <al...@gmail.com>.

On 07/11/2013 13:17, Manuel Amoabeng wrote:

> Sounds good, but wouldn't  the aggregated scores of documents
> consisting of many sub-documents potentially be greater than the
> scores of docs with very few sub-documents even if the overall
> content is equal?

I don't pretend to understand Lucene scoring well enough to say, I just 
did the easiest thing I could think of and by observation it worked 
well, so I didn't bother digging any further. You could always average 
the scores within a grouping which would reduce the influence of the 
number of documents, but in my case I wanted groupings with higher 
numbers of matching documents to score higher, so simple addition of the 
scores worked well.

-- 
Alan Burlison
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is the best way to aggregate scores for sets of documents?

Posted by Manuel Amoabeng <ma...@vjoon.com>.

Sounds good, but wouldn't  the aggregated scores of documents consisting of many sub-documents potentially be greater than the scores of docs with very few sub-documents even if the overall content is equal? 

Thanks,

Manuel

On 07.11.2013, at 14:08, Alan Burlison <al...@gmail.com> wrote:

> On 07/11/2013 10:59, Manuel Amoabeng wrote:
> 
>> Is there are a way to aggregate the scores for logically connected
>> ScoreDocs so that the result would be similar to the score a single
>> document containing all matched content would have gotten?
> 
> I did something similar by just post-processing the query results, grouping by the upper-level construct and adding up all the scores for the sub-documents, then sorting by aggregated score. Crude, but gives good relevancy in the results.
> 
> -- 
> Alan Burlison
> --
>

Re: What is the best way to aggregate scores for sets of documents?

Posted by Alan Burlison <al...@gmail.com>.

On 07/11/2013 10:59, Manuel Amoabeng wrote:

> Is there are a way to aggregate the scores for logically connected
> ScoreDocs so that the result would be similar to the score a single
> document containing all matched content would have gotten?

I did something similar by just post-processing the query results, 
grouping by the upper-level construct and adding up all the scores for 
the sub-documents, then sorting by aggregated score. Crude, but gives 
good relevancy in the results.

-- 
Alan Burlison
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is the best way to aggregate scores for sets of documents?

Posted by Manuel Amoabeng <ma...@vjoon.com>.

Hmm, I am not sure about how it could be achieved but my task is to produce a similar score for articles with similar content but different distribution of this content to text objects.
Maybe something like creating a temporary document from the text objects and computing its score instead of just aggregating the text scores would do the trick?

Thanks,

Manuel
 
On 07.11.2013, at 13:22, Michael McCandless <lu...@mikemccandless.com> wrote:

> Alas, the scoring is very simple: just what you see in the ScoreMode enum.
> 
> But this is something that we should fix, e.g. we should at least open
> up a method so the app can do its own score aggregation.
> 
> What scoring/model do you have in mind?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 7, 2013 at 7:08 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
>> Thanks for pointing me to the lucene-join module.
>> Does the ToParentBlockJoinQuery produce the scores in a more sophisticated way than the ScoreMode enum suggests?
>> Actually finding the related entities is not my problem, I am only having trouble to produce scores consistent with the overall content of an article.
>> 
>> Thanks,
>> 
>> Manuel
>> 
>> 
>> 
>> On 07.11.2013, at 12:08, Michael McCandless <lu...@mikemccandless.com> wrote:
>> 
>>> Maybe the join module fits here?  For example you can join "up" to a
>>> single parent from multiple child hits.  I described one of the
>>> options (now called ToParentBlockJoinQuery) here:
>>> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
>>> but there is also query-time joining now as well, which Martijn
>>> described here:
>>> http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
>>> 
>>> Mike McCandless
>>> 
>>> http://blog.mikemccandless.com
>>> 
>>> 
>>> On Thu, Nov 7, 2013 at 5:59 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
>>>> Hello everybody,
>>>> 
>>>> 
>>>> I am currently working on an index where the documents only represent parts of the entities that should be searchable:
>>>> We have text objects indexed as independent documents but actually want to find articles the text objects are placed on. We also need to provide an indication of the relevance of the matched articles.
>>>> In this scenario the way the content of an article is distributed in text objects will determine how many hits representing the article are present in TopDocs.scoreDocs and what score they carry.
>>>> 
>>>> Is there are a way to aggregate the scores for logically connected ScoreDocs so that the result would be similar to the score a single document containing all matched content would have gotten?
>>>> 
>>>> 
>>>> Thanks and best regards,
>>>> 
>>>> Manuel
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is the best way to aggregate scores for sets of documents?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Alas, the scoring is very simple: just what you see in the ScoreMode enum.

But this is something that we should fix, e.g. we should at least open
up a method so the app can do its own score aggregation.

What scoring/model do you have in mind?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 7, 2013 at 7:08 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
> Thanks for pointing me to the lucene-join module.
> Does the ToParentBlockJoinQuery produce the scores in a more sophisticated way than the ScoreMode enum suggests?
> Actually finding the related entities is not my problem, I am only having trouble to produce scores consistent with the overall content of an article.
>
> Thanks,
>
> Manuel
>
>
>
> On 07.11.2013, at 12:08, Michael McCandless <lu...@mikemccandless.com> wrote:
>
>> Maybe the join module fits here?  For example you can join "up" to a
>> single parent from multiple child hits.  I described one of the
>> options (now called ToParentBlockJoinQuery) here:
>> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
>> but there is also query-time joining now as well, which Martijn
>> described here:
>> http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Nov 7, 2013 at 5:59 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
>>> Hello everybody,
>>>
>>>
>>> I am currently working on an index where the documents only represent parts of the entities that should be searchable:
>>> We have text objects indexed as independent documents but actually want to find articles the text objects are placed on. We also need to provide an indication of the relevance of the matched articles.
>>> In this scenario the way the content of an article is distributed in text objects will determine how many hits representing the article are present in TopDocs.scoreDocs and what score they carry.
>>>
>>> Is there are a way to aggregate the scores for logically connected ScoreDocs so that the result would be similar to the score a single document containing all matched content would have gotten?
>>>
>>>
>>> Thanks and best regards,
>>>
>>> Manuel
>>>
>>>
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is the best way to aggregate scores for sets of documents?

Posted by Manuel Amoabeng <ma...@vjoon.com>.

Thanks for pointing me to the lucene-join module. 
Does the ToParentBlockJoinQuery produce the scores in a more sophisticated way than the ScoreMode enum suggests?
Actually finding the related entities is not my problem, I am only having trouble to produce scores consistent with the overall content of an article. 

Thanks,

Manuel



On 07.11.2013, at 12:08, Michael McCandless <lu...@mikemccandless.com> wrote:

> Maybe the join module fits here?  For example you can join "up" to a
> single parent from multiple child hits.  I described one of the
> options (now called ToParentBlockJoinQuery) here:
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
> but there is also query-time joining now as well, which Martijn
> described here:
> http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 7, 2013 at 5:59 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
>> Hello everybody,
>> 
>> 
>> I am currently working on an index where the documents only represent parts of the entities that should be searchable:
>> We have text objects indexed as independent documents but actually want to find articles the text objects are placed on. We also need to provide an indication of the relevance of the matched articles.
>> In this scenario the way the content of an article is distributed in text objects will determine how many hits representing the article are present in TopDocs.scoreDocs and what score they carry.
>> 
>> Is there are a way to aggregate the scores for logically connected ScoreDocs so that the result would be similar to the score a single document containing all matched content would have gotten?
>> 
>> 
>> Thanks and best regards,
>> 
>> Manuel
>> 
>> 
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

Re: What is the best way to aggregate scores for sets of documents?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Maybe the join module fits here?  For example you can join "up" to a
single parent from multiple child hits.  I described one of the
options (now called ToParentBlockJoinQuery) here:
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
but there is also query-time joining now as well, which Martijn
described here:
http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/

Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 7, 2013 at 5:59 AM, Manuel Amoabeng <ma...@vjoon.com> wrote:
> Hello everybody,
>
>
> I am currently working on an index where the documents only represent parts of the entities that should be searchable:
> We have text objects indexed as independent documents but actually want to find articles the text objects are placed on. We also need to provide an indication of the relevance of the matched articles.
> In this scenario the way the content of an article is distributed in text objects will determine how many hits representing the article are present in TopDocs.scoreDocs and what score they carry.
>
> Is there are a way to aggregate the scores for logically connected ScoreDocs so that the result would be similar to the score a single document containing all matched content would have gotten?
>
>
> Thanks and best regards,
>
> Manuel
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org