You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by frank shi <fi...@gmail.com> on 2014/05/05 06:31:36 UTC

sort groups by the sum of the scores of the documents within each group

Currently, solr grouping (http://wiki.apache.org/solr/FieldCollapsing) sorts
groups "by the score of the top document within each group". E.g. 
[...] 
"groups":[{ 
    "groupValue":"81cb63020d0339adb019a924b2a9e0c2", 
    "doclist":{"numFound":9,"start":0,"maxScore":4.729042,"docs":[ 
        { 
          "id":"7481df771afe39fab368ce19dfeeb528", 
          [...], 
          "score":4.729042}, 
        { 
          "id":"c879e95b5f16343dad8b1248133727c2", 
          [...], 
          "score":4.6635237}, 
        { 
          "id":"485b9aec90fd3ef381f013c51ab6a4df", 
          [...], 
          "score":4.347174}] 
    }}, 
[...] 
Is there an out-of-the-box way to sort groups by the sum of the scores of
the documents within each group? E.g. 
[...] 
"groups":[{ 
    "groupValue":"81cb63020d0339adb019a924b2a9e0c2", 
    "doclist":{"numFound":9,"start":0,"scoreSum":13.739738,"docs":[ 
        { 
          "id":"7481df771afe39fab368ce19dfeeb528", 
          [...], 
          "score":4.729042}, 
        { 
          "id":"c879e95b5f16343dad8b1248133727c2", 
          [...], 
          "score":4.6635237}, 
        { 
          "id":"485b9aec90fd3ef381f013c51ab6a4df", 
          [...], 
          "score":4.347174}] 
    }}, 
[...] 
With the release of sorting by Function Query
(https://issues.apache.org/jira/browse/SOLR-1297), it seems that there
should be a way to use the sum() function
(http://wiki.apache.org/solr/FunctionQuery). But it's not quite close enough
since the "score" field is not part of the documents. 

I feel like I'm close but I'm missing some obvious piece. I'm using Solr
4.6.



--
View this message in context: http://lucene.472066.n3.nabble.com/sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134607.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Frankcis <fi...@gmail.com>.
hei, Erick, Sorry to bother you again, i send the client requirement to you
in the solr mail list, but i can't get your reply, i want your advice.


2014-05-06 13:24 GMT+08:00 Frankcis [via Lucene] <
ml-node+s472066n4134869h9@n3.nabble.com>:

> thank you, Erick, you're good man,
> this is the client requirement:
> In the forum, there is a lot of discussion of the content under different
> subjects, search for a keyword,
> which will lead to a result that the word of content or subject match the
> query, group these document based on every subject, sort these groups based
> on the sum score of every subject.
>
> my pleasure to listen your suggestions.
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134869.html
>  To unsubscribe from Re: sort groups by the sum of the scores of the
> documents within each group, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4134715&code=ZmluYWx4Y29kZUBnbWFpbC5jb218NDEzNDcxNXwyMDg1ODE1Mzg4>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4135044.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Frankcis <fi...@gmail.com>.
thank you, Erick, you're good man,
this is the client requirement:
In the forum, there is a lot of discussion of the content under different
subjects, search for a keyword,
which will lead to a result that the word of content or subject match the
query, group these document based on every subject, sort these groups based
on the sum score of every subject.

my pleasure to listen your suggestions.





--
View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134869.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Erick Erickson <er...@gmail.com>.
Frankly, I really don't know how to make that happen. I took a quick
look at the function query stuff (I don't have them all memorized yet)
and I just can't seem to make them bend that way.

I can imagine  writing custom code to make it work but I don't really
know how much effort would be involved. I suspect it would not be
trivial.

What I'd do is go back to the client and ask _them_ why it would be
useful. Along with some estimate for figuring out what was necessary
and let _them_ figure out whether it was worth it. Say a week's worth
of effort to scope the work involved. From my viewpoint, given that
the use of this feature is questionable at best, it's a service to the
client to force them to lay out a clear use-case for this capability
and also give them some kind of cost (in this case, just the cost to
figure out _how_ to do it, not actually do it).

Then they can make a rational decision whether the functionality is
worth it. One outcome for them is to say "yes, our use case is
compelling enough we're willing to pay you to figure out how to make
it happen". Another outcome is for them to say "Oh, if it's not OOB
functionality, it's not worth much effort". Yet a third response is
"You're right, that makes no sense whatsoever, don't bother".

Until and unless you give them the feedback that this is not OOB
functionality, and get them to explain why they think it's valuable
and let them know that it'll likely cost a significant amount, you're
not giving them the information to make a rational decision.

I've just seen way too many features implemented in various projects
that wind up taking a lot of effort without being useful...

There, rant finished.

Best,
Erick

On Mon, May 5, 2014 at 9:37 PM, Frankcis <fi...@gmail.com> wrote:
> thank you, Erick, you're right, the maxScore of document within each group is
> more effective than the sum of scores in a group, especially some use-case
> just as your assumption(group 1 could have 10M documents all with a score of
> .01 and group 2 could have 1 document with a score of 1,000 and group 1
> would sort
> first) ,but the function is required by the client, can you tell me the way
> how to achieve it ?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134856.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Frankcis <fi...@gmail.com>.
thank you, Erick, you're right, the maxScore of document within each group is
more effective than the sum of scores in a group, especially some use-case
just as your assumption(group 1 could have 10M documents all with a score of
.01 and group 2 could have 1 document with a score of 1,000 and group 1
would sort 
first) ,but the function is required by the client, can you tell me the way
how to achieve it ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134856.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Erick Erickson <er...@gmail.com>.
You haven't answered _why_ this is a good idea. I'm having a hard
time understanding what would be _useful_ about sorting this way. Just
because the sum of scores in a group is greater than the sum of scores
in another says _nothing_ about how relevant any of the docs in the group
are relative to each other.

I mean group 1 could have 10M documents all with a score of .01 and group
2 could have 1 document with a score of 1,000 and group 1 would sort
first.

So unless you have some unusual use-case which you haven't yet articulated,
this seems like a bad idea.

Best,
Erick

On Mon, May 5, 2014 at 7:20 PM, Frankcis <fi...@gmail.com> wrote:
> my scheme.xml:
> <schema name="example core one" version="1.1">
>   <types>
>    <fieldtype name="string"  class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0"/>
>    <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>    <fieldtype name="textComplex" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false"
> autoGeneratePhraseQueries="false">
>    <analyzer type="query">
>                 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
> mode="complex" dicPath="E:\solr-4.6.1\example\solr\dict"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true"/>
>         </analyzer>
>         <analyzer type="index">
>                 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
> mode="complex" dicPath="E:\solr-4.6.1\example\solr\dict"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true"/>
>         </analyzer>
>   </fieldtype>
>   </types>
>
>  <fields>
>   <field name="id"                        type="uuid"            indexed="true"  stored="true"
> multiValued="false" required="true" />
>   <field name="name"      type="textComplex"    indexed="true"
> stored="true"  multiValued="false" />
>   <field name="type"      type="string"    indexed="true"  stored="true"
> multiValued="false" />
>   <field name="price"     type="long"            indexed="true"  stored="true" />
>
>   <field name="_version_" type="long"      indexed="true"  stored="true"/>
>  </fields>
>
>  <uniqueKey>id</uniqueKey>
>
>
>  <defaultSearchField>name</defaultSearchField>
>
>
>  <solrQueryParser defaultOperator="OR"/>
> </schema>
>
> update docs:
> "docs": [
>       {
>         "name": "苹果4s",
>         "type": "手机",
>         "price": 2000,
>         "id": "4017e35a-6b19-45b6-b945-382340ca1eec",
>         "_version_": 1466799722505175000
>       },
>       {
>         "name": "苹果5",
>         "type": "手机",
>         "price": 5000,
>         "id": "4052d9f3-f6d9-458f-8bb0-477b17852f37",
>         "_version_": 1466799735745544200
>       },
>       {
>         "name": "三星",
>         "type": "手机",
>         "price": 3000,
>         "id": "468abce8-8bb9-4f51-9900-8d4d6abc02ac",
>         "_version_": 1466799747596550100
>       },
>       {
>         "name": "摩托罗拉i3",
>         "type": "电脑",
>         "price": 1000,
>         "id": "db66bb02-3d6a-4ab0-9133-2e6e38b3d4dd",
>         "_version_": 1466799757491961900
>       },
>       {
>         "name": "摩托罗拉i5",
>         "type": "电脑",
>         "price": 1500,
>         "id": "f211525f-bc3c-4ea7-aded-1c46a94ecd1c",
>         "_version_": 1466799766311534600
>       }
>     ]
> thank you , Erick,
> i want to sort groups based on the sum of documents' scores within each
> group, as you said, solr excels at getting the score of single documents, in
> solr 4.6, the default sort of group each other depends on the maxScore of
> all documents within each group, but the sum of documents' scores, though i
> can get the sum of documents' scores by the client program, it's not good
> idea, l know that the stats component of solr can statistics the long field,
> so I had the idea to use statistic data for score field, but the score is
> pse-udo field, the stats.field doesn't support it. In addition, as
> scheme.xml displayed,  i do group on the elements of a string field(type)
> without using participle.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134830.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Frankcis <fi...@gmail.com>.
my scheme.xml:
<schema name="example core one" version="1.1">
  <types>
   <fieldtype name="string"  class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0"/>
   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
   <fieldtype name="textComplex" class="solr.TextField"
positionIncrementGap="100" omitNorms="false"
autoGeneratePhraseQueries="false">
   <analyzer type="query">
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode="complex" dicPath="E:\solr-4.6.1\example\solr\dict"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
		<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true"/>
	</analyzer>
	<analyzer type="index">
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode="complex" dicPath="E:\solr-4.6.1\example\solr\dict"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
		<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true"/>
	</analyzer>
  </fieldtype>
  </types>
  
 <fields>   
  <field name="id" 			  type="uuid" 		 indexed="true"  stored="true" 
multiValued="false" required="true" />
  <field name="name"      type="textComplex"    indexed="true" 
stored="true"  multiValued="false" />
  <field name="type"      type="string"    indexed="true"  stored="true" 
multiValued="false" />
  <field name="price"     type="long" 		 indexed="true"  stored="true" />
 
  <field name="_version_" type="long"      indexed="true"  stored="true"/>
 </fields>
 
 <uniqueKey>id</uniqueKey>

 
 <defaultSearchField>name</defaultSearchField>

 
 <solrQueryParser defaultOperator="OR"/>
</schema>

update docs:
"docs": [
      {
        "name": "苹果4s",
        "type": "手机",
        "price": 2000,
        "id": "4017e35a-6b19-45b6-b945-382340ca1eec",
        "_version_": 1466799722505175000
      },
      {
        "name": "苹果5",
        "type": "手机",
        "price": 5000,
        "id": "4052d9f3-f6d9-458f-8bb0-477b17852f37",
        "_version_": 1466799735745544200
      },
      {
        "name": "三星",
        "type": "手机",
        "price": 3000,
        "id": "468abce8-8bb9-4f51-9900-8d4d6abc02ac",
        "_version_": 1466799747596550100
      },
      {
        "name": "摩托罗拉i3",
        "type": "电脑",
        "price": 1000,
        "id": "db66bb02-3d6a-4ab0-9133-2e6e38b3d4dd",
        "_version_": 1466799757491961900
      },
      {
        "name": "摩托罗拉i5",
        "type": "电脑",
        "price": 1500,
        "id": "f211525f-bc3c-4ea7-aded-1c46a94ecd1c",
        "_version_": 1466799766311534600
      }
    ]
thank you , Erick,
i want to sort groups based on the sum of documents' scores within each
group, as you said, solr excels at getting the score of single documents, in
solr 4.6, the default sort of group each other depends on the maxScore of
all documents within each group, but the sum of documents' scores, though i
can get the sum of documents' scores by the client program, it's not good
idea, l know that the stats component of solr can statistics the long field,
so I had the idea to use statistic data for score field, but the score is
pse-udo field, the stats.field doesn't support it. In addition, as
scheme.xml displayed,  i do group on the elements of a string field(type)
without using participle.



--
View this message in context: http://lucene.472066.n3.nabble.com/Re-sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134715p4134830.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: sort groups by the sum of the scores of the documents within each group

Posted by Erick Erickson <er...@gmail.com>.
I don't think so. Solr excels at getting the score of single
documents, not aggregation.

It's not at all clear to me, though, that the sum of documents' scores
is a reasonable thing to sort by. Consider grouping on a very common
term. You'd never do this, but group on the elements of a text field.
Then the group 'a' would sort to the top almost always (or maybe 'the'
or...).

This sounds like an XY problem, what use-case are you trying to solve?

Best,
Erick

On Sun, May 4, 2014 at 9:31 PM, frank shi <fi...@gmail.com> wrote:
> Currently, solr grouping (http://wiki.apache.org/solr/FieldCollapsing) sorts
> groups "by the score of the top document within each group". E.g.
> [...]
> "groups":[{
>     "groupValue":"81cb63020d0339adb019a924b2a9e0c2",
>     "doclist":{"numFound":9,"start":0,"maxScore":4.729042,"docs":[
>         {
>           "id":"7481df771afe39fab368ce19dfeeb528",
>           [...],
>           "score":4.729042},
>         {
>           "id":"c879e95b5f16343dad8b1248133727c2",
>           [...],
>           "score":4.6635237},
>         {
>           "id":"485b9aec90fd3ef381f013c51ab6a4df",
>           [...],
>           "score":4.347174}]
>     }},
> [...]
> Is there an out-of-the-box way to sort groups by the sum of the scores of
> the documents within each group? E.g.
> [...]
> "groups":[{
>     "groupValue":"81cb63020d0339adb019a924b2a9e0c2",
>     "doclist":{"numFound":9,"start":0,"scoreSum":13.739738,"docs":[
>         {
>           "id":"7481df771afe39fab368ce19dfeeb528",
>           [...],
>           "score":4.729042},
>         {
>           "id":"c879e95b5f16343dad8b1248133727c2",
>           [...],
>           "score":4.6635237},
>         {
>           "id":"485b9aec90fd3ef381f013c51ab6a4df",
>           [...],
>           "score":4.347174}]
>     }},
> [...]
> With the release of sorting by Function Query
> (https://issues.apache.org/jira/browse/SOLR-1297), it seems that there
> should be a way to use the sum() function
> (http://wiki.apache.org/solr/FunctionQuery). But it's not quite close enough
> since the "score" field is not part of the documents.
>
> I feel like I'm close but I'm missing some obvious piece. I'm using Solr
> 4.6.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/sort-groups-by-the-sum-of-the-scores-of-the-documents-within-each-group-tp4134607.html
> Sent from the Solr - User mailing list archive at Nabble.com.