You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Diego Ceccarelli (JIRA)" <ji...@apache.org> on 2016/03/10 18:08:41 UTC
[jira] [Comment Edited] (SOLR-8776) Support RankQuery in grouping

    [ https://issues.apache.org/jira/browse/SOLR-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189552#comment-15189552 ] 

Diego Ceccarelli edited comment on SOLR-8776 at 3/10/16 5:08 PM:
-----------------------------------------------------------------

[~joel.bernstein] thanks for pointing out about the {MergeStrategy}. I uploaded a new patch with a first step. I agree that merge strategy must stay there, that's why I wrote "partially moved" :)   as well as there's IndexSearcher and SolrIndexSearcher, I moved {RankQuery} in Lucene and created lucene {SolrRankQuery}.  The reason is that the {RankQuery} works by manipulating the collector, through this method:

{code:java}
public abstract TopDocsCollector getTopDocsCollector(int len, QueryCommand cmd, IndexSearcher searcher) throws IOException;
{code}

At the moment what happens is that if the query is a RankQuery, and into the SolrIndexSearcher: 
{code:java}
  private TopDocsCollector buildTopDocsCollector(int len, QueryCommand cmd) throws IOException {

    Query q = cmd.getQuery();
    if (q instanceof RankQuery) {
      RankQuery rq = (RankQuery) q;
      return rq.getTopDocsCollector(len, cmd, this);
    }
    ..
{code}

Instead of creating a top collector using the {TopScoreDocCollector.create}, we wrap a topScoreCollector into a 'RankQuery' collector.

Let me remind that grouping works in two separate stages:
   *  in the first stage, we iterate on the documents scoring them and keep a map {<group -> score>} where score is the highest score of a document in the group (the map contains only the TOP-k groups with the highest scores);
   * for each group in the top groups documents in the group are ranked and top documents for each group are returned.

This logic is mainly implemented into {Abstract(First|Second)PassGroupingCollector} (within Lucene). 

We should probably discuss what means reranking for groups: in my opinion we should keep in mind that the idea behind RankQuery is that you don't want to apply the query to all the documents in the collection, so the "group-reranking"
should: 

   1 in the first stage, we iterate on the documents scoring them as usual and keep a map {group -> score>};
   2 for each group, RankQuery is applied to the top documents in the group;
   3 groups will be reranked according to the new scores.

In this patch, I'm able to perform 2. I had to move RankQuery into Lucene, because what happens in the 
{AbstractSecondPassGroupingCollector} is that for each group a collector is created: 

{code:java}
 for (SearchGroup<GROUP_VALUE_TYPE> group : groups) {
      //System.out.println("  prep group=" + (group.groupValue == null ? "null" : group.groupValue.utf8ToString()));
      TopDocsCollector<?> collector;
      if (withinGroupSort.equals(Sort.RELEVANCE)) { // optimize to use TopScoreDocCollector
        // Sort by score
        collector = TopScoreDocCollector.create(maxDocsPerGroup);
    ...
{code}

... so no way to 'inject' the reranking collector from Solr. Moving the RankQuery into lucene I modified the code in: 

{code:java}
        collector = TopScoreDocCollector.create(maxDocsPerGroup);
        if (query != null && query instanceof RankQuery){
          collector = ((RankQuery)query).getTopDocsCollector(collector, null, searcher);
        }
{code}

and now documents in groups are reranked. I'll work now on 3. i.e., reordering the groups based on the new rerank score
(I added a new test that fails at the moment). 
Happy to discuss about this first change, if you have comments.

Minor notes: 
  - At the moment {SolrRankQuery} doesn't extend {ExtendedQueryBase}, I have to check if it is a problem. RankQuery could become an interface maybe.
  - I did some changes to the interface of {RankQuery.getTopDocsCollector}: {QueryCommand} was in solr but used only for getting {Sort}, len was never used. I added in input the previous collector, instead of creating a new TopDocScore collector inside {RankQuery}. 


was (Author: diegoceccarelli):
[~joel.bernstein] thanks for pointing out about the MergeStrategy. I uploaded a new patch with a first step.
I agree that merge strategy must stay there, that's why I wrote "partially moved" :)  
as well as there's IndexSearcher and SolrIndexSearcher, I moved {RankQuery} in Lucene and created lucene {SolrRankQuery}. 
The reason is that the {RankQuery} works by manipulating the collector, through this method:

{code:java}
public abstract TopDocsCollector getTopDocsCollector(int len, QueryCommand cmd, IndexSearcher searcher) throws IOException;
{code}

At the moment what happens is that if the query is a RankQuery, and into the SolrIndexSearcher: 
{code:java}
  private TopDocsCollector buildTopDocsCollector(int len, QueryCommand cmd) throws IOException {

    Query q = cmd.getQuery();
    if (q instanceof RankQuery) {
      RankQuery rq = (RankQuery) q;
      return rq.getTopDocsCollector(len, cmd, this);
    }
    ..
{code}

Instead of creating a topCollector using the {TopScoreDocCollector.create}, we wrap a topScoreCollector into a ReRanking 
collector.

Let me remind that grouping works in two separate stages:
  1. in the first stage, we iterate on the documents scoring them and keep a map {<group -> score>} where score is the highest score of a document in the group (the map contains only the TOP-k groups with the highest scores);
  2. for each group in the top groups documents in the group are ranked and top documents for each group are returned.

This logic is mainly implemented into {Abstract(First|Second)PassGroupingCollector} (within Lucene). 

We should probably discuss what means reranking for groups: in my opinion we should keep in mind that the idea behind RankQuery is that you don't want to apply the query to all the documents in the collection, so the "group-reranking"
should: 

   1 in the first stage, we iterate on the documents scoring them as usual and keep a map {group -> score>};
   2 for each group, RankQuery is applied to the top documents in the group;
   3 groups will be reranked according to the new scores.

In this patch, I'm able to perform 2. I had to move RankQuery into Lucene, because what happens in the 
{AbstractSecondPassGroupingCollector} is that for each group a collector is created: 

{code:java}
 for (SearchGroup<GROUP_VALUE_TYPE> group : groups) {
      //System.out.println("  prep group=" + (group.groupValue == null ? "null" : group.groupValue.utf8ToString()));
      TopDocsCollector<?> collector;
      if (withinGroupSort.equals(Sort.RELEVANCE)) { // optimize to use TopScoreDocCollector
        // Sort by score
        collector = TopScoreDocCollector.create(maxDocsPerGroup);
    ...
{code}

... so no way to 'inject' the reranking collector from Solr. Moving the RankQuery into lucene I modified the code in: 

{code:java}
        collector = TopScoreDocCollector.create(maxDocsPerGroup);
        if (query != null && query instanceof RankQuery){
          collector = ((RankQuery)query).getTopDocsCollector(collector, null, searcher);
        }
{code}

and now documents in groups are reranked. I'll work now on 3. i.e., reordering the groups based on the new rerank score
(I added a new test that fails at the moment). 
Happy to discuss about this first change, if you have comments.

Minor notes: 
  - At the moment {SolrRankQuery} doesn't extend {ExtendedQueryBase}, I have to check if it is a problem. RankQuery could become an interface maybe.
  - I did some changes to the interface of {RankQuery.getTopDocsCollector}: {QueryCommand} was in solr but used only for getting {Sort}, len was never used. I added in input the previous collector, instead of creating a new TopDocScore collector inside {RankQuery}. 

> Support RankQuery in grouping
> -----------------------------
>
>                 Key: SOLR-8776
>                 URL: https://issues.apache.org/jira/browse/SOLR-8776
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: master
>            Reporter: Diego Ceccarelli
>            Priority: Minor
>             Fix For: master
>
>         Attachments: 0001-SOLR-8776-Support-RankQuery-in-grouping.patch, 0001-SOLR-8776-Support-RankQuery-in-grouping.patch
>
>
> Currently it is not possible to use RankQuery [1] and Grouping [2] together (see also [3]). In some situations Grouping can be replaced by Collapse and Expand Results [4] (that supports reranking), but i) collapse cannot guarantee that at least a minimum number of groups will be returned for a query, and ii) in the Solr Cloud setting you will have constraints on how to partition the documents among the shards.
> I'm going to start working on supporting RankQuery in grouping. I'll start attaching a patch with a test that fails because grouping does not support the rank query and then I'll try to fix the problem, starting from the non distributed setting (GroupingSearch).
> My feeling is that since grouping is mostly performed by Lucene, RankQuery should be refactored and moved (or partially moved) there. 
> Any feedback is welcome.
> [1] https://cwiki.apache.org/confluence/display/solr/RankQuery+API 
> [2] https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> [3] http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3CCAHM-LpuvsPEsT-Sw63_8a6gt-wOr6dS_T_Nb2rOpe93e4+sTNQ@mail.gmail.com%3E
> [4] https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org