You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2017/01/18 08:15:26 UTC
[jira] [Updated] (SOLR-9978) Reduce collapse query memory usage

     [ https://issues.apache.org/jira/browse/SOLR-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Varun Thacker updated SOLR-9978:
--------------------------------
    Description: 
- Single shard test with one replica 
- 10M documents and 9M of those documents are unique. Test was for string
- Collapse query parser creates two arrays :
  - int array for unique documents ( 9M in this case )
  - float array for the corresponding scores ( 9M in this case )
- It goes through all documents and puts the document in the array if the score is better than the previously existing score.
- So collapse creates a lot of garbage when the total number of documents is high and the duplicates is very less
- Even for a query like this {{q={!cache=false}*:*&fq={!collapse field=collapseField_s cache=false}&sort=id desc}}
  which has a top level sort , the collapse query parser creates the score array and scores every document


Indexing script used to generate dummy data:
{code}
    //Index 10M documents , with every 1/10 document as a duplicate.
    List<SolrInputDocument> docs = new ArrayList<>(1000);
    for(int i=0; i<1000*1000*10; i++) {
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField("id", i);
      if (i%10 ==0 && i!=0) {
        doc.addField("collapseField_s", i-1);
      } else {
        doc.addField("collapseField_s", i);
      }
      docs.add(doc);
      if (docs.size() == 1000) {
        client.add("ct", docs);
        docs.clear();
      }
    }
    client.commit("ct");
{code}

Query:
{{q=\{!cache=false\}*:*&fq=\{!collapse field=collapseField_s cache=false\}&sort=id desc}}

Improvements
- We currently default to the SCORE implementation if no min|max|sort param is provided in the collapse query. Check if a global sort is provided and don't score documents picking the first occurrence of each unique value.
- Instead of creating an array for unique documents use a bitset




  was:
- Single shard test with one replica 
- 10M documents and 9M of those documents are unique. Test was for string
- Collapse query parser creates two arrays :
  - int array for unique documents ( 9M in this case )
  - float array for the corresponding scores ( 9M in this case )
- It goes through all documents and puts the document in the array if the score is better than the previously existing score.
- So collapse creates a lot of garbage when the total number of documents is high and the duplicates is very less
- Even for a query like this {{q={!cache=false}*:*&fq={!collapse field=collapseField_s cache=false}&sort=id desc}}
  which has a top level sort , the collapse query parser creates the score array and scores every document


Indexing script used to generate dummy data:
{code}
    //Index 10M documents , with every 1/10 document as a duplicate.
    List<SolrInputDocument> docs = new ArrayList<>(1000);
    for(int i=0; i<1000*1000*10; i++) {
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField("id", i);
      if (i%10 ==0 && i!=0) {
        doc.addField("collapseField_s", i-1);
      } else {
        doc.addField("collapseField_s", i);
      }
      docs.add(doc);
      if (docs.size() == 1000) {
        client.add("ct", docs);
        docs.clear();
      }
    }
    client.commit("ct");
{code}

Query:
{{q={!cache=false}*:*&fq={!collapse field=collapseField_s cache=false}&sort=id desc}}

Improvements
- We currently default to the SCORE implementation if no min|max|sort param is provided in the collapse query. Check if a global sort is provided and don't score documents picking the first occurrence of each unique value.
- Instead of creating an array for unique documents use a bitset





> Reduce collapse query memory usage
> ----------------------------------
>
>                 Key: SOLR-9978
>                 URL: https://issues.apache.org/jira/browse/SOLR-9978
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>
> - Single shard test with one replica 
> - 10M documents and 9M of those documents are unique. Test was for string
> - Collapse query parser creates two arrays :
>   - int array for unique documents ( 9M in this case )
>   - float array for the corresponding scores ( 9M in this case )
> - It goes through all documents and puts the document in the array if the score is better than the previously existing score.
> - So collapse creates a lot of garbage when the total number of documents is high and the duplicates is very less
> - Even for a query like this {{q={!cache=false}*:*&fq={!collapse field=collapseField_s cache=false}&sort=id desc}}
>   which has a top level sort , the collapse query parser creates the score array and scores every document
> Indexing script used to generate dummy data:
> {code}
>     //Index 10M documents , with every 1/10 document as a duplicate.
>     List<SolrInputDocument> docs = new ArrayList<>(1000);
>     for(int i=0; i<1000*1000*10; i++) {
>       SolrInputDocument doc = new SolrInputDocument();
>       doc.addField("id", i);
>       if (i%10 ==0 && i!=0) {
>         doc.addField("collapseField_s", i-1);
>       } else {
>         doc.addField("collapseField_s", i);
>       }
>       docs.add(doc);
>       if (docs.size() == 1000) {
>         client.add("ct", docs);
>         docs.clear();
>       }
>     }
>     client.commit("ct");
> {code}
> Query:
> {{q=\{!cache=false\}*:*&fq=\{!collapse field=collapseField_s cache=false\}&sort=id desc}}
> Improvements
> - We currently default to the SCORE implementation if no min|max|sort param is provided in the collapse query. Check if a global sort is provided and don't score documents picking the first occurrence of each unique value.
> - Instead of creating an array for unique documents use a bitset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org