You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Brian Whitman (JIRA)" <ji...@apache.org> on 2008/07/25 16:07:31 UTC

[jira] Created: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
-----------------------------------------------------------------------------------------------

Key: SOLR-659
URL: https://issues.apache.org/jira/browse/SOLR-659
Project: Solr
Issue Type: Improvement
Components: search
Affects Versions: 1.3
Reporter: Brian Whitman
Priority: Minor
Fix For: 1.3
Attachments: shards.start_rows.patch

The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.

Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616903#action_12616903 ] 

Brian Whitman commented on SOLR-659:
------------------------------------

An example of a bulk query using this patch. Without this patch such bulk queries will eventually time out or cause exceptions in the server as too much data is passed back and forth.

{code:java}
public SolrDocumentList blockQuery(SolrQuery q, int blockSize, int maxResults) {
    SolrDocumentList allResults = new SolrDocumentList();
    if(blockSize > maxResults) { blockSize = maxResults;  }
    for(int i=0; i<maxResults; i=i+blockSize) {
      // Sets rows of this query to the most results that could ever come back - the blockSize * the number of shards
      q.setRows(blockSize * getNumberOfHosts());
      // Don't set a start on the main query
      q.setStart(0);
      // But do set start and rows on the individual shards. 
      q.set("shards.start", String.valueOf(i));
      q.set("shards.rows", String.valueOf(blockSize));
      // Perform the query.
      QueryResponse sub = query(q);
      // For each returned document (up to blockSize*numberOfHosts() of them), append them to the main result
      for(SolrDocument s : sub.getResults()) {
        allResults.add(s);
        // Break if we've reached our requested limit
        if(allResults.size() > maxResults) { break; }
      }
      if(allResults.size() > maxResults) { break; }
    }
    return allResults;
  }
{code}

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: shards.start_rows.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683803#action_12683803 ] 

Shalin Shekhar Mangar commented on SOLR-659:
--------------------------------------------

If I understand this correctly, it makes bulk queries cheaper at the expense of less precise scoring. But if I'm paging through some results and you modify the shard.start and shard.rows then I'll get inconsistent results. Is that correct?

bq. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

I do not understand that. Why will the client get more than rows? Or by client, did you mean the solr server to which the initial request is sent?

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Whitman updated SOLR-659:
-------------------------------

    Attachment: SOLR-659.patch

New patch syncs w/ trunk

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-659:
----------------------------

    Fix Version/s:     (was: 1.3)

IMO it is too late in the release process for new features.

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>         Attachments: shards.start_rows.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "johnson.hong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769114#action_12769114 ] 

johnson.hong commented on SOLR-659:
-----------------------------------

This is really helpful to bulk  queries ,but how to handle the pagination of query results.
e.g.at the first query,I set  shards.start to 0 and set shards.rows to 30,it  may return 50 documents,and i get 30 documents to show ,the other 20 documents is discarded ;then how to get the next 30 documents ?

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748986#action_12748986 ] 

Yonik Seeley commented on SOLR-659:
-----------------------------------

I agree this makes sense to enable efficient bulk operations, and also fits in with a past idea I had about mapping shards.param=foo to param=foo during a sub-request.

I'll give it a couple of days and commit if there are no objections.

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-659:
----------------------------------

    Fix Version/s: 1.4

This looks simple enough.  I haven't tried it.  Brian, do you have a unit test you could attach?

Or would it make more sense to have a custom QueryComponent for something like this? (I don't know yet)


> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley reassigned SOLR-659:
---------------------------------

    Assignee: Yonik Seeley

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Whitman updated SOLR-659:
-------------------------------

    Attachment: shards.start_rows.patch

Attaching patch.

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: shards.start_rows.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-659) Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-659.
-------------------------------

    Resolution: Fixed

Thanks Brian, I just committed this.

> Explicitly set start and rows per shard for more efficient bulk queries across distributed Solr
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-659
>                 URL: https://issues.apache.org/jira/browse/SOLR-659
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Brian Whitman
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: shards.start_rows.patch, SOLR-659.patch
>
>
> The default behavior of setting start and rows on distributed solr (SOLR-303) is to set start at 0 across all shards and set rows to start+rows across each shard. This ensures all results are returned for any arbitrary start and rows setting, but during "bulk queries" (where start is incrementally increased and rows is kept consistent) the client would need finer control of the per-shard start and rows parameter as retrieving many thousands of documents becomes intractable as start grows higher.
> Attaching a patch that creates a &shards.start and &shards.rows parameter. If used, the logic that sets rows to start+rows per shard is overridden and each shard gets the exact start and rows set in shards.start and shards.rows. The client will receive up to shards.rows * nShards results and should set rows accordingly. This makes bulk queries across distributed solr possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.