You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Sharad Agarwal (JIRA)" <ji...@apache.org> on 2007/07/16 10:37:04 UTC

[jira] Created: (SOLR-303) Federated Search over HTTP

Federated Search over HTTP
--------------------------

Key: SOLR-303
URL: https://issues.apache.org/jira/browse/SOLR-303
Project: Solr
Issue Type: New Feature
Components: search
Reporter: Sharad Agarwal
Priority: Minor

Motivated by http://wiki.apache.org/solr/FederatedSearch
"Index view consistency between multiple requests" requirement is relaxed in this implementation.

Does the federated search query side. Update not yet done.

Tries to achieve:-
------------------------
- The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.

- Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)

- Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml

- Global weight calculation is done by querying the terms' doc frequencies from all shards.

- Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.

-Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.

HOW:
-------
A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.

The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client.

The search request processing on the set of shards is performed as follows:

STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.

STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.

STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.

STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.

STEP 5: Responses from all shards from SecondQueryPhase are merged.

STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.

TODO:
-Support sort field other than default score
-Support ResponseDocs in writers other than XMLWriter
-Http connection timeouts

OPEN ISSUES;
-Merging of facets by "top n terms of field f"

Scope for Performance optimization:-
-Search shards in parallel threads
-Http connection Keep-Alive ?
-Cache global numDocs and docFreqs
-Cache Query objects in handlers ??

Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by S DALAL <da...@gmail.com>.

Hi Zhang,
     Can you please some more details about the error ? Are you seeing
any exceptions ? How are your partitions set up and what is the
request you are sending ?

regards
dalal

On Nov 22, 2007 8:53 AM, zhang.zuxin (JIRA) <ji...@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544684 ]
>
> zhang.zuxin commented on SOLR-303:
> ----------------------------------
>
> to Sabyasachi Dalal:
> I update solr trunk to version 597284. And I patch it cleanly.But it does't work,just like it doesn't support distributed search.
> Alternately,it works when I used Sharad Agarwal 's patch.I don't know what's wrong, or maybe you change anything?
>
> > Distributed Search over HTTP
> > ----------------------------
> >
> >                 Key: SOLR-303
> >                 URL: https://issues.apache.org/jira/browse/SOLR-303
> >             Project: Solr
> >          Issue Type: New Feature
> >          Components: search
> >            Reporter: Sharad Agarwal
> >         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
> >
> >
> > Searching over multiple shards and aggregating results.
> > Motivated by http://wiki.apache.org/solr/DistributedSearch
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557017#action_12557017 ] 

Ryan McKinley commented on SOLR-303:
------------------------------------

yonik, if you say "go", I'll add SOLR-446

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531997 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

>> Does this mean that this patch requires SOLR-281 to be applied first?
No. Current patch has all files. When SOLR-281 gets in to the trunk then this patch needs to be reworked.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557012#action_12557012 ] 

patrick o'leary commented on SOLR-303:
--------------------------------------

Was missing a file from an svn add, so the patch I put in there misses out on SolrFieldSortedHitQueue
I'll remove it to reduce confusion.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548032 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I've been prototyping distributed search in python...
The current methods I have for a component are something like

{code}
  // returns the current stage this component is at... stage starts at -1 and the next stage is the minimum returned
  // by all components on the previous calls to process()
  int process(RequestBuilder rb, int stage);

   // callback for a single response received (optional... this could be left out)
   // all components have this called, regardless of who queued the request
   void singleResponse(ResponseBuilder rb, int stage, Request req, Response rsp);

   // callback when all responses (from all shards) to a request have been received
   void allResponses(ResponseBuilder rb, int stage, Request req);
{code}

Any of these methods can add another request to the outgoing queue.  The current stage is only over after all
requests have been sent, responses received, and the outgoing queue is empty.
When all components return maxint from process(), we are done.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543553 ] 

Sabyasachi Dalal commented on SOLR-303:
---------------------------------------

I mean i removed the files pertaining to 281. If you follow the development above, the files pertaining to 281 were added to this patch to make it easier to apply this patch.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528253 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

>Is there any way ResponseDocs could extend Doclist so that all of the writers don't need to be modified?
ResonseDocs are based on document unique key while DocList is based on internal doc id. 
The purpose of ResponseDocs is to represent documents lying in remote index while DocList are meant for local internal doc id.

I dont think there is an easy way to avoid modifying writers. Currently writers retrieve document data based on local internal doc id. But for remote index, this has to be done differently.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Gereon Steffens (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556575#action_12556575 ] 

Gereon Steffens commented on SOLR-303:
--------------------------------------

Yonik, no matter what I try, I keep getting exceptions when querying anything that uses shards. 
Is the correct query URL still what I've used in my previous comment?

Excerpt from my logs:

{noformat}
SEVERE: org.apache.solr.client.solrj.SolrServerException: Error executing query
        at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)
[...]
Caused by: org.apache.solr.common.SolrException: /select

/select

request: http://localhost:8090/select?echoParams=explicit&q=id:1527426&start=0&rows=10&fsv=true&fl=id,score&isShard=true&wt=xml&version=2.2
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
{noformat}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512983 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Thanks for kicking this off Sharad!

> "Index view consistency between multiple requests" requirement is relaxed in this implementation. 

Do you have plans to remedy that?  Or do you think that most people are OK with inconsistencies that could arise?

> Load-balancing and Fail-over taken care by VIP as usual

In a static configuration, this works OK, but it might be nice to support a more dynamic environment where extra shards could be easily added.  It might also be the case that a custom partitioning function could be implemented (such as improving caching by partitioning queries, etc) or it may be more efficient to do the second phase of a query on the same shard copy as the first phase.
In that case it might make sense load balancing across shards from Solr . The  VIP solution would map to the simplest case of a single copy of each shard, thus a LB could still be used if desired.

> STEP 1: The query is built, terms are extracted. 

Where are terms extracted from (some queries require index access)?  This should be delegated to the shards, no?  It can be the same step that gets the docFreqs from the shards (pass the query, *not* the terms).  Step 1 should also be optional for those that can make do with local idf factors.

In order to facilitate custom logic in a distributed environment,
I think we should base the solution on something like
https://issues.apache.org/jira/browse/SOLR-281
With additional hooks for distributed search.
This should allow relatively independent parts of query processing to piggyback in the same network request (for example, the first steps to querying and faceting can be added to a single request, and highlighting and stored field retrieval can be done in conjunction).

Any thoughts on RMI vs HTTP for the searcher-subsearcher interface?  



> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613969#action_12613969 ] 

bwhitman edited comment on SOLR-303 at 7/16/08 7:28 AM:
-------------------------------------------------------------

Getting "Form too large" from jetty while doing normal but large rows= (40000) shards requests. Is this related to SOLR-612 ?

Query was : http://x.x.x.x/solr/search?q=*:*&sort=indexed%20desc&fl=indexed&rows=40000 , where x.x.x.x is a single shard and /search has the shards ivars mapped to it in solrconfig.

(Sorry for the mess, but that's how it appears)

Form_too_large__javalangIllegalStateException_
Form_too_large__at_orgmortbayjettyRequestextractParametersRequestjava1273__at_
orgmortbayjettyRequestgetParameterMapRequestjava650__at_
orgapachesolrrequestServletSolrParamsinitServletSolrParamsjava29__at_
orgapachesolrservletStandardRequestParserparseParamsAndFillStreamsSolrRequestParsersjava392__at_
orgapachesolrservletSolrRequestParsersparseSolrRequestParsersjava113__at_
orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava240__at_
orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_
orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_
orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_
orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_
orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_
orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_
orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_
orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_
orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_
orgmortbayjettyServerhandleServerjava285__at_
orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_
orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_
orgmortbayjettyHttpParserparseNextHttpParserjava641__at_
orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_
orgmortbayjettyHttpConnectionhandleHttpConnectionjava378__at_
orgmortbayjettybioSocketConnector$ConnectionrunSocketConnectorjava226__at_
orgmortbaythreadBoundedThreadPool$PoolThreadrunBoundedThreadPooljava442_

request: http://x.x.x.x.y/solr/select (ed: this was a different shard than the one I called)

request: http://x.x.x.y/solr/select
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:371)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:345)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
	at java.lang.Thread.run(Thread.java:619)


      was (Author: bwhitman):
    Getting "Form too large" from jetty while doing normal but large rows= (40000) shards requests. Is this related to SOLR-612 ?

Query was : http://x.x.x.x/solr/search?q=*:*&sort=indexed%20desc&fl=indexed&rows=40000 , where x.x.x.x is a single shard and /search has the shards ivars mapped to it in solrconfig.

(Sorry for the mess, but that's how it appears)

Form_too_large__javalangIllegalStateException_Form_too_large__at_orgmortbayjettyRequestextractParametersRequestjava1273__at_orgmortbayjettyRequestgetParameterMapRequestjava650__at_orgapachesolrrequestServletSolrParamsinitServletSolrParamsjava29__at_orgapachesolrservletStandardRequestParserparseParamsAndFillStreamsSolrRequestParsersjava392__at_orgapachesolrservletSolrRequestParsersparseSolrRequestParsersjava113__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava240__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_orgmortbayjettyHttpConnectionhandleHttpConnectionjava378__at_orgmortbayjettybioSocketConnector$ConnectionrunSocketConnectorjava226__at_orgmortbaythreadBoundedThreadPool$PoolThreadrunBoundedThreadPooljava442_

request: http://x.x.x.x.y/solr/select (ed: this was a different shard than the one I called)

request: http://x.x.x.y/solr/select
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:371)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:345)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
	at java.lang.Thread.run(Thread.java:619)

  
> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557525#action_12557525 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Note that for a normal facet query, this could result in 3 waves of requests.
1) query + facet
2) facet refinements
3) retrieve stored fields + highlight

We probably want to allow #2 to piggyback on #3 requests, provided that nothing needs final facet values before retrieving the stored fields.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528083 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I was trying to use the PHP serialized response writer with the federate search patch, and ran into some trouble. Then I noticed that you had made some changes in XMLWriter to support the federated.ResponseDocs class.

Is there any way ResponseDocs could extend Doclist so that all of the writers don't need to be modified?

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556612#action_12556612 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

There is currently no "local" shard... is that causing your problem?
Use something like shards=localhost:8983/solr,localhost:8080/solr

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Jayson Minard (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jayson Minard updated SOLR-303:
-------------------------------

    Attachment: distributed_facet_count_bugfix.patch

Attached patch to fix issue with distributed search.  If you specified a facet.field that was valid for the schema but not contained in a shard, an unintentional exception (array index out of bounds) would be thrown instead of returning the facet as empty.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated SOLR-303:
--------------------------------

    Attachment: fedsearch.patch

Hi Stu, I have merged the issues fixed by you in my version of patch.

Also the following changes:

->Based the solution on SOLR-281. Got away with the MultiSearchRequestHandler base class. Now federated features are just pure components which can be plugged along with other regular components like QueryComponent, HighlightComponent etc.
This way it would be very easy to override the core federated functionality.

->Renamed the Federated components to :
GlobalCollectionStatComponent
MainQPhaseComponent
AuxiliaryQPhaseComponent

-> Doing url encoding for the request params in XMLResponseParser





> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611230#action_12611230 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Fixed "debugQuery on a query with shards that returns 0 results will NPE".
There are still some issues with debugQuery=true, but it's not critical since it is just debugging.  I'll open another issue for that.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554513 ] 

Ryan McKinley commented on SOLR-303:
------------------------------------

I just took a quick look...  a few observations:

We should extract out a few simple things and commit them quickly to make this go more smoothly:
# move SearchHandler to o.a.s.handler.component -- I vote you go ahead and commit that change.
# Create a separate issue for adding SolrDocument to XMLWriter
# Move solrj into the main source tree.  I'm not sure the best way to do this, but I don't think solrj should sit in its own source folder if the core depends on it.


Is there a good reason to use the same handler for distributed search?  Why not have a DistributedSearchHandler that extends SearchHandler and skip the if {} else {} checking?  Likewise, I wonder if a DistributedResponseBuilder could/should extend ResponseBuilding and add the necessary logic.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513167 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
>>Do you have plans to remedy that? Or do you think that most people are OK with inconsistencies that could arise?
The thing to note here is that currently multi phase execution is based on document unique fields, NOT on doc internal ids. So there wont be much inconsistencies between requests; as it does not depend on changing internal doc ids. 
The possibility is that a particular document may have been deleted when the second phase executes.; which in my opinion should be OK to live with.
Other possibility could be the document is changed and original query terms are not present in the document anymore. This can be solved by doing a AND with the original query and uniq field document query.

If people think it is really crucial to have index view consistency, then it should be easy to implement "Consistency via Retry" as mentioned in http://wiki.apache.org/solr/FederatedSearch 
"Consistency via specifying Index version" would be little involved. Session management with "Sticky" load balancers could be explored.

>>It might also be the case that a custom partitioning function could be implemented (such as improving caching by partitioning queries, etc) or it may >>be more efficient to do the second phase of a query on the same shard copy as the first phase.
>>In that case it might make sense load balancing across shards from Solr. 
For second phase of a query to execute on the same shard copy, third party "Sticky load balancers" can be used. I believe Apache already does that. All copies of a single partition can sit behind the Apache load balancer (doing the "Sticky"). The merger just needs to know about the Load-balancer ip/port for each partition. Now based on the query, merger can search the appropriate partitions only.

To improve the caching, Solr itself has to do the load balancing. Other option could be to introduce the query result cache at the merger itself.

>>Where are terms extracted from (some queries require index access)? This should be delegated to the shards, no?It can be the same step that gets >>the docFreqs from the shards (pass the query, *not* the terms). 
yes, if thats the case, should be easy to implement as you have suggested.

>>I think we should base the solution on something like https://issues.apache.org/jira/browse/SOLR-281 
cool, I was looking for something like this. This looks like the way to go.

>>Any thoughts on RMI vs HTTP for the searcher-subsearcher interface? 
RMI could be supported as an option by enhancing the ResponseParser (better name ??) interface. The remote search server can directly return the SolrQueryResponse object. I understand that there will be some performance benefit if doing the native java marshalling/unmarshalling of object; instead of Solr response writing and then parsing (if done the HTTP way). The question we need to answer is: Is the effort/complexity worth it?

In our organization we made a conscious decision to go for HTTP. The operation folks like HTTP as it is standard stuff, load balancing, monitoring etc. Lot of tools already available for it. With RMI, I am not sure external Sticky load-balancing is possible; the merger itself has to build the logic.
Moreover, I think HTTP fits more naturally with Solr in its Request handler model.





> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Attachment: fedsearch.patch

I have updated the patch to remove the code pertaining to SOLR-281, because 281 has been committed.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528674 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

Thanks much Stu for pointing the issues. Will take care of these in next update.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614708#action_12614708 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Lars- I'm using the jetty that comes with solr-trunk, jetty-6.1.3.

I found this: http://webteam.archive.org/jira/browse/HER-1173#action_14736

Which indicates the Jetty 6 concordant property is org.mortbay.jetty.Request.maxFormContentSize.

I set that to 1000000, restarted my shards, and queries of &rows=40000 works. So for those who have this problem, start jetty with:

java -Dorg.mortbay.jetty.Request.maxFormContentSize=1000000 -jar start.jar

I would suggest only that the jetty.xml included in the solr example somehow get this parameter hardcoded (I don't know how personally.) I understand this is not a solr issue but it does cause a non-obvious result to an obvious query.




> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581747#action_12581747 ] 

Stu Hood commented on SOLR-303:
-------------------------------

Because the subqueries to Solr shards use GET requests (via SolrJ), they are limited in the number of documents they can request during the second phase by the maximum length of the query string.

One (API preserving) solution would be to modify SolrJ to use a POST request for queries if the query string is longer than some constant value.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556944#action_12556944 ] 

Sean Timm commented on SOLR-303:
--------------------------------

I'm receiving both patch errors and compile errors from Yonik's latest patch (03/Jan/08) against head on the trunk (r. 610010).  I ignore the errors on the two Test files.  It fails to remove the handler/SearchHandler.java
{noformat}
% patch -p0 -u < ~/distributed.patch
[...]
patching file src/java/org/apache/solr/handler/SearchHandler.java
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 1.
File src/java/org/apache/solr/handler/SearchHandler.java is not empty after patch, as expected
1 out of 1 hunk FAILED -- saving rejects to file src/java/org/apache/solr/handler/SearchHandler.java.rej
{noformat}

After removing handler/SearchHandler.java, the build errors that I am getting are:
{noformat}
% ant compile
Buildfile: build.xml

init-forrest-entities:

compile-common:

compile:
    [javac] Compiling 84 source files to /home/timmsc/svn.apache.org/lucene/solr/trunk/build/core
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:303: 'class' or 'interface' expected
    [javac] package org.apache.solr.handler.component;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:305: 'class' or 'interface' expected
    [javac] import org.apache.solr.handler.RequestHandlerBase;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:306: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.util.NamedList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:307: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.util.RTimer;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:308: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.util.SimpleOrderedMap;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:309: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.params.CommonParams;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:310: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.params.ModifiableSolrParams;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:311: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.SolrException;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:312: 'class' or 'interface' expected
    [javac] import org.apache.solr.request.SolrQueryRequest;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:313: 'class' or 'interface' expected
    [javac] import org.apache.solr.request.SolrQueryResponse;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:314: 'class' or 'interface' expected
    [javac] import org.apache.solr.client.solrj.SolrServer;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:315: 'class' or 'interface' expected
    [javac] import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:316: 'class' or 'interface' expected
    [javac] import org.apache.solr.util.plugin.SolrCoreAware;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:317: 'class' or 'interface' expected
    [javac] import org.apache.solr.core.SolrCore;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:318: 'class' or 'interface' expected
    [javac] import org.apache.lucene.queryParser.ParseException;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:320: 'class' or 'interface' expected
    [javac] import java.util.logging.Logger;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:321: 'class' or 'interface' expected
    [javac] import java.util.Collection;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:322: 'class' or 'interface' expected
    [javac] import java.util.List;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:323: 'class' or 'interface' expected
    [javac] import java.util.ArrayList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/SearchHandler.java:324: 'class' or 'interface' expected
    [javac] import java.util.LinkedList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:273: 'class' or 'interface' expected
    [javac] package org.apache.solr.handler.component;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:275: 'class' or 'interface' expected
    [javac] import org.apache.lucene.search.SortComparatorSource;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:276: 'class' or 'interface' expected
    [javac] import org.apache.lucene.search.SortField;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:277: 'class' or 'interface' expected
    [javac] import org.apache.lucene.util.PriorityQueue;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:278: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.util.NamedList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:279: 'class' or 'interface' expected
    [javac] import org.apache.solr.search.MissingStringLastComparatorSource;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:281: 'class' or 'interface' expected
    [javac] import java.text.Collator;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:282: 'class' or 'interface' expected
    [javac] import java.util.Comparator;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:283: 'class' or 'interface' expected
    [javac] import java.util.Locale;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:284: 'class' or 'interface' expected
    [javac] import java.util.List;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardDoc.java:285: 'class' or 'interface' expected
    [javac] import java.util.ArrayList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardRequest.java:60: 'class' or 'interface' expected
    [javac] package org.apache.solr.handler.component;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardRequest.java:62: 'class' or 'interface' expected
    [javac] import org.apache.solr.client.solrj.response.QueryResponse;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardRequest.java:63: 'class' or 'interface' expected
    [javac] import org.apache.solr.common.params.ModifiableSolrParams;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardRequest.java:65: 'class' or 'interface' expected
    [javac] import java.util.ArrayList;
    [javac] ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/ShardRequest.java:66: 'class' or 'interface' expected
    [javac] import java.util.List;
    [javac] ^
    [javac] 36 errors

BUILD FAILED
/home/timmsc/svn.apache.org/lucene/solr/trunk/build.xml:224: The following error occurred while executing this line:
/home/timmsc/svn.apache.org/lucene/solr/trunk/build.xml:110: Compile failed; see the compiler error output for details.

Total time: 1 second
{noformat}

A good number of the errors are because 3 of the files are duplicated inline after the patch.  After fixing this I still get 17 errors.
{noformat}
% ant compile
Buildfile: build.xml

init-forrest-entities:

compile-common:

compile:
    [javac] Compiling 84 source files to /home/timmsc/svn.apache.org/lucene/solr/trunk/build/core
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxDebugComponent.java:45: cannot find symbol
    [javac] symbol  : constructor SearchComponent(org.apache.solr.handler.component.SearchHandler)
    [javac] location: class org.apache.solr.handler.component.SearchComponent
    [javac]     super(handler);
    [javac]     ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxDebugComponent.java:51: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxQueryComponent.java:57: cannot find symbol
    [javac] symbol  : constructor QueryComponent(org.apache.solr.handler.component.SearchHandler)
    [javac] location: class org.apache.solr.handler.component.QueryComponent
    [javac]     super(handler);
    [javac]     ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxQueryComponent.java:63: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxQueryComponent.java:204: cannot find symbol
    [javac] symbol  : method getSortSpec(org.apache.solr.request.SolrQueryRequest)
    [javac] location: class org.apache.solr.util.SolrPluginUtils
    [javac]     builder.setSortSpec(SolrPluginUtils.getSortSpec(req) );
    [javac]                                        ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/component/DisMaxResponseBuilder.java:37: cannot find symbol
    [javac] symbol  : constructor ResponseBuilder(org.apache.solr.request.SolrQueryRequest)
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     super(req);
    [javac]     ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/FedSearchComponent.java:64: cannot find symbol
    [javac] symbol  : constructor SearchComponent(org.apache.solr.handler.component.SearchHandler)
    [javac] location: class org.apache.solr.handler.component.SearchComponent
    [javac]     super(handler);
    [javac]     ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/FedSearchComponent.java:91: cannot find symbol
    [javac] symbol  : variable handler
    [javac] location: class org.apache.solr.handler.federated.component.FedSearchComponent
    [javac]     SolrCore.getSolrCore().execute(handler, localReq, response);
    [javac]                                    ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/AuxiliaryQPhaseComponent.java:83: cannot find symbol
    [javac] symbol  : variable request_HL_and_MLT_Info_InMainPhase
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     rspBuilder.request_HL_and_MLT_Info_InMainPhase = false;
    [javac]               ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/AuxiliaryQPhaseComponent.java:84: cannot find symbol
    [javac] symbol  : variable request_fields_InMainPhase
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     rspBuilder.request_fields_InMainPhase = false;
    [javac]               ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/AuxiliaryQPhaseComponent.java:88: cannot find symbol
    [javac] symbol  : method skipProcess(org.apache.solr.request.SolrQueryRequest,org.apache.solr.request.SolrQueryResponse)
    [javac] location: class org.apache.solr.handler.federated.component.FedSearchComponent
    [javac]     if(super.skipProcess(req, rsp) ||
    [javac]             ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/GlobalCollectionStatComponent.java:138: cannot find symbol
    [javac] symbol  : method skipProcess(org.apache.solr.request.SolrQueryRequest,org.apache.solr.request.SolrQueryResponse)
    [javac] location: class org.apache.solr.handler.federated.component.FedSearchComponent
    [javac]     if(super.skipProcess(req, rsp) ||
    [javac]             ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/GlobalCollectionStatComponent.java:195: cannot find symbol
    [javac] symbol  : variable extractedTerms
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     rspBuilder.extractedTerms = extractedTerms;
    [javac]               ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/GlobalCollectionStatComponent.java:205: cannot find symbol
    [javac] symbol  : variable extractedTerms
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     final Set<String> extractedTerms = rspBuilder.extractedTerms;
    [javac]                                                  ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/MainQPhaseComponent.java:74: cannot find symbol
    [javac] symbol  : method skipProcess(org.apache.solr.request.SolrQueryRequest,org.apache.solr.request.SolrQueryResponse)
    [javac] location: class org.apache.solr.handler.federated.component.FedSearchComponent
    [javac]     if(super.skipProcess(req, rsp) ||
    [javac]             ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/MainQPhaseComponent.java:208: cannot find symbol
    [javac] symbol  : variable request_fields_InMainPhase
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]     if(rspBuilder.request_fields_InMainPhase){
    [javac]                  ^
    [javac] /home/timmsc/svn.apache.org/lucene/solr/trunk/src/java/org/apache/solr/handler/federated/component/MainQPhaseComponent.java:238: cannot find symbol
    [javac] symbol  : variable request_HL_and_MLT_Info_InMainPhase
    [javac] location: class org.apache.solr.handler.component.ResponseBuilder
    [javac]           rspBuilder.request_HL_and_MLT_Info_InMainPhase
    [javac]                     ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 17 errors

BUILD FAILED
/home/timmsc/svn.apache.org/lucene/solr/trunk/build.xml:224: The following error occurred while executing this line:
/home/timmsc/svn.apache.org/lucene/solr/trunk/build.xml:110: Compile failed; see the compiler error output for details.

Total time: 2 seconds
{noformat}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613969#action_12613969 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Getting "Form too large" from jetty while doing normal but large rows= (40000) shards requests. Is this related to SOLR-612 ?

Query was : http://x.x.x.x/solr/search?q=*:*&sort=indexed%20desc&fl=indexed&rows=40000 , where x.x.x.x is a single shard and /search has the shards ivars mapped to it in solrconfig.

(Sorry for the mess, but that's how it appears)

Form_too_large__javalangIllegalStateException_Form_too_large__at_orgmortbayjettyRequestextractParametersRequestjava1273__at_orgmortbayjettyRequestgetParameterMapRequestjava650__at_orgapachesolrrequestServletSolrParamsinitServletSolrParamsjava29__at_orgapachesolrservletStandardRequestParserparseParamsAndFillStreamsSolrRequestParsersjava392__at_orgapachesolrservletSolrRequestParsersparseSolrRequestParsersjava113__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava240__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_orgmortbayjettyHttpConnectionhandleHttpConnectionjava378__at_orgmortbayjettybioSocketConnector$ConnectionrunSocketConnectorjava226__at_orgmortbaythreadBoundedThreadPool$PoolThreadrunBoundedThreadPooljava442_

request: http://x.x.x.x.y/solr/select (ed: this was a different shard than the one I called)

request: http://x.x.x.y/solr/select
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:371)
	at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:345)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
	at java.lang.Thread.run(Thread.java:619)


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531565 ] 

Stu Hood commented on SOLR-303:
-------------------------------

{quote}
->Based the solution on SOLR-281. Got away with the MultiSearchRequestHandler base class.
{quote}
Does this mean that this patch requires SOLR-281 to be applied first? Also, what revision should it be applied to, or will HEAD work?


{quote}
-> Doing url encoding for the request params in XMLResponseParser
{quote}
Ah yea, I ran into that one a few days ago as well. Additionally, I had XMLResponseParser strip the "WT" parameter off its queries: 'extractterms' was passing through the user's wt, which caused the XML parsing to fail (obviously =) ).


Can't wait to try it out... Thanks a lot!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-303.
-------------------------------

    Resolution: Fixed

Closing this issue (finally!).  Specific bugs or improvements can get their own new issues.
Thanks to everyone who contributed to this!

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Jayson Minard (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580403#action_12580403 ] 

Jayson Minard commented on SOLR-303:
------------------------------------

I'll see if I can work up a patch tonight on the extended response...

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "zhang.zuxin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544684 ] 

zhang.zuxin commented on SOLR-303:
----------------------------------

to Sabyasachi Dalal:
I update solr trunk to version 597284. And I patch it cleanly.But it does't work,just like it doesn't support distributed search.
Alternately,it works when I used Sharad Agarwal 's patch.I don't know what's wrong, or maybe you change anything?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

updated patch:
- refactored some distributed search code to make things easier (added modifyRequest, etc)
- added merging of debugging info timing info (including timing info, via generic recursive merging)
- merge explain info, drops internal id from explain key for easier merging
- Many small changes: don't return scores if they aren't requested (even if needed for shard requests to merge), return maxScore
  if scores are requested, enable escaping for shards parameter.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

New patch:
  - test framework using multiple embedded jetty servers that adds documents to multiple servers, and also to a control server, then executes both distributed and non-distributed queries and compares the results.
  - fixed merging for non-string uniqueKeyFields
  - fixed issue when id field was not selected by client
  - break facet count ties by label
  - added rudimentary duplicate detection in case one accidentally adds the same doc to different shards
  - add code to handle index changes between query phases (docs may no longer exist)

Given that most of this is new functionality, I think things are in good enough shape to commit now (making it much easier for others to generate patches against it).

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Description: 
Searching over multiple shards and aggregating results.
Motivated by http://wiki.apache.org/solr/DistributedSearch


  was:
Motivated by http://wiki.apache.org/solr/FederatedSearch
"Index view consistency between multiple requests" requirement is relaxed in this implementation.

Does the federated search query side. Update not yet done.

Tries to achieve:-
------------------------
- The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.

- Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)

- Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml

- Global weight calculation is done by querying the terms' doc frequencies from all shards.

- Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.

-Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.


HOW:
-------
A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
 
The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 

The search request processing on the set of shards is performed as follows:

STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.

STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.

STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.

STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.

STEP 5: Responses from all shards from SecondQueryPhase are merged.

STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.




TODO:
-Support sort field other than default score
-Support ResponseDocs in writers other than XMLWriter
-Http connection timeouts

OPEN ISSUES;
-Merging of facets by "top n terms of field f" 

Scope for Performance optimization:-
-Search shards in parallel threads
-Http connection Keep-Alive ?
-Cache global numDocs and docFreqs
-Cache Query objects in handlers ??

Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 


       Priority: Major  (was: Minor)
        Summary: Distributed Search over HTTP  (was: Federated Search over HTTP)

Original description by Sharad, moved to this comment because a JIRA "Description" is sent to the email list *every time* there is an update to the issue.

{quote}
Motivated by http://wiki.apache.org/solr/DistributedSearch
"Index view consistency between multiple requests" requirement is relaxed in this implementation.

Does the federated search query side. Update not yet done.

Tries to achieve:-
------------------------
- The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.

- Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)

- Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml

- Global weight calculation is done by querying the terms' doc frequencies from all shards.

- Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.

-Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.


HOW:
-------
A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
 
The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 

The search request processing on the set of shards is performed as follows:

STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.

STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.

STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.

STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.

STEP 5: Responses from all shards from SecondQueryPhase are merged.

STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.




TODO:
-Support sort field other than default score
-Support ResponseDocs in writers other than XMLWriter
-Http connection timeouts

OPEN ISSUES;
-Merging of facets by "top n terms of field f" 

Scope for Performance optimization:-
-Search shards in parallel threads
-Http connection Keep-Alive ?
-Cache global numDocs and docFreqs
-Cache Query objects in handlers ??

Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 
{quote}


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Gereon Steffens (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551457 ] 

Gereon Steffens commented on SOLR-303:
--------------------------------------

I started experimenting with this patch and have a couple of issues.

First, the patch did not apply cleanly to the latest trunk (603869), so I reverted to 600419 - no big deal.

I then set up two separate tomcat/solr instances using identical schemas (on ports 8080 and 8090) and tried querying both using solr/search requests and can't any of my queries to work.

For example, there is a document with field "id" = 1527426 in the database on port 8090. "id" is defined as a "sint" field. The 8080 instance has no such id.
When querying "http://localhost/8080/solr/search?q=id:1527426&shards=local,localhost:8090/solr", I get the following in the tomcat logs:

{noformat}
catalina.out on the 8080 instance:

Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.component.ResponseBuilder <init>
INFO: ### *** shards len 2
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: --------Extract terms starting----------- :
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: ### *** is shards null false
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: ### *** SHARDS len 2
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8090/solr/select?q=id%3A1527426&shards=local%2Clocalhost%3A8090%2Fsolr&eqt=true&
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent execute
WARNING: Exception while querying shard localhost:8090/solr :java.lang.NullPointerException
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent calcuateGlobalCollectionStat
INFO: --------getGlobalCollectionStat starting----------- :
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8090/solr/federated/collectionstats?terms=id%3A%C2%80%C5%B4%E0%BA%82%2C&
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values nd : java.lang.Integer
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values tdf : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.MainQPhaseComponent process
INFO: --------MainQPhaseComponent starting----------- :
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.FedSearchComponent executeOnLocal
INFO: ->Local request params: {fl=id,score,,q=id:1527426,nd=74621,tdf=id:Ŵຂ@1,,fsv=true}
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.search.DocSlice
Dec 13, 2007 10:55:04 AM org.apache.solr.core.SolrCore execute
INFO: null nd=74621&fsv=true&tdf=id:Ŵຂ@1,&q=id:1527426&fl=id,score, 0 1
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8090/solr/select?fl=id%2Cscore%2C&q=id%3A1527426&nd=74621&tdf=id%3A%C2%80%C5%B4%E0%BA%82%401%2C&fsv=true&
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.MainQPhaseComponent process
WARNING: Exception while querying shard localhost:8090/solr :java.lang.ClassCastException: java.lang.Integer
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.handler.federated.ResponseDocs
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.handler.federated.ResponseDocs
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent process
INFO: --------AuxiliaryQPhaseComponent starting----------- :
Dec 13, 2007 10:55:04 AM org.apache.solr.core.SolrCore execute
INFO: /search q=id:1527426&shards=local,localhost:8090/solr 0 60
{noformat}

{noformat}
catalina.out on the 8090 instance

Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 10:55:04 AM org.apache.solr.handler.component.ResponseBuilder <init>
INFO: ### *** shards len 2
Dec 13, 2007 10:55:04 AM org.apache.solr.core.SolrCore execute
INFO: /select q=id:1527426&eqt=true&shards=local,localhost:8090/solr 0 3
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values nd : java.lang.Integer
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values tdf : org.apache.solr.common.util.NamedList
Dec 13, 2007 10:55:04 AM org.apache.solr.core.SolrCore execute
INFO: /federated/collectionstats terms=id:Ŵຂ, 0 3
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 10:55:04 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.search.DocSlice
Dec 13, 2007 10:55:04 AM org.apache.solr.core.SolrCore execute
INFO: /select nd=74621&fsv=true&fl=id,score,&q=id:1527426&tdf=id:Ŵຂ@1, 0 1
{noformat}

So the request does reach the 8090 instance, but triggers a CastException on the 8080 instance. The XML output is
{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">135</int>
  <lst name="params">
    <str name="q">id:1527426</str>
    <str name="shards">local,localhost:8090/solr</str>
  </lst>
</lst>
<result name="response" numFound="0" start="0"/>
  <lst name="responseHeader">
    <lst name="local">
      <int name="status">0</int>
      <int name="QTime">4</int>
      <lst name="params">
      <str name="nd">74621</str>
      <str name="fsv">true</str>
      <str name="tdf">id:Ŵຂ@1,</str>
      <str name="q">id:1527426</str>
      <str name="fl">id,score,</str>
    </lst>
  </lst>
</lst>
</response>
{noformat}

The "reverse" request for "http://localhost:8090/solr/search?q=id:1527426&shards=local,localhost:8080/solr" produces an HTTP Status 500 - null java.lang.NullPointerException response, the logs are:

{noformat}
catalina.out on the 8080 instance

Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.component.ResponseBuilder <init>
INFO: ### *** shards len 2
Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: /select q=id:1527426&eqt=true&shards=local,localhost:8080/solr 0 2
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values nd : java.lang.Integer
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values tdf : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: /federated/collectionstats terms=id:Ŵຂ, 0 5
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.search.DocSlice
Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: /select nd=74621&fsv=true&fl=id,score,&q=id:1527426&tdf=id:Ŵຂ@1, 0 1
{noformat}

{noformat}
catalina.out on the 8090 instance

Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.component.ResponseBuilder <init>
INFO: ### *** shards len 2
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: --------Extract terms starting----------- :
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: ### *** is shards null false
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent extractTerms
INFO: ### *** SHARDS len 2
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8080/solr/select?q=id%3A1527426&shards=local%2Clocalhost%3A8080%2Fsolr&eqt=true&
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent execute
WARNING: Exception while querying shard localhost:8080/solr :java.lang.NullPointerException
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.GlobalCollectionStatComponent calcuateGlobalCollectionStat
INFO: --------getGlobalCollectionStat starting----------- :
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8080/solr/federated/collectionstats?terms=id%3A%C2%80%C5%B4%E0%BA%82%2C&
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values nd : java.lang.Integer
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values tdf : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.MainQPhaseComponent process
INFO: --------MainQPhaseComponent starting----------- :
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.XMLResponseParser parse
INFO: ->Request http://localhost:8080/solr/select?fl=id%2Cscore%2C&q=id%3A1527426&nd=74621&tdf=id%3A%C2%80%C5%B4%E0%BA%82%401%2C&fsv=true&
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.FedSearchComponent executeOnLocal
INFO: ->Local request params: {fl=id,score,,q=id:1527426,nd=74621,tdf=id:Ŵຂ@1,,fsv=true}
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.search.DocSlice
Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: null nd=74621&fsv=true&tdf=id:Ŵຂ@1,&q=id:1527426&fl=id,score, 0 4
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.handler.federated.ResponseDocs
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.handler.federated.ResponseDocs
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values response : org.apache.solr.handler.federated.ResponseDocs
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.NamedList
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent process
INFO: --------AuxiliaryQPhaseComponent starting----------- :
Dec 13, 2007 11:07:33 AM org.apache.solr.handler.federated.component.FedSearchComponent executeOnLocal
INFO: ->Local request params: {dq=id:"Ŵຂ" ,q=id:1527426}
Dec 13, 2007 11:07:33 AM org.apache.solr.request.SolrQueryResponse add
INFO: adding into values responseHeader : org.apache.solr.common.util.SimpleOrderedMap
Dec 13, 2007 11:07:33 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NumberFormatException: For input string: "Ŵຂ"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Integer.parseInt(Integer.java:447)
        at java.lang.Integer.parseInt(Integer.java:497)
        at org.apache.solr.util.NumberUtils.int2sortableStr(NumberUtils.java:36)
        at org.apache.solr.schema.SortableIntField.toInternal(SortableIntField.java:52)
        at org.apache.solr.schema.FieldType$DefaultAnalyzer$1.next(FieldType.java:315)
        at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:437)
        at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:97)
        at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:515)
        at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1227)
        at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:979)
        at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:907)
        at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:896)
        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:146)
        at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:101)
        at org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent.prepare(AuxiliaryQPhaseComponent.java:71)
        at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:152)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:866)
        at org.apache.solr.handler.federated.component.FedSearchComponent.executeOnLocal(FedSearchComponent.java:87)
        at org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent$1.call(AuxiliaryQPhaseComponent.java:115)
        at org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent$1.call(AuxiliaryQPhaseComponent.java:114)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: null q=id:1527426&dq=id:"Ŵຂ"+ 0 2
Dec 13, 2007 11:07:33 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
        at org.apache.solr.handler.federated.SearchResponseMerger.mergeResponseDocs_NoSort(SearchResponseMerger.java:215)
        at org.apache.solr.handler.federated.SearchResponseMerger.merge(SearchResponseMerger.java:83)
        at org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent.process(AuxiliaryQPhaseComponent.java:156)
        at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:158)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:866)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:210)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:870)
        at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
        at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
        at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685)
        at java.lang.Thread.run(Thread.java:595)

Dec 13, 2007 11:07:33 AM org.apache.solr.core.SolrCore execute
INFO: /search q=id:1527426&shards=local,localhost:8080/solr 0 95
Dec 13, 2007 11:07:33 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
        at org.apache.solr.handler.federated.SearchResponseMerger.mergeResponseDocs_NoSort(SearchResponseMerger.java:215)
        at org.apache.solr.handler.federated.SearchResponseMerger.merge(SearchResponseMerger.java:83)
        at org.apache.solr.handler.federated.component.AuxiliaryQPhaseComponent.process(AuxiliaryQPhaseComponent.java:156)
        at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:158)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:866)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:210)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:870)
        at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
        at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
        at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685)
        at java.lang.Thread.run(Thread.java:595)

{noformat}



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Attachment:     (was: fedsearch.patch)

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: shards_qt.patch

Attaching shards_qt.patch, which uses "shards.qt" as "qt" for sub-requests to avoid infinite recursion when setting "shards" as a default in the request handler.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

patrick o'leary updated SOLR-303:
---------------------------------

    Attachment: distributed_pjaol.patch

Hey Yonik
Needed to make a couple of updates to ShardDoc as the nested outer classes were preventing me from using the patch.
Also included SOLR-457, with a multi threaded implementation of solrj to query the shards.
with this patch.

P

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614397#action_12614397 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Yonik, sure-- but I think we should probably handle the case better than a 500 error. maybe a solr warning about per-shard row limits?

Lars -- I am having trouble getting that maxFormContentSize property set. I am running jetty like:

/usr/local/java/bin/java -Dorg.mortbay.http.HttpRequest.maxFormContentSize=1000000 -Xmx7000m -Xms1024m -jar start.jar

(I've also tried 0 and -1, per the jetty docs this means "unlimited.") 

but the same distributed query gives the same error. How are you setting that property?



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554446 ] 

Stu Hood commented on SOLR-303:
-------------------------------

Thanks for the new patch Yonik! It doesn't apply cleanly because of the way you generated the test files, but after those have been removed, it looks good. It seems you figured out the sorting issue that I had mentioned: thanks.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553841 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

{quote}I'm not quite sure about GlobalCollectionStat. Is its purpose just to normalize weights from the shards?{quote}

It's to make a distributed search score the same as it would if everything was in a single index.
idf (inverse document frequency) is part of the scoring, so that component essentially does a distributed idf.

I still use the PriorityQueue, but it's been modified since SolrJ returns objects rather than strings.
I'll try to post a draft soon... if you understood the old code, it will be great for you to look at the new stuff to see what I'm missing.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523996 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

Recently I have added a feature of parallel requests to shards using a thread pool. (not yet uploaded the patch)
Async IO would be the next thing but dont want to bring in its complexity so early.
Perhaps, we can benchmark the performance of the thread pool/parallel requests implementation. Later based on the numbers, work towards having Async IO.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Attachment: fedsearch.patch

I have fixed and updated the patch with trunk version 600419. It is integrated with the re-opened SOLR-281 patch.
I have added the configuration for the three distributed-search components in the solrconfig.xml, under "/search" request handler. So, the distributed search works with /search request only.

Couple of issues :
1. The dist search components need the reference to the SearchHandler. So for now , i have hard coded the "/search" pattern in the FedSearchComponent.
2. Need a clean way to load common init params for the dist search components, such as timeout, thread pool size and search handler pattern.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527647 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I'm also seeing the following issue, but I haven't have time to investigate:

{quote}
WARNING: Exception while querying shard crc10:8080/solr_postfix09092000-09112000 :java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast to org.w3c.dom.Element
{quote}

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553838 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I recognize the advantage of the AuxiliaryQPhase, but I'm not quite sure about GlobalCollectionStat. Is its purpose just to normalize weights from the shards?

I had to make some changes to the MainQPhase parameter building, and to the PriorityQueue that SearchResponseMerger uses to get sorting working properly. Yonik, if you aren't planning on re-writing those from scratch, would you prefer a patch, or an explanation of what I needed to change?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513167 ] 

Sharad Agarwal edited comment on SOLR-303 at 7/17/07 12:27 AM:
---------------------------------------------------------------

> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
>>Do you have plans to remedy that? Or do you think that most people are OK with inconsistencies that could arise?
The thing to note here is that currently multi phase execution is based on document unique fields, NOT on doc internal ids. So there wont be much inconsistencies between requests; as it does not depend on changing internal doc ids. 
The possibility is that a particular document may have been deleted when the second phase executes.; which in my opinion should be OK to live with.
Other possibility could be the document is changed and original query terms are not present in the document anymore. This can be solved by doing a AND with the original query and uniq field document query.

If people think it is really crucial to have index view consistency, then it should be easy to implement "Consistency via Retry" as mentioned in http://wiki.apache.org/solr/FederatedSearch 

>>It might also be the case that a custom partitioning function could be implemented (such as improving caching by partitioning queries, etc) or it may >>be more efficient to do the second phase of a query on the same shard copy as the first phase.
>>In that case it might make sense load balancing across shards from Solr. 
For second phase of a query to execute on the same shard copy, third party "Sticky load balancers" can be used. I believe Apache already does that. All copies of a single partition can sit behind the Apache load balancer (doing the "Sticky"). The merger just needs to know about the Load-balancer ip/port for each partition. Now based on the query, merger can search the appropriate partitions only.

To improve the caching, Solr itself has to do the load balancing. Other option could be to introduce the query result cache at the merger itself.

>>Where are terms extracted from (some queries require index access)? This should be delegated to the shards, no?It can be the same step that gets >>the docFreqs from the shards (pass the query, *not* the terms). 
yes, if thats the case, should be easy to implement as you have suggested.

>>I think we should base the solution on something like https://issues.apache.org/jira/browse/SOLR-281 
cool, I was looking for something like this. This looks like the way to go.

>>Any thoughts on RMI vs HTTP for the searcher-subsearcher interface? 
RMI could be supported as an option by enhancing the ResponseParser (better name ??) interface. The remote search server can directly return the SolrQueryResponse object. I understand that there will be some performance benefit if doing the native java marshalling/unmarshalling of object; instead of Solr response writing and then parsing (if done the HTTP way). The question we need to answer is: Is the effort/complexity worth it?

In our organization we made a conscious decision to go for HTTP. The operation folks like HTTP as it is standard stuff, load balancing, monitoring etc. Lot of tools already available for it. With RMI, I am not sure external Sticky load-balancing is possible; the merger itself has to build the logic.
Moreover, I think HTTP fits more naturally with Solr in its Request handler model.






 was:
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
>>Do you have plans to remedy that? Or do you think that most people are OK with inconsistencies that could arise?
The thing to note here is that currently multi phase execution is based on document unique fields, NOT on doc internal ids. So there wont be much inconsistencies between requests; as it does not depend on changing internal doc ids. 
The possibility is that a particular document may have been deleted when the second phase executes.; which in my opinion should be OK to live with.
Other possibility could be the document is changed and original query terms are not present in the document anymore. This can be solved by doing a AND with the original query and uniq field document query.

If people think it is really crucial to have index view consistency, then it should be easy to implement "Consistency via Retry" as mentioned in http://wiki.apache.org/solr/FederatedSearch 
"Consistency via specifying Index version" would be little involved. Session management with "Sticky" load balancers could be explored.

>>It might also be the case that a custom partitioning function could be implemented (such as improving caching by partitioning queries, etc) or it may >>be more efficient to do the second phase of a query on the same shard copy as the first phase.
>>In that case it might make sense load balancing across shards from Solr. 
For second phase of a query to execute on the same shard copy, third party "Sticky load balancers" can be used. I believe Apache already does that. All copies of a single partition can sit behind the Apache load balancer (doing the "Sticky"). The merger just needs to know about the Load-balancer ip/port for each partition. Now based on the query, merger can search the appropriate partitions only.

To improve the caching, Solr itself has to do the load balancing. Other option could be to introduce the query result cache at the merger itself.

>>Where are terms extracted from (some queries require index access)? This should be delegated to the shards, no?It can be the same step that gets >>the docFreqs from the shards (pass the query, *not* the terms). 
yes, if thats the case, should be easy to implement as you have suggested.

>>I think we should base the solution on something like https://issues.apache.org/jira/browse/SOLR-281 
cool, I was looking for something like this. This looks like the way to go.

>>Any thoughts on RMI vs HTTP for the searcher-subsearcher interface? 
RMI could be supported as an option by enhancing the ResponseParser (better name ??) interface. The remote search server can directly return the SolrQueryResponse object. I understand that there will be some performance benefit if doing the native java marshalling/unmarshalling of object; instead of Solr response writing and then parsing (if done the HTTP way). The question we need to answer is: Is the effort/complexity worth it?

In our organization we made a conscious decision to go for HTTP. The operation folks like HTTP as it is standard stuff, load balancing, monitoring etc. Lot of tools already available for it. With RMI, I am not sure external Sticky load-balancing is possible; the merger itself has to build the logic.
Moreover, I think HTTP fits more naturally with Solr in its Request handler model.





> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-303:
----------------------------------

    Comment: was deleted

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

attaching updated patch (distributed.patch) that fixes some sorting issues.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580024#action_12580024 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

committed addition tests... thanks!

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Attachment: fedsearch.patch

Removed the commented line from SolrCore.loadSearchComponents and couple of debug statements.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598548#action_12598548 ] 

Lars Kotthoff commented on SOLR-303:
------------------------------------

On closer inspection of the code, are the fields "sort" and "prefix" of FieldFacet used anywhere at all? They don't seem to be referenced anywhere in the code and just removing them doesn't seem to have any obvious effect.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ian Holsman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557533#action_12557533 ] 

Ian Holsman commented on SOLR-303:
----------------------------------

Hoss.. 

I'm not sure about n**2. 

I would think it would be n * number of shards.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572633#action_12572633 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

OK, I've committed this!  Thanks everyone!
I'll leave this bug open for now as a place to accumulate patches.

Some things that are missing (but optional and not currently high on my TODO list):
 - field faceting when facet.sort=false
 - distributed idf... this has a performance cost, and should matter little in a well mixed index.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

Small update, mostly to sorting
- This changes sorting to get values from the Sort comparators (thus supporting custom sorts)
- uses external values that can be supported by XML, also nicer for debugging
-  returns sort field values in an array per-field {price=[10,20,30,40,50]}
- merging should be faster... lookup of sort values is by index number instead of searching
  for the field name.
- merging short-circuits comparisons for docs in the same shard
- sorting null values now works & respects sortMissingFirst/Last, etc
- if a shard request, don't pre-fetch docs for highlighter

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611372#action_12611372 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

{quote}
http://localhost:8983/solr/select?shards=[4 shards]&q=*:*&start=5000&rows=1000
Seems to request &rows=6000 from all the shards?
{quote}

It's a feature.

To retrieve documents 5000-6000, one must find the first 6000 documents then take the last 1000.
Since it's possible that all top 6000 documents could come from a single shard, the top 6000 documents must be collected from each and merged.

There are alternatives:
1) Optimistically request less than 6000 documents per shard and re-query if we are wrong
2) Add an optional mode that treats documents across shards in the same position as equal, so if you had 10 shards, you would simply get the top 100 docs starting at 500.  This might be OK for some applications.

In general, search engines are optimized at retrieving the top 10 of something, and bad at retrieving the top 10 starting at a big number.  Limit the depth people can page, or restructure queries to avoid the latter case.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Comment: was deleted

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607106#action_12607106 ] 

bwhitman edited comment on SOLR-303 at 7/7/08 2:57 PM:
------------------------------------------------------------

Putting &debugQuery on a query with shards that returns 0 results will NPE:

(removing NPE code block so it stops wrapping the page)

      was (Author: bwhitman):
    Putting &debugQuery on a query with shards that returns 0 results will NPE:

{code}
INFO: webapp=/solr path=/select params={shards=localhost:8983/solr,localhost:8984/solr,localhost:8985/solr,localhost:8986/solr&debugQuery=true&q=i_tag:894&rows=100} status=500 QTime=8 
Jun 22, 2008 12:45:38 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
	at org.apache.solr.handler.component.DebugComponent.finishStage(DebugComponent.java:133)
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:257)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
	at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
	at org.mortbay.jetty.Server.handle(Server.java:285)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
	at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

{code}
  
> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605660#action_12605660 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

> But shouldn't there be an option to skip over servers that aren't responding or time out?

That does sound like it would be a useful option (but I think it should be false by default though).

FYI, I'm currently looking into Lars' facet changes.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Jayson Minard (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580188#action_12580188 ] 

Jayson Minard commented on SOLR-303:
------------------------------------

Would it be interesting to others to have an extended response format for distributed queries that would bring back the list of shards numbered, and then code each element of the response with the source list of shards that contributed to the element appearing in the results?  For example, which shard was the source of a document?  Or which shards had the facet value present?  And so on.

In really high shard counts it is more efficient if you can trim follow-on queries and pivots to only shards that matter.  This information would help that effort.  

Regardless, it is useful for debugging.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579931#action_12579931 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I just committed this bugfix... thanks Jayson!

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

Updated patch:
- face refinement requests piggyback on the requests to retrieve stored fields where possible.
- fixed bug when requesting scores... don't include scores even if requested if they are not in the given DocList
- fixed HTTP error codes for query parse errirs
- added double/long support in sorting since we've upgraded to lucene 2.3, and changed aggregate numFound to handle long
- escape&unescape comma separated "ids" string using backslash escaping (used to specify docs from each shard to retrieve)
- other misc cleanups

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572542#action_12572542 ] 

Ryan McKinley commented on SOLR-303:
------------------------------------


{quote}Given that most of this is new functionality, I think things are in good enough shape to commit now (making it much easier for others to generate patches against it).{quote}

+1 (But have only checked that it does not break anything I'm working with) -- I think this should get committed soon.  Since it is large and mostly discrete from existing functions, it will be much easier to refine with smaller patches.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614181#action_12614181 ] 

Lars Kotthoff commented on SOLR-303:
------------------------------------

The default limit for form submissions is 200000 bytes with Jetty. I'm not sure why Solr is trying to send such large amounts of data to the shards though, the only case I've seen this happening is with faceting -- Solr has to request facet counts for specific values from the shards to get exact counts. Maybe because of the sorting?

Anyway, you can change the limit by setting the org.mortbay.http.HttpRequest.maxFormContentSize system property.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated SOLR-303:
--------------------------------

    Attachment: fedsearch.patch

Added support for sorting.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557537#action_12557537 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

> I would think it would be n * number of shards.

That would make the number of terms to transfer over the network and to merge O(n_shards**2)... not great for scalability :-)

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611368#action_12611368 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Anyone notice something like this:

http://localhost:8983/solr/select?shards={4 shards}&q=*:*&start=5000&rows=1000

Seems to request &rows=6000 from all the shards? (likewise, start=10000&rows=1000 sends rows=11000 to all the shards?) 

The shards all say:
INFO: webapp=/solr path=/select params={fl=id,score&start=0&q=*:*&isShard=true&wt=javabin&fsv=true&rows=6000&version=2.2} hits=6000 status=0 QTime=175 

And the host I called select on says:
INFO: webapp=/solr path=/search params={start=5000&q=*:*&rows=1000} status=0 QTime=1192 

And the QTime goes up the higher &start goes. (QTime for start=5000 was 200, QTime for start=50000 was 4500, start=500000 had 35000!)

Bug or feature?



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Gereon Steffens (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556881#action_12556881 ] 

Gereon Steffens commented on SOLR-303:
--------------------------------------

Yonik - thanks, that's what caused it.

Patrick - as far as I can tell, you can ignore the error messages from patch.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567953#action_12567953 ] 

patrick o'leary commented on SOLR-303:
--------------------------------------

It looks pretty good, I really need the ShardDoc's classes to be split up into public classes so I can use
them. 
It would also be fantastic to open up QueryComponent, my component only needs to over ride
a few functions, and it would so much cleaner to just extend QueryComponent rather than duplicate the code.

Also through testing, it might be worth while to apply a few negative edge cases.
e.g. duplicate documents in different shards. As systems get larger this is a huge possibility. Only fixed hash indexing could ensure you don't get duplicates, but if you try to have an extend-able  environment that might not be an option.

Took me a while to realize I had duplicated documents during indexing, but it causes NPEs in the query response writers, so not obvious or easy to figure out.

A solution would be to maintain map of unique fields as adding the ShardDocs to the priority queue, and continue on duplicates. You might also want to put some logic in there to ensure same shard doc is used for each duplicate doc, simple because the scores for identical doc's will be different across shards, and could change based upon order of which Shard responds first. This should eliminate that


So something like
QueryComponent.mergeIds
{code}

Map<Object, String> uniqueDoc = new HashMap<Object, String>();
      
      for (ShardResponse srsp : sreq.responses) {
        SolrDocumentList docs = srsp.rsp.getResults();
         ................
         ................
         // go through every doc in this response, construct a ShardDoc, and
        // put it in the priority queue so it can be ordered.
        for (int i=0; i<docs.size(); i++) {
          SolrDocument doc = docs.get(i);
          ..................
          ..................
          Object uniqueField = doc.getFieldValue(uniqueKeyField.getName());
          
          if(! uniqueDoc.containsKey(uniqueField)) {
        	  shardDoc.setId(uniqueField);
        	  uniqueDoc.put(uniqueField, shardDoc.shard);
          } else{
        	  numFound--;
        	  if(uniqueDoc.get(uniqueField).compareTo(shardDoc.shard) >0){
        		 continue;
        	  }
          }

          ..........................
          queue.insert(shardDoc);
        } // end for-each-doc-in-response
      } // end for-each-response
{code}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534213 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I really like where you are headed with the 'componentized' version of the patch: it much more elegant.

But: I'm still having the problem where multi-valued fields only get one value returned. During AuxiliaryQPhaseComponent.merge(SolrQueryResponse rsp, SolrQueryResponse auxPhaseRes), you check whether the field already exists before adding it, but multi-value fields can exist multiple times.

Also, I'm considering disabling the AuxiliaryQPhase and just letting the MainQPhase fetch the document fields. All of my documents are small ( < 1k on average with 10ish fields), so I think making another call across the network to fetch the remaining fields is probably a waste for our indexes. What do you think?

Thanks!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley reassigned SOLR-303:
---------------------------------

    Assignee: Yonik Seeley

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580392#action_12580392 ] 

Sean Timm commented on SOLR-303:
--------------------------------

Jayson--

I agree.  I've been meaning to recommend that be added.  We've found it invaluable in the past (mostly with debugging) when doing federated and distributed search.  I would like to see a "shard" field added which would contain the base URI of the shard where the result originated as provided in the request.  The index of each result is less important to me, but I can see how that would be useful.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614400#action_12614400 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

bq. but I think we should probably handle the case better than a 500 error. maybe a solr warning about per-shard row limits?

That's a jetty limit you hit, the exception was understandable, and an unknown exception like that (from solr's perspective) seems like it should map to a 500 error code.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527570 ] 

Stu Hood commented on SOLR-303:
-------------------------------

Thanks Sharad, the last patch applied cleanly as you said.

I've run into some errors that should be quick fixes for your next revision:

* I had to modify the code not to assume that shard names end in '/solr' so that I could specify an instance name, like: 'blah.com:8080/instance_name'.
* The parameters for your subqueries are not (always?) getting escaped. My document ids contain some colons (':'), and so its throwing a null pointer error during the SecondQueryphase, and then again in SolrCore execute.


Thanks a lot for your work!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526563 ] 

Stu Hood commented on SOLR-303:
-------------------------------

Sharad, what Solr revision have you applied the latest copy of this this patch against? I know that the r573893 commit caused all kinds of havoc in the source tree, but I'd really like to try it out, and I don't mind using an older revision to get it working.

Also, do you have any newer versions of the patch?

Thanks a lot!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571992#action_12571992 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Patrick, I've reproduced your null pointer exception on accidental duplicates (I've been working on tests).  I'll look into a fix along the lines of what you suggested.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556644#action_12556644 ] 

patrick o'leary commented on SOLR-303:
--------------------------------------

Hey Yonik

Are you applying the federated search patch first before the distributed search?
The patch itself won't apply cleanly against trunk 

Thanks
P

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

OK, this version patches cleanly and includes some distributed faceting code.
- facet.query and facet.field sorted by count is mostly handled
- breaking ties by natural (index) sort order is not yet implemented
- date faceting and unsorted (index order) facet.field is not implemented

Assuming the user asks for the top 10 terms of a field:
1) The first facet queries piggyback on the queries to get the top ids and sort field values.
2) counts are merged, and new "refinement" requests are send out for those terms in the top 10 where a count was not received from some shards.  Also, for terms below the top 10, we calculate the maximum it could have based on shards we have not heard from, and if that boosts it into the top 10, we include that term for "refinement".
3) refinement responses are used to adjust the counts, and we are done.

Note that it is theoretically possible to miss terms.  A term could be just below the threshold of each shard (and thus not returned by any shard), but the total count could boost it in the top.  This could be rectified by retrieving *all* terms above a specified count, but it could be expensive.  The counts that are currently returned are exact.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523925 ] 

Mike Klaas commented on SOLR-303:
---------------------------------

Great stuff!

I think asynchronous/parallel requests are a central feature to this kind of result aggregator.  In my similar python implementation, I fire off all the requests and collect the responses in a select() loop.  Threads are possible but get somewhat weighty when you have many shards (I've used up to 90).  An easier alternative to select() is to simply fire off all the requests and then wait for the responses sequentially (assuming java has an api that allows this).  This is almost as good as the select() loop but does not have the same complexity.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543774 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I'm really just starting to dig into this again, but here are a couple of thoughts:

It looks like there is a monolithic main federated query component that does all the work... It would be nice if there were a way to turn this around so that a user could write a query component that could participate in a distributed search call.  It seems like query info should be able to be gathered from multiple components and then a single request to a shard could be made.  This entails multiple methods on QueryComponent for use in a distributed request.

Another observation is that the number of "phases" may be unpredictable.  For example when faceting, if one wants "exact" results, more information may be required from certain nodes.  This means that components need a way to say if they are done or not, and a way to send different requests to different shards.  Then when responses are received, it should be possible to optionally handle them one-by-one as they come in, or alternately all at once to merge the results.



> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528494 ] 

Hoss Man commented on SOLR-303:
-------------------------------

FWIW: I haven't really been able to follow this issue much (it's way out of my area of expertise) but seeing some comments go by in email i wanted to mention two things...

> ResonseDocs are based on document unique key while DocList is based on internal doc id.
> The purpose of ResponseDocs is to represent documents lying in remote index while DocList are
> meant for local internal doc id.

One thing to keep in mind is the way MultiReader deals with this in Lucene ... if you know the maxDoc of each of your sub-indexes, then you can compute internal docIds ... that may be one way to preserve the DocList abstraction (and allow for supporting schemas without uniqueKey fields) when dealing with federated search  (allthough it may open up new problems if you need to rely on havingsome form of an identifier that doesn't change .. i'm not sure if the approach being taken makes multiple requests to the shards)

That said...

Federated Search is a complex enough concept that if it requires additions to the ResponseWriter API to be done effeciently, I don't think that would be the end of the world -- the key thing would be to find ways to minimize the impact on existing clients -- if things work for you now, they should keep working for you; if you want to start using federated search, then it's fair to expect that you may have to change a few things, or deal with a few limitations.  Off hte top of my head: one option may be to add a FederatableResponseWriter subclass such that if a request is federated, then the writer being used must implement that interface or it's a runtime error.


> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Thomas Peuss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581885#action_12581885 ] 

Thomas Peuss commented on SOLR-303:
-----------------------------------

bq. they are limited in the number of documents they can request during the second phase by the maximum length of the query string.
For Tomcat you can increase the allowed length of the query string by adding for example _maxHttpHeaderSize="65536"_ to the Connector entries in server.xml. This increases the max. allowed GET request size to 64KB (standard is 4KB).

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Kotthoff updated SOLR-303:
-------------------------------

    Attachment: solr-dist-faceting-non-ascii-all.patch

I've had a couple of issues with the current version. First, the facet queries which are sent to the other shards are posted in the URL, but aren't URL encoded, i.e. during the refine stage anything non-ascii results in facet counts for "new" values (i.e. the garbled version) coming back and causing NPEs when trying to update the counts.

Furthermore, facet.limit=<negative value> isn't working as expected, i.e. instead of all facets it returns none. Also facet.sort is not automatically enabled for negative values.

I've attached "solr-dist-faceting-non-ascii-all.patch" which fixes the above issues. Somebody who understands what everything is supposed to do should have a look over it though :)
For example I've found two linked hash maps in FacetInfo, topFacets and listFacets, which seem to serve the same purpose. Therefore I replaced them by a single hash map. It seems to work just fine this way.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569712#action_12569712 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

> I really need the ShardDoc's classes to be split up into public classes

ShardDoc is public already... can you elaborate?

> It would also be fantastic to open up QueryComponent, my component only needs to override a few functions

What is yours trying to accomplish?

> A solution would be to maintain map of unique fields as adding the ShardDocs to the priority queue, and continue on duplicates.

Agree.  It should fall into the category of robustness though, rather than a duplicates detection feature (since it will mean that facets will be off, and it will be possible to get fewer docs than requested if duplicates do exist).

We also need to be robust in the face of a commit on a shard happening between phases of a request (a doc that we request info for may no longer exist, etc).  That would probably cause us to blow up currently.

Hopefully this can be committed after some basic tests are added, and that will make it much easier for others to contribute patches.  In the future maybe we should try a branch for changes this large.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557535#action_12557535 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

> one solution i've seen to mitigate problems like this in the past is to compute a higher "limit" when querying the individual shards

Yep.  Eventually should be configurable too.  We should definitely do some "over requesting" for very small limits.  Expanding the limit too much can be expensive though (CPU cost partially depends on the algorithm).  I think users should even be able to disable refinement queries if they just want an estimate.

Note that it's possible to tell if there even could be stealth terms out there... we maintain the smallest count we get from each shard, so that serves as the largest count any unknown term could have.  Add all those together to see if it's possible an unknown term could make it to the top terms.   This means you could do a request with a smaller limit, and then re-request with a larger limit if necessary.

Beyond that, it becomes unclear what the best strategy is.  Worst case scenario: If the top N facets get down to a count of 1, then *any* unknown term could bump another higher.  Requesting all terms with count>=1 from each shard isn't something I want to ponder. 

Anyway, a colleague informs me that this is the way at least one other major search vendor does things (counts are exact for terms shown, but it is theoretically possible to miss a term). 


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557332#action_12557332 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

WRT a switch, I left room for other components to insert stages between the well defined ones.
I'm not sure if this will be useful in the future or not.  Much of that seems like it would depend on the contracts between the components and the ResponseBuilder, and thus how other unknown custom coponents would be able to change things. That's still very immature, as I've really just been focusing on getting things working.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated SOLR-303:
--------------------------------

    Attachment: fedsearch.patch

To do a quick test of the patch, try adding:
shards=local,localhost:8080
as a request parameter to the search url

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

fixed test cases that relied on parsing previous explain format

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534268 ] 

Sharad Agarwal commented on SOLR-303:
-------------------------------------

>>But: I'm still having the problem where multi-valued fields only get one value returned. During AuxiliaryQPhaseComponent.merge(SolrQueryResponse rsp, SolrQueryResponse auxPhaseRes), you check whether the field already exists before adding it, but multi-value fields can exist multiple times.

yeah, may be I have missed those scenarios. If you have the fix, pl feel free to update the patch.

>>Also, I'm considering disabling the AuxiliaryQPhase and just letting the MainQPhase fetch the document fields. All of my documents are small ( < 1k on average with 10ish fields), so I think making another call across the network to fetch the remaining fields is probably a waste for our indexes. What do you think?
Having AuxiliaryQPhase saves primarily on following counts:-
1) fetching doc fields 
2) generating snippets 
3) more like this query etc
-> for only the merged docs.

>From my experience generating snippets is very CPU intensive and if the no of shards are large, there would be lot of CPU wastage (if snippets are generated in MainQPhase) => CPU wastage proportional to (n-1)/n  => n being no of shards
So, having extra network calls saves on CPU. Hence there being trade-off between two.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

New patch attached... last one had an unfinished change that prevented compilation (using the generic SolrResponse instead of SolrQueryResponse).

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

New patch attached...

I just discovered that refinement queries weren't working because filter.query doesn't accept the new query syntax I was using to avoid having to escape field values: <!field f=myfield>value
(this should probably be committed separately, but it's in this patch for now).

I put in code to over-request facet.field limit, but then commented it out for now since it too easily covers up bugs because it often prevents any refinement query logic from being exercized.

Also corrected the code that always used the last element as the max possible missing count.  If we requested 10 terms and only got 6, then we know that the max possible missing count is zero.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated SOLR-303:
--------------------------

    Attachment: fedsearch.stu.patch

Here is another revision of the latest patch (I've still only tried it with r574785: I'm a bit crunched for time).

*Resolved issues:*
* We were forgetting to increment a counter during the last step in SecondQPhaseComponent.process, and so we weren't getting results from all shards.
* SecondQPhaseComponent.merge was throwing away any fields that already existed in a document, and so it was throwing away parts of multi-value fields. Fixing this exposed the first issue listed below.
* MultiSearchRequestHandler was creating non-daemon threads (the default) for the thread pool. This meant that when the JVM died, the threads were sticking around. I added a ThreadFactory that creates daemonized threads.

*Open issues:*
* The 'local' shard is ignoring the 'FL' parameter during the FirstQueryPhase, and returning the entire document. We then try and merge the document into itself in SecondQPhaseComponent.merge, causing a ConcurrectMod exception. For now, I put a check for "newDoc != oldDoc", but I think we need to figure out why the local query is returning full documents.
* Range queries are broken (probably due to the extract terms phase failing)
* 'start' and 'numfound' are incorrect when returned to the user
** start is getting wiped out somewhere
** numfound is counting all copies of matches for a uniqKey towards the total
* MultiSearchRequestHandler.THREAD_POOL_SIZE and MultiSearchRequestHandler.REQUEST_TIME_OUT_IN_MS should be configuration parameters in solrconfig.xml.

Thanks a lot!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

OK, here is a *draft* that mostly works for searches and highlighting.

There are stages in the request:
{code}
  public static int STAGE_START           = 0;
  public static int STAGE_PARSE_QUERY     = 1000;
  public static int STAGE_EXECUTE_QUERY   = 2000;
  public static int STAGE_GET_FIELDS      = 3000;
  public static int STAGE_DONE            = Integer.MAX_VALUE;
{code}

When a component wants to send a request, it adds it to "outgoing" queue.
Other components can inspect and modify these shard requests.
All components get a callback when the shard response is received.

All shard responses purposes (to aid in both correlation and inspection/modification by other components).
This is what a ShardRequest looks like:
{code}
public class ShardRequest {
  public final static String[] ALL_SHARDS = null;

  public final static int PURPOSE_PRIVATE         = 0x01;
  public final static int PURPOSE_GET_TERM_DFS    = 0x02;
  public final static int PURPOSE_GET_TOP_IDS     = 0x04;
  public final static int PURPOSE_REFINE_TOP_IDS  = 0x08;
  public final static int PURPOSE_GET_FACETS      = 0x10;
  public final static int PURPOSE_REFINE_FACETS   = 0x20;
  public final static int PURPOSE_GET_FIELDS      = 0x40;
  public final static int PURPOSE_GET_HIGHLIGHTS  = 0x80;

  public int purpose;  // the purpose of this request

  public String[] shards;  // the shards this request should be sent to
// TODO: how to request a specific shard address?

  public ModifiableSolrParams params;

  public List<ShardResponse> responses = new ArrayList<ShardResponse>();
}
{code}


Components are responsible for themselves... the highlighting component is responsible for turning itself on/off at the appropriate time... the query component has no knowledge of the highlight component.  This will make it so that custom components can be developed that can work in a distributed environment w/o explicit support for that component baked into the other components.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549573 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Yes, I'm suggesting changing the main control loop.
Normal non-distributed requests don't necessarily need stages (but could be added to be more consistent with the distributed methods... with stages, I don't think there would be a "prepare" method).
Right now, my private copy of SearchComponent looks like
{code}
public abstract class SearchComponent implements SolrInfoMBean
{
  public abstract void prepare( SolrQueryRequest req, SolrQueryResponse rsp ) throws IOException, ParseException;
  public abstract void process( SolrQueryRequest req, SolrQueryResponse rsp ) throws IOException;

  public int distributedProcess(ResponseBuilder rb) throws IOException {
    return ResponseBuilder.STAGE_END;
  }

  public void handleResponses(ResponseBuilder rb, ShardRequest sreq) {
  }
{code}


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sabyasachi Dalal updated SOLR-303:
----------------------------------

    Attachment: fedsearch.patch

I made a mistake and uploaded the wrong patch file. Now uploading the correct file.

I have fixed and updated the patch with trunk version 600419. It is integrated with the re-opened SOLR-281 patch.
I have added the configuration for the three distributed-search components in the solrconfig.xml, under "/search" request handler. So, the distributed search works with /search request only.

Couple of issues :
1. The dist search components need the reference to the SearchHandler. So for now , i have hard coded the "/search" pattern in the FedSearchComponent.
2. Need a clean way to load common init params for the dist search components, such as timeout, thread pool size and search handler pattern.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Jayson Minard (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jayson Minard updated SOLR-303:
-------------------------------

    Attachment: distributed_add_tests_for_intended_behavior.patch

A few more tests to show intended behavior when facets differ between shards which is likely in the wild (missing from all but valid in schema, missing from some, and invalid field not in schema).  The last test  is just to ensure error behavior matches non-distributed searches.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556950#action_12556950 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I'm in the middle of implementing some distributed faceting... but I'll try to get a better patch the next time around.
I think some of Ryan's suggestions are good (a separate patch to move SearchHandler, put solrj in core, implement ResponseWriter support for SolrJ objects).



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605666#action_12605666 ] 

Sean Timm commented on SOLR-303:
--------------------------------

In SOLR-502, there is the notion of partialResults.  It seems that the same flag could be used in this case.  Perhaps a string should also be added indicating why all results were not able to be returned.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-303:
--------------------------

    Fix Version/s: 1.3

marking as intended for 1.3 ... i'm not overly familiar with the state of this issue, but i do know that large chunks of functionality have already been committed, so i want to make sure that before 1.3 is released someone conciously decides between:
   * "DONE" ...resolving this issue
   * "NOT DONE BUT OK" ... leaving the issue unresolved and removing the 1.3 designation
   * "NOT DONE AND NOT OK" ... rolling back any/all committed code that is considered detrimental for the 1.3 release.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Gunnar Wagenknecht (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598551#action_12598551 ] 

Gunnar Wagenknecht commented on SOLR-303:
-----------------------------------------

Hi / Hallo,

Thanks for your mail. Unfortunately, I won't be able to answer it
soon. I'm on vacation till June 2nd without access to my mails.

~~~~

Vielen Dank für die Email. Leider werde ich nicht sofort antworten.
Ich bin bis 2. Juni im Urlaub ohne Zugriff auf mein Postfach.

-Gunnar

-- 
Gunnar Wagenknecht
gunnar@wagenknecht.org
http://wagenknecht.org/


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545143 ] 

Sabyasachi Dalal commented on SOLR-303:
---------------------------------------

Can you please some more details about the error ? Are you seeing
any exceptions ? How are your partitions set up and what is the
request you are sending ?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614402#action_12614402 ] 

Lars Kotthoff commented on SOLR-303:
------------------------------------

bq. I think we should probably handle the case better than a 500 error. maybe a solr warning about per-shard row limits?

That's specific to the configuration of your container, I think there's nothing that Solr can do about it.

As for the form content size, I haven't actually tried that myself I must admit. I'm running Tomcat and just got that parameter from the Jetty documentation. I'd take a wiredump with something like tcpdump to see what the actual size of the request is. Maybe it's even larger than 1000000 bytes?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605931#action_12605931 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I forgot we've already gone a few rounds on charset in POST bodies:
https://issues.apache.org/jira/browse/SOLR-443
http://markmail.org/message/gtzbtwzqa6zranur?q=POST+body+charset#query:POST%20body%20charset+page:1+mid:fkragfatbox5fff5+state:results

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Whitman updated SOLR-303:
-------------------------------

    Attachment: shards.start_rows.patch

Attaching patch to add a &shards.start and &shards.rows optional parameter. If set, they override distributed search's intelligence on setting start and rows per shard. If you set &shards.start=10 and &shards.rows=10, each shard will be queried with &start=10 and &rows=10 and you'll get back N*10 results (set &rows on the main query to get it all.)

[Not a java developer, my patch works but may violate good taste/style]

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614186#action_12614186 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

bq. I'm not sure why Solr is trying to send such large amounts of data to the shards though

Specifying 40,000 ids to be retrieved I imagine.  The average id length must be over 50 bytes.

Brian: if ordering isn't important for some of these big bulk queries, you might want to consider directly querying the shards.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Henri Biestro (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574984#action_12574984 ] 

Henri Biestro commented on SOLR-303:
------------------------------------

Nothing functional , just noticed reading the code that Shard{Doc,Request} are missing the Apache license header.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557324#action_12557324 ] 

patrick o'leary commented on SOLR-303:
--------------------------------------

Small thing but if you update org.apache.solr.handler.component.ResponseBuilder
and set the stages to final, you can use a switch statement in the distributedProcess phase.

{code}
public class ResponseBuilder 
{
  public static final int STAGE_START           = 0;
  public static final int STAGE_PARSE_QUERY     = 1000;
  public static final int STAGE_EXECUTE_QUERY   = 2000;
  public static final int STAGE_GET_FIELDS      = 3000;
  public static final int STAGE_DONE            = Integer.MAX_VALUE;
{code}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611758#action_12611758 ] 

Sean Timm commented on SOLR-303:
--------------------------------

Another option is to pass state on the number of documents and positions retrieved from each shard.  I have  a client layer that can do that, so it works, but it is complicated, maintaining state is messy, and the vast majority of requests are first page requests so in practice we almost never use that feature, but instead do exactly as is implemented here and request the full document count from each shard.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605868#action_12605868 ] 

Lars Kotthoff commented on SOLR-303:
------------------------------------

Yonik, thanks for taking a look at it.

I've investigated this issue further and I believe I know what the root cause is now. The line
{code:title=o.a.s.client.solrj.impl.CommonsHttpSolrServer.java}
...
post.getParams().setContentCharset("UTF-8");
...
{code}
tells the *sender* to encode the data as UTF-8. The way the *receiver* decodes the data depends on whatever is set as charset in the Content-Type header. This header is currently automatically added by httpclient and, as you can see in the netcat log, "application/x-www-form-urlencoded", i.e. without a charset. The default charset is ISO-8859-1 (cf. [http://hc.apache.org/httpclient-3.x/charencodings.html]). So the data is *encoded* as UTF-8 but *decoded* as ISO-8859-1, which causes the effect I described earlier.

I tried to reproduce this with TestDistributedSearch myself, but for some reason it seems to be fine. Perhaps the Jetty configuration is different to my Tomcat configuration. I didn't find any parameter to tell Tomcat the default encoding if the Content-Type header doesn't specify one though.

The minimal change I had to make to make it work was add a line to set the Content-Type header explicitly, i.e.
{code:title=o.a.s.client.solrj.impl.CommonsHttpSolrServer.java}
...
post.getParams().setContentCharset("UTF-8");
post.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
...
{code}
This probably won't work with multi-part requests though. I'm not sure what the right way to handle this would be. The stub Content-Type header is set by httpclient when the method is executed, i.e. there's no way to let httpclient figure out the first part and then append the charset in CommonsHttpSolrServer.

Some other things I've noticed:
* Just before the content charset is set, the parameters of the POST request are populated. If the value for a parameter is null, the code attempts to to add a null parameter. This however will cause an IllegalArgumentException from httpclient (cf. [http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/methods/PostMethod.html#addParameter(java.lang.String, java.lang.String)]).
* TestDistributedSearch does not exercise the code to refine facet counts. Adding another facet request with facet.limit=1 redresses this.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Sabyasachi Dalal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549970 ] 

Sabyasachi Dalal commented on SOLR-303:
---------------------------------------

I fixed the issue with the patch and it works with version 594268. 
Now, i am trying to make it work with the latest trunk.  I am facing a problem. The  FedSearchComponent needs a handle to the "handler" in order to execute on the local shard. I am trying to figure out how to pass the handler during component initialization.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605650#action_12605650 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

When I give the following request:

http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:8984/solr&q=woof

With no server running on 8984 I get a error 500 (naturally.)

But shouldn't there be an option to skip over servers that aren't responding or time out? Envisioning a scenario in which this is used to search across possibly redundant uniqueIDs and a server being down is not cause for exception.




> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543931 ] 

Hoss Man commented on SOLR-303:
-------------------------------

Note: there has been discussion recently about the terminology distinction between "federated search" and "distributed search" (which ken recently updated on the wiki) ... this issue is tracking "distributed search" and not "federated search" correct?

if so, the issue summary should be updated

http://wiki.apache.org/solr/FederatedSearch
http://wiki.apache.org/solr/DistributedSearch




> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

patrick o'leary updated SOLR-303:
---------------------------------

    Attachment: distributed_trunk.patch

This might help, merged the distributed & federated patchs with trunk last night, fixed the rejects. Appears to work.
The only things not included are the distributed searcher unit tests from the previous patch. Only the deltas were in the patch, so I had no way to rebuild them.

Hope this helps
P

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605699#action_12605699 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Lars: I'm not yet able to reproduce an issue with SolrJ not encoding the parameters properly.

The following code finds the sample solr document:
{code}
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
    ModifiableSolrParams params = new ModifiableSolrParams();
    params.set("echoParams","all");
    params.set("q","+h\u00E9llo");
    QueryRequest req = new QueryRequest(params);
    req.setMethod(SolrRequest.METHOD.POST);
     System.out.println(server.request(req));
{code}

And netcat confirms the encoding looks good, and is in fact using POST
{code}
$ nc -l -p 8983
POST /solr/select HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
Host: localhost:8983
Content-Length: 53
Content-Type: application/x-www-form-urlencoded

echoParams=all&q=%2Bh%C3%A9llo&wt=javabin&version=2.2
{code}

I'll see if I can reproduce anything with TestDistributedSearch

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534731 ] 

Stu Hood commented on SOLR-303:
-------------------------------

{quote}
yeah, may be I have missed those scenarios. If you have the fix, pl feel free to update the patch.
{quote}
Unfortunately, my fix was more of a workaround: I allow any field that is not the unique key to be added multiple times. But, the local shard always returns all the fields  of the document, so if the local shard is queried directly, some fields are duplicated. And so I don't query the local shard directly =/

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Dima Brodsky (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559259#action_12559259 ] 

Dima Brodsky commented on SOLR-303:
-----------------------------------

Hey,

Quick question from a solr newbie.  I'd love to be able to play/test out the distributed functionality of this patch.  Are there some user level instructions as to how to configure and run?

Thanks!
ttyl
Dima



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611376#action_12611376 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Understood. Can I suggest a third alternative?

two new params: named &d.rows and &d.start with the implication that these get sent unchanged to each of the shards. You may get back up to N*d.rows, where N is the # of shards. That leaves the paging management up to the client.

Our use case is millions of documents across many shards, and we often do queries that are "get all document of type X." There may be 5m type X documents. Doing a &rows=5000000 is unpredictable so we've previously done a loop of incrementing start by a 1000 and getting 1000 rows each time. But with this distributed setup, each successive batch query takes slightly longer, and by the time we've gotten to the 5,001,000 batch queries are timing out and breaking anyway. 





> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554519 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

{quote}}We should extract out a few simple things and commit them quickly to make this go more smoothly:

   1. move SearchHandler to o.a.s.handler.component - I vote you go ahead and commit that change.
   2. Create a separate issue for adding SolrDocument to XMLWriter
   3. Move solrj into the main source tree. I'm not sure the best way to do this, but I don't think solrj should sit in its own source folder if the core depends on it.
{quote}

Definitely agree on #1 and #2.
For #3, are there SolrJ parts (or future parts) that we wouldn't want automatically bundled with Solr?

{quote}Is there a good reason to use the same handler for distributed search?{quote}

It seems like a single search component should be able to handle distributed search.
If that's the case, what separates a handler that is distributed and one that isn't?
The first thing that occured to me was to just detect the presence of shards[] after the prepare phase.
There is a side benefit in that a component can control whether a request is distributed or not (all solrconfig could be the same for systems in a cluster, with some sort of external system controlling topology). 

One could have a distributed handler that could delegate or handle non-distributed requests, but it seems to amount to the same thing (a single handler that can do both on the fly).

Saving an if() doesn't seem too compelling (the current code could certainly be refactored to be cleaner anyway).  Are there other benefits to having a separate DistributedSearchHandler though?
.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548044 ] 

Ryan McKinley commented on SOLR-303:
------------------------------------

Are you suggesting changing the main control loop from:
{code}
      for( SearchComponent c : components ) {
        c.process( req, rsp );
      }
{code}

to something that knows "stages"? 

Or are you discussing something that would happen within a single 'c.process( req, rsp );?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614602#action_12614602 ] 

Lars Kotthoff commented on SOLR-303:
------------------------------------

Which version of Jetty are you using? The org.mortbay.http.HttpRequest.maxFormContentSize system property seems to be specific to Jetty 5 -- I didn't find any information on how to set the limit with Jetty 6 (or indeed if it exists at all).

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607106#action_12607106 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

Putting &debugQuery on a query with shards that returns 0 results will NPE:

{code}
INFO: webapp=/solr path=/select params={shards=localhost:8983/solr,localhost:8984/solr,localhost:8985/solr,localhost:8986/solr&debugQuery=true&q=i_tag:894&rows=100} status=500 QTime=8 
Jun 22, 2008 12:45:38 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
	at org.apache.solr.handler.component.DebugComponent.finishStage(DebugComponent.java:133)
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:257)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
	at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
	at org.mortbay.jetty.Server.handle(Server.java:285)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
	at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

{code}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554524 ] 

Ryan McKinley commented on SOLR-303:
------------------------------------


{quote}
\\
For #3, are there SolrJ parts (or future parts) that we wouldn't want automatically bundled with Solr?
\\
\\
{quote}

I don't think so.  The thing I want to make sure is still possible is that solrj can be distributed independently (without the lucene dependencies)

The existing artifact topology makes sense as is: common, solrj, core.  

Currently we have:
{code}
+ common
  + solrj
  + core
{code}
we need
{code}
+ common
  + solrj  
      +core
{code}
or
{code}
+ common & solrj  
  + core
{code}

This issue is essentially independent of SOLR-303, but we should try to make our source directory structures consistent with standard practice. 

{quote}
\\
Saving an if() doesn't seem too compelling (the current code could certainly be refactored to be cleaner anyway). Are there other benefits to having a separate DistributedSearchHandler though?
\\
\\
{quote}
If there is a good reason to keep it the *same* handler then that is a reason enough.  

I just looked at it  (without really grocking how it works) and it seemed a bit bloated with distribution lifecycle stuff.  As long as the non-distributed request cycle isn't tied to the distributed stuff, I'm sure it is fine.

-----

BTW, where does the term "shard" come from?  What specifically does it refer to?

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated SOLR-303:
--------------------------------

    Attachment: fedsearch.patch

Updated to do following:
1. Fed search query being executed via different components
-GlobalCollectionStatComponent (optional)
-FirstQPhaseComponent
-SecondQPhaseComponent (optional)
The user can use 'skip' request param to tell which component to skip

2. Sub searcher requests are executed in parallel threads using thread pool.

3. work against the trunk revision 574785.

I am working on further refactoring the code and make it work with SOLR-281, which should make the code really clean with pluggable components.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Timm updated SOLR-303:
---------------------------

    Comment: was deleted

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607097#action_12607097 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

If the user is going to be splitting their index over N shards, it's going to be crucial to have the distributed search (optionally) return the docid->shard map in the response. Is that tricky to add as part of this issue? 


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557531#action_12557531 ] 

Hoss Man commented on SOLR-303:
-------------------------------

bq. OK, this version patches cleanly and includes some distributed faceting code.

I haven't looked at it ... but holy freaking cow that's cool.

bq. Note that it is theoretically possible to miss terms. A term could be just below the threshold of each shard (and thus not returned by any shard), but the total count could boost it in the top. This could be rectified by retrieving all terms above a specified count, but it could be expensive. The counts that are currently returned are exact.

one solution i've seen to mitigate problems like this in the past is to compute a higher "limit" when querying the individual shards, someone somewhere suggested that n**2 is a good approach (but they may have been talking out of their ass) so if the initial request says facet.limit=5, the individual shards would be queried with facet.limit=25 ... but you'd also still want to use refinement requests.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by patrick o'leary <po...@aol.com>.

is this still blocked by solr-281?

Stu Hood (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Stu Hood updated SOLR-303:
> --------------------------
>
>     Attachment: fedsearch.stu.patch
>
> I got the rest of the DF issues resolved: please refer to the attached and ignore my earlier comments (some of them were faulty).
>
> Here is a patch that is very similar to your last patch, but with my fixes included. If you `diff fedsearch.stu.patch fedsearch.patch` you should be able to see what I did.
>
> The final (minor) issue I've found, is that when I strip the 'start' parameter in SecondQPhaseComponent.createSecondPhaseParams, it gets stripped from the response that is returned to the user as well (although it is honored in the results).
>
> Thanks again!
>
>   
>> Federated Search over HTTP
>> --------------------------
>>
>>                 Key: SOLR-303
>>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>>             Project: Solr
>>          Issue Type: New Feature
>>          Components: search
>>            Reporter: Sharad Agarwal
>>            Priority: Minor
>>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch
>>
>>
>> Motivated by http://wiki.apache.org/solr/FederatedSearch
>> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
>> Does the federated search query side. Update not yet done.
>> Tries to achieve:-
>> ------------------------
>> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
>> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
>> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
>> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
>> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
>> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
>> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
>> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
>> HOW:
>> -------
>> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
>> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>>  
>> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
>> The search request processing on the set of shards is performed as follows:
>> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
>> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
>> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
>> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
>> STEP 5: Responses from all shards from SecondQueryPhase are merged.
>> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
>> TODO:
>> -Support sort field other than default score
>> -Support ResponseDocs in writers other than XMLWriter
>> -Http connection timeouts
>> OPEN ISSUES;
>> -Merging of facets by "top n terms of field f" 
>> Scope for Performance optimization:-
>> -Search shards in parallel threads
>> -Http connection Keep-Alive ?
>> -Cache global numDocs and docFreqs
>> -Cache Query objects in handlers ??
>> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 
>>     
>
>   

-- 

Patrick O'Leary


You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein

View Patrick O Leary's LinkedIn profileView Patrick O Leary's profile
<http://www.linkedin.com/in/pjaol>

[jira] Updated: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated SOLR-303:
--------------------------

    Attachment: fedsearch.stu.patch

I got the rest of the DF issues resolved: please refer to the attached and ignore my earlier comments (some of them were faulty).

Here is a patch that is very similar to your last patch, but with my fixes included. If you `diff fedsearch.stu.patch fedsearch.patch` you should be able to see what I did.

The final (minor) issue I've found, is that when I strip the 'start' parameter in SecondQPhaseComponent.createSecondPhaseParams, it gets stripped from the response that is returned to the user as well (although it is honored in the results).

Thanks again!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528593 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I've been working with the most recent version of the patch some more, and have run into some more issues. Since I'm sure that you have been working on the patch on your own, I don't want you to have to dig through my changes as a diff. Instead I'll just try and point them out for your revision.

We have a few fields that are indexed as strings that contain characters like '@' and ':'. There are still a few places having to do with the 'df' parameter where these need to be escaped/worked around, but here is what I've found so far:
* During the iteration over the document's uniqFields in SecondQPhaseComponent.createSecondPhaseParams
** Surrounded the value in "quotes"
* During the iteration over strTerms in MultiSearchRequestHandler.buildQuery
** Modified the split on '@' to only split on the last '@' in the string.
** Modified the split on ':' to split into a maximum of 2 pieces.
* During the iteration over extractedTerms in GlobalCollectionStatComponent.calcuateGlobalCollectionStat
** Modified the split on ':' to split into a maximum of 2 pieces.


I also ran into some problems in other areas:
* XMLResponseParser.parse(url, params) fails to parse a response if it is indented using the 'indent=on' parameter, which gets passed through to the subqueries
** Stripped out 'indent' during the iteration over the params (but there is probably a better solution to this issue)
* SecondQPhaseComponent.createSecondPhaseParams passes the 'start' parameter through to the subqueries, which leads to a null pointer when we are querying for specific unique ids.
** Stripped out 'start' during the iteration over the params


I'll keep looking for the last few 'df' issues. Thanks a lot for the patch!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593574#action_12593574 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

I just committed shards_qt.patch

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

Now patch attached... this one implements count tiebreaking by index order (to match the non-distributed faceting).

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Brian Whitman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614410#action_12614410 ] 

Brian Whitman commented on SOLR-303:
------------------------------------

My ids are 32-character MD5s, and the break happens around 23000 rows. The maxFormContentSize doesn't seem to make any difference whether I set it or not-- with it set at 0, -1, 10000000 or not set at all I can query &rows=22300 but not &rows=22400. Obviously this is an edge case but I'm posting this here for the next person who runs into this... but since I can work around it I'll stop messing with it.



> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554534 ] 

Otis Gospodnetic commented on SOLR-303:
---------------------------------------

Shard is what you call a small(er) index that is a part of a large(r) cluster of indices.  These smaller shards together form one large logical index.

See http://www.scribd.com/doc/312186/THE-GOOGLE-CLUSTER-ARCHITECTURE

I wish Nutch used the same (shard) nomenclature instead of using "segments", so there is no confusion with Lucene index segments.... but that's another issue.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545154 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

It doesn't seem like there is any request handler set up that references the distributed search components.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-303:
------------------------------

    Attachment: distributed.patch

This update adds parallel requests.
  - a singleton communications thread pool (executor) is added... currently static, but it should be *per core* and have a way of shutting down.
 - a singleton HttpClient for use by all SolrServer instances, currently static, probably fine to remain so (unless there needs to be core specific config?)
 - an exception causes everything to be aborted
 - all requests in a phase are sent out in parallel
 - a completion service is used for grabbing completed requests, so the first requests back can start being processed.
 - while receiving responses, if any new requests are put on the outgoing queue, they are immediately sent out before waiting for any further responses.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-303) Distributed Search over HTTP

Posted by "patrick o'leary (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

patrick o'leary updated SOLR-303:
---------------------------------

    Attachment:     (was: distributed_trunk.patch)

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605653#action_12605653 ] 

Otis Gospodnetic commented on SOLR-303:
---------------------------------------

Ah, yes, I agree with Brian.  I did see this, too, fut forgot to report it as a problem that needs a fix.


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551514 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Bear with me... I'm working on this from a bit of a different angle.
- multiple stages, defined by components themselves, and a stage doesn't end until an outgoing request queue is empty.
- making components responsible for turning on/off their own options in the query phases, rather than having the distributed search component have to know all the different options.
- using SolrJ/HttpClient for communication
- organizational: moved SearchHandler into the component package, along with distributed search stuff.  It's all related and allows us to keep things private that should be kept private.

I understand the original author is no longer involved with this issue, so I'm basing things on his code in some places, but not others.  Hopefully I'll have something 


> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527637 ] 

Stu Hood commented on SOLR-303:
-------------------------------

For the second issue above, I did the following:

*Added 'static String escape(string, field, schema)' to QueryParsing, that uses SolrQueryParser's escape method. I run this across all key values as they are being iterated in the beginning of 'SecondQPhaseComponent.createSecondPhaseParams'

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605670#action_12605670 ] 

Yonik Seeley commented on SOLR-303:
-----------------------------------

Lars: I committed your fix to the facet.limit value sent to shards, and instead of changing ntop when facet.limit<=0, I simply short-circuited checking if refinement is needed at all.

Next up: investigate this URL encoding (or lack of it) in the POST body.

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>             Fix For: 1.3
>
>         Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528260 ] 

Stu Hood commented on SOLR-303:
-------------------------------

Yea, that is a bit of a problem isn't it...

It looks like if you subclassed SolrIndexSearcher and DocList, you could generate fake Lucene document ids that map back to actual unique keys. Unfortunately, SolrIndexSearcher is intimidatingly long, so depending on how people feel about adding to the writers, it might not be necessary to modify it.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Federated Search over HTTP

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536064 ] 

Stu Hood commented on SOLR-303:
-------------------------------

I'm still working on wrapping my head around the fedsearch phases, but I noticed the following stacktrace showing up in the logs every now and then:
{noformat}
SEVERE: java.lang.NullPointerException
        at org.apache.solr.handler.federated.component.GlobalCollectionStatComponent.prepare(GlobalCollectionStatComponent.java:81)
        at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:116)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:807)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:619)
{noformat}

... that is probably caused by the following statements around line 81 in GlobalCollectionStatComponent.prepare. We only enter the if statement if terms is null, and then we dereference it...
{code}    String terms = req.getParams().get(ResponseBuilder.DOCFREQS);
    if (numDocs != null && terms == null) {
      // the build query has to be over-written to take into
      //account global numDocs and docFreqs

      //extract the numDocs and docFreqs from request params
      Map<Term, Integer> dfMap = new HashMap<Term, Integer>();
      String[] strTerms = terms.split(",");
{code}

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.