You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeff Wartes <jw...@whitepages.com> on 2018/03/27 23:00:20 UTC

Routing a subquery directly to the shard a document came from

I have a large 7.2 index with nested documents and many shards.
For each result (parent doc) in a query, I want to gather a relevance-ranked subset of the child documents. It seemed like the subquery transformer would be ideal: https://lucene.apache.org/solr/guide/7_2/transforming-result-documents.html#TransformingResultDocuments-_subquery_
(the [child] transformer allows for a filter, but the results have an effectively random sort)

So maybe something like this:
q=<something>
fl=id,subquery:[subquery]
subquery.q=<something>
subquery.fq={!cache=false} +{!terms f=_root_ v=$row.id}

This actually works fine, but there’s a lot more work going on than necessary. Say we have X shards and get N documents back:

Query http requests = 1 top-level query + X distributed shard-requests
Subquery http requests = N rows + N * X distributed shard-requests
So with N=10 results and X=50 shards, that is: 1+50+10+500 = 561 http requests through the cluster.

Some of that is unavoidable, of course, but it occurs to me that all the child docs are indexed in the same shard (segment) that the parent doc is. Meaning that if you know the parent doc id, (and I do) you can use the document routing to know exactly which shard to send the subquery request to. This would save 490 of the http requests in the scenario above.

Is there any form of query that allows for explicitly following the document routing rules for a given document ID?

I’m aware of the “distrib=false” and “shards=foo” parameters, but using those would require me to recreate the document routing in the client.
There’s also the “fl=[shard]” thing, but that would still require me to handle the subqueries in the client.

Re: Routing a subquery directly to the shard a document came from

Posted by Jeff Wartes <jw...@whitepages.com>.

This gets really close:

    q=<something>
    fl=id,subquery:[subquery],[shard]
    subquery.q=<something>
    subquery.fq={!cache=false} +{!terms f=_root_ v=$row.id}
    subquery.shards=$row.[shard]

The issue here is that local params aren't a thing except in a query parser, and the "shards=" param isn't a query so it isn't parsed. So I have no way to dereference the "$row.[shard]".


On 3/27/18, 4:00 PM, "Jeff Wartes" <jw...@whitepages.com> wrote:

    
    I have a large 7.2 index with nested documents and many shards.
    For each result (parent doc) in a query, I want to gather a relevance-ranked subset of the child documents. It seemed like the subquery transformer would be ideal: https://lucene.apache.org/solr/guide/7_2/transforming-result-documents.html#TransformingResultDocuments-_subquery_
    (the [child] transformer allows for a filter, but the results have an effectively random sort)
    
    So maybe something like this:
    q=<something>
    fl=id,subquery:[subquery]
    subquery.q=<something>
    subquery.fq={!cache=false} +{!terms f=_root_ v=$row.id}
    
    This actually works fine, but there’s a lot more work going on than necessary. Say we have X shards and get N documents back:
    
    Query http requests = 1 top-level query + X distributed shard-requests
    Subquery http requests = N rows + N * X distributed shard-requests
    So with N=10 results and X=50 shards, that is: 1+50+10+500 = 561 http requests through the cluster.
    
    Some of that is unavoidable, of course, but it occurs to me that all the child docs are indexed in the same shard (segment) that the parent doc is. Meaning that if you know the parent doc id, (and I do) you can use the document routing to know exactly which shard to send the subquery request to. This would save 490 of the http requests in the scenario above.
    
    Is there any form of query that allows for explicitly following the document routing rules for a given document ID?
    
    I’m aware of the “distrib=false” and “shards=foo” parameters, but using those would require me to recreate the document routing in the client.
    There’s also the “fl=[shard]” thing, but that would still require me to handle the subqueries in the client.