You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James Nolan <no...@gmail.com> on 2012/06/22 20:04:26 UTC

Editing search documents during "process" stage of custom extension during distributed search

Hello everyone,

I have a three server 15 core set of solr instances.  They all share common
config files, schemas and extensions.  I have been trying to speed up
de-duplication of multi-value queries by matching returned results against
their term-vector positions.  As I understand, during a sharded search, the
query is called twice for each core.  Once to get the total number of rows
requested and then once again to get the specific documents in each core.

Currently, I extend TermVectorComponent and have it run a set intersection
during the second "process" stage of this query.  I pass the list of valid
document IDs along with the rest of the response and clean out the document
list from the query client side based on the results of my set intersection
in the process stage.

So, for example, I want 100 documents returned.  I run the initial
distributed query, do the set intersection and end up with 10 valid results
at the client.  At this point, I run a new request based on the 10% density
to try to get my 100 valid documents.  Currently, I re-send the query with
1.1*(1/.1)*(rows needed) this usually means requesting 1100 documents on my
second request.

There are some problems:
1) I have to run the search twice per client-side query even though I know
I won't use 90% of the results from the first time the shard request
queries the documents.
2) I am still sending 100 documents to the client and discarding 90
of them.

The thing I tried:
Overwrite the responses DocListAndSet and only send back the valid
documents.  When I do this, it doesn't have an effect on the overall
search.  I'm guessing that I'm either not really changing the object or
it's too late to do anything with it.

Help! (If you can)
I want to edit the list of documents sent to the requestHandler from the
initial shard search (the one where each of the 15 cores retrieves 100
documents) so that the second search only uses the valid results.  Is there
any way to do that from my extended TermVectorComponent, or do I need to
extend the QueryComponent to get what I need.  Also, any other hints are
welcome too.

Thanks in advance!

Jim Nolan