You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Luo, Jeff" <jl...@cas.org> on 2009/09/23 22:26:18 UTC

large OR-boolean query

Hi,

We are experimenting a parallel approach to issue a large OR-Boolean
query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
solr shards.

The way we are trying is to break the large query into smaller ones,
e.g.,  
the example above can be broken into 10 small queries: keywords:(1 OR 2
OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
etc

Now each shard will get 10 requests and the master will merge the
results coming back from each shard, similar to the regular distributed
search. 

There are a few issues with this approach, though:

Issue #1: sorting: 
ShardFieldSortedHitQueue.lessThan() checks if two documents are from the
same shard; if yes, it will just compare the orderInShard. However, in
our approach, two documents from the same shard could be results of two
different small queries, in that case, we can't just compare the
orderInShard. There is a simple solution to this issue: just change the
if statement to :
if (docA.shard == docB.shard && docA.sortFieldValues ==
docB.sortFieldValues)

Can someone make this change for Solr1.4 so that we don't have to
customize it? It does NOT impact the normal case.

Issue #2: number of matching documents found:
Since multiple responses from the same shard could contain duplicated
documents, the sum of number of documents found from each response will
be greater than the real number of documents found. 

Unless we ask each shard to return all the matching records for every
small query and then check duplicates, we can't get the accurate number
of documents found.

Issue #3: faceting counts
Due to the same limitation as in issue #2, the faceting count is always
greater than the correct one.

I don't have any idea how this issue can be resolved.

Re: large OR-boolean query

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Sep 23, 2009 at 4:26 PM, Luo, Jeff <jl...@cas.org> wrote:
> We are experimenting a parallel approach to issue a large OR-Boolean
> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
> solr shards.
>
> The way we are trying is to break the large query into smaller ones,
> e.g.,
> the example above can be broken into 10 small queries: keywords:(1 OR 2
> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
> etc
>
> Now each shard will get 10 requests and the master will merge the
> results coming back from each shard, similar to the regular distributed
> search.

You're going to end up with a lot of custom code I think.
Where's the bottleneck... searching or faceting?

If faceting is the bottleneck, making an implementation that utilized
multiple threads would be one of the best ways.
If searching, you could develop a custom query type (QParserPlugin)
that handled your type of queries and split them across multiple
threads.

-Yonik
http://www.lucidimagination.com

Re: large OR-boolean query

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:

> Hi,
>
> We are experimenting a parallel approach to issue a large OR-Boolean
> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
> solr shards.
>
> The way we are trying is to break the large query into smaller ones,
> e.g.,
> the example above can be broken into 10 small queries: keywords:(1  
> OR 2
> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
> etc
>
> Now each shard will get 10 requests and the master will merge the
> results coming back from each shard, similar to the regular  
> distributed
> search.


Can you tell us a little bit more about the why/what of this?  Are you  
really searching numbers or are those just for example?  Do you care  
about the score or do you just need to know whether the result is  
there or not?


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, Sep 24, 2009 at 6:56 PM, Luo, Jeff <jl...@cas.org> wrote:

> I think the searching is the bottle neck. Solr/Lucene is slow when the
> maxBooleanClauses is bigger enough.
>
>
Your comment reminded me of this post:

http://invertedindex.blogspot.com/2009/07/making-booleanqueries-faster.html

-- 
Regards,
Shalin Shekhar Mangar.

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Sep 24, 2009 at 9:26 AM, Luo, Jeff <jl...@cas.org> wrote:
> I think the searching is the bottle neck. Solr/Lucene is slow when the
> maxBooleanClauses is bigger enough.

OK, the I'd go with the custom query.  You can reduce the message size
and get gains in query parsing speed too:

{!parallel_or}1,2,3,4,5,6,7,8,9,...2048

Sort all of the terms first before creating the lower level
disjunctions - this can speed up the term seeking (for example seeking
to 1000, then 1001, then 1002 in the same thread will be faster than
1000, 2000, 1.

If you don't need scoring, then it can be made even faster by using bitsets.

-Yonik
http://www.lucidimagination.com

RE: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by "Luo, Jeff" <jl...@cas.org>.

I think the searching is the bottle neck. Solr/Lucene is slow when the
maxBooleanClauses is bigger enough. 

In my previous example, I should say the large query is broken into 100
smaller ones.

Since we still want facet counts with this large query, is there any way
one can accurately aggregate the facet counts coming back from multiple
threads as you suggested?

Thanks a a lot for your reply,

Jeff

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
Seeley
Sent: Wednesday, September 23, 2009 4:39 PM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query

On Wed, Sep 23, 2009 at 4:26 PM, Luo, Jeff <jl...@cas.org> wrote:
> We are experimenting a parallel approach to issue a large OR-Boolean
> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
> solr shards.
>
> The way we are trying is to break the large query into smaller ones,
> e.g.,
> the example above can be broken into 10 small queries: keywords:(1 OR
2
> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
> etc
>
> Now each shard will get 10 requests and the master will merge the
> results coming back from each shard, similar to the regular
distributed
> search.

You're going to end up with a lot of custom code I think.
Where's the bottleneck... searching or faceting?

If faceting is the bottleneck, making an implementation that utilized
multiple threads would be one of the best ways.
If searching, you could develop a custom query type (QParserPlugin)
that handled your type of queries and split them across multiple
threads.

-Yonik
http://www.lucidimagination.com

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Chris Hostetter <ho...@fucit.org>.

: for reasons forthcoming.  The QParser then  just returns a
: ConstantScoreQuery wrapped around a Filter subclass that I wrote which uses
: these Terms.  The Filter subclass does most of the "work".

If there is a lot of overlap in the Terms that are used from query to 
query, you might find it more efficient to construct individual 
TermFilters for each term, and utilize the filterCache to reuse them from 
request to request -- then your plugin (it would probably have to be 
a SearchComponent instead of a QParser) would only need to find the union 
of the individual DocSets

: Correct me if I'm wrong, but it seemed important to have my input terms in
: natural order of a TreeSet in order to take advantage of the seek() approach
: to TermDocs (presuming it is sort of like a database cursor?).

(I believe) You are correct .. seek can only move ahead.

: In any event, we're getting rougly 2-3 second query times, with an
: additional 1-2 seconds parsing input from the request.  so our local client
: app sees about a 6-8 second roundtrip on it's queries, with faceting turned
: on. For such a large query: not bad!

Unless the individual terms tend to be extermely unique, or your are 
opening a new search extremeley frequently, i would suggest you try the 
filterCache and DocSet union based approach....

   DocSet main = BitDocSet();
   for (Term t : myTerms) {
     main = searcher.getDocSet(new TermQuery(t)).union(main);
   }

-Hoss

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by just_adam <ad...@gmail.com>.

Hi - Thought I'd follow up with this thread and share what I did with a bit
more specifics.  

Background

For our use case, scoring the results is not as important.  We mainly want
to just answer whether the document is in the user's input list of
identifiers (large: 1M tokens), and then rely on faceting to help the user
navigate their results.  

We're using a 64bit Linux host with 26GB of RAM.  Our index is ~2.5GB, with
roughly 20 million docs.  Each doc has a unique identifier that we will
allow for this large scale querying by our users.

In addition to this Nabble thread, I also cite a cross reference to
lucidimanination's forum about using Filters on large queries:

http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34
http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34 

Approach

So, for our implementation, I use a QParserPlugin which basically parses the
input into a list of Terms.  These terms are stored in a java.util.TreeSet
for reasons forthcoming.  The QParser then  just returns a
ConstantScoreQuery wrapped around a Filter subclass that I wrote which uses
these Terms.  The Filter subclass does most of the "work".

When processing the query, this Filter overrides getDocIdList(IndexReader)
method, where it obtains a TermDocs instance on the IndexReader, and employs
the seek() method while iterating over the TreeSet of input Terms.  It uses
an OpenBitSet to hold results.  

Correct me if I'm wrong, but it seemed important to have my input terms in
natural order of a TreeSet in order to take advantage of the seek() approach
to TermDocs (presuming it is sort of like a database cursor?).

I've tried breaking up the processing of getDocIdList() into multiple Java
threads, where each thread is handed a subset of the input Terms and a
separate OpenBitSet that the main thread manages at the end of processing
with the union() call.  Each thread obtains its own TermDocs instance on the
Reader and iterates over its subset of the input terms and stores in the
component bitset.  

The difference between using a single thread and multiple (10) threads does
not seem to matter much in my tests.  Perhaps there's some blocking at the
IndexReader level that I do not understand which forces things to go
serially no matter what.

In any event, we're getting rougly 2-3 second query times, with an
additional 1-2 seconds parsing input from the request.  so our local client
app sees about a 6-8 second roundtrip on it's queries, with faceting turned
on. For such a large query: not bad!

Anyway, I'm posting this in hopes that someone will have any input pro/cons
to share, and some code snippets below should you be interested.  Thanks,

/**
 *  MyQueryFilter.getDocIdSet()
*/

public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
  long start = System.currentTimeMillis();
  OpenBitSet[] results = new OpenBitSet[SUBQUERY];
  Thread[] threads = new Thread[SUBQUERY];
  for (int i = 0; i < SUBQUERY; ++i) {
    OpenBitSet result = new OpenBitSet(reader.maxDoc());
    results[i] = result;
    SubQ s = new SubQ(subqueries[i], reader, result);
    Thread t =  new Thread(s);
    threads[i] = t;
    t.start();
  }
  try {
    for (int i = 0; i < SUBQUERY; ++i) {
      threads[i].join();
    }    
  } catch (InterruptedException ie) {
    log.error(this.getClass().toString() + " Thread Error ", ie);
  }
  OpenBitSet result = results[0];
  log.info(this.getClass().toString() 
      + ".getDocIDSet() terms: " + this.all_terms.size()
      + " miliseconds: " + (System.currentTimeMillis() - start)
      + " threads: " + SUBQUERY);
  for (int i = 1; i < SUBQUERY; ++ i) {
    result.union(results[i]);
  }
  return result; 

}

/** 
 * Here's my SubQ class
 */

private static class SubQ implements Runnable {
  TreeSet subterms;
  IndexReader my_reader;
  OpenBitSet my_results;
  private SubQ() {}
  protected SubQ(TreeSet terms, IndexReader reader, OpenBitSet bitset) {
    this();
    this.my_results = bitset;
    this.subterms = terms;
    this.my_reader = reader;
  }
  public void run() {
   TermDocs td = null;
   try {
    td = my_reader.termDocs();
      for (Iterator iter = this.subterms.iterator(); iter.hasNext();) {
        Term term = (Term) iter.next();
        td.seek(term);
        while (td.next()) {
          my_results.set(td.doc());
        }
      }
    } catch (IOException ioe) {
      log.error(this.getClass().toString() + " TermDoc seek Error ", ioe);
    } finally {
      if (td != null) try {
        td.close();
      } catch (IOException ioe) { 
        log.error(this.getClass().toString() + " TermDoc close Error ",
ioe);
      }
    }
  }
}

PS to Jeff - Thanks for the hints

just_adam wrote:
> 
> I'm attempting a very similar large OR boolean query here and wish to
> clarify what behavior that the custom query needs to be overriding.  
> 
> Do you mean to have the custom query actually run a bunch of subqueries as
> Java Threads?  Where in the message traffic between the Searcher
> IndexReader and Query would this type of bifurcation logic take place?
> 
> Thank you,
> Adam
> 
> 
> Yonik Seeley-2 wrote:
>> 
>> 
>> A custom query type that does this big parallel OR, and a
>> QParserPlugin that creates that query.  No changes to any search
>> components should be needed.
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/large-OR-boolean-query-tp25584607p27315341.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by just_adam <ad...@gmail.com>.

I'm attempting a very similar large OR boolean query here and wish to clarify
what behavior that the custom query needs to be overriding.  

Do you mean to have the custom query actually run a bunch of subqueries as
Java Threads?  Where in the message traffic between the Searcher IndexReader
and Query would this type of bifurcation logic take place?

Thank you,
Adam

Yonik Seeley-2 wrote:
> 
> 
> A custom query type that does this big parallel OR, and a
> QParserPlugin that creates that query.  No changes to any search
> components should be needed.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 

-- 
View this message in context: http://old.nabble.com/large-OR-boolean-query-tp25584607p27026976.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Sep 25, 2009 at 10:52 AM, Luo, Jeff <jl...@cas.org> wrote:
> Based on what is suggested by two other responses to my question, I
> think it is possible that the master can pass the original large query
> to each shard, and each shard will split the large query into 100 lower
> level disjunctive lucene queries, fire them against its Lucene index in
> a parallel way and merge the results. Then each shard shall only return
> 1(instead of 100) result set to the master with disjunctive faceting
> counts. It seems that the faceting problem can be solved in this way. I
> would appreciate it if you could let me know if this approach is
> feasible and correct; what solr plug-ins are needed(my guess is a custom
> parser and query-component?)

A custom query type that does this big parallel OR, and a
QParserPlugin that creates that query.  No changes to any search
components should be needed.

-Yonik
http://www.lucidimagination.com

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Walter Underwood <wu...@wunderwood.org>.

This would work a lot better if you did the join at index time. For  
each paper, add a field with all the related drug names (or whatever  
you want to search for), then search on that field.

With the current design, it will never be fast and never scale. Each  
lookup has a cost, so expanding a query to a thousand terms will  
always be slow. Distributing the query to multiple shards will only  
make a bad design slightly faster.

This is fundamental to search index design. The schema is flat, fully- 
denormalized, no joins. You tag each document with the terms that you  
will use to find it. Then you search for those terms directly.

wunder

On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:

> We are searching strings, not numbers. The reason we are doing this  
> kind
> of query is that we have two big indexes, say, a collection of  
> medicine
> drugs and a collection of research papers. I first run a query against
> the drugs index and get 102400 unique drug names back. Then I need to
> find all the research papers where one or more of the 102400 drug  
> names
> are mentioned, hence the large OR query. This is a kind of JOIN query
> between 2 indexes, which an article in the lucid web site comparing
> databases and search engines briefly touched.
>
> I was able to issue 100 parallel small queries against solr shards and
> get the results back successfully (even sorted). My custom code is  
> less
> than 100 lines, mostly in my SearchHandler.handleRequestBody. But I  
> have
> problem summing up the correct facet counts because the faceting  
> counts
> from each shard are not disjunctive.
>
> Based on what is suggested by two other responses to my question, I
> think it is possible that the master can pass the original large query
> to each shard, and each shard will split the large query into 100  
> lower
> level disjunctive lucene queries, fire them against its Lucene index  
> in
> a parallel way and merge the results. Then each shard shall only  
> return
> 1(instead of 100) result set to the master with disjunctive faceting
> counts. It seems that the faceting problem can be solved in this  
> way. I
> would appreciate it if you could let me know if this approach is
> feasible and correct; what solr plug-ins are needed(my guess is a  
> custom
> parser and query-component?)
>
> Thanks,
>
> Jeff
>
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Thursday, September 24, 2009 10:01 AM
> To: solr-dev@lucene.apache.org
> Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query
>
>
> On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:
>
>> Hi,
>>
>> We are experimenting a parallel approach to issue a large OR-Boolean
>> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
>> solr shards.
>>
>> The way we are trying is to break the large query into smaller ones,
>> e.g.,
>> the example above can be broken into 10 small queries: keywords:(1
>> OR 2
>> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
>> etc
>>
>> Now each shard will get 10 requests and the master will merge the
>> results coming back from each shard, similar to the regular
>> distributed
>> search.
>
>
> Can you tell us a little bit more about the why/what of this?  Are you
> really searching numbers or are those just for example?  Do you care
> about the score or do you just need to know whether the result is
> there or not?
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

RE: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by "Luo, Jeff" <jl...@cas.org>.

So the QParserPlugin can create multiple threads with each one of them
holding a query? 

I am not familiar with how Solr uses lucene to do the search. 

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
Seeley
Sent: Friday, September 25, 2009 11:09 AM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean
query

On Fri, Sep 25, 2009 at 10:52 AM, Luo, Jeff <jl...@cas.org> wrote:
> Based on what is suggested by two other responses to my question, I
> think it is possible that the master can pass the original large query
> to each shard, and each shard will split the large query into 100
lower
> level disjunctive lucene queries, fire them against its Lucene index
in
> a parallel way and merge the results. Then each shard shall only
return
> 1(instead of 100) result set to the master with disjunctive faceting
> counts. It seems that the faceting problem can be solved in this way.
I
> would appreciate it if you could let me know if this approach is
> feasible and correct; what solr plug-ins are needed(my guess is a
custom
> parser and query-component?)

A custom query type that does this big parallel OR, and a
QParserPlugin that creates that query.  No changes to any search
components should be needed.

-Yonik
http://www.lucidimagination.com

Re: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by Lance Norskog <go...@gmail.com>.

It sounds like a fully denormalized schema is:
unique id is a compound key of the drug and the technical paper, with
many search attributes per drug.

If your document has a drug & a paper as a compound unique ID, how
many documents total do you have? A Lucene index works quite well with
hundreds of millions of documents.

What is the average number of search attributes per drug? The index
allocates space only to attributes that are used. If there are 5000
searchable attributes, and each drug has only a 100 unique attributes,
the inverted list of fields contains 5000 attribute entries with an
average of 100 document ids. (This comes from my vague understanding
of space requirements.)

If you experiment with this you may find that the space and
performance characteristics are actually OK.

On Mon, Sep 28, 2009 at 6:42 AM, Luo, Jeff <jl...@cas.org> wrote:
> We DO have a multi-value field Drug-Name on the paper index.
> Unfortunately, we can NOT combine these two distinct indexes into one
> BIG index, as you suggested, to fully de-normalize them. A drug has TOO
> MANY properties which one can query against and a paper might contain
> TOO MANY drugs.
>
> In a trivial case, the user knows the terms(drug names), so they can
> search the documents very fast using those terms. In a non-trivial case,
> the user does NOT know the drug names, so they search by drug
> properties, and the number of drug names resulted from their property
> search could be in the range of thousands, hence my question.
>
> If you have any good design ideas on this requirement, please let me
> know.
>
> Thanks for your time,
>
> Jeff
>
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Friday, September 25, 2009 11:02 AM
> To: solr-dev@lucene.apache.org
> Subject: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean
> query
>
> This would work a lot better if you did the join at index time. For
> each paper, add a field with all the related drug names (or whatever
> you want to search for), then search on that field.
>
> With the current design, it will never be fast and never scale. Each
> lookup has a cost, so expanding a query to a thousand terms will
> always be slow. Distributing the query to multiple shards will only
> make a bad design slightly faster.
>
> This is fundamental to search index design. The schema is flat, fully-
> denormalized, no joins. You tag each document with the terms that you
> will use to find it. Then you search for those terms directly.
>
> wunder
>
> On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:
>
>> We are searching strings, not numbers. The reason we are doing this
>> kind
>> of query is that we have two big indexes, say, a collection of
>> medicine
>> drugs and a collection of research papers. I first run a query against
>> the drugs index and get 102400 unique drug names back. Then I need to
>> find all the research papers where one or more of the 102400 drug
>> names
>> are mentioned, hence the large OR query. This is a kind of JOIN query
>> between 2 indexes, which an article in the lucid web site comparing
>> databases and search engines briefly touched.
>>
>> I was able to issue 100 parallel small queries against solr shards and
>> get the results back successfully (even sorted). My custom code is
>> less
>> than 100 lines, mostly in my SearchHandler.handleRequestBody. But I
>> have
>> problem summing up the correct facet counts because the faceting
>> counts
>> from each shard are not disjunctive.
>>
>> Based on what is suggested by two other responses to my question, I
>> think it is possible that the master can pass the original large query
>> to each shard, and each shard will split the large query into 100
>> lower
>> level disjunctive lucene queries, fire them against its Lucene index
>> in
>> a parallel way and merge the results. Then each shard shall only
>> return
>> 1(instead of 100) result set to the master with disjunctive faceting
>> counts. It seems that the faceting problem can be solved in this
>> way. I
>> would appreciate it if you could let me know if this approach is
>> feasible and correct; what solr plug-ins are needed(my guess is a
>> custom
>> parser and query-component?)
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Thursday, September 24, 2009 10:01 AM
>> To: solr-dev@lucene.apache.org
>> Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query
>>
>>
>> On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:
>>
>>> Hi,
>>>
>>> We are experimenting a parallel approach to issue a large OR-Boolean
>>> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
>>> solr shards.
>>>
>>> The way we are trying is to break the large query into smaller ones,
>>> e.g.,
>>> the example above can be broken into 10 small queries: keywords:(1
>>> OR 2
>>> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
>>> etc
>>>
>>> Now each shard will get 10 requests and the master will merge the
>>> results coming back from each shard, similar to the regular
>>> distributed
>>> search.
>>
>>
>> Can you tell us a little bit more about the why/what of this?  Are you
>> really searching numbers or are those just for example?  Do you care
>> about the score or do you just need to know whether the result is
>> there or not?
>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

RE: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by "Luo, Jeff" <jl...@cas.org>.

We DO have a multi-value field Drug-Name on the paper index.
Unfortunately, we can NOT combine these two distinct indexes into one
BIG index, as you suggested, to fully de-normalize them. A drug has TOO
MANY properties which one can query against and a paper might contain
TOO MANY drugs. 

In a trivial case, the user knows the terms(drug names), so they can
search the documents very fast using those terms. In a non-trivial case,
the user does NOT know the drug names, so they search by drug
properties, and the number of drug names resulted from their property
search could be in the range of thousands, hence my question.

If you have any good design ideas on this requirement, please let me
know.

Thanks for your time,

Jeff 

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Friday, September 25, 2009 11:02 AM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean
query

This would work a lot better if you did the join at index time. For  
each paper, add a field with all the related drug names (or whatever  
you want to search for), then search on that field.

With the current design, it will never be fast and never scale. Each  
lookup has a cost, so expanding a query to a thousand terms will  
always be slow. Distributing the query to multiple shards will only  
make a bad design slightly faster.

This is fundamental to search index design. The schema is flat, fully- 
denormalized, no joins. You tag each document with the terms that you  
will use to find it. Then you search for those terms directly.

wunder

On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:

> We are searching strings, not numbers. The reason we are doing this  
> kind
> of query is that we have two big indexes, say, a collection of  
> medicine
> drugs and a collection of research papers. I first run a query against
> the drugs index and get 102400 unique drug names back. Then I need to
> find all the research papers where one or more of the 102400 drug  
> names
> are mentioned, hence the large OR query. This is a kind of JOIN query
> between 2 indexes, which an article in the lucid web site comparing
> databases and search engines briefly touched.
>
> I was able to issue 100 parallel small queries against solr shards and
> get the results back successfully (even sorted). My custom code is  
> less
> than 100 lines, mostly in my SearchHandler.handleRequestBody. But I  
> have
> problem summing up the correct facet counts because the faceting  
> counts
> from each shard are not disjunctive.
>
> Based on what is suggested by two other responses to my question, I
> think it is possible that the master can pass the original large query
> to each shard, and each shard will split the large query into 100  
> lower
> level disjunctive lucene queries, fire them against its Lucene index  
> in
> a parallel way and merge the results. Then each shard shall only  
> return
> 1(instead of 100) result set to the master with disjunctive faceting
> counts. It seems that the faceting problem can be solved in this  
> way. I
> would appreciate it if you could let me know if this approach is
> feasible and correct; what solr plug-ins are needed(my guess is a  
> custom
> parser and query-component?)
>
> Thanks,
>
> Jeff
>
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Thursday, September 24, 2009 10:01 AM
> To: solr-dev@lucene.apache.org
> Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query
>
>
> On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:
>
>> Hi,
>>
>> We are experimenting a parallel approach to issue a large OR-Boolean
>> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
>> solr shards.
>>
>> The way we are trying is to break the large query into smaller ones,
>> e.g.,
>> the example above can be broken into 10 small queries: keywords:(1
>> OR 2
>> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
>> etc
>>
>> Now each shard will get 10 requests and the master will merge the
>> results coming back from each shard, similar to the regular
>> distributed
>> search.
>
>
> Can you tell us a little bit more about the why/what of this?  Are you
> really searching numbers or are those just for example?  Do you care
> about the score or do you just need to know whether the result is
> there or not?
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

RE: [PMX:FAKE_SENDER] Re: large OR-boolean query

Posted by "Luo, Jeff" <jl...@cas.org>.

We are searching strings, not numbers. The reason we are doing this kind
of query is that we have two big indexes, say, a collection of medicine
drugs and a collection of research papers. I first run a query against
the drugs index and get 102400 unique drug names back. Then I need to
find all the research papers where one or more of the 102400 drug names
are mentioned, hence the large OR query. This is a kind of JOIN query
between 2 indexes, which an article in the lucid web site comparing
databases and search engines briefly touched.

I was able to issue 100 parallel small queries against solr shards and
get the results back successfully (even sorted). My custom code is less
than 100 lines, mostly in my SearchHandler.handleRequestBody. But I have
problem summing up the correct facet counts because the faceting counts
from each shard are not disjunctive.

Based on what is suggested by two other responses to my question, I
think it is possible that the master can pass the original large query
to each shard, and each shard will split the large query into 100 lower
level disjunctive lucene queries, fire them against its Lucene index in
a parallel way and merge the results. Then each shard shall only return
1(instead of 100) result set to the master with disjunctive faceting
counts. It seems that the faceting problem can be solved in this way. I
would appreciate it if you could let me know if this approach is
feasible and correct; what solr plug-ins are needed(my guess is a custom
parser and query-component?)

Thanks,

Jeff   

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Thursday, September 24, 2009 10:01 AM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query

On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:

> Hi,
>
> We are experimenting a parallel approach to issue a large OR-Boolean
> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
> solr shards.
>
> The way we are trying is to break the large query into smaller ones,
> e.g.,
> the example above can be broken into 10 small queries: keywords:(1  
> OR 2
> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
> etc
>
> Now each shard will get 10 requests and the master will merge the
> results coming back from each shard, similar to the regular  
> distributed
> search.

Can you tell us a little bit more about the why/what of this?  Are you  
really searching numbers or are those just for example?  Do you care  
about the score or do you just need to know whether the result is  
there or not?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search