You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2009/09/28 22:33:50 UTC

Re: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

It sounds like a fully denormalized schema is:
unique id is a compound key of the drug and the technical paper, with
many search attributes per drug.

If your document has a drug & a paper as a compound unique ID, how
many documents total do you have? A Lucene index works quite well with
hundreds of millions of documents.

What is the average number of search attributes per drug? The index
allocates space only to attributes that are used. If there are 5000
searchable attributes, and each drug has only a 100 unique attributes,
the inverted list of fields contains 5000 attribute entries with an
average of 100 document ids. (This comes from my vague understanding
of space requirements.)

If you experiment with this you may find that the space and
performance characteristics are actually OK.

On Mon, Sep 28, 2009 at 6:42 AM, Luo, Jeff <jl...@cas.org> wrote:
> We DO have a multi-value field Drug-Name on the paper index.
> Unfortunately, we can NOT combine these two distinct indexes into one
> BIG index, as you suggested, to fully de-normalize them. A drug has TOO
> MANY properties which one can query against and a paper might contain
> TOO MANY drugs.
>
> In a trivial case, the user knows the terms(drug names), so they can
> search the documents very fast using those terms. In a non-trivial case,
> the user does NOT know the drug names, so they search by drug
> properties, and the number of drug names resulted from their property
> search could be in the range of thousands, hence my question.
>
> If you have any good design ideas on this requirement, please let me
> know.
>
> Thanks for your time,
>
> Jeff
>
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Friday, September 25, 2009 11:02 AM
> To: solr-dev@lucene.apache.org
> Subject: [PMX:FAKE_SENDER] Re: [PMX:FAKE_SENDER] Re: large OR-boolean
> query
>
> This would work a lot better if you did the join at index time. For
> each paper, add a field with all the related drug names (or whatever
> you want to search for), then search on that field.
>
> With the current design, it will never be fast and never scale. Each
> lookup has a cost, so expanding a query to a thousand terms will
> always be slow. Distributing the query to multiple shards will only
> make a bad design slightly faster.
>
> This is fundamental to search index design. The schema is flat, fully-
> denormalized, no joins. You tag each document with the terms that you
> will use to find it. Then you search for those terms directly.
>
> wunder
>
> On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:
>
>> We are searching strings, not numbers. The reason we are doing this
>> kind
>> of query is that we have two big indexes, say, a collection of
>> medicine
>> drugs and a collection of research papers. I first run a query against
>> the drugs index and get 102400 unique drug names back. Then I need to
>> find all the research papers where one or more of the 102400 drug
>> names
>> are mentioned, hence the large OR query. This is a kind of JOIN query
>> between 2 indexes, which an article in the lucid web site comparing
>> databases and search engines briefly touched.
>>
>> I was able to issue 100 parallel small queries against solr shards and
>> get the results back successfully (even sorted). My custom code is
>> less
>> than 100 lines, mostly in my SearchHandler.handleRequestBody. But I
>> have
>> problem summing up the correct facet counts because the faceting
>> counts
>> from each shard are not disjunctive.
>>
>> Based on what is suggested by two other responses to my question, I
>> think it is possible that the master can pass the original large query
>> to each shard, and each shard will split the large query into 100
>> lower
>> level disjunctive lucene queries, fire them against its Lucene index
>> in
>> a parallel way and merge the results. Then each shard shall only
>> return
>> 1(instead of 100) result set to the master with disjunctive faceting
>> counts. It seems that the faceting problem can be solved in this
>> way. I
>> would appreciate it if you could let me know if this approach is
>> feasible and correct; what solr plug-ins are needed(my guess is a
>> custom
>> parser and query-component?)
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Thursday, September 24, 2009 10:01 AM
>> To: solr-dev@lucene.apache.org
>> Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query
>>
>>
>> On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:
>>
>>> Hi,
>>>
>>> We are experimenting a parallel approach to issue a large OR-Boolean
>>> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
>>> solr shards.
>>>
>>> The way we are trying is to break the large query into smaller ones,
>>> e.g.,
>>> the example above can be broken into 10 small queries: keywords:(1
>>> OR 2
>>> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
>>> etc
>>>
>>> Now each shard will get 10 requests and the master will merge the
>>> results coming back from each shard, similar to the regular
>>> distributed
>>> search.
>>
>>
>> Can you tell us a little bit more about the why/what of this?  Are you
>> really searching numbers or are those just for example?  Do you care
>> about the score or do you just need to know whether the result is
>> there or not?
>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
>



-- 
Lance Norskog
goksron@gmail.com