You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by srinalluri <na...@yahoo.com> on 2012/08/21 22:21:46 UTC

Solr 3.6.1: query performance is slow when asterisk is in the query

Our environment is Solr 3.6.1. I have the following fieldType. There is a
field called 'body' of this fieldType. When I make a query: q=body:*, it is
talking longer than the expected. What are the changes I need to do to this
fieldType for better query performance? Some other fieldTypes in our schema
are performing better.

<fieldType name="text_general_html" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Thanks in advance
Srini



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by david3s <da...@hotmail.com>.
Jack, sorry to forgot to answer you, we tried "[* TO *]" and the response
times are the same as doing plain "*"



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002708.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Jack Krupansky <ja...@basetechnology.com>.
You could also add a bodySize numeric (trie) field, which you can check for 
0 for empty/missing bodies.

And don't forget to check and see whether the "[* TO *]" range query might 
be faster.

-- Jack Krupansky

-----Original Message----- 
From: david3s
Sent: Wednesday, August 22, 2012 12:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.6.1: query performance is slow when asterisk is in the 
query

Hello Chris, thanks a lot for your reply. But is there an alternative
solution? Because I see adding "has_body" as data duplication.

Imagine in that in a Relational DB you had to create extra columns because
you can't do something like "where body is not null"

If there's no other alternative I'll have to go with your suggestion that I
greatly appreciate.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by david3s <da...@hotmail.com>.
Chris,

This is really good stuff, I said stuff not really thinking/knowing about
the index inner-workings.

I was thinking if I could use "copyField", as in my previous example:

<field name="body" type="text" />
<field name="has_body" type="boolean" />

<copyField source="body" dest="has_body"/>

But I guess I would have had to write a custom processor and define a
specific field type.

I guess a more elegant solution will be
CountFieldValuesUpdateProcessorFactory (Thanks again)

And again, thank you very much for being so responsive about this.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4004788.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Chris Hostetter <ho...@fucit.org>.
: Ok, I'll take your suggestion, but I would still be really happy if the
: wildcard searches behaved a little more intelligent (body:* not looking for
: everything in the body). More like when you do "q=*:*" it doesn't really
: search for everything in every field.

If you can suggest an algorithm for it, then i'll happily implement it.

*:* doesn't have to scan every term in every field because it's easy to 
get a list of all non-deleted documents and do a single pass over it.

But getting a list of all documents that have *any* term in a specific 
field means you *have* to look at the list of *all* terms in that field, 
and for each term you then have to iterate over all the docs containing 
that term -- listing all docs containing a term is easy and fairly fast 
because of the inverted index, but doing that for a large number of terms 
(ie: a "body" field containing huge amounts of aritrary text) is where 
get really slow.

I've added a FAQ about this, mentioning 
CountFieldValuesUpdateProcessorFactory (which was committed just after the 
4.0-BETA, so you'll have to wait for 4.0-final unforunately.)

https://wiki.apache.org/solr/FAQ#How_can_I_efficently_search_for_all_documents_that_contain_a_value_in_fieldX_.3F

https://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/org/apache/solr/update/processor/CountFieldValuesUpdateProcessorFactory.html




-Hoss

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by lboutros <bo...@gmail.com>.
You could add a default value in your field via the schema :

<field ... default="mynullllvalue"/>

and then your query could be :

-body:mynullllvalue

but I prefer the Chris's solution which is what I usually do.

Ludovic.







-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002872.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Erick Erickson <er...@gmail.com>.
Maybe you can spoof this by using an "fq" clause instead?
as &fq=body:*?

The first one will be slow, but after that it'll use the filterCache.

FWIW,
Erick

On Wed, Aug 22, 2012 at 4:51 PM, david3s <da...@hotmail.com> wrote:
> Ok, I'll take your suggestion, but I would still be really happy if the
> wildcard searches behaved a little more intelligent (body:* not looking for
> everything in the body). More like when you do "q=*:*" it doesn't really
> search for everything in every field.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002743.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by david3s <da...@hotmail.com>.
Ok, I'll take your suggestion, but I would still be really happy if the
wildcard searches behaved a little more intelligent (body:* not looking for
everything in the body). More like when you do "q=*:*" it doesn't really
search for everything in every field.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Michael Della Bitta <mi...@appinions.com>.
The name of the game for performance and functionality in Solr quite
often *denormalization*, which might run against your RDBMS instincts,
but once you embrace it, you'll find that things go a lot more
smoothly.

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 12:37 PM, david3s <da...@hotmail.com> wrote:
> Hello Chris, thanks a lot for your reply. But is there an alternative
> solution? Because I see adding "has_body" as data duplication.
>
> Imagine in that in a Relational DB you had to create extra columns because
> you can't do something like "where body is not null"
>
> If there's no other alternative I'll have to go with your suggestion that I
> greatly appreciate.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by david3s <da...@hotmail.com>.
Hello Chris, thanks a lot for your reply. But is there an alternative
solution? Because I see adding "has_body" as data duplication.

Imagine in that in a Relational DB you had to create extra columns because
you can't do something like "where body is not null"

If there's no other alternative I'll have to go with your suggestion that I
greatly appreciate.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Chris Hostetter <ho...@fucit.org>.
: Our environment is Solr 3.6.1. I have the following fieldType. There is a
: field called 'body' of this fieldType. When I make a query: q=body:*, it is
: talking longer than the expected. What are the changes I need to do to this
: fieldType for better query performance? Some other fieldTypes in our schema
: are performing better.

this is a *really* terrible practice -- especially for large text fields, 
it requires solr to scan every possible term in that field (ie: every 
word) looking to see which documents contain that term.

if your goal is to query for docs that "have a value in the body field" 
then i highly suggest you add a boolean field called "has_body" and query 
that.  Much faster.


-Hoss

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by srinalluri <na...@yahoo.com>.
Thanks Jack for your reply.

I don't have much documents which have a null field value.

I added ReversedWildcardFilterFactory to test the performance improvement
only, but that didn't help.

What else changes I can do to the fieldType?

thanks
Srini



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002522.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1: query performance is slow when asterisk is in the query

Posted by Jack Krupansky <ja...@basetechnology.com>.
You could try a "[* TO *]" range query. It will also match all documents 
which have a non-null field value.

So: q=body:[* TO *]

Actually, I see that you have reverse wildcard enabled. Try removing that. 
"*" would normally map to PrefixQuery, which is normally more efficient than 
a WildcardQuery.

-- Jack Krupansky

-----Original Message----- 
From: srinalluri
Sent: Tuesday, August 21, 2012 4:21 PM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1: query performance is slow when asterisk is in the query

Our environment is Solr 3.6.1. I have the following fieldType. There is a
field called 'body' of this fieldType. When I make a query: q=body:*, it is
talking longer than the expected. What are the changes I need to do to this
fieldType for better query performance? Some other fieldTypes in our schema
are performing better.

<fieldType name="text_general_html" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Thanks in advance
Srini



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496.html
Sent from the Solr - User mailing list archive at Nabble.com.