You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "girish.gopal" <gi...@axonnetworks.com> on 2013/03/01 19:24:17 UTC

Email Search Slow

Hello,
I have over 40 million records/documents and I need to retrieve them using
wildcard searches on email and / or firstname and / or lastname. 
The firstname, lastname and blank search (*:*) all return results within 3
seconds. But my Email search alone takes more than 20-25 secs. 
I would like to know what is the general recommendations for this field. I
have tried tokenizing(StandardTokenizer) and also the simple TextField for
this. 
Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/1/2013 11:49 AM, girish.gopal wrote:
> My Specs are:
> Windows Server 2008 64 bit Dual Quad Core CPUs with 64 GB of RAM.
> I have allocated 55GB of memory to Tomcat in its config.

In addition to the advice you've gotten about wildcards, your memory 
allocation needs some tweaking.  It is highly unlikely that Solr needs 
that much RAM.  Depending on the size of your index, I would expect that 
between 4GB and 8GB would be appropriate.  I've got a system handling a 
distributed index that's about 84GB and it's running on an 8GB heap with 
no problems, and the heap could likely be made smaller.  Garbage 
collection pauses can be a major problem even with a heap that's only 
8GB, so you may also need some tuning options for your java commandline.

Lowering your java heap allocation will leave more memory for the OS to 
use for caching your index, which is what is required for good 
performance from Solr.

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Below are the tuning options I use with my 8GB heap, which have pretty 
much eliminated the long GC pauses I was seeing.  None of these options 
have any relation to a specific max heap size.  They probably can use 
some additional tweaking, which I haven't found the time to do yet:

-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 
-XX:NewRatio=3 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled 
-XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts

Thanks,
Shawn


Re: Email Search Slow

Posted by Walter Underwood <wu...@wunderwood.org>.
That is a good start. Use the Analysis page in the admin UI to see what the tokenizer does.

wunder
 
On Mar 1, 2013, at 11:02 AM, girish.gopal wrote:

> Hello Wunder,
> I see your point. Will this help if I search for "giri", "giri@",
> "giri@gmail", "@gmail.com" and other combinations.
> So, if I use a StandardTokenizer, I will get the ALPHANUM without the "@"
> and the '.'. So my phrases would be "giri","gmail","com". And I should do a
> phrase search on this.
> Would this be a better approach?
> Regards,
> Giri
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044078.html
> Sent from the Solr - User mailing list archive at Nabble.com.

-



Re: Email Search Slow

Posted by "girish.gopal" <gi...@axonnetworks.com>.
Hello Wunder,
I see your point. Will this help if I search for "giri", "giri@",
"giri@gmail", "@gmail.com" and other combinations.
So, if I use a StandardTokenizer, I will get the ALPHANUM without the "@"
and the '.'. So my phrases would be "giri","gmail","com". And I should do a
phrase search on this.
Would this be a better approach?
Regards,
Giri



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044078.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by Walter Underwood <wu...@wunderwood.org>.
Don't use wildcards. A leading wildcard matches against every token in the index. This is the search equivalent of a full table scan in a relational database.

Instead, create a field type that tokenizes e-mail addresses into pieces, then use phrase search against that.

The address "fred@yahoo.com" might be tokenized into "fred", "@", "yahoo", "com".

wunder

On Mar 1, 2013, at 10:49 AM, girish.gopal wrote:

> Thanks Jack. The search is slow only when it is issued for the first time. 
> Ex. querying for *@gmail* takes 20+ seconds for the first time; when I
> re-issue the same search, then it returns pretty quick(Possibly reading out
> of cache). 
> But when I issue a new search *@yahoo.* then this too takes about 20+ secs
> before returning with results. Basically I seem to have a problem when a new
> search is issued. 
> Is this normal?
> 
> My Specs are:
> Windows Server 2008 64 bit Dual Quad Core CPUs with 64 GB of RAM.
> I have allocated 55GB of memory to Tomcat in its config.
> 
> I will check on the Heap.
> Regards,
> Giri
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044072.html
> Sent from the Solr - User mailing list archive at Nabble.com.






Re: Email Search Slow

Posted by "girish.gopal" <gi...@axonnetworks.com>.
So my changes worked. The StandardTokenizer worked fine and removing the "*"
at the beginning of the query worked like a charm .. Am at 40 million
records now and search results come back in 2-3 seconds.
The heap allocation suggestion by Shawn also did its bit I guess.
Thanks a bunch guys.



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044371.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by "girish.gopal" <gi...@axonnetworks.com>.
Here is my config now:
<fieldType name="email" class="solr.TextField"> 
      <analyzer type="index"> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer> 
	  <analyzer type="query"> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </fieldType> 

And my initial heap allocation is now set to 4 GB and a max of 8GB as per
Shawn's recommendation.
Thanks Jack, Walter and Shawn for your suggestions.  
I will post the results on this forum for others to refer.
Regards,
Giri



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044093.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by "girish.gopal" <gi...@axonnetworks.com>.
Jack,
No. It is a simple search. I cannot limit what the search will be like. Like
I mentioned to Walter, search could land for a "*@gmail.com" or a "*yahoo*".
Most of the time it is the dreaded and expensive "contains" search.
Regards,
Giri



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044087.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by Jack Krupansky <ja...@basetechnology.com>.
It sounds like you have enough raw memory. How big is the index (GB)?

Are you doing anything like ngrams that generate zillions of terms?

-- Jack Krupansky

-----Original Message----- 
From: girish.gopal
Sent: Friday, March 01, 2013 1:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Email Search Slow

Thanks Jack. The search is slow only when it is issued for the first time.
Ex. querying for *@gmail* takes 20+ seconds for the first time; when I
re-issue the same search, then it returns pretty quick(Possibly reading out
of cache).
But when I issue a new search *@yahoo.* then this too takes about 20+ secs
before returning with results. Basically I seem to have a problem when a new
search is issued.
Is this normal?

My Specs are:
Windows Server 2008 64 bit Dual Quad Core CPUs with 64 GB of RAM.
I have allocated 55GB of memory to Tomcat in its config.

I will check on the Heap.
Regards,
Giri



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044072.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Email Search Slow

Posted by "girish.gopal" <gi...@axonnetworks.com>.
Thanks Jack. The search is slow only when it is issued for the first time. 
Ex. querying for *@gmail* takes 20+ seconds for the first time; when I
re-issue the same search, then it returns pretty quick(Possibly reading out
of cache). 
But when I issue a new search *@yahoo.* then this too takes about 20+ secs
before returning with results. Basically I seem to have a problem when a new
search is issued. 
Is this normal?

My Specs are:
Windows Server 2008 64 bit Dual Quad Core CPUs with 64 GB of RAM.
I have allocated 55GB of memory to Tomcat in its config.

I will check on the Heap.
Regards,
Giri



--
View this message in context: http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064p4044072.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Email Search Slow

Posted by Jack Krupansky <ja...@basetechnology.com>.
Make sure you have enough heap space for your JVM and the most if not all of 
your index fits in OS system memory.

After you start Solr and issue a couple of queries, how much JVM heap is 
available?

-- Jack Krupansky

-----Original Message----- 
From: girish.gopal
Sent: Friday, March 01, 2013 1:24 PM
To: solr-user@lucene.apache.org
Subject: Email Search Slow

Hello,
I have over 40 million records/documents and I need to retrieve them using
wildcard searches on email and / or firstname and / or lastname.
The firstname, lastname and blank search (*:*) all return results within 3
seconds. But my Email search alone takes more than 20-25 secs.
I would like to know what is the general recommendations for this field. I
have tried tokenizing(StandardTokenizer) and also the simple TextField for
this.
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Email-Search-Slow-tp4044064.html
Sent from the Solr - User mailing list archive at Nabble.com.