You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yonik Seeley <ys...@yahoo.com> on 2004/08/21 00:04:04 UTC

speeding up queries (MySQL faster)

Hi,

I'm trying to figure out how to speed up queries to a
large index.
I'm currently getting 133 req/sec, which isn't bad,
but isn't too close
to MySQL, which is getting 500 req/sec on the same
hardware with the
same set of documents.

Setup info & Stats:
- 4.3M documents, 12 keyword fields per document, 11
unindexed fields per document.
- lucene index size on disk=1.3G
- Hardware: dual opteron w/ 16GB memory, running 64
bit JVM (Sun 1.5 beta)
- Lucene version 1.4.1
- Hitting multithreaded server w/ 10 clients at once
- This is a read-only index... no updating is done
- Single IndexSearcher that is reused for all requests
 

Q1)  while hitting it with multiple queries at once,
lucene is pegged at 50% CPU usage (meaning it is
only using 1 out of 2 CPUs on average).  I took a
thread dump
and all of the lucene threads except one are blocked
on
reading a file (see trace below).  I could create two
index
readers, but that seems like it might be a waste, and
fixing
a symptom instead of the root problem.  Would multiple
IndexSearchers or IndexReaders share internal caches?
Is there a way to cache more info at a higher level
such that
it would get rid of this bottleneck?  The JVM isn't
taking up
much space (125M or so), and I have 16GB to work with!
The OS (linux) is obviously caching the index file,
but
that doesn't get rid of the synchronization issues,
and the
overhead of re-reading.
How is caching in lucene configured?
Does it internally use FieldCache, or do I have to use
that
somehow myself?
 
"tcpConnection-8080-72" daemon prio=1
tid=0x0000002b24412490 nid=0x34a4 waiting for monitor
entry 

[0x0000000045aba000..0x0000000045abb2d0]
        at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215)
        - waiting to lock <0x0000002ae153fa00> (a
org.apache.lucene.store.FSInputStream)
        at
org.apache.lucene.store.InputStream.refill(InputStream.java:158)
        at
org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
        at
org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
        at
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
        at
org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88)
        at
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53)
        at
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
        at
org.apache.lucene.search.Scorer.score(Scorer.java:37)
        at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
        at
org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
        at
org.apache.lucene.search.Hits.<init>(Hits.java:43)
        at
org.apache.lucene.search.Searcher.search(Searcher.java:33)
        at
org.apache.lucene.search.Searcher.search(Searcher.java:27)


Even using only 1 cpu though, MySQL is faster. Here is
what
the queries look like:

"field1:4 AND field2:188453 AND field3:1"

field1:4      done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1      done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).

The fields are all keywords checked for exact matches
(no
fulltext search is done).  Is there anything I can do
to
speed these queries up, or is the structure just more
suited
to MySQL (and not an inverted index)?

How is a query like this carried out?

Any help would be greatly appreciated.  There's not a
lot of info
on searching (much more on updating). I'm looking
forward
to "Lucene in Action"!  too bad it's not out till
October.

-Yonik


		
_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Yonik Seeley <ys...@yahoo.com>.

FYI, this optimization resulted in a fantastic
performance boost!  I went from 133 queries/sec to 990
queries per sec!  I'm now more limited by socket
overhead, as I get 1700 queries/sec when I stick the
clients right in the same process as the server.

Oddly enough, the performance increased, but the CPU
utilization decreased to around 55% (in both
configurations above).  I'll have to look into that
later, but any additional performance at this point is
pure gravy.

-Yonik

--- Yonik Seeley <ys...@yahoo.com> wrote:
> Doug wrote:
> > For example, Nutch automatically translates such
> > clauses into QueryFilters.
> 
> Thanks for the excellent pointer Doug!  I'll will
> definitely be implementing this optimization.

__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Yonik Seeley <ys...@yahoo.com>.

> For example, Nutch automatically translates such
> clauses into QueryFilters.

Thanks for the excellent pointer Doug!  I'll will
definitely be implementing this optimization.

If anyone cares, I did a 1 minute hprof test with the
search server in a servlet container.  Here are the
results (sorry about Yahoo's short line length).

-Yonik

resin.hprof.txt: Exclusive Method Times (CPU) (virtual
times)
     27390  (37.5%)
java.net.PlainSocketImpl.socketAccept
     14885  (20.4%)
org.apache.lucene.index.SegmentTermDocs.skipTo
      6700   (9.2%)
org.apache.lucene.index.CompoundFileReader$CSInputStream.rea
dInternal
      5810   (8.0%) java.io.UnixFileSystem.list
      4785   (6.5%)
org.apache.lucene.store.InputStream.readByte
      3315   (4.5%) java.io.RandomAccessFile.readBytes
      1302   (1.8%)
java.net.SocketOutputStream.socketWrite0
      1004   (1.4%) java.io.RandomAccessFile.seek
       546   (0.7%) java.lang.String.intern
       336   (0.5%) com.caucho.vfs.WriteStream.print
       248   (0.3%)
org.apache.lucene.search.TermScorer.next
       236   (0.3%)
org.apache.lucene.queryParser.QueryParser.jj_scan_token
       232   (0.3%)
org.apache.lucene.index.SegmentTermEnum.readTerm
       228   (0.3%)
org.apache.lucene.search.ConjunctionScorer.score
       200   (0.3%)
org.apache.lucene.queryParser.FastCharStream.refill
       196   (0.3%)
org.apache.lucene.store.InputStream.readVInt
       180   (0.2%)
java.security.AccessController.doPrivileged
       172   (0.2%)
org.apache.lucene.search.ConjunctionScorer.doNext
       152   (0.2%) java.lang.Object.clone
       152   (0.2%)
org.apache.lucene.index.SegmentReader.document
       148   (0.2%)
java.lang.Throwable.fillInStackTrace
       128   (0.2%)
org.apache.lucene.index.SegmentReader.norms
       116   (0.2%)
org.apache.lucene.store.InputStream.readString
       112   (0.2%) java.lang.StrictMath.log
       108   (0.1%) java.util.LinkedList.addLast
       100   (0.1%)
java.net.SocketInputStream.socketRead0
        88   (0.1%)
org.apache.lucene.search.ConjunctionScorer.next




		
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Setup info & Stats:
> - 4.3M documents, 12 keyword fields per document, 11
  [ ... ]
> "field1:4 AND field2:188453 AND field3:1"
> 
> field1:4      done alone selects around 4.2M records
> field2:188453 done alone selects around 1.6M records
> field3:1      done alone selects around 1K records
> The whole query normally selects less than 50 records
> Only the first 10 are returned (or whatever range
> the client selects).

The "field1:4" clause is probably dominating the cost of query 
execution.  Clauses which match large portions of the collection are 
slow to evaluate.  If there are not too many different such clauses then 
you can optimize this by re-using a Filter in place of such clauses, 
typically a QueryFilter.

For example, Nutch automatically translates such clauses into 
QueryFilters.  See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup

Note that this only converts clauses whose boost is zero.  Since filters 
do not affect ranking we can only safely convert clauses which do not 
contribute to the score, i.e, those whose boost is zero.  Scores might 
still be different in the filtered results because of 
Similarity.coord().  But, in Nutch, Similarity.coord() is overidden to 
always return 1.0, so that the replacement of clauses with filters does 
not alter the final scores at all.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Yonik Seeley <ys...@yahoo.com>.

Oops, CPU usage is *not* 50%, but closer to 98%.
This is due to a bug in CPU% on RHEL 3 on
multiprocessor CPUS (I ran run multiple threads in
while(1) loops, and it will still only show 50% CPU
usage for that process).  The agregated (not
per-process) statistics shown by top are correct, and
they show about 73% user time, 25% system time, and
anywhere between .5% and 2% idle time.

Unfortunately, this means that I won't be getting any
performance improvements from using a second
IndexSearcher, and I'm stuck at being 3 times slower
than MySQL on the same data/queries.

I guess the next step is some profiling... move the
server out of the servlet container and move the
clients in with the server, and then try some hprof
work.

Does anyone have pointers to lucene caching and how to
tune it?

-Yonik 

--- Bernhard Messer <Be...@intrafind.de>
wrote:
> Yonik,
> 
> there is another "synchronized" block in
> CSInputStream which could block 
> your second cpu out.

__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Bernhard Messer <Be...@intrafind.de>.

Yonik,

there is another "synchronized" block in CSInputStream which could block 
your second cpu out. Do you think there is a chance to recreate the 
index (maybe a smaller subset) without compound file option enabled and 
run your test again, so that we can see if this helps ?

regards
Bernhard

Otis Gospodnetic wrote:

>Ah, you may be right (no stack trace in email any more).  Somebody
>recenly identified a few bottlenecks that, if I recall correctly, were
>related to synchronized blocks.  I believe Doug committed some
>improvements, but I can't remember which version of Lucene that is in. 
>It's definitely in 1.4.1.
>
>Otis
>
>
>--- Yonik Seeley <ys...@yahoo.com> wrote:
>
>  
>
>>--- Otis Gospodnetic <ot...@yahoo.com>
>>wrote:
>>
>>    
>>
>>>The bottleneck seems to be disk IO.
>>>      
>>>
>>But it's not.  Linux is caching the whole file, and
>>there really isn't any disk activity at all.  Most of
>>the threads are blocked on InputStream.refill, not
>>waiting for the disk, but waiting for their turn into
>>the synchronized block to read from the disk (which is
>>why I asked about cacheing above that level).
>>
>>CPU is a constant 50% on a dual CPU system (meaning
>>100% of 1 cpu).
>>
>>-Yonik
>>
>>__________________________________________________
>>Do You Yahoo!?
>>Tired of spam?  Yahoo! Mail has the best spam protection around 
>>http://mail.yahoo.com 
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>

Re: speeding up queries (MySQL faster)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis


--- Yonik Seeley <ys...@yahoo.com> wrote:

> 
> --- Otis Gospodnetic <ot...@yahoo.com>
> wrote:
> 
> > The bottleneck seems to be disk IO.
> 
> But it's not.  Linux is caching the whole file, and
> there really isn't any disk activity at all.  Most of
> the threads are blocked on InputStream.refill, not
> waiting for the disk, but waiting for their turn into
> the synchronized block to read from the disk (which is
> why I asked about cacheing above that level).
> 
> CPU is a constant 50% on a dual CPU system (meaning
> 100% of 1 cpu).
> 
> -Yonik
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Yonik Seeley <ys...@yahoo.com>.

--- Otis Gospodnetic <ot...@yahoo.com>
wrote:

> The bottleneck seems to be disk IO.

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).

CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).

-Yonik

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Yonik Seeley <ys...@yahoo.com>.

--- Otis Gospodnetic <ot...@yahoo.com>
wrote:

> The bottleneck seems to be disk IO.

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).

CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).

-Yonik


		
__________________________________
Do you Yahoo!?
Yahoo! Mail is new and improved - Check it out!
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: speeding up queries (MySQL faster)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

The bottleneck seems to be disk IO.
Since this is a read-only index, why not spread some of the frequently
scanned index files over multiple disks, or put the index on SCSI disks
hooked up in a RAID.  Maybe this is already the case, but you didn't
mention in.

Oh, I already answered a similar question once before:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg05103.html

Otis
http://www.simpy.com/ -- Index, Search and Share your bookmarks


--- Yonik Seeley <ys...@yahoo.com> wrote:

> Hi,
> 
> I'm trying to figure out how to speed up queries to a
> large index.
> I'm currently getting 133 req/sec, which isn't bad,
> but isn't too close
> to MySQL, which is getting 500 req/sec on the same
> hardware with the
> same set of documents.
> 
> Setup info & Stats:
> - 4.3M documents, 12 keyword fields per document, 11
> unindexed fields per document.
> - lucene index size on disk=1.3G
> - Hardware: dual opteron w/ 16GB memory, running 64
> bit JVM (Sun 1.5 beta)
> - Lucene version 1.4.1
> - Hitting multithreaded server w/ 10 clients at once
> - This is a read-only index... no updating is done
> - Single IndexSearcher that is reused for all requests
>  
> 
> Q1)  while hitting it with multiple queries at once,
> lucene is pegged at 50% CPU usage (meaning it is
> only using 1 out of 2 CPUs on average).  I took a
> thread dump
> and all of the lucene threads except one are blocked
> on
> reading a file (see trace below).  I could create two
> index
> readers, but that seems like it might be a waste, and
> fixing
> a symptom instead of the root problem.  Would multiple
> IndexSearchers or IndexReaders share internal caches?
> Is there a way to cache more info at a higher level
> such that
> it would get rid of this bottleneck?  The JVM isn't
> taking up
> much space (125M or so), and I have 16GB to work with!
> The OS (linux) is obviously caching the index file,
> but
> that doesn't get rid of the synchronization issues,
> and the
> overhead of re-reading.
> How is caching in lucene configured?
> Does it internally use FieldCache, or do I have to use
> that
> somehow myself?
>  
> "tcpConnection-8080-72" daemon prio=1
> tid=0x0000002b24412490 nid=0x34a4 waiting for monitor
> entry 
> 
> [0x0000000045aba000..0x0000000045abb2d0]
>         at
>
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215)
>         - waiting to lock <0x0000002ae153fa00> (a
> org.apache.lucene.store.FSInputStream)
>         at
> org.apache.lucene.store.InputStream.refill(InputStream.java:158)
>         at
> org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
>         at
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
>         at
>
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
>         at
> org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88)
>         at
>
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53)
>         at
>
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
>         at
> org.apache.lucene.search.Scorer.score(Scorer.java:37)
>         at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
>         at
> org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
>         at
> org.apache.lucene.search.Hits.<init>(Hits.java:43)
>         at
> org.apache.lucene.search.Searcher.search(Searcher.java:33)
>         at
> org.apache.lucene.search.Searcher.search(Searcher.java:27)
> 
> 
> Even using only 1 cpu though, MySQL is faster. Here is
> what
> the queries look like:
> 
> "field1:4 AND field2:188453 AND field3:1"
> 
> field1:4      done alone selects around 4.2M records
> field2:188453 done alone selects around 1.6M records
> field3:1      done alone selects around 1K records
> The whole query normally selects less than 50 records
> Only the first 10 are returned (or whatever range
> the client selects).
> 
> The fields are all keywords checked for exact matches
> (no
> fulltext search is done).  Is there anything I can do
> to
> speed these queries up, or is the structure just more
> suited
> to MySQL (and not an inverted index)?
> 
> How is a query like this carried out?
> 
> Any help would be greatly appreciated.  There's not a
> lot of info
> on searching (much more on updating). I'm looking
> forward
> to "Lucene in Action"!  too bad it's not out till
> October.
> 
> -Yonik
> 
> 
> 		
> _______________________________
> Do you Yahoo!?
> Win 1 of 4,000 free domain names from Yahoo! Enter now.
> http://promotions.yahoo.com/goldrush
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org