You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jon Bodner <jb...@blackboard.com> on 2009/04/24 06:25:47 UTC

Solr Performance bottleneck

Hi all,

I am trying to solve a serious performance problem with our Solr search
index.  We're running under Solr 1.3.  We've sharded our index into 4
shards.  Index data is stored on a network mount that is accessed over Fibre
Channel.  Each document's text is indexed, but not stored.  Each day,
roughly 10K - 20K new documents are added.  After a document is submitted,
it is compared, sentence by sentence, against every document we have indexed
in its category.   It's a requirement that we keep our index as up-to-date
as possible.  We reload our indexes once a minute in order to miss as few
matches as possible.  We are not expecting to find matches, so our document
cache hits rates are abysmal.  We also don't expect many repeated sentences
across documents, so cached query hits rates are also practically zero.

After running fine for over 9 months, the system broke down this week.  The
queries per second are around 17 to 18, and our paper backlog is well north
of 14,000.  The number of papers in the index has hit 3.7 million, and each
shard is 2.3GB in size (roughly 925K papers in each index).

In order to increase throughput, we tried to stand up additional read-only
Solr instances pointed at the shared indexes, but got I/O errors from the
secondary Solr instances when the reload time came.  We tried switching the
locking mechanize from single to simple, but the I/O error continued.

We're running on 64-bit Linux with a 64-bit JVM (Java 1.6.something), with
4GB of RAM assigned to each Solr instance.

Has anyone else seen a problem like this before?  Can anyone suggest any
solutions?  Will Solr 1.4 help (and is Solr 1.4 ready for production use)?

Any answers would be greatly appreciated.

Thanks,

Jon

-- 
View this message in context: http://www.nabble.com/Solr-Performance-bottleneck-tp23209595p23209595.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Performance bottleneck

Posted by Andrey Klochkov <ak...@griddynamics.com>.
On Tue, Apr 28, 2009 at 3:18 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

>
> Hi,
>
> You should probably just look at the index version number to figure out if
> the name changed.  If you are looking at segments.gen, you are looking at a
> file that may not exist in Lucene in the future.  Use IndexReader API
> instead.


Yeah, I user IndexReader.isCurrent() to determine if I should "refresh" Solr
after catching a data grid event. But I have to create that event listener
somehow, and here I have no other way but to hardcode this index file name.
So when some node of the cluster performs commit, other nodes which listen
for segments.gen changes, receive the event and refresh their Solr instances
by calling SolrServer.commit().


> By "refreshes" do you mean "reopened a new Searcher"?  Does commit + post
> commit event not work for you?


Currently I use the following code to "refresh" cores:

new EnbeddedlSolrServer(cores, coreName).commit()


>
>
> By "kicks Solr" I hope you don't mean a Solr/container restart! :)


:) No, I mean the same "refresh" code i.e. calling SolrServer.commit().

-- 
Andrew Klochkov

Re: Solr Performance bottleneck

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

You should probably just look at the index version number to figure out if the name changed.  If you are looking at segments.gen, you are looking at a file that may not exist in Lucene in the future.  Use IndexReader API instead.

By "refreshes" do you mean "reopened a new Searcher"?  Does commit + post commit event not work for you?

By "kicks Solr" I hope you don't mean a Solr/container restart! :)

Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrey Klochkov <ak...@griddynamics.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, April 28, 2009 4:57:54 AM
> Subject: Re: Solr Performance bottleneck
> 
> On Mon, Apr 27, 2009 at 10:27 PM, Jon Bodner wrote:
> 
> >
> > Trying to point multiple Solrs  on multiple boxes at a single shared
> > directory is almost certainly doomed to failure; the read-only Solrs won't
> > know when the read/write Solr instance has updated the index.
> >
> 
> I'm solving the same problem while working with index stored in data-grid
> and I've just created a data-grid listener which looks for "segments.gen"
> file changes and forces Solr to refresh its structures after receiving this
> event. You can do the same job with file system index - write some code
> which looks at segments.gen file changes and kicks solr when a change is
> detected.
> 
> It would be great to add such a mechanism to Solr, I mean some abstracted
> (via an interface) way to implement index refresh events sources.
> 
> Also there's code in SolrCore which checks index existence by looking into
> file system and it would be better to abstract that code too. WDYT? I can
> provide patches.
> 
> -- 
> Andrew Klochkov


Re: Solr Performance bottleneck

Posted by Andrey Klochkov <ak...@griddynamics.com>.
On Mon, Apr 27, 2009 at 10:27 PM, Jon Bodner <jb...@blackboard.com> wrote:

>
> Trying to point multiple Solrs  on multiple boxes at a single shared
> directory is almost certainly doomed to failure; the read-only Solrs won't
> know when the read/write Solr instance has updated the index.
>

I'm solving the same problem while working with index stored in data-grid
and I've just created a data-grid listener which looks for "segments.gen"
file changes and forces Solr to refresh its structures after receiving this
event. You can do the same job with file system index - write some code
which looks at segments.gen file changes and kicks solr when a change is
detected.

It would be great to add such a mechanism to Solr, I mean some abstracted
(via an interface) way to implement index refresh events sources.

Also there's code in SolrCore which checks index existence by looking into
file system and it would be better to abstract that code too. WDYT? I can
provide patches.

-- 
Andrew Klochkov

Re: Solr Performance bottleneck

Posted by Walter Underwood <wu...@netflix.com>.
This isn't a new problem, NFS was 100X slower than local disk for me
with Solr 1.1.

Backing up indexes is very tricky. You need to do it while the are
not being updated, or you'll get a corrupt copy. If your indexes
aren't large, you are probably better off backing up the source
documents and building new indexes from scratch.

wunder

On 4/27/09 11:27 AM, "Jon Bodner" <jb...@blackboard.com> wrote:

> 
> As a follow-up note, we solved our problem by moving the indexes to local
> store and upgrading to Solr 1.4.  I did a thread dump against our 1.3 Solr
> instance and it was spending lots of time blocking on index section loading.
> The NIO implementation in 1.4 solved that problem and copying to local store
> almost certainly reduced file loading time.
> 
> Trying to point multiple Solrs  on multiple boxes at a single shared
> directory is almost certainly doomed to failure; the read-only Solrs won't
> know when the read/write Solr instance has updated the index.
> 
> We are going to try to move our indexes back to shared disk, as our backup
> solutions are all tied to the shared disk.  Also, if an individual box
> fails, we can bring up a new box and point it at the shared disk.  Are there
> any known problems with NIO and NFS that will cause this to fail?  Can
> anyone suggest a better solution?
> 
> Thanks,
> 
> Jon


Re: Solr Performance bottleneck

Posted by Jon Bodner <jb...@blackboard.com>.
As a follow-up note, we solved our problem by moving the indexes to local
store and upgrading to Solr 1.4.  I did a thread dump against our 1.3 Solr
instance and it was spending lots of time blocking on index section loading. 
The NIO implementation in 1.4 solved that problem and copying to local store
almost certainly reduced file loading time.

Trying to point multiple Solrs  on multiple boxes at a single shared
directory is almost certainly doomed to failure; the read-only Solrs won't
know when the read/write Solr instance has updated the index.

We are going to try to move our indexes back to shared disk, as our backup
solutions are all tied to the shared disk.  Also, if an individual box
fails, we can bring up a new box and point it at the shared disk.  Are there
any known problems with NIO and NFS that will cause this to fail?  Can
anyone suggest a better solution?

Thanks,

Jon

-- 
View this message in context: http://www.nabble.com/Solr-Performance-bottleneck-tp23209595p23262198.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Performance bottleneck

Posted by Jon Bodner <jb...@blackboard.com>.

Grant Ingersoll-6 wrote:
> 
> Hi Jon,
> 
> Can you share the stack traces for the exceptions?  Also, I don't know  
> what it is about it, but an index with non-stored items of only 925K  
> items being about 2.3GB seems weird to me for some reason.  How many  
> unique terms do you have?
> 

Individual documents can be large, and we aren't expecting much overlap. 
>From the checkIndex report from the largest segment in one of our shards:
  1 of 7: name=_5zxe docCount=924125
    compound=false
    numFiles=9
    size (MB)=2,340.488
    has deletions [delFileName=_5zxe_2p.del]
    test: open reader.........OK [180 deleted docs]
    test: fields, norms.......OK [5 fields]
    test: terms, freq, prox...OK [7749453 terms; 376806454 terms/docs pairs;
1030915432 tokens]
    test: stored fields.......OK [3695780 total field count; avg 4 fields
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]


Here is the stack trace from the read-only master that we stood up
temporarily:
Apr 23, 2009 6:51:43 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: Input/output error
	at java.io.RandomAccessFile.readBytes(Native Method)
	at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
	at
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:550)
	at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
	at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
	at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
	at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
	at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123)
	at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154)
	at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
	at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:212)
	at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678)
	at
org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:373)
	at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87)
	at org.apache.lucene.search.Similarity.idf(Similarity.java:457)
	at org.apache.lucene.search.TermQuery$TermWeight.<init>(TermQuery.java:44)
	at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146)
	at
org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:187)
	at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:362)
	at
org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:187)
	at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:362)
	at org.apache.lucene.search.Query.weight(Query.java:95)
	at org.apache.lucene.search.Searcher.createWeight(Searcher.java:171)
	at org.apache.lucene.search.Searcher.search(Searcher.java:118)
	at org.apache.lucene.search.Searcher.search(Searcher.java:97)
	at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:901)
	at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:815)
	at
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:700)
	at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:160)
	at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:153)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
	at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
	at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
	at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
	at java.lang.Thread.run(Thread.java:619)



Grant Ingersoll-6 wrote:
> 
> Also, in Lucene there is a standalone program called CheckIndex, can  
> you point that at your index and see what it says?
> 

CheckIndex says that all of the indexes are fine.


Grant Ingersoll-6 wrote:
> 
> What would happen if you didn't access the index over the network and  
> instead used local disk?  Is that an option?  Or am I not  
> understanding your setup correctly?
> 

Unfortunately, that's not an option.  The indexes have to remain on the
network share.  It is over Fibre Channel, so the performance shouldn't be
too bad, and until this week it was fine.


Grant Ingersoll-6 wrote:
> 
> Finally, what changed this week?  And what is a "paper backlog" and  
> how does it factor in?
> 
Sorry, I slipped into internal terminology ;-)

What's odd is that nothing has changed in the past week.  By "paper
backlog", I just meant that there were 14K (now 19K) documents waiting to be
processed.  Normally, there are a couple hundred, maybe a couple thousand if
we receive an unexpectedly large load.

Thanks for your help,

Jon



-- 
View this message in context: http://www.nabble.com/Solr-Performance-bottleneck-tp23209595p23219377.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Performance bottleneck

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Jon,

Can you share the stack traces for the exceptions?  Also, I don't know  
what it is about it, but an index with non-stored items of only 925K  
items being about 2.3GB seems weird to me for some reason.  How many  
unique terms do you have?

Also, in Lucene there is a standalone program called CheckIndex, can  
you point that at your index and see what it says?

What would happen if you didn't access the index over the network and  
instead used local disk?  Is that an option?  Or am I not  
understanding your setup correctly?

Finally, what changed this week?  And what is a "paper backlog" and  
how does it factor in?

Thanks,
Grant

On Apr 24, 2009, at 12:25 AM, Jon Bodner wrote:

>
> Hi all,
>
> I am trying to solve a serious performance problem with our Solr  
> search
> index.  We're running under Solr 1.3.  We've sharded our index into 4
> shards.  Index data is stored on a network mount that is accessed  
> over Fibre
> Channel.  Each document's text is indexed, but not stored.  Each day,
> roughly 10K - 20K new documents are added.  After a document is  
> submitted,
> it is compared, sentence by sentence, against every document we have  
> indexed
> in its category.   It's a requirement that we keep our index as up- 
> to-date
> as possible.  We reload our indexes once a minute in order to miss  
> as few
> matches as possible.  We are not expecting to find matches, so our  
> document
> cache hits rates are abysmal.  We also don't expect many repeated  
> sentences
> across documents, so cached query hits rates are also practically  
> zero.
>
> After running fine for over 9 months, the system broke down this  
> week.  The
> queries per second are around 17 to 18, and our paper backlog is  
> well north
> of 14,000.  The number of papers in the index has hit 3.7 million,  
> and each
> shard is 2.3GB in size (roughly 925K papers in each index).
>
> In order to increase throughput, we tried to stand up additional  
> read-only
> Solr instances pointed at the shared indexes, but got I/O errors  
> from the
> secondary Solr instances when the reload time came.  We tried  
> switching the
> locking mechanize from single to simple, but the I/O error continued.
>
> We're running on 64-bit Linux with a 64-bit JVM (Java  
> 1.6.something), with
> 4GB of RAM assigned to each Solr instance.
>
> Has anyone else seen a problem like this before?  Can anyone suggest  
> any
> solutions?  Will Solr 1.4 help (and is Solr 1.4 ready for production  
> use)?
>
> Any answers would be greatly appreciated.
>
> Thanks,
>
> Jon
>
> -- 
> View this message in context: http://www.nabble.com/Solr-Performance-bottleneck-tp23209595p23209595.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search