You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lanny Ripple <la...@spotright.com> on 2013/05/31 18:43:40 UTC

updating docs in solr cloud hangs

Hi all,

We're using Solr 4.1.0 and a 15 node Solr Cloud (configured for a 2 minute
autoCommit with no searcher being built).  We have a large dataset in
Cassandra and use a Hadoop cluster to read over the dataset, build
documents, and insert them (via CloudSolrServer).  That part works as
expected.  We found that we had not included all the data in the documents
we wanted so we generated updates and sent them to the cloud.  We observed
that 15 to 20 tasks of the Hadoop job would complete fine but then we
started getting task timeouts.  Task would be retried and complete but the
longer the job ran the more tasks would see repeated timeouts (some taking
8 hours to finish).  We finally killed the job after 12 or so hours of
running with only 0.70% progress through the job.

Grabbing thread stack traces showed the trace I've placed at the end of
this post.  Basically the request is waiting (and keeps waiting) for a
response that does not show up within our 1200 second task timeout window.
 It sure feels like we're saturating some resource and even with the cloud
relatively quiet because every Hadoop task is tied up waiting for a
response the Solr Cloud can't seem to straighten up and fly right.

We've worked around this by clearing out the index and building the
documents with all data from the start.

Are document updates particularly expensive?  (I realize they are more
expensive than straight inserts but should we expect the behavior we've
been seeing?)


java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:129)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264)
	at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:98)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247)
	at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
	at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353)
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
	at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:256)
	at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:286)
	at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)

Re: updating docs in solr cloud hangs

Posted by Yago Riveiro <ya...@gmail.com>.
Hi,  

I'm experimenting the same issue, I'm indexing a big file with 15M in batches of 100K.

Sometimes, the indexing operation hangs and my HTTP client return an error of timeout.

I see that is more frequent when the collection has more replicas.

Other thing that I can see is a lot of POST update operations on tomcat hanged for a long time, the only way that I have to can index more documents in restarting the node. 

Regards

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, May 31, 2013 at 5:43 PM, Lanny Ripple wrote:

> ead stack traces showed the trace I've placed at the  


Re: updating docs in solr cloud hangs

Posted by Yago Riveiro <ya...@gmail.com>.
Hi,

My cluster hangs again running an update process, the HTTP POST request was aborted because a timeout error. After the hang,  I couldn't do more updates without restart the cluster.

I could see this error on node's log after kill it. Is like if solr waits for the update response forever … and no more operations can be handle until this one finish.

[qtp301150411-1248] ERROR org.apache.solr.core.SolrCore  – org.apache.solr.common.SolrException: interrupted waiting for shard update response
at org.apache.solr.update.SolrCmdDistributor.checkResponses(SolrCmdDistributor.java:429)
at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:99)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:447)
at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1140)
at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.solr.update.SolrCmdDistributor.checkResponses(SolrCmdDistributor.java:408)
... 35 more

--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 3, 2013 at 2:18 AM, Erick Erickson wrote:

> Did you take a stack trace of your _server_ and see if the
> fragment I posted is the place a bunch of threads are
> stuck? If so, then it's what I mentioned, and the patch
> I pointed to should fix it up (when it's ready)...
>  
> The fact that it hangs more frequently with replication > 1
> is consistent with the JIRA.
>  
> Shawn:
>  
> Thanks, you beat me to the punch for clarifying "replication"!
>  
> Best
> Erick
>  
> On Sun, Jun 2, 2013 at 12:41 PM, Yago Riveiro <yago.riveiro@gmail.com (mailto:yago.riveiro@gmail.com)> wrote:
> > Shawn:
> >  
> > replicationFactor higher than one yes.
> >  
> > --
> > Yago Riveiro
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> >  
> >  
> > On Sunday, June 2, 2013 at 4:07 PM, Shawn Heisey wrote:
> >  
> > > On 6/2/2013 8:28 AM, Yago Riveiro wrote:
> > > > Erick:
> > > >  
> > > > In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck at some place on the update operation.
> > > >  
> > > > Other thing that I notice is the fact that the cluster hangs more frequently when the collection has replication.
> > >  
> > > Just to clarify, you are talking about a replicationFactor higher than
> > > one, not old-style master-slave replication, correct? I'm pretty sure
> > > that's the case, I'm just trying to keep this topic from getting derailed.
> > >  
> > > Thanks,
> > > Shawn
> > >  
> >  
> >  
>  
>  
>  



Re: updating docs in solr cloud hangs

Posted by Erick Erickson <er...@gmail.com>.
Did you take a stack trace of your _server_ and see if the
fragment I posted is the place a bunch of threads are
stuck? If so, then it's what I mentioned, and the patch
I pointed to should fix it up (when it's ready)...

The fact that it hangs more frequently with replication > 1
is consistent with the JIRA.

Shawn:

Thanks, you beat me to the punch for clarifying "replication"!

Best
Erick

On Sun, Jun 2, 2013 at 12:41 PM, Yago Riveiro <ya...@gmail.com> wrote:
> Shawn:
>
> replicationFactor higher than one yes.
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Sunday, June 2, 2013 at 4:07 PM, Shawn Heisey wrote:
>
>> On 6/2/2013 8:28 AM, Yago Riveiro wrote:
>> > Erick:
>> >
>> > In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck at some place on the update operation.
>> >
>> > Other thing that I notice is the fact that the cluster hangs more frequently when the collection has replication.
>>
>> Just to clarify, you are talking about a replicationFactor higher than
>> one, not old-style master-slave replication, correct? I'm pretty sure
>> that's the case, I'm just trying to keep this topic from getting derailed.
>>
>> Thanks,
>> Shawn
>>
>>
>
>

Re: updating docs in solr cloud hangs

Posted by Yago Riveiro <ya...@gmail.com>.
Shawn: 

replicationFactor higher than one yes. 

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Sunday, June 2, 2013 at 4:07 PM, Shawn Heisey wrote:

> On 6/2/2013 8:28 AM, Yago Riveiro wrote:
> > Erick:
> > 
> > In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck at some place on the update operation.
> > 
> > Other thing that I notice is the fact that the cluster hangs more frequently when the collection has replication.
> 
> Just to clarify, you are talking about a replicationFactor higher than
> one, not old-style master-slave replication, correct? I'm pretty sure
> that's the case, I'm just trying to keep this topic from getting derailed.
> 
> Thanks,
> Shawn
> 
> 



Re: updating docs in solr cloud hangs

Posted by Shawn Heisey <so...@elyograg.org>.
On 6/2/2013 8:28 AM, Yago Riveiro wrote:
> Erick:
> 
> In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck at some place on the update operation.
> 
> Other thing that I notice is the fact that the cluster hangs more frequently when the collection has replication.

Just to clarify, you are talking about a replicationFactor higher than
one, not old-style master-slave replication, correct?  I'm pretty sure
that's the case, I'm just trying to keep this topic from getting derailed.

Thanks,
Shawn


Re: updating docs in solr cloud hangs

Posted by Yago Riveiro <ya...@gmail.com>.
Erick:

In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck at some place on the update operation.

Other thing that I notice is the fact that the cluster hangs more frequently when the collection has replication.

--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Sunday, June 2, 2013 at 2:28 PM, Erick Erickson wrote:

> Yago:
>  
> Batches of 100k docs at a time are pretty big, you're way past the
> diminishing returns point. I rarely go over 1,000. That said, reducing
> the size might be a work-around, perhaps down to one.
>  
> All:
>  
> Look on your Solr servers (not client) for a stack trace fragment similar to:
>  
> at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:349)
> at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:299)
>  
> This has been lurking in the background, and work is being done here:
> https://issues.apache.org/jira/browse/SOLR-4816
> that should address this.
>  
> It'd be great if either or both of you could try this patch and see if
> it cures your problem!
>  
> Of course this may be unrelated to what you're seeing, look at the
> stack trace on your server before jumping in....
>  
> In the mean time, another way around this would be to very
> significantly reduce the number of docs in an update. I _think_ that
> the more docs you have the more likely you are to get into a deadlock
> state.
>  
> FWIW,
> Erick
>  
>  
>  
> On Fri, May 31, 2013 at 1:51 PM, bbarani <bbarani@gmail.com (mailto:bbarani@gmail.com)> wrote:
> > As far as I know, partial update in Solr 4.X doesn’t partially update Lucene
> > index , but instead removes a document from the index and indexes an
> > updated one. The underlying lucene always requires to delete the old
> > document and index the new one..
> >  
> >  
> > We usually dont use partial update when updating huge number of documents.
> > This is really useful for small number of documents (mostly during push
> > indexing)...
> >  
> >  
> >  
> > --
> > View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4067416.html
> > Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
> >  
>  
>  
>  



RE: updating docs in solr cloud hangs

Posted by Greg Walters <gw...@sherpaanalytics.com>.
Thanks, Erick that's exactly the clarification/confirmation I was looking for!

Greg


Re: updating docs in solr cloud hangs

Posted by Erick Erickson <er...@gmail.com>.
Right, it's a little arcane. But the lockup is because the
various leaders send documents to each other and wait
for returns. If there are a _lot_ of incoming packets to
various leaders, it can generate the distributed deadlock.
So the shuffling you refer to is the root of the issue.

If the leaders only receive documents for the shard they're
a leader of, then they won't have to send updates to other
leaders and shouldn't hit this condition.

But you're right, this situation was encountered the first time
by SolrJ clients sending lots and lots or parallel requests,
I don't remember whether it was just one client with lots of
threads or many clients. If you're not using SolrJ, then
it won't do you much good since it's client-side only.

As far as being a true fix or not, you can look at it as
kicking the can down the road. This patch has several
advantages:
1> It should pave the way for, and move towards,
    linear scalability as far as scaling up to many
    many nodes when indexing from SolrJ.
2> It should improve throughput in the normal case as well.
3> Along the way it _should_ significantly lower (perhaps
    remove entirely) the chance that this deadlock will occur,
    again when indexing from SolrJ.

If you had a bunch of clients sending, say, posting csv files
to SolrCloud I'd guess you'd find this happening again.

So it's an improvement not a perfect cure. But if you think
it'd help....

Best,
Erick


On Thu, Aug 22, 2013 at 3:23 PM, allrightname <al...@gmail.com>wrote:

> Erick,
>
> I've read over SOLR-4816 after finding your comment about the server-side
> stack traces showing threads locked up over semaphores and I'm curious how
> that issue cures the problem on the server-side as the patch only includes
> client-side changes. Do the servers get so tied up shuffling documents
> around when they're not sent to the master that they get blocked as
> described? If they do get blocked due to shuffling documents around is a
> client-side fix for this not more of a workaround than a true fix?
>
> I'm entirely willing to apply this patch to all of the code I've got that
> talks to my solr servers and try it out but I'm reluctant to because this
> looks like a client-side fix to a server-side issue.
>
> Thanks,
> Greg
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4086160.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: updating docs in solr cloud hangs

Posted by allrightname <al...@gmail.com>.
Erick,

I've read over SOLR-4816 after finding your comment about the server-side
stack traces showing threads locked up over semaphores and I'm curious how
that issue cures the problem on the server-side as the patch only includes
client-side changes. Do the servers get so tied up shuffling documents
around when they're not sent to the master that they get blocked as
described? If they do get blocked due to shuffling documents around is a
client-side fix for this not more of a workaround than a true fix?

I'm entirely willing to apply this patch to all of the code I've got that
talks to my solr servers and try it out but I'm reluctant to because this
looks like a client-side fix to a server-side issue.

Thanks,
Greg



--
View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4086160.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: updating docs in solr cloud hangs

Posted by Erick Erickson <er...@gmail.com>.
Yago:

Batches of 100k docs at a time are pretty big, you're way past the
diminishing returns point. I rarely go over 1,000. That said, reducing
the size might be a work-around, perhaps down to one.

All:

Look on your Solr servers (not client) for a stack trace fragment similar to:

at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:349)
at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:299)

This has been lurking in the background, and work is being done here:
https://issues.apache.org/jira/browse/SOLR-4816
that should address this.

It'd be great if either or both of you could try this patch and see if
it cures your problem!

Of course this may be unrelated to what you're seeing, look at the
stack trace on your server before jumping in....

In the mean time, another way around this would be to very
significantly reduce the number of docs in an update. I _think_ that
the more docs you have the more likely you are to get into a deadlock
state.

FWIW,
Erick



On Fri, May 31, 2013 at 1:51 PM, bbarani <bb...@gmail.com> wrote:
> As far as I know, partial update in Solr 4.X doesn’t partially update Lucene
> index  , but instead removes a document from the index and indexes an
> updated one. The underlying lucene always requires to delete the old
> document and index the new one..
>
>
> We usually dont use partial update when updating huge number of documents.
> This is really useful for small number of documents (mostly during push
> indexing)...
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4067416.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: updating docs in solr cloud hangs

Posted by bbarani <bb...@gmail.com>.
As far as I know, partial update in Solr 4.X doesn’t partially update Lucene
index  , but instead removes a document from the index and indexes an
updated one. The underlying lucene always requires to delete the old
document and index the new one..


We usually dont use partial update when updating huge number of documents.
This is really useful for small number of documents (mostly during push
indexing)...



--
View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4067416.html
Sent from the Solr - User mailing list archive at Nabble.com.