You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Justin Babuscio <jb...@linchpinsoftware.com> on 2013/05/22 17:08:31 UTC

Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

*Problem:*

We periodically rebuild our Solr index from scratch.  We have built a
custom publisher that horizontally scales to increase write throughput.  On
a given rebuild, we will have ~60 JVMs running with 5 threads that are
actively publishing to all Solr masters.

For each thread, we instantiate one StreamingUpdateSolrServer(
QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.

At the end of a publish cycle (we publish in smaller chunks = 5MM records),
we execute server.blockUntilFinished() on each of the 20 servers on each
thread ( 100 total ).  Before we applied a recent change, this would always
execute to completion.  There were a few hang-ups on publishes but we
consistently re-published our entire corpus in 6-7 hours.

The *problem* is that the blockUntilFinished hangs indefinitely.  From the
java thread dumps, it appears that the loop in StreamingUpdateSolrServer
thinks a runner thread is still active so it blocks (as expected).  The
other note about the java thread dump is that the active runner thread is
exactly this:


*Hung Runner Thread:*
"pool-1-thread-8" prio=3 tid=0x00000001084c0000 nid=0xfe runnable
[0xffffffff5c7fe000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 - locked <0xfffffffe81dbcbe0> (a java.io.BufferedInputStream)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
 at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
 at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
 at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
 at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:154)


Although the runner thread is reading the socket, there is absolutely no
activity on the Solr clients.  Other than the blockUntilFinished thread,
the client is basically sleeping.

*
*
*
*
***Recent Change:*

We increased the "maxFieldLength" from 10000(default) to 2147483647
(Integer.MAX_VALUE).

Given this change is server side, I don't know how this would impact adding
a new document.  I see how it would increase commit times and index size,
but don't see the relationship to hanging client adds.


*Ingest Workflow:*

1) Pull artifacts from relational database (PDF/TXT/Java bean)
2) Extract all searchable text fields -- this is where we use Tika,
independent of Solr
3) Using Solr4J client, we publish an object that is serialized to XML and
written to the master
4) execute "blockUntilFinished" for all 20 servers on each thread.

5) Autocommit set on servers at 30 minutes or 50k documents.  During
republish, 50k threshold is met first.

*
*
*Environment:*

Solr v3.5.0
20 masters
2 slaves/master = 40 slaves


*Corpus:*

We have ~100MM records, ranging in size from 50MB PDFs to 1KB TXT files.
 Our schema has an unusually large number of fields, 200.  Our index size
averages about 30GB/shards, totally 600GB.


*Releated Bugs:*

My symptoms are most related to this bug but we are not executing any
deletes so I have low confidence that it is 100% related
https://issues.apache.org/jira/browse/SOLR-1990


Although we have similar stack traces, we are only ADDING docs.


Thanks ahead for any input/help!

-- 
Justin Babuscio

Re: Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/22/2013 11:25 AM, Justin Babuscio wrote:
> On your overflow theory, why would this impact the client?  Is is possible
> that a write attempt to Solr would block indefinitely while the Solr server
> is running wild or in a bad state due to the overflow?

That's the general notion.  I could be completely wrong about this, but 
as that limit is the only thing you changed, it was the idea that came 
to mind first.

One other thing I thought of, though this would be a band-aid, not a 
real solution - if there's a definable maximum amount of time that an 
individual update request should take to complete (1 minute? 5 minutes?) 
then you might be able to use the setSoTimeout call on your server 
object.  In the 3.5.0 source code, this method is inherited, so it might 
not actually work correctly, but I'm hopeful.

If the problem is stuck update requests (and not a bug in 
blockUntilFinished), setting the SoTimeout (assuming it works) might 
unplug the works.  The stuck requests might fail, but your SolrJ log 
might contain enough info to help you track that down.  I don't think 
your application would ever be notified about such failures, but they 
should be logged.

Good luck with the upgrade plan.  Would you be able to upgrade the 
dependent jars for the existing SolrJ without an extensive approval 
process?  I won't be surprised if the answer is no.

On SOLR-1990, I don't think that's it, because unless 
blockUntilFinished() itself is broken, calling it more often than 
strictly necessary shouldn't be an issue.

Do you see any problems in the server log?

Thanks,
Shawn

Re: Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

Posted by Justin Babuscio <jb...@linchpinsoftware.com>.

Shawn,

Thank you!

Just some quick responses:

On your overflow theory, why would this impact the client?  Is is possible
that a write attempt to Solr would block indefinitely while the Solr server
is running wild or in a bad state due to the overflow?

We attempt to set the BinaryRequestWriter but per this bug:
https://issues.apache.org/jira/browse/SOLR-1565, v3.5 uses the default XML
writer.

On upgrading to 3.6.2 or 4.x, we have an organizational challenge that
requires approval of the software/upgrade.  I am promoting/supporting this
idea but cannot execute in the short-term.

For the mass publish, we originally used the CommonsHttpSolrServer (what we
use in live production updates) but we found the trade-off with performance
was quite large.  I really like your idea about KISS on threading.  Since
I'm already introducing complexity with all the multi-threading, why stress
the older 3.x software.  We may need to trade-off time for this.

My first tactics will be to adjust the maxFieldLength and toggle the
configuration to use CommonsHttpSolrServer.  I will follow-up with any
discoveries.

Thanks again,
Justin

On Wed, May 22, 2013 at 11:46 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 5/22/2013 9:08 AM, Justin Babuscio wrote:
>
>> We periodically rebuild our Solr index from scratch.  We have built a
>> custom publisher that horizontally scales to increase write throughput.
>>  On
>> a given rebuild, we will have ~60 JVMs running with 5 threads that are
>> actively publishing to all Solr masters.
>>
>> For each thread, we instantiate one StreamingUpdateSolrServer(
>> QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.
>>
>
> Looking over all your details, you might want to try first reducing the
> maxFieldLength to slightly below Integer.MAX_VALUE.  Try setting it to 2
> billion, or even something more modest, in the millions.  It's
> theoretically possible that the other value might be leading to an overflow
> somewhere.  I've been looking for evidence of this, nothing's turned up yet.
>
> There MIGHT be bugs in the Apache Commons libraries that SolrJ uses. The
> next thing I would try is upgrading those component jars in your
> application's classpath - httpclient, commons-io, commons-codec, etc.
>
> Upgrading to a newer SolrJ version is also a good idea.  Your notes imply
> that you are using the default XML request writer in SolrJ.  If that's
> true, you should be able to use a 4.3 SolrJ even with an older Solr
> version, which would give you a server object that's based on
> HttpComponents 4.x, where your current objects are based on HttpClient 3.x.
>  You would need to make adjustments in your source code.  If you're not
> using the default XML request writer, you can get a similar change by using
> SolrJ 3.6.2.
>
> IMHO you should switch to HttpSolrServer (CommonsHttpSolrServer in SolrJ
> 3.5 and earlier).  StreamingUpdateSolrServer (and its replacement in 3.6
> and later, named ConcurrentUpdateSolrServer) has one glaring problem - it
> never informs the calling application about any errors that it encounters
> during indexing.  It lies to you, and tells you that everything has
> succeeded even when it doesn't.
>
> The one advantage that SUSS/CUSS has over its Http sibling is that it is
> multi-threaded, so it can send updates concurrently.  You seem to know
> enough about how it works, so I'll just say that you don't need additional
> complexity that is not under your control and refuses to throw exceptions
> when an error occurs.  You already have a large-scale concurrent and
> multi-threaded indexing setup, so SolrJ's additional thread handling
> doesn't really buy you much.
>
> Thanks,
> Shawn
>
>

-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com

Re: Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/22/2013 9:08 AM, Justin Babuscio wrote:
> We periodically rebuild our Solr index from scratch.  We have built a
> custom publisher that horizontally scales to increase write throughput.  On
> a given rebuild, we will have ~60 JVMs running with 5 threads that are
> actively publishing to all Solr masters.
>
> For each thread, we instantiate one StreamingUpdateSolrServer(
> QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.

Looking over all your details, you might want to try first reducing the 
maxFieldLength to slightly below Integer.MAX_VALUE.  Try setting it to 2 
billion, or even something more modest, in the millions.  It's 
theoretically possible that the other value might be leading to an 
overflow somewhere.  I've been looking for evidence of this, nothing's 
turned up yet.

There MIGHT be bugs in the Apache Commons libraries that SolrJ uses. 
The next thing I would try is upgrading those component jars in your 
application's classpath - httpclient, commons-io, commons-codec, etc.

Upgrading to a newer SolrJ version is also a good idea.  Your notes 
imply that you are using the default XML request writer in SolrJ.  If 
that's true, you should be able to use a 4.3 SolrJ even with an older 
Solr version, which would give you a server object that's based on 
HttpComponents 4.x, where your current objects are based on HttpClient 
3.x.  You would need to make adjustments in your source code.  If you're 
not using the default XML request writer, you can get a similar change 
by using SolrJ 3.6.2.

IMHO you should switch to HttpSolrServer (CommonsHttpSolrServer in SolrJ 
3.5 and earlier).  StreamingUpdateSolrServer (and its replacement in 3.6 
and later, named ConcurrentUpdateSolrServer) has one glaring problem - 
it never informs the calling application about any errors that it 
encounters during indexing.  It lies to you, and tells you that 
everything has succeeded even when it doesn't.

The one advantage that SUSS/CUSS has over its Http sibling is that it is 
multi-threaded, so it can send updates concurrently.  You seem to know 
enough about how it works, so I'll just say that you don't need 
additional complexity that is not under your control and refuses to 
throw exceptions when an error occurs.  You already have a large-scale 
concurrent and multi-threaded indexing setup, so SolrJ's additional 
thread handling doesn't really buy you much.

Thanks,
Shawn