You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2015/04/29 00:15:07 UTC
[jira] [Created] (SOLR-7483) Investigate ways to deal with the tlog growing indefinitely while it's being replayed

Timothy Potter created SOLR-7483:
------------------------------------

             Summary: Investigate ways to deal with the tlog growing indefinitely while it's being replayed
                 Key: SOLR-7483
                 URL: https://issues.apache.org/jira/browse/SOLR-7483
             Project: Solr
          Issue Type: Task
          Components: SolrCloud
            Reporter: Timothy Potter


While trying to track down the data-loss issue I found while testing SOLR-7332, one of my replicas was forced into recovery by the leader due to a network error (I'm over-stressing Solr as part of this test) ... 

In the leader log:
{code}
INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; I/O exception (java.net.SocketException) caught when processing request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe
INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985
ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error
org.apache.http.NoHttpResponseException: ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

In the logs on the replica, I see a bunch of failed checksums messages, like:
{code}
WARN  - 2015-04-28 21:38:43.345; [   ] org.apache.solr.handler.IndexFetcher; File _xv.si did not match. expected checksum is 617655777 and actual is checksum 1090588695. expected length is 419 and actual length is 419
WARN  - 2015-04-28 21:38:43.349; [   ] org.apache.solr.handler.IndexFetcher; File _xv.fnm did not match. expected checksum is 1992662616 and actual is checksum 1632122630. expected length is 1756 and actual length is 1756
WARN  - 2015-04-28 21:38:43.353; [   ] org.apache.solr.handler.IndexFetcher; File _xv.nvm did not match. expected checksum is 384078655 and actual is checksum 3108095639. expected length is 92 and actual length is 92
{code}

This tells me it tried a snapshot pull of the index from the leader ...

Also, I see the replica started to replay the tlog (presumably the snapshot pull succeeded - of course my logging is set to WARN so I'm not seeing a full story in the logs):

{code}
WARN  - 2015-04-28 21:38:45.656; [   ] org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046 refcount=2} active=true starting pos=56770101
{code}

The problem is the tlog continues to grow and grow while this "replay" is happening ... when I first looked at the tlog, it was 769m a few minutes later, it's at 2.2g and still growing, i.e the leader is still pounding it with updates it can't keep up with.

The good thing of course is that the updates are being persisted to durable storage on the replica, so it's better than if the replica was just marked down. So maybe there isn't much we can do about this, but I wanted to capture the description of this event in a JIRA so we can investigate it further.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org