You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Erick Erickson (Jira)" <ji...@apache.org> on 2019/10/29 14:27:00 UTC

[jira] [Commented] (SOLR-7483) Investigate ways to deal with the tlog growing indefinitely while it's being replayed

    [ https://issues.apache.org/jira/browse/SOLR-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962052#comment-16962052 ] 

Erick Erickson commented on SOLR-7483:
--------------------------------------

I think this can be closed? Sounds like some other JIRAs we've had where TLOG replay was inefficient....

> Investigate ways to deal with the tlog growing indefinitely while it's being replayed
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-7483
>                 URL: https://issues.apache.org/jira/browse/SOLR-7483
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Priority: Major
>
> While trying to track down the data-loss issue I found while testing SOLR-7332, one of my replicas was forced into recovery by the leader due to a network error (I'm over-stressing Solr as part of this test) ... 
> In the leader log:
> {code}
> INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; I/O exception (java.net.SocketException) caught when processing request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe
> INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985
> ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error
> org.apache.http.NoHttpResponseException: ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
>         at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
>         at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)
>         at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> In the logs on the replica, I see a bunch of failed checksums messages, like:
> {code}
> WARN  - 2015-04-28 21:38:43.345; [   ] org.apache.solr.handler.IndexFetcher; File _xv.si did not match. expected checksum is 617655777 and actual is checksum 1090588695. expected length is 419 and actual length is 419
> WARN  - 2015-04-28 21:38:43.349; [   ] org.apache.solr.handler.IndexFetcher; File _xv.fnm did not match. expected checksum is 1992662616 and actual is checksum 1632122630. expected length is 1756 and actual length is 1756
> WARN  - 2015-04-28 21:38:43.353; [   ] org.apache.solr.handler.IndexFetcher; File _xv.nvm did not match. expected checksum is 384078655 and actual is checksum 3108095639. expected length is 92 and actual length is 92
> {code}
> This tells me it tried a snapshot pull of the index from the leader ...
> Also, I see the replica started to replay the tlog (presumably the snapshot pull succeeded - of course my logging is set to WARN so I'm not seeing a full story in the logs):
> {code}
> WARN  - 2015-04-28 21:38:45.656; [   ] org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046 refcount=2} active=true starting pos=56770101
> {code}
> The problem is the tlog continues to grow and grow while this "replay" is happening ... when I first looked at the tlog, it was 769m a few minutes later, it's at 2.2g and still growing, i.e the leader is still pounding it with updates it can't keep up with.
> The good thing of course is that the updates are being persisted to durable storage on the replica, so it's better than if the replica was just marked down. So maybe there isn't much we can do about this, but I wanted to capture the description of this event in a JIRA so we can investigate it further.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org