You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2015/06/12 18:35:01 UTC
[jira] [Commented] (SOLR-6875) No data integrity between replicas

    [ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583657#comment-14583657 ] 

Erick Erickson commented on SOLR-6875:
--------------------------------------

Do any of the logs on the leaders mention "leader initiated recovery"? And how fast are you sending documents at Solr? I've seen situations where flooding "too many" updates at Solr can cause some wonky behavior, there are some inefficiencies in how leaders talk to replicas, see Tim Potter's blog here: http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/

The symptom I saw was two-fold:
1> the leader forced the follower into recovery. No errors reported on the follower, just a timeout on the leader
2> There were a bazillion updates coming in as fast as possible, there were a lot of threads outstanding on the leader from ConcurrentUpdateSolrServer.

Not saying this is your problem, but if you see something like this it'd be good to know when tracking this down. If you don't have followers going down then this isn't the issue.

> No data integrity between replicas
> ----------------------------------
>
>                 Key: SOLR-6875
>                 URL: https://issues.apache.org/jira/browse/SOLR-6875
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.2
>         Environment: One replica is @ Linux solr1.devops.wegohealth.com 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> Solr is running with the next options:
> * -Xms12G
> * -Xmx16G
> * -XX:+UseConcMarkSweepGC
> * -XX:+UseLargePages
> * -XX:+CMSParallelRemarkEnabled
> * -XX:+ParallelRefProcEnabled
> * -XX:+UseLargePages
> * -XX:+AggressiveOpts
> * -XX:CMSInitiatingOccupancyFraction=75
>            Reporter: Alexander S.
>         Attachments: replica1.png, replica2.png
>
>
> Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
> Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, and another (Solr1.1) 45 574 038 docs.
> Solr1 is the leader, these errors appeared in the logs:
> {code}
> ERROR - 2014-12-20 09:54:38.783; org.apache.solr.update.StreamingSolrServers$1; error
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.787; org.apache.solr.update.processor.DistributedUpdateProcessor; Error sending update
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.813; org.apache.solr.cloud.ZkController; Leader is publishing core=crm-prod coreNodeName =10.128.209.232:8081_solr_crm-prod state=down on behalf of un-reachable replica http://10.128.209.232:8081/solr/crm-prod/; forcePublishState? false
> ERROR - 2014-12-20 09:54:38.818; org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to try to start recovery on replica http://10.128.209.232:8081/solr/crm-prod/ after: java.net.SocketException: Connection reset
> {code}
> On Solr1.1:
> {code}
> WARN  - 2014-12-20 09:54:38.854; org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for core=crm-prod coreNodeName=10.128.209.232:8081_solr_crm-prod
> {code}
> Index optimization was running at that time.
> It was not a system crash, the server is up and was running smoothly with a lot of available resources on board, lots of CPU, available RAM and a very fast SSD RAID. So whatever happened Solr should get recovered properly, e.g. as mysql does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org