You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alexander S. (JIRA)" <ji...@apache.org> on 2015/06/12 07:25:01 UTC

[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas

    [ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582960#comment-14582960 ] 

Alexander S. edited comment on SOLR-6875 at 6/12/15 5:24 AM:
-------------------------------------------------------------

Got another error today on 4 shards set up, each has 2 replicas (8 nodes in total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:196)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
	at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
{code}

After all this replica 1 shows:
{quote}
numDocs: 28 215 608
{quote}

And the replica 2 shows:
{quote}
numDocs: 28 215 609
{quote}

Everything worked well for a few months until yesterday, when we started to reindex some data (like 1.7m records).

Our Solr set up is using large pages and there's enough resources. Here's how we run the instances:
{code}
exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC -XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 -DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf -Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml
{code}

The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 3 usually. The charts also don't show anything suspicious in server load, it is very stable.

So seems like something went wrong during recovery after the network error. Not sure how to debug that deeper and what those warnings in the log mean, for example the last 2 messages on the first screenshot, from DistributedUpdateProcessor and CoreAdminHandler.


was (Author: aheaven):
Get another error today on 4 shards set up, each has 2 replicas (8 nodes in total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:196)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
	at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
{code}

After all this replica 1 shows:
{quote}
numDocs: 28 215 608
{quote}

And the replica 2 shows:
{quote}
numDocs: 28 215 609
{quote}

Everything worked well for a few months until yesterday, when we started to reindex some data (like 1.7m records).

Our Solr set up is using large pages and there's enough resources. Here's how we run the instances:
{code}
exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC -XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 -DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf -Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml
{code}

The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 3 usually. The charts also don't show anything suspicious in server load, it is very stable.

So seems like something went wrong during recovery after the network error. Not sure how to debug that deeper and what those warnings in the log mean, for example the last 2 messages on the first screenshot, from DistributedUpdateProcessor and CoreAdminHandler.

> No data integrity between replicas
> ----------------------------------
>
>                 Key: SOLR-6875
>                 URL: https://issues.apache.org/jira/browse/SOLR-6875
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.2
>         Environment: One replica is @ Linux solr1.devops.wegohealth.com 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> Solr is running with the next options:
> * -Xms12G
> * -Xmx16G
> * -XX:+UseConcMarkSweepGC
> * -XX:+UseLargePages
> * -XX:+CMSParallelRemarkEnabled
> * -XX:+ParallelRefProcEnabled
> * -XX:+UseLargePages
> * -XX:+AggressiveOpts
> * -XX:CMSInitiatingOccupancyFraction=75
>            Reporter: Alexander S.
>         Attachments: replica1.png, replica2.png
>
>
> Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
> Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, and another (Solr1.1) 45 574 038 docs.
> Solr1 is the leader, these errors appeared in the logs:
> {code}
> ERROR - 2014-12-20 09:54:38.783; org.apache.solr.update.StreamingSolrServers$1; error
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.787; org.apache.solr.update.processor.DistributedUpdateProcessor; Error sending update
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.813; org.apache.solr.cloud.ZkController; Leader is publishing core=crm-prod coreNodeName =10.128.209.232:8081_solr_crm-prod state=down on behalf of un-reachable replica http://10.128.209.232:8081/solr/crm-prod/; forcePublishState? false
> ERROR - 2014-12-20 09:54:38.818; org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to try to start recovery on replica http://10.128.209.232:8081/solr/crm-prod/ after: java.net.SocketException: Connection reset
> {code}
> On Solr1.1:
> {code}
> WARN  - 2014-12-20 09:54:38.854; org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for core=crm-prod coreNodeName=10.128.209.232:8081_solr_crm-prod
> {code}
> Index optimization was running at that time.
> It was not a system crash, the server is up and was running smoothly with a lot of available resources on board, lots of CPU, available RAM and a very fast SSD RAID. So whatever happened Solr should get recovered properly, e.g. as mysql does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org