You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2013/01/04 16:48:13 UTC
[jira] [Updated] (SOLR-4038) SolrCloud indexing blocks if shard is marked as DOWN

     [ https://issues.apache.org/jira/browse/SOLR-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated SOLR-4038:
--------------------------------

    Summary: SolrCloud indexing blocks if shard is marked as DOWN  (was: SolrCloud indexing blocks if node is recovering)

I checked if i could reproduce this behaviour after two months and many commits later and it is still here. However, it is not blocked when the shards on the node are recovering but when they are marked as down.

We ungracefully killed one machine holding two shards while indexing from Hadoop and using CloudSolrServer, the shards are now marked as GONE. After a while we started the node again and it became marked as DOWN. During this state the cluster did not accept incoming updates. Only after a minute orso the node finally started recovery on both shards and indexing resumed.
                
> SolrCloud indexing blocks if shard is marked as DOWN
> ----------------------------------------------------
>
>                 Key: SOLR-4038
>                 URL: https://issues.apache.org/jira/browse/SOLR-4038
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Markus Jelsma
>             Fix For: 4.2, 5.0
>
>
> See: http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html
> While indexing (without CloudSolrServer at that time) one node dies with an OOME perhaps  because of the linked issue SOLR-4032. The OOME stack traces are varied but here are some ZK-related logs between the OOME stack traces:
> {code}
> 2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e
> 2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 8.0 seconds before trying to recover again (3)
> 2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : numShards not found on descriptor - reading it from system property
> 2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_e
> 2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_e
> 2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. core=shard_e
> 2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
> 2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; maxCacheMB=48.0 maxMergeSizeMB=4.0)
> 2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket connection and attempting reconnect
> 2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : org.apache.solr.common.SolrException: Ping query caused exception: Java heap space
> .....
> .....
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : Watcher org.apache.solr.common.cloud.ConnectionManager@305e7020 name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : zkClient has disconnected
> 2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:java.lang.OutOfMemoryError: Java heap space
> .....
> .....
> 2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e
> 2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 16.0 seconds before trying to recover again (4)
> 2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121102141424591
> 2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
> 2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _773.tvf completely. Downloaded 246
> 415360!=562327645
> .....
> .....
> {code} 
> At this point indexing has already been blocked. Some nodes do not write anything to the logs and the two surrounding nodes are still busy doing some replication. Most nodes show a increased number of state changes:
> {code}
> 2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
> {code}
> It's about 10 minutes of doing little by trying replication by the surrounding nodes and mostly state changes for the other nodes before are documents indexed again. At that point i stopped the node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org