You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by forest_soup <ta...@gmail.com> on 2016/07/25 02:04:48 UTC

Downgraded Raid5 cause endless recovery and hang.

We have a 5 node solrcloud. When a solr node's disk had issue and Raid5
downgraded, a recovery on the node was triggered. But there's a hanging
happens. The node disappears in the live_nodes list. 

Could anyone help comment why this happens? Thanks!

The only meaningful call stacks are:
"zkCallback-4-thread-50-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr-EventThread"
#7791 daemon prio=5 os_prio=0 tid=0x00007f7e26467800 nid=0x4df7 waiting on
condition [0x00007f7e01adf000]
java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for <0x00007f8315800070> (a
java.util.concurrent.FutureTask)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at
org.apache.solr.update.DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:349)
	- locked <0x00007f7fd0cefd28> (a java.lang.Object)
	at
org.apache.solr.core.CoreContainer.cancelCoreRecoveries(CoreContainer.java:617)
	at org.apache.solr.cloud.ZkController$1.command(ZkController.java:295)
	at
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:158)
	at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
	at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:132)
	at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

"updateExecutor-2-thread-620-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr
x:collection36_shard1_replica2 s:shard1 c:collection36 r:core_node1" #7779
prio=5 os_prio=0 tid=0x00007f7e8827e000 nid=0x4dea waiting on condition
[0x00007f7ed0f9f000]
java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for <0x00007f7fd562e860> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
	at org.apache.solr.update.VersionInfo.blockUpdates(VersionInfo.java:118)
	at
org.apache.solr.update.UpdateLog.dropBufferedUpdates(UpdateLog.java:1140)
	at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:467)
	at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)




--
View this message in context: http://lucene.472066.n3.nabble.com/Downgraded-Raid5-cause-endless-recovery-and-hang-tp4288677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Downgraded Raid5 cause endless recovery and hang.

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/24/2016 8:04 PM, forest_soup wrote:
> We have a 5 node solrcloud. When a solr node's disk had issue and
> Raid5 downgraded, a recovery on the node was triggered. But there's a
> hanging happens. The node disappears in the live_nodes list. 

In my opinion, RAID5 (and RAID6) are bad ways to handle storage.  Cost
per usable gigabyte is the only real advantage, but the performance
problems are not worth that advantage.  If you care more about capacity
than performance, then it might be OK.

Under normal circumstances (no failed disk), if you're writing to the
array at all, all I/O (both read and write) is slow.  RAID5 can have
awesome read performance, but *only* if the array is health and there is
no writing happening at the same time.

If you lose a disk, the parity reads required to reconstruct the missing
data cause REALLY bad performance.

When you replace the failed disk and it is rebuilding, performance is
even worse.  The additional load is often enough to cause a second disk
to fail, which for RAID5 means the entire array is lost.

These I/O performance issues cause really big problems for Solr and
zookeeper.  There's no surprise to me that a degraded RAID5 array has
issues like you describe.

Thanks,
Shawn