You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Philip Thompson (JIRA)" <ji...@apache.org> on 2015/03/19 15:52:39 UTC

[jira] [Resolved] (CASSANDRA-8620) Bootstrap session hanging indefinitely

     [ https://issues.apache.org/jira/browse/CASSANDRA-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Thompson resolved CASSANDRA-8620.
----------------------------------------
    Resolution: Cannot Reproduce

> Bootstrap session hanging indefinitely
> --------------------------------------
>
>                 Key: CASSANDRA-8620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8620
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE
>            Reporter: Adam Horwich
>
> Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a few months with ~100GB load per node and a RF=3 and over the last few weeks have been trying to scale up capacity. 
> We've been recently seeing scenarios in which the Bootstrap or Unbootstrap streaming process hangs indefinitely for one or more sessions on the receiver without stacktrace or exception. This does not happen every time, and we do not get into this state with the same sender every time. When the receiver is in a hung state, the following can be found in the thread dump:
> The Stream-IN thread for one or more sessions is blocked in the following state: 
> Thread 24942: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Compiled frame)
>  - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Compiled frame)
>  - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled frame)
>  - org.apache.cassandra.streaming.compress.CompressedInputStream.read() @bci=31, line=89 (Compiled frame)
>  - java.io.DataInputStream.readUnsignedShort() @bci=4, line=337 (Compiled frame)
>  - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4, line=140 (Compiled frame)
>  - org.apache.cassandra.utils.ByteBufferUtil.readShortLength(java.io.DataInput) @bci=1, line=317 (Compiled frame)
>  - org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(java.io.DataInput) @bci=2, line=327 (Compiled frame)
>  - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput) @bci=5, line=397 (Compiled frame)
>  - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput) @bci=2, line=381 (Compiled frame)
>  - org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(java.io.DataInput, org.apache.cassandra.db.ColumnSerializer$Flag, int, org.apache.cassandra.io.sstable.Descriptor$Version) @bci=10, line=75 (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52 (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46 (Compiled frame)
>  - com.google.common.collect.AbstractIterator.tryToComputeNext() @bci=9, line=143 (Compiled frame)
>  - com.google.common.collect.AbstractIterator.hasNext() @bci=61, line=138 (Compiled frame)
>  - org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(org.apache.cassandra.db.DecoratedKey, org.apache.cassandra.config.CFMetaData, java.io.DataInput, org.apache.cassandra.io.sstable.Descriptor$Version) @bci=320, line=283 (Compiled frame)
>  - org.apache.cassandra.streaming.StreamReader.writeRow(org.apache.cassandra.io.sstable.SSTableWriter, java.io.DataInput, org.apache.cassandra.db.ColumnFamilyStore) @bci=26, line=157 (Compiled frame)
>  - org.apache.cassandra.streaming.compress.CompressedStreamReader.read(java.nio.channels.ReadableByteChannel) @bci=258, line=89 (Compiled frame)
>  - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel, int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48 (Interpreted frame)
>  - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel, int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38 (Interpreted frame)
>  - org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel, int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55 (Interpreted frame)
>  - org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run() @bci=24, line=245 (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> Debug logging shows that the receiver is still reading the file it is receiving from the receiver and has not yet sent an ACK.
> The receiver is waiting for more data to finish writing its row, and the sender is not sending any more data. On both the receiver and sender there is a large amount of data (~5MB) stuck in the Recv-Q (receiver) and Send-Q (sender). 
> We've been trying to diagnose this issue internally, but it's difficult to create a reliably reproducible scenario. We have so far found that restarting all nodes in the cluster and ensuring that a cleanup has been performed helps mitigate the problem (though a cleanup without restart can still result in a hung state). However, it's unclear to me why either of these would affect the streaming process in the way we've observed. One theory is that the calculated section sizes are inaccurate.
> We tried setting a timeout on the dataBuffer reading (moving take to poll), which forced a retry on the file, but the same transfer failed again until retries were exhausted.
> From a heap dump in the dead locked state we confirmed the following:
> ByteTracker.bytesRead = 100477411
> Length of the section being read (happens to be the last section) 100477411
> We also wrote all the data in the buffer to a text file and observed that the data in the buffer ended half way through a row.
> The stream reader checks the value of bytes read after importing each row (StreamReader line 97) to detect the end of the section, however it does not expect the data to finish half way through a row. This happens in OnDiskAtom.deserializeFromSSTable in any of the various deserialization calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)