You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Peter Schuller (JIRA)" <ji...@apache.org> on 2010/10/09 15:21:33 UTC
[jira] Commented: (CASSANDRA-1597) cassandra start-up seek-bound in uninterruptable sleep

    [ https://issues.apache.org/jira/browse/CASSANDRA-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919495#action_12919495 ] 

Peter Schuller commented on CASSANDRA-1597:
-------------------------------------------

My apologies for the unnecessary traffic. The explanation was right there in my own ramblings above (about rm). It was actually in uninterruptible sleep due to the removal of orphan files:

0/10/09 11:38:34 INFO db.ColumnFamilyStore: Removing orphan /var/lib/cassandra/data/sporebench/sporebench-tmp-c-10204-Index.db
Service killed by signal 9

In other words, to avoid this, run Cassandra on better file systems... Presumably xfs/ext4 should both be good.

> cassandra start-up seek-bound in uninterruptable sleep
> ------------------------------------------------------
>
>                 Key: CASSANDRA-1597
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1597
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Priority: Minor
>
> (Cassandra trunk from a few months ago, Linux 2.6.32)
> I just restarted a Cassandra node with a large (a few TB:s) column family. I was expecting index sampling to take a while, but in this case I'm seeing something unexpected. First, the log has:
> 10/10/09 11:15:49 INFO config.DatabaseDescriptor: Auto DiskAccessMode determined to be mmap
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/Schema-c-1-<>
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/Migrations-c-1-<>
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/LocationInfo-c-1-<>
> 10/10/09 11:15:50 INFO config.DatabaseDescriptor: Loading schema version d8ad1485-a6e4-11df-beb1-d1273d6ae3d0
> And it is currently stuck with iostat looking like this:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00    67.00  146.00    2.00   584.00   276.00    11.62     0.99    6.68   6.68  98.80
> I re-checked the code and the index sampling really should not be seek bound since it is sequential. At first I figured that perhaps some aspect of the I/O pattern is causing the kernel (and I guess underlying RAID+drives) to not do read-ahead. This could explain why lots of small reads, even though they are sequential in nature, could end up being seek bound.
> mmap() has the problem that you cannot strace to see what is doing I/O, so I wanted to confirm with jstack. At this point I notice jstack won't attach at all (after a timeout).
> Further, a SIGKILL fails to kill Java, with iostat proceeding, still seek bound, and the process in uninterruptable sleep (presumably constantly, or else the SIGKILL should have worked).
> Somewhere along the way, strace has begun failing to attach as well but from my remembrance I didn't see any I/O calls when it did attach properly, which is consistent with whatever I/O is happening being triggered by mmap():ed memory access or else by some other long-running syscall.
> A very speculative hypothesis is that perhaps the initial mmap() call can be long and blocking, seek-bound, due to inodes being consulted on disk (compare with the cost of "rm:ing" large files on ext3fs). This would probably have the potential to take quite a long time, if it is the case. On the other hand, if this was the behavior of mmap() on ext3fs, it seems unlikely that it would not be widely known already. It also seems unlikely in terms of prior probability since I don't see a good reason why mmap() would have a need to scan inodes.
> I am currently waiting it out to see if/when it eventually gets killed.
> I will then try restarting the node in standard I/O mode and see if that works. If it does, I'll re-try with mmap():ed mode to see that I can trigger it again.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.