You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by John Pyeatt <jo...@singlewire.com> on 2013/12/02 23:59:38 UTC

Stack trace from a node during a repair

We are running a 6-node AWS EC2 (m1.large) cluster of cassandra 1.2.9
across three availability zones with Ec2Snitch and NetworkTopologyStrategy.

One of our nodes was apparently sharing a physical box with another
customer who was really hogging the IO. So we needed to bring the node up
on a new ec2 instance.

We decommissioned the offending node, killed the instance and brought a new
instance into the cluster. Everything went fine so far.

After it came up I ran a nodetool repair -pr on each of the nodes in the
cluster. I ran these sequentially. When it got to doing the repair on the
new node three times the gossip service shut down. At the bottom of this
email is a copy of the stack trace we received.

It says it couldn't create a backups directory. I have no idea why this
would be the /data-1 partition is 400Gb in size and currently 1% utilized.
Does anyone have any idea what could be causing this?

my /etc/security/limits.conf file currently has
# resource settings added based on
# http://www.datastax.com/docs/1.2/install/recommended_settings
* soft nofile 65536
* hard nofile 65536
root soft nofile 65536
root hard nofile 65536
* soft memlock unlimited
* hard memlock unlimited
root soft memlock unlimited
root hard memlock unlimited
* soft as unlimited
* hard as unlimited
root soft as unlimited
root hard as unlimited



ERROR 2013-12-02 21:02:25,711 [Thread-3050] CassandraDaemon Exception in
thread Thread[Thread-3050,5,main]
FSWriteError in /data-1/cassandra/data/SinglewireSupport/Binaries/backups
        at
org.apache.cassandra.db.Directories.getOrCreate(Directories.java:483)
        at
org.apache.cassandra.db.Directories.getBackupsDirectory(Directories.java:242)
        at
org.apache.cassandra.db.DataTracker.maybeIncrementallyBackup(DataTracker.java:165)
        at
org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:237)
        at
org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:911)
        at
org.apache.cassandra.streaming.StreamInSession.closeIfFinished(StreamInSession.java:186)
        at
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:138)
        at
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:238)
        at
org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:178)
        at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
Caused by: java.io.IOException: Unable to create directory
/data-1/cassandra/data/SinglewireSupport/Binaries/backups


-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pyeatt@singlewire.com

Re: Stack trace from a node during a repair

Posted by John Pyeatt <jo...@singlewire.com>.

This is running the Amazon Linux OS which is essentially CentOS 6 I believe.

java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

Installed cassandra 1.2.9 from
http://archive.apache.org/dist/cassandra/1.2.9/apache-cassandra-1.2.9-bin.tar.gzthen
tar -xzf to a directory, change the .yaml file a bit (using vnodes,
set concurrent_writes to 16) change the cassandra-env.sh changed
MAX_HEAP_SIZE to 3G and HEAP_NEWSIZE=200M



On Tue, Dec 3, 2013 at 12:05 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Dec 3, 2013 at 6:19 AM, John Pyeatt <jo...@singlewire.com>wrote:
>
>> Then my issue must be the 0.000001% because
>>
>> 1) I'm running the repair as root.
>>
>
> Huh? Repair doesn't care what user your shell is. It is a process built
> into cassandra and has the permissions that cassandra does?
>
>
>> 2) The directory exists and the permissions are appropriate. root:root 755
>>
>
> Why are you running Cassandra as root?
>
>
>> 3) The three times it occurred during the repair it always complained
>> about backups directories. But there are dozens other backups directories
>> that were created during the repair that caused no exceptions.
>>
>
> Cassandra doesn't have a lot of chances to mess up while creating
> directories. This appears to be one of them.
>
>
>> The biggest issue with this is that is shuts down gossip.
>>
>
> That sounds like rather a serious issue, and hints towards a potential
> common cause : too many open files?
>
> To rule out other potential causes of issue :
>
> - what o/s?
> - what JVM?
> - how have you installed cassandra?
> - what version of cassandra?
>
> =Rob
>



-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pyeatt@singlewire.com

Re: Stack trace from a node during a repair

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Dec 3, 2013 at 6:19 AM, John Pyeatt <jo...@singlewire.com>wrote:

> Then my issue must be the 0.000001% because
>
> 1) I'm running the repair as root.
>

Huh? Repair doesn't care what user your shell is. It is a process built
into cassandra and has the permissions that cassandra does?

> 2) The directory exists and the permissions are appropriate. root:root 755
>

Why are you running Cassandra as root?

> 3) The three times it occurred during the repair it always complained
> about backups directories. But there are dozens other backups directories
> that were created during the repair that caused no exceptions.
>

Cassandra doesn't have a lot of chances to mess up while creating
directories. This appears to be one of them.

> The biggest issue with this is that is shuts down gossip.
>

That sounds like rather a serious issue, and hints towards a potential
common cause : too many open files?

To rule out other potential causes of issue :

- what o/s?
- what JVM?
- how have you installed cassandra?
- what version of cassandra?

=Rob

Re: Stack trace from a node during a repair

Posted by John Pyeatt <jo...@singlewire.com>.

Both cassandra and nodetool are running as root.

also
ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 59450
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10240
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


On Tue, Dec 3, 2013 at 8:36 AM, Hannu Kröger <hk...@gmail.com> wrote:

> Hi,
>
> Are you running nodetool or cassandra as root? I think it doesn't really
> matter what user is running the nodetool. Those directories should be
> writable by the user who is running the actual cassandra process.
>
> Hannu
>
>
> 2013/12/3 John Pyeatt <jo...@singlewire.com>
>
>> Then my issue must be the 0.000001% because
>>
>> 1) I'm running the repair as root.
>> 2) The directory exists and the permissions are appropriate. root:root 755
>> 3) The three times it occurred during the repair it always complained
>> about backups directories. But there are dozens other backups directories
>> that were created during the repair that caused no exceptions.
>>
>>
>> The biggest issue with this is that is shuts down gossip.
>>
>>
>>
>>
>> On Mon, Dec 2, 2013 at 5:56 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>
>>> On Mon, Dec 2, 2013 at 2:59 PM, John Pyeatt <jo...@singlewire.com>wrote:
>>>
>>>> Caused by: java.io.IOException: Unable to create directory
>>>> /data-1/cassandra/data/SinglewireSupport/Binaries/backups
>>>>
>>>
>>> This is an exception directly from a core java method. The cause is
>>> 99.99999% likely to be permissions.
>>>
>>> =Rob
>>>
>>
>>
>>
>> --
>> John Pyeatt
>> Singlewire Software, LLC
>> www.singlewire.com
>> ------------------
>> 608.661.1184
>> john.pyeatt@singlewire.com
>>
>
>


-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pyeatt@singlewire.com

Re: Stack trace from a node during a repair

Posted by Hannu Kröger <hk...@gmail.com>.

Hi,

Are you running nodetool or cassandra as root? I think it doesn't really
matter what user is running the nodetool. Those directories should be
writable by the user who is running the actual cassandra process.

Hannu


2013/12/3 John Pyeatt <jo...@singlewire.com>

> Then my issue must be the 0.000001% because
>
> 1) I'm running the repair as root.
> 2) The directory exists and the permissions are appropriate. root:root 755
> 3) The three times it occurred during the repair it always complained
> about backups directories. But there are dozens other backups directories
> that were created during the repair that caused no exceptions.
>
>
> The biggest issue with this is that is shuts down gossip.
>
>
>
>
> On Mon, Dec 2, 2013 at 5:56 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Mon, Dec 2, 2013 at 2:59 PM, John Pyeatt <jo...@singlewire.com>wrote:
>>
>>> Caused by: java.io.IOException: Unable to create directory
>>> /data-1/cassandra/data/SinglewireSupport/Binaries/backups
>>>
>>
>> This is an exception directly from a core java method. The cause is
>> 99.99999% likely to be permissions.
>>
>> =Rob
>>
>
>
>
> --
> John Pyeatt
> Singlewire Software, LLC
> www.singlewire.com
> ------------------
> 608.661.1184
> john.pyeatt@singlewire.com
>

Re: Stack trace from a node during a repair

Posted by John Pyeatt <jo...@singlewire.com>.

Then my issue must be the 0.000001% because

1) I'm running the repair as root.
2) The directory exists and the permissions are appropriate. root:root 755
3) The three times it occurred during the repair it always complained about
backups directories. But there are dozens other backups directories that
were created during the repair that caused no exceptions.

The biggest issue with this is that is shuts down gossip.

On Mon, Dec 2, 2013 at 5:56 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Dec 2, 2013 at 2:59 PM, John Pyeatt <jo...@singlewire.com>wrote:
>
>> Caused by: java.io.IOException: Unable to create directory
>> /data-1/cassandra/data/SinglewireSupport/Binaries/backups
>>
>
> This is an exception directly from a core java method. The cause is
> 99.99999% likely to be permissions.
>
> =Rob
>

-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pyeatt@singlewire.com

Re: Stack trace from a node during a repair

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Dec 2, 2013 at 2:59 PM, John Pyeatt <jo...@singlewire.com>wrote:

> Caused by: java.io.IOException: Unable to create directory
> /data-1/cassandra/data/SinglewireSupport/Binaries/backups
>

This is an exception directly from a core java method. The cause is
99.99999% likely to be permissions.

=Rob