You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by David McNelis <dm...@gmail.com> on 2013/08/07 15:14:27 UTC

System hints compaction stuck

Morning folks,

For the last couple of days all of my nodes (17, all running 1.2.8) have
been stuck at various percentages of completion for compacting
system.hints.  I've tried restarting the nodes (including a full rolling
restart of the cluster) to no avail.

When I turn on Debugging I am seeing this error on all of the nodes
constantly:

DEBUG 09:03:21,999 Thrift transport error occurred during processing of
message.
org.apache.thrift.transport.TTransportException
        at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
        at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
        at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
        at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)


When I turn on tracing, I see that shortly after this error there is a
message similar to:
TRACE 09:03:22,000 ClientState removed for socket addr /10.55.56.211:35431

The IP in this message is sometimes a client machine, sometimes another
cassandra node with no processes other than C* running on it (which I think
rules out an issue with a particular client library doing something funny
with Thrift).

While I wouldn't expect a Thrift issue to cause problems with compaction,
I'm out of other ideas at the moment.  Anyone have any thoughts they could
share?

Thanks,
David

Re: System hints compaction stuck

Posted by David McNelis <dm...@gmail.com>.

Fwiw, similar to another issue of stuck compaction that was on the list
several days ago, if I cleared out the hints, either by removing files
while node was down, or running a scrub on system.hints during node
startup, I was able to get these compactions cleared, an the nodes are
starting to get caught up on tasks that had been blocked.

Nate, there are definiately a number of things that could be hitting the
9160 port... but I was seeing the transport size error even between nodes
(and there was nothing runnining on any node other than C*)... switching
back to sync and no longer get that error.


On Wed, Aug 7, 2013 at 2:58 PM, Nate McCall <zz...@gmail.com> wrote:

> Is there anything else on the network that could be attempting to
> connect to 9160?
>
> That is the exact error you would get when someone initiates a
> connection and sends a null byte. You can reproduce it thusly:
> echo -n 'm' | nc localhost 9160
>
>
> On Wed, Aug 7, 2013 at 11:11 AM, David McNelis <dm...@gmail.com> wrote:
> > Nate,
> >
> > We had a node that was flaking on us last week and had a lot of handoffs
> > fail to that node.  We ended up decommissioning that node entirely.  I
> can't
> > find the actual error we were getting at the time (logs have been rotated
> > out), but currently we're not seeing any errors there.
> >
> > We haven't had any schema updates recently and we are using the sync rpc
> > server.  We had hsha turned on for a while, but we were getting a bunch
> of
> > transport frame size errors.
> >
> >
> > On Wed, Aug 7, 2013 at 1:55 PM, Nate McCall <zz...@gmail.com> wrote:
> >>
> >> Thrift and ClientState are both unrelated to hints.
> >>
> >> What do you see in the logs after "Started hinted handoff for
> >> host:..." from HintedHandoffManager?
> >>
> >> It should either have an error message or something along the lines of
> >> "Finished hinted handoff of:..."
> >>
> >> Where there any schema updates that preceded this happening?
> >>
> >> As for the thrift stuff, which rpc_server_type are you using?
> >>
> >>
> >>
> >> On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dm...@gmail.com>
> wrote:
> >> > Morning folks,
> >> >
> >> > For the last couple of days all of my nodes (17, all running 1.2.8)
> have
> >> > been stuck at various percentages of completion for compacting
> >> > system.hints.
> >> > I've tried restarting the nodes (including a full rolling restart of
> the
> >> > cluster) to no avail.
> >> >
> >> > When I turn on Debugging I am seeing this error on all of the nodes
> >> > constantly:
> >> >
> >> > DEBUG 09:03:21,999 Thrift transport error occurred during processing
> of
> >> > message.
> >> > org.apache.thrift.transport.TTransportException
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> >> >         at
> >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> >> >         at
> >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >> >         at
> >> > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
> >> >         at
> >> >
> >> >
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
> >> >         at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >         at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >         at java.lang.Thread.run(Thread.java:724)
> >> >
> >> >
> >> > When I turn on tracing, I see that shortly after this error there is a
> >> > message similar to:
> >> > TRACE 09:03:22,000 ClientState removed for socket addr
> >> > /10.55.56.211:35431
> >> >
> >> > The IP in this message is sometimes a client machine, sometimes
> another
> >> > cassandra node with no processes other than C* running on it (which I
> >> > think
> >> > rules out an issue with a particular client library doing something
> >> > funny
> >> > with Thrift).
> >> >
> >> > While I wouldn't expect a Thrift issue to cause problems with
> >> > compaction,
> >> > I'm out of other ideas at the moment.  Anyone have any thoughts they
> >> > could
> >> > share?
> >> >
> >> > Thanks,
> >> > David
> >
> >
>

Re: System hints compaction stuck

Posted by Nate McCall <zz...@gmail.com>.

Is there anything else on the network that could be attempting to
connect to 9160?

That is the exact error you would get when someone initiates a
connection and sends a null byte. You can reproduce it thusly:
echo -n 'm' | nc localhost 9160


On Wed, Aug 7, 2013 at 11:11 AM, David McNelis <dm...@gmail.com> wrote:
> Nate,
>
> We had a node that was flaking on us last week and had a lot of handoffs
> fail to that node.  We ended up decommissioning that node entirely.  I can't
> find the actual error we were getting at the time (logs have been rotated
> out), but currently we're not seeing any errors there.
>
> We haven't had any schema updates recently and we are using the sync rpc
> server.  We had hsha turned on for a while, but we were getting a bunch of
> transport frame size errors.
>
>
> On Wed, Aug 7, 2013 at 1:55 PM, Nate McCall <zz...@gmail.com> wrote:
>>
>> Thrift and ClientState are both unrelated to hints.
>>
>> What do you see in the logs after "Started hinted handoff for
>> host:..." from HintedHandoffManager?
>>
>> It should either have an error message or something along the lines of
>> "Finished hinted handoff of:..."
>>
>> Where there any schema updates that preceded this happening?
>>
>> As for the thrift stuff, which rpc_server_type are you using?
>>
>>
>>
>> On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dm...@gmail.com> wrote:
>> > Morning folks,
>> >
>> > For the last couple of days all of my nodes (17, all running 1.2.8) have
>> > been stuck at various percentages of completion for compacting
>> > system.hints.
>> > I've tried restarting the nodes (including a full rolling restart of the
>> > cluster) to no avail.
>> >
>> > When I turn on Debugging I am seeing this error on all of the nodes
>> > constantly:
>> >
>> > DEBUG 09:03:21,999 Thrift transport error occurred during processing of
>> > message.
>> > org.apache.thrift.transport.TTransportException
>> >         at
>> >
>> > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>> >         at
>> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>> >         at
>> >
>> > org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
>> >         at
>> >
>> > org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
>> >         at
>> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>> >         at
>> >
>> > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
>> >         at
>> >
>> > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
>> >         at
>> >
>> > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
>> >         at
>> > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
>> >         at
>> >
>> > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >         at java.lang.Thread.run(Thread.java:724)
>> >
>> >
>> > When I turn on tracing, I see that shortly after this error there is a
>> > message similar to:
>> > TRACE 09:03:22,000 ClientState removed for socket addr
>> > /10.55.56.211:35431
>> >
>> > The IP in this message is sometimes a client machine, sometimes another
>> > cassandra node with no processes other than C* running on it (which I
>> > think
>> > rules out an issue with a particular client library doing something
>> > funny
>> > with Thrift).
>> >
>> > While I wouldn't expect a Thrift issue to cause problems with
>> > compaction,
>> > I'm out of other ideas at the moment.  Anyone have any thoughts they
>> > could
>> > share?
>> >
>> > Thanks,
>> > David
>
>

Re: System hints compaction stuck

Posted by David McNelis <dm...@gmail.com>.

Nate,

We had a node that was flaking on us last week and had a lot of handoffs
fail to that node.  We ended up decommissioning that node entirely.  I
can't find the actual error we were getting at the time (logs have been
rotated out), but currently we're not seeing any errors there.

We haven't had any schema updates recently and we are using the sync rpc
server.  We had hsha turned on for a while, but we were getting a bunch of
transport frame size errors.


On Wed, Aug 7, 2013 at 1:55 PM, Nate McCall <zz...@gmail.com> wrote:

> Thrift and ClientState are both unrelated to hints.
>
> What do you see in the logs after "Started hinted handoff for
> host:..." from HintedHandoffManager?
>
> It should either have an error message or something along the lines of
> "Finished hinted handoff of:..."
>
> Where there any schema updates that preceded this happening?
>
> As for the thrift stuff, which rpc_server_type are you using?
>
>
>
> On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dm...@gmail.com> wrote:
> > Morning folks,
> >
> > For the last couple of days all of my nodes (17, all running 1.2.8) have
> > been stuck at various percentages of completion for compacting
> system.hints.
> > I've tried restarting the nodes (including a full rolling restart of the
> > cluster) to no avail.
> >
> > When I turn on Debugging I am seeing this error on all of the nodes
> > constantly:
> >
> > DEBUG 09:03:21,999 Thrift transport error occurred during processing of
> > message.
> > org.apache.thrift.transport.TTransportException
> >         at
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> >         at
> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >         at
> >
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> >         at
> >
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> >         at
> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >         at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >         at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >         at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >         at
> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
> >         at
> >
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:724)
> >
> >
> > When I turn on tracing, I see that shortly after this error there is a
> > message similar to:
> > TRACE 09:03:22,000 ClientState removed for socket addr /
> 10.55.56.211:35431
> >
> > The IP in this message is sometimes a client machine, sometimes another
> > cassandra node with no processes other than C* running on it (which I
> think
> > rules out an issue with a particular client library doing something funny
> > with Thrift).
> >
> > While I wouldn't expect a Thrift issue to cause problems with compaction,
> > I'm out of other ideas at the moment.  Anyone have any thoughts they
> could
> > share?
> >
> > Thanks,
> > David
>

Re: System hints compaction stuck

Posted by Nate McCall <zz...@gmail.com>.

Thrift and ClientState are both unrelated to hints.

What do you see in the logs after "Started hinted handoff for
host:..." from HintedHandoffManager?

It should either have an error message or something along the lines of
"Finished hinted handoff of:..."

Where there any schema updates that preceded this happening?

As for the thrift stuff, which rpc_server_type are you using?



On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dm...@gmail.com> wrote:
> Morning folks,
>
> For the last couple of days all of my nodes (17, all running 1.2.8) have
> been stuck at various percentages of completion for compacting system.hints.
> I've tried restarting the nodes (including a full rolling restart of the
> cluster) to no avail.
>
> When I turn on Debugging I am seeing this error on all of the nodes
> constantly:
>
> DEBUG 09:03:21,999 Thrift transport error occurred during processing of
> message.
> org.apache.thrift.transport.TTransportException
>         at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>         at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>         at
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
>         at
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
>         at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>         at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
>         at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
>         at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
>         at
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
>
>
> When I turn on tracing, I see that shortly after this error there is a
> message similar to:
> TRACE 09:03:22,000 ClientState removed for socket addr /10.55.56.211:35431
>
> The IP in this message is sometimes a client machine, sometimes another
> cassandra node with no processes other than C* running on it (which I think
> rules out an issue with a particular client library doing something funny
> with Thrift).
>
> While I wouldn't expect a Thrift issue to cause problems with compaction,
> I'm out of other ideas at the moment.  Anyone have any thoughts they could
> share?
>
> Thanks,
> David