You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David CHARBONNIER <Da...@rgsystem.com> on 2015/06/12 15:08:24 UTC

Connection reset during repair service

Hi,

We're using Cassandra 2.0.8.39 through Datastax Enterprise 4.5.1 and we're experiencing issues with OPSCenter (version 5.1.3) Repair Service.
When Repair Service is running, we can see repair timing out on a few ranges in OPSCenter's event log viewer. See screenshot attached.

On our Cassandra nodes, we can see a lot of theese messages in cassandra/system.log log file while a timeout shows up in OPSCenter :

                ERROR [Native-Transport-Requests:3372] 2015-06-12 02:22:33,231 ErrorMessage.java (line 222) Unexpected exception during request
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(Unknown Source)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
        at sun.nio.ch.IOUtil.read(Unknown Source)
        at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

You'll find attached an extract of the system.log file with some more informations.

Do you have any idea of what's happening ?

We suspect timeouts happening because we have some tables with many tombstones, and a warning is something triggered. We have edited the configuration allow warning, but still perform until encounter 1.000.000 tombstones.

During a compaction, we've also warning messages telling us that we've a lot of tombstones too :

                WARN [CompactionExecutor:1584] 2015-06-11 19:22:24,904 SliceQueryFilter.java (line 225) Read 8640 live and 8904 tombstoned cells in rgsupv.event_data (see tombstone_warn_threshold). 10000 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}

Do you think it's related to our first problem ?

Our cluster is configured as follow :

-          8 nodes with Debian 7.8 x64

-          16Gb of memory and 4 CPU

-          2  HDD : 1 for the system and the other for the data directory

Best regards,

[cid:image001.png@01D0A506.5CBF1D80]

David CHARBONNIER

Sysadmin

T : +33 411 934 200

david.charbonnier@rgsystem.com<ma...@rgsystem.com>


ZAC Aéroport

125 Impasse Adam Smith

34470 Pérols - France

www.rgsystem.com<http://www.rgsystem.com/>



[cid:image003.png@01D0A50B.718C7C80]




Re: Connection reset during repair service

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Regarding the Datastax repair service I saw the same error over here.

Here is the datastax answer fwiw:

"The repair service timeout message is telling you that the service has not
received a response from the nodetool repair process running on Cassandra
within the configured (default) 3600 seconds. When this happens, the
Opscenter repair service stops monitoring the progress and places the sub
range repair request to the back of a queue to be re-run at a later time.
Is not necessarily indicative of a repair failure but it does suggest that
the repair process is taking longer than expected for some reason,
typically due to a hang, network issues, or wide rows on the table being
repaired.

As a possible workaround you can increase the timeout value in opscenter by
increasing the timeout period in the opscenterd.conf or <cluster_name>.conf
(cluster takes precedence ) but if there is an underlying issue with
repairs completing on Cassandra this will not help.

single_repair_timeout = 3600

(see:
http://docs.datastax.com/en/opscenter/4.1/opsc/online_help/services/repairServiceAdvancedConfiguration.html
)."




2015-06-17 15:21 GMT+02:00 Sebastian Estevez <sebastian.estevez@datastax.com
>:

> Do you do a ton of random updates amd deletes? That would not be a good
> workload for DTCS.
>
> Where are all your tombstones coming from?
>  On Jun 17, 2015 3:43 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:
>
>> Hi David, Edouard,
>>
>> Depending on your data model on event_data, you might want to consider
>> upgrading to use DTCS (C* 2.0.11+).
>>
>> Basically if those tombstones are due to a a Constant TTL and this is a
>> time series, it could be a real improvement.
>>
>> See:
>> https://labs.spotify.com/2014/12/18/date-tiered-compaction/
>> http://www.datastax.com/dev/blog/datetieredcompactionstrategy
>>
>> I am not sure this is related to your problem but having 8904 tombstones
>> read at once is pretty bad. Also you might want to paginate queries a bit
>> since it looks like you retrieve a lot of data at once.
>>
>> Meanwhile, if you are using STCS you can consider performing major
>> compaction on a regular basis (taking into consideration major compaction
>> downsides)
>>
>> C*heers,
>>
>> Alain
>>
>>
>>
>>
>>
>> 2015-06-12 15:08 GMT+02:00 David CHARBONNIER <
>> David.CHARBONNIER@rgsystem.com>:
>>
>>>  Hi,
>>>
>>>
>>>
>>> We’re using Cassandra 2.0.8.39 through Datastax Enterprise 4.5.1 and
>>> we’re experiencing issues with OPSCenter (version 5.1.3) Repair Service.
>>>
>>> When Repair Service is running, we can see repair timing out on a few
>>> ranges in OPSCenter’s event log viewer. See screenshot attached.
>>>
>>>
>>>
>>> On our Cassandra nodes, we can see a lot of theese messages in
>>> cassandra/system.log log file while a timeout shows up in OPSCenter :
>>>
>>>
>>>
>>>                 ERROR [Native-Transport-Requests:3372] 2015-06-12
>>> 02:22:33,231 ErrorMessage.java (line 222) Unexpected exception during
>>> request
>>>
>>> java.io.IOException: Connection reset by peer
>>>
>>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>
>>>         at sun.nio.ch.SocketDispatcher.read(Unknown Source)
>>>
>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
>>>
>>>         at sun.nio.ch.IOUtil.read(Unknown Source)
>>>
>>>         at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
>>>
>>>         at
>>> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
>>>
>>>         at
>>> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>>>
>>>         at
>>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>>>
>>>         at
>>> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>>>
>>>         at
>>> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>>>
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>>> Source)
>>>
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>>> Source)
>>>
>>>         at java.lang.Thread.run(Unknown Source)
>>>
>>>
>>>
>>> You’ll find attached an extract of the system.log file with some more
>>> informations.
>>>
>>>
>>>
>>> Do you have any idea of what’s happening ?
>>>
>>>
>>>
>>> We suspect timeouts happening because we have some tables with many
>>> tombstones, and a warning is something triggered. We have edited the
>>> configuration allow warning, but still perform until encounter 1.000.000
>>> tombstones.
>>>
>>>
>>>
>>> During a compaction, we’ve also warning messages telling us that we’ve a
>>> lot of tombstones too :
>>>
>>>
>>>
>>>                 WARN [CompactionExecutor:1584] 2015-06-11 19:22:24,904
>>> SliceQueryFilter.java (line 225) Read 8640 live and 8904 tombstoned cells
>>> in rgsupv.event_data (see tombstone_warn_threshold). 10000 columns was
>>> requested, slices=[-], delInfo={deletedAt=-9223372036854775808,
>>> localDeletion=2147483647}
>>>
>>>
>>>
>>> Do you think it’s related to our first problem ?
>>>
>>>
>>>
>>> Our cluster is configured as follow :
>>>
>>> -          8 nodes with Debian 7.8 x64
>>>
>>> -          16Gb of memory and 4 CPU
>>>
>>> -          2  HDD : 1 for the system and the other for the data
>>> directory
>>>
>>>
>>>
>>> Best regards,
>>>
>>>
>>>
>>>     *David CHARBONNIER*
>>>
>>> Sysadmin
>>>
>>> T : +33 411 934 200
>>>
>>> david.charbonnier@rgsystem.com
>>>
>>> ZAC Aéroport
>>>
>>> 125 Impasse Adam Smith
>>>
>>> 34470 Pérols - France
>>>
>>> *www.rgsystem.com* <http://www.rgsystem.com/>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>

Re: Connection reset during repair service

Posted by Sebastian Estevez <se...@datastax.com>.
Do you do a ton of random updates amd deletes? That would not be a good
workload for DTCS.

Where are all your tombstones coming from?
 On Jun 17, 2015 3:43 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:

> Hi David, Edouard,
>
> Depending on your data model on event_data, you might want to consider
> upgrading to use DTCS (C* 2.0.11+).
>
> Basically if those tombstones are due to a a Constant TTL and this is a
> time series, it could be a real improvement.
>
> See:
> https://labs.spotify.com/2014/12/18/date-tiered-compaction/
> http://www.datastax.com/dev/blog/datetieredcompactionstrategy
>
> I am not sure this is related to your problem but having 8904 tombstones
> read at once is pretty bad. Also you might want to paginate queries a bit
> since it looks like you retrieve a lot of data at once.
>
> Meanwhile, if you are using STCS you can consider performing major
> compaction on a regular basis (taking into consideration major compaction
> downsides)
>
> C*heers,
>
> Alain
>
>
>
>
>
> 2015-06-12 15:08 GMT+02:00 David CHARBONNIER <
> David.CHARBONNIER@rgsystem.com>:
>
>>  Hi,
>>
>>
>>
>> We’re using Cassandra 2.0.8.39 through Datastax Enterprise 4.5.1 and
>> we’re experiencing issues with OPSCenter (version 5.1.3) Repair Service.
>>
>> When Repair Service is running, we can see repair timing out on a few
>> ranges in OPSCenter’s event log viewer. See screenshot attached.
>>
>>
>>
>> On our Cassandra nodes, we can see a lot of theese messages in
>> cassandra/system.log log file while a timeout shows up in OPSCenter :
>>
>>
>>
>>                 ERROR [Native-Transport-Requests:3372] 2015-06-12
>> 02:22:33,231 ErrorMessage.java (line 222) Unexpected exception during
>> request
>>
>> java.io.IOException: Connection reset by peer
>>
>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>
>>         at sun.nio.ch.SocketDispatcher.read(Unknown Source)
>>
>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
>>
>>         at sun.nio.ch.IOUtil.read(Unknown Source)
>>
>>         at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
>>
>>         at
>> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
>>
>>         at
>> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>>
>>         at
>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>>
>>         at
>> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>>
>>         at
>> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>>
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>> Source)
>>
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>> Source)
>>
>>         at java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> You’ll find attached an extract of the system.log file with some more
>> informations.
>>
>>
>>
>> Do you have any idea of what’s happening ?
>>
>>
>>
>> We suspect timeouts happening because we have some tables with many
>> tombstones, and a warning is something triggered. We have edited the
>> configuration allow warning, but still perform until encounter 1.000.000
>> tombstones.
>>
>>
>>
>> During a compaction, we’ve also warning messages telling us that we’ve a
>> lot of tombstones too :
>>
>>
>>
>>                 WARN [CompactionExecutor:1584] 2015-06-11 19:22:24,904
>> SliceQueryFilter.java (line 225) Read 8640 live and 8904 tombstoned cells
>> in rgsupv.event_data (see tombstone_warn_threshold). 10000 columns was
>> requested, slices=[-], delInfo={deletedAt=-9223372036854775808,
>> localDeletion=2147483647}
>>
>>
>>
>> Do you think it’s related to our first problem ?
>>
>>
>>
>> Our cluster is configured as follow :
>>
>> -          8 nodes with Debian 7.8 x64
>>
>> -          16Gb of memory and 4 CPU
>>
>> -          2  HDD : 1 for the system and the other for the data directory
>>
>>
>>
>> Best regards,
>>
>>
>>
>>     *David CHARBONNIER*
>>
>> Sysadmin
>>
>> T : +33 411 934 200
>>
>> david.charbonnier@rgsystem.com
>>
>> ZAC Aéroport
>>
>> 125 Impasse Adam Smith
>>
>> 34470 Pérols - France
>>
>> *www.rgsystem.com* <http://www.rgsystem.com/>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Connection reset during repair service

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi David, Edouard,

Depending on your data model on event_data, you might want to consider
upgrading to use DTCS (C* 2.0.11+).

Basically if those tombstones are due to a a Constant TTL and this is a
time series, it could be a real improvement.

See:
https://labs.spotify.com/2014/12/18/date-tiered-compaction/
http://www.datastax.com/dev/blog/datetieredcompactionstrategy

I am not sure this is related to your problem but having 8904 tombstones
read at once is pretty bad. Also you might want to paginate queries a bit
since it looks like you retrieve a lot of data at once.

Meanwhile, if you are using STCS you can consider performing major
compaction on a regular basis (taking into consideration major compaction
downsides)

C*heers,

Alain





2015-06-12 15:08 GMT+02:00 David CHARBONNIER <David.CHARBONNIER@rgsystem.com
>:

>  Hi,
>
>
>
> We’re using Cassandra 2.0.8.39 through Datastax Enterprise 4.5.1 and we’re
> experiencing issues with OPSCenter (version 5.1.3) Repair Service.
>
> When Repair Service is running, we can see repair timing out on a few
> ranges in OPSCenter’s event log viewer. See screenshot attached.
>
>
>
> On our Cassandra nodes, we can see a lot of theese messages in
> cassandra/system.log log file while a timeout shows up in OPSCenter :
>
>
>
>                 ERROR [Native-Transport-Requests:3372] 2015-06-12
> 02:22:33,231 ErrorMessage.java (line 222) Unexpected exception during
> request
>
> java.io.IOException: Connection reset by peer
>
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>
>         at sun.nio.ch.SocketDispatcher.read(Unknown Source)
>
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
>
>         at sun.nio.ch.IOUtil.read(Unknown Source)
>
>         at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
>
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
>
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
>
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>
>         at java.lang.Thread.run(Unknown Source)
>
>
>
> You’ll find attached an extract of the system.log file with some more
> informations.
>
>
>
> Do you have any idea of what’s happening ?
>
>
>
> We suspect timeouts happening because we have some tables with many
> tombstones, and a warning is something triggered. We have edited the
> configuration allow warning, but still perform until encounter 1.000.000
> tombstones.
>
>
>
> During a compaction, we’ve also warning messages telling us that we’ve a
> lot of tombstones too :
>
>
>
>                 WARN [CompactionExecutor:1584] 2015-06-11 19:22:24,904
> SliceQueryFilter.java (line 225) Read 8640 live and 8904 tombstoned cells
> in rgsupv.event_data (see tombstone_warn_threshold). 10000 columns was
> requested, slices=[-], delInfo={deletedAt=-9223372036854775808,
> localDeletion=2147483647}
>
>
>
> Do you think it’s related to our first problem ?
>
>
>
> Our cluster is configured as follow :
>
> -          8 nodes with Debian 7.8 x64
>
> -          16Gb of memory and 4 CPU
>
> -          2  HDD : 1 for the system and the other for the data directory
>
>
>
> Best regards,
>
>
>
>     *David CHARBONNIER*
>
> Sysadmin
>
> T : +33 411 934 200
>
> david.charbonnier@rgsystem.com
>
> ZAC Aéroport
>
> 125 Impasse Adam Smith
>
> 34470 Pérols - France
>
> *www.rgsystem.com* <http://www.rgsystem.com/>
>
>
>
>
>
>
>