You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Paco Trujillo <F....@genetwister.nl> on 2016/04/04 14:33:52 UTC

all the nost are not reacheable when running massive deletes

Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running "massive deletes" on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem?

Thanks in advance

RE: all the nost are not reacheable when running massive deletes

Posted by Paco Trujillo <F....@genetwister.nl>.
Hi Alain


-          Over use the cluster was one thing which I was thinking about, and I have requested two new nodes (anyway it was something already planned). But the pattern of nodes with high CPU load is only visible in 1 or two of the nodes, the rest are working correctly. That made me think that adding two new nodes maybe will not help.


-          Run the deletes at slower at constant path sounds good and definitely I will try that. Anyway I have similar errors during the weekly repair, even without the deletes running.



-          Our cluster is inhouse one, each machine ois only use as a Cassandra node.



-          Logs are quite normal, even when the timeouts start to appear on the client.



-          The update of Cassandra is a good point but I am afraid that if I start the updates right now the timeouts problems will appear again. During an update compactions are executed? If it is not I think is safe to update the cluster.

Thanks for your comments

From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: maandag 4 april 2016 18:35
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

Hola Paco,

the mutation stages pending column grows without stop, could be that the problem

CPU (near 96%)

Yes, basically I think you are over using this cluster.

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

Solutions would be to run delete at a slower & constant path, against all the nodes, using a balancing policy or adding capacity if all the nodes are facing the issue and you can't slow deletes. You should also have a look at iowait and steal, see if CPU are really used 100% or masking an other issue. (disk not answering fast enough or hardware / shared instance issue). I had some noisy neighbours at some point while using Cassandra on AWS.

 I cannot find the reason that originates the timeouts.

I don't see it that weird while being overusing some/all the nodes.

I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error

Any relevant logs in Cassandra nodes (other than dropped mutations INFO)?

7 nodes version 2.0.17

Note: Be aware that this Cassandra version is quite old and no longer supported. Plus you might face issues that were solved already. I know that upgrading is not straight forward, but 2.0 --> 2.1 brings an amazing set of optimisations and some fixes too. You should try it out :-).

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2016-04-04 14:33 GMT+02:00 Paco Trujillo <F....@genetwister.nl>>:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem?

Thanks in advance


Re: all the nost are not reacheable when running massive deletes

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hola Paco,


> the mutation stages pending column grows without stop, could be that the
> problem



> CPU (near 96%)
>

Yes, basically I think you are over using this cluster.

but two of them have high cpu load, especially the 232 because I am running
> a lot of deletes using cqlsh in that node.
>

Solutions would be to run delete at a slower & constant path, against all
the nodes, using a balancing policy or adding capacity if all the nodes are
facing the issue and you can't slow deletes. You should also have a look at
iowait and steal, see if CPU are really used 100% or masking an other
issue. (disk not answering fast enough or hardware / shared instance
issue). I had some noisy neighbours at some point while using Cassandra on
AWS.

 I cannot find the reason that originates the timeouts.


I don't see it that weird while being overusing some/all the nodes.

I already have increased the timeouts, but It do not think that is a
> solution because the timeouts indicated another type of error


Any relevant logs in Cassandra nodes (other than dropped mutations INFO)?

7 nodes version 2.0.17


Note: Be aware that this Cassandra version is quite old and no longer
supported. Plus you might face issues that were solved already. I know that
upgrading is not straight forward, but 2.0 --> 2.1 brings an amazing set of
optimisations and some fixes too. You should try it out :-).

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2016-04-04 14:33 GMT+02:00 Paco Trujillo <F....@genetwister.nl>:

> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256     14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256     14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256     13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256     14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256     14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB     256     13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256     15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>

RE: all the nost are not reacheable when running massive deletes

Posted by Paco Trujillo <F....@genetwister.nl>.
Thanks Alain for all your answer:


-          In a few days I am going to set up a maintenance window so I can test again to run repairs and see what happens. Definitely I will run 'iostat -mx 5 100' On that time and also use the command you pointed to see why is consuming so much power.

-          About the client configuration, we had QUORUM because we were planning to have another data center last year (running in the locations of one of our clients) but at the end we postponed that. The configuration is still the same :), thanks for the indication. We used the downgrading policy because of the timeouts, and problems we had in the past with the network. In fact I have not seen in the logs for some months that the downgrading is occurring, so probably is good also to remove It from the configuration.

-          The secondary index in the cf is definitely a bad decision, taking at the beginning when I start getting familiar with Cassandra. The problem is the cf at this moment have a lot of data, and remodel it will cost some time, so we decide to postpone it. There are some queries which use this index,  using materialized views on this cf and other related with it, will solve the problem. But for that, I need to update the cluster ☺

-          Good that you mention that LCS will not be a good idea, because I will planning to make a snapshot of that cf and restore the data in our test cluster to see if the LCS compaction will help. It was more a decision based on “I have to try something” than based on arguments ☺



From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: vrijdag 8 april 2016 12:46
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

It looks like a complex issue that might worth having a close look at your data model, configurations and machines.

It is hard to help you from the mailing list. Yet here are some thoughts, some might be irrelevant or wrong, but some other might point you to your issue, hope we will get lucky there :-):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,00    0,00    0,40    0,03    0,00   98,57

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,20     0,00     0,00     8,00     0,00    0,00   0,00   0,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdc               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdd               0,00     0,20    0,00    0,40     0,00     0,00    12,00     0,00    2,50   2,50   0,10

CPU:


-          General use: 1 – 4 %

-          Worst case: 98% .It is when the problem comes, running massive deletes(even in a different machine which is receiving the deletes) or running a repair.

First, the cluster is definitely not overloaded. You are having an issue with some nodes from time to time. This looks like an imbalanced cluster. It can be due to some wide rows or bad partition key. Make sure writes are well balanced at any time with the partition you are using and try to spot some warnings about large row compactions in the logs. Yet, I don't think this is what you face as you then should have 2 or 3 nodes going crazy at the same time because of RF (2 or 3).

Also, can we have an 'iostat -mx 5 100' on when a node goes mad?
An other good troubleshooting tool would be using https://github.com/aragozin/jvm-tools/blob/master/sjk-core/COMMANDS.md#ttop-command. It would be interesting to see what Cassandra threads are consuming the CPU power. This is definitely something I would try on a high load node/time.


About the client, some comments, clearly unrelated to your issue, but probably worth it to be told:

.setConsistencyLevel(ConsistencyLevel.QUORUM))
 [...]
 .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))

I advice people to never do this. Basically, consistency level means: even in the worst case, I want to make sure that at least (RF / 2) + 1 got the read / write to consider it valid, if not drop the operation. If used for both writes & reads, this provide you a strong and 'immediate' consistency (no locks though, so excepted for some races). Data will always be sent to all the nodes in charge of the token (generally 2 or 3 nodes, depending on RF).

Then you say, if I can't have quorum, then go for one. Meaning you prefer availability, rather than consistency. Then, why not use one from the start as the consistency level? I would go for CL ONE or remove the 'DowngradingConsistencyRetryPolicy'.

Also, I would go with 'LOCAL_ONE/QUORUM', using Local is not an issue when using only one DC as you do, but avoid some surprises when adding a new DC. If you don't change it, keep it in mind for the day you add a new DC.

Yet, this client does a probably well balanced use of the cluster.

About your table:

I think the problem is related with one specific column family:

CREATE TABLE snpaware.snpsearch ...

First thing is that this table is using a secondary index. I must say I never used them, because it never worked very well and I did not want to operate this kind of tables. I preferred maintaining my own indexes in the past. In the future I might rather use Materialised View (C* 3.0+). Though, I am not sure how performant they are yet.

From what I heard, indexes are quite efficient on low cardinality. Is that your case?

Also indexes are hold locally, no distribution. That would fit with your 'one node at the time and rotating' issue. Also, when deleting data from there, index need to be updated. Delete operation is probably quite heavy.

Plus you say:

Which holds a lot of data. It is normally a cf which needs to be read but sometimes updated and deleted and I think the problem is there.

And I believe you're right. Could you work around this index somehow?

I wanted to change the compaction strategy but that means that a compaction will be executed and then timeouts will appear and I can not do that on the live cluster right now.

Not sure what new strategy you wanted to use, but LCS could make things a lot worse, as LCS uses far more resources than STCS at compaction time. Plus, at start, all your data would have to go through an heavy process.

Honestly, from what I know now, I would blame, the index, but again, I can be wrong.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-07 9:18 GMT+02:00 Paco Trujillo <F....@genetwister.nl>>:
Well, then you could trying to replace this node as soon as you have more nodes available. I would use this procedure as I believe it is the most efficient one: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.

It is not always the same node, it is always one node from the seven in the cluster which has the high load but not always the same.

Respect to the question of the hardware ( from one of the nodes, all of them have the same configuration)

Disk:


-          We use sdd disks

-          Output from iostat -mx 5 100:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,00    0,00    0,40    0,03    0,00   98,57

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,20     0,00     0,00     8,00     0,00    0,00   0,00   0,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdc               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdd               0,00     0,20    0,00    0,40     0,00     0,00    12,00     0,00    2,50   2,50   0,10


-          Logs, I do not see nothing on the messages log except this:

Apr  3 03:07:01 GT-cassandra7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1504" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Apr  3 18:24:55 GT-cassandra7 ntpd[1847]: 0.0.0.0 06a8 08 no_sys_peer
Apr  4 06:56:18 GT-cassandra7 ntpd[1847]: 0.0.0.0 06b8 08 no_sys_peer

CPU:


-          General use: 1 – 4 %

-          Worst case: 98% .It is when the problem comes, running massive deletes(even in a different machine which is receiving the deletes) or running a repair.

RAM:


-          We are using CMS.

-          Each node have 16GB, and we dedicate to Cassandra

o   MAX_HEAP_SIZE="10G"

o   HEAP_NEWSIZE="800M"


Regarding to the rest of questions you mention:


-          Clients: we use the datastax java driver with this configuration:
//Get contact points
                  String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
          cluster = com.datastax.driver.core.Cluster.builder()
                  .addContactPoints(contactPoints)
                      //.addContactPoint(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL))
                      .withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME),
                                  this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
                                  .withQueryOptions(new QueryOptions()
                                  .setConsistencyLevel(ConsistencyLevel.QUORUM))
                                  //.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy(CASSANDRA_PRIMARY_CLUSTER)))
                                  .withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
                                  //.withLoadBalancingPolicy(new TokenAwarePolicy((LoadBalancingPolicy) new RoundRobinBalancingPolicy()))
                                  .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
                                  .withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
                                  .build();

So request should be evenly distributed.


-          Deletes are contained in a cql file, and I am using cqlsh to execute them. I will try to run the deletes in small batches and separate nodes, but same problem appear when running repairs.

I think the problem is related with one specific column family:

CREATE TABLE snpaware.snpsearch (
    idline1 bigint,
    idline2 bigint,
    partid int,
    id uuid,
    alleles int,
    coverage int,
    distancetonext int,
    distancetonextbyline int,
    distancetoprev int,
    distancetoprevbyline int,
    frequency double,
    idindividual bigint,
    idindividualmorph bigint,
    idreferencebuild bigint,
    isinexon boolean,
    isinorf boolean,
    max_length int,
    morphid bigint,
    position int,
    qualityflag int,
    ranking int,
    referencebuildlength int,
    snpsearchid uuid,
    synonymous boolean,
    PRIMARY KEY ((idline1, idline2, partid), id)
) WITH CLUSTERING ORDER BY (id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = 'KEYS_ONLY'
    AND comment = 'Table with the snp between lines'
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
   AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND index_interval = 128
    AND memtable_flush_period_in_ms = 0
    AND populate_io_cache_on_flush = false
    AND read_repair_chance = 0.1
    AND replicate_on_write = true
    AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX snpsearch_morphid ON snpaware.snpsearch (morphid);

Which holds a lot of data. It is normaly a cf which needs to be readed but sometimes updated and deleted and I think the problem is there. I wanted to change the compaction strategy but that means that a compaction will be executed and then timeouts will appear and I can not do that on the live cluster right now.

I will try bring a snapshot of the cf to a test cluster and test the repair there (I can not snaphost the data from the live cluster completely because it does not fit in our test cluster). Following your recommendation I will postpone the upgrade of the cluster (but the partial repair in version 2.1 looks a good fit for my situation to decrease the pressure on the nodes when running compactions).

Anyway I have ordered two new nodes, because maybe that will help. The problem is that adding a new node will need to run clean up in all nodes, the clean up implies a compaction? If the answer to this is yes, then the timeouts will appear again.


From: Alain RODRIGUEZ [mailto:arodrime@gmail.com<ma...@gmail.com>]
Sent: dinsdag 5 april 2016 15:11

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: all the nost are not reacheable when running massive deletes

 Over use the cluster was one thing which I was thinking about, and I have requested two new nodes (anyway it was something already planned). But the pattern of nodes with high CPU load is only visible in 1 or two of the nodes, the rest are working correctly. That made me think that adding two new nodes maybe will not help.

Well, then you could trying to replace this node as soon as you have more nodes available. I would use this procedure as I believe it is the most efficient one: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.

Yet I believe it might not be a hardware or cluster throughput issue, and if it is a hardware issues you probably want to dig it as this machine is yours and not a virtual one. You might want to reuse it anyway.

Some questions about the machine and their usage.

Disk:
What disk hardware and configuration do you use.
iostat -mx 5 100 gives you? How is iowait?
Any error in the system / kernel logs?

CPU
How much used are the CPUs in general / worst cases?
What is the load average / max and how many cores have the cpu?

RAM
You are using 10GB heap and CMS right? You seems to say that GC activity looks ok, can you confirm?
How much total RAM are the machines using?

The point here is to see if we can spot the bottleneck. If there is none, Cassandra is probably badly configured at some point.

when running “massive deletes” on one of the nodes

 Run the deletes at slower at constant path sounds good and definitely I will try that.

Are clients and queries well configured to use all the nodes evenly? Are deletes well balanced also? If not, balancing the usage of the nodes will probably alleviate things.

The update of Cassandra is a good point but I am afraid that if I start the updates right now the timeouts problems will appear again. During an update compactions are executed? If it is not I think is safe to update the cluster.

I do not recommend you to upgrade right now indeed. Yet I would do it asap (= as soon as the cluster is ready and clients are compatible with the new version). You should always start operations with an healthy cluster or you might end in a worst situation. Compactions will run normally. Make sure not to run any streaming process (repairs / bootstrap / node removal) during the upgrade and while you have not yet run "nodetool upgradesstable". There is a lot of informations out there about upgrades.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-05 10:32 GMT+02:00 Paco Trujillo <F....@genetwister.nl>>:
Hi daemeon

We have check network and it is ok, in fact the nodes are connecting between themselves with a dedicated network.

From: daemeon reiydelle [mailto:daemeonr@gmail.com<ma...@gmail.com>]
Sent: maandag 4 april 2016 18:42
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: all the nost are not reacheable when running massive deletes


Network issues. Could be jumbo frames not consistent or other.

sent from my mobile

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198<tel:415.501.0198>
London +44.0.20.8144.9872<tel:%2B44.0.20.8144.9872>
On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl>> wrote:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem?

Thanks in advance



Re: all the nost are not reacheable when running massive deletes

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
It looks like a complex issue that might worth having a close look at your
data model, configurations and machines.

It is hard to help you from the mailing list. Yet here are some thoughts,
some might be irrelevant or wrong, but some other might point you to your
issue, hope we will get lucky there :-):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>            1,00    0,00    0,40    0,03    0,00   98,57
>
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0,00     0,00    0,00    0,20     0,00     0,00
> 8,00     0,00    0,00   0,00   0,00
>
> sdb               0,00     0,00    0,00    0,00     0,00     0,00
> 0,00     0,00    0,00   0,00   0,00
>
> sdc               0,00     0,00    0,00    0,00     0,00     0,00
> 0,00     0,00    0,00   0,00   0,00
>
> sdd               0,00     0,20    0,00    0,40     0,00     0,00
> 12,00     0,00    2,50   2,50   0,10
>
>
>
CPU:
>
>
>
> -          General use: 1 – 4 %
>
> -          Worst case: 98% .It is when the problem comes, running massive
> deletes(even in a different machine which is receiving the deletes) or
> running a repair.
>

First, the cluster is definitely not overloaded. You are having an issue
with some nodes from time to time. This looks like an imbalanced cluster.
It can be due to some wide rows or bad partition key. Make sure writes are
well balanced at any time with the partition you are using and try to spot
some warnings about large row compactions in the logs. Yet, I don't think
this is what you face as you then should have 2 or 3 nodes going crazy at
the same time because of RF (2 or 3).

Also, can we have an 'iostat -mx 5 100' on when a node goes mad?
An other good troubleshooting tool would be using
https://github.com/aragozin/jvm-tools/blob/master/sjk-core/COMMANDS.md#ttop-command.
It would be interesting to see what Cassandra threads are consuming the CPU
power. This is definitely something I would try on a high load node/time.


About the client, some comments, clearly unrelated to your issue, but
probably worth it to be told:

.setConsistencyLevel(ConsistencyLevel.QUORUM))

 [...]

 .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPol
> icy.INSTANCE))


I advice people to never do this. Basically, consistency level means: even
in the worst case, I want to make sure that at least (RF / 2) + 1 got the
read / write to consider it valid, if not drop the operation. If used for
both writes & reads, this provide you a strong and 'immediate' consistency
(no locks though, so excepted for some races). Data will always be sent to
all the nodes in charge of the token (generally 2 or 3 nodes, depending on
RF).

Then you say, if I can't have quorum, then go for one. Meaning you prefer
availability, rather than consistency. Then, why not use one from the start
as the consistency level? I would go for CL ONE or remove the '
DowngradingConsistencyRetryPolicy'.

Also, I would go with 'LOCAL_ONE/QUORUM', using Local is not an issue when
using only one DC as you do, but avoid some surprises when adding a new DC.
If you don't change it, keep it in mind for the day you add a new DC.

Yet, this client does a probably well balanced use of the cluster.

About your table:

I think the problem is related with one specific column family:
>
>
>
> CREATE TABLE snpaware.snpsearch ...
>

First thing is that this table is using a secondary index. I must say I
never used them, because it never worked very well and I did not want to
operate this kind of tables. I preferred maintaining my own indexes in the
past. In the future I might rather use Materialised View (C* 3.0+). Though,
I am not sure how performant they are yet.

>From what I heard, indexes are quite efficient on low cardinality. Is that
your case?

Also indexes are hold locally, no distribution. That would fit with your
'one node at the time and rotating' issue. Also, when deleting data from
there, index need to be updated. Delete operation is probably quite heavy.

Plus you say:

Which holds a lot of data. It is normally a cf which needs to be read but
> sometimes updated and deleted and I think the problem is there.


And I believe you're right. Could you work around this index somehow?

I wanted to change the compaction strategy but that means that a compaction
> will be executed and then timeouts will appear and I can not do that on the
> live cluster right now.


Not sure what new strategy you wanted to use, but LCS could make things a
lot worse, as LCS uses far more resources than STCS at compaction time.
Plus, at start, all your data would have to go through an heavy process.

Honestly, from what I know now, I would blame, the index, but again, I can
be wrong.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-07 9:18 GMT+02:00 Paco Trujillo <F....@genetwister.nl>:

> Well, then you could trying to replace this node as soon as you have more
> nodes available. I would use this procedure as I believe it is the most
> efficient one:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
> .
>
>
>
> It is not always the same node, it is always one node from the seven in
> the cluster which has the high load but not always the same.
>
>
>
> Respect to the question of the hardware ( from one of the nodes, all of
> them have the same configuration)
>
>
>
> Disk:
>
>
>
> -          We use sdd disks
>
> -          Output from iostat -mx 5 100:
>
>
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>            1,00    0,00    0,40    0,03    0,00   98,57
>
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0,00     0,00    0,00    0,20     0,00     0,00
> 8,00     0,00    0,00   0,00   0,00
>
> sdb               0,00     0,00    0,00    0,00     0,00     0,00
> 0,00     0,00    0,00   0,00   0,00
>
> sdc               0,00     0,00    0,00    0,00     0,00     0,00
> 0,00     0,00    0,00   0,00   0,00
>
> sdd               0,00     0,20    0,00    0,40     0,00     0,00
> 12,00     0,00    2,50   2,50   0,10
>
>
>
> -          Logs, I do not see nothing on the messages log except this:
>
>
>
> Apr  3 03:07:01 GT-cassandra7 rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="1504" x-info="http://www.rsyslog.com"] rsyslogd
> was HUPed
>
> Apr  3 18:24:55 GT-cassandra7 ntpd[1847]: 0.0.0.0 06a8 08 no_sys_peer
>
> Apr  4 06:56:18 GT-cassandra7 ntpd[1847]: 0.0.0.0 06b8 08 no_sys_peer
>
>
>
> CPU:
>
>
>
> -          General use: 1 – 4 %
>
> -          Worst case: 98% .It is when the problem comes, running massive
> deletes(even in a different machine which is receiving the deletes) or
> running a repair.
>
>
>
> RAM:
>
>
>
> -          We are using CMS.
>
> -          Each node have 16GB, and we dedicate to Cassandra
>
> o   MAX_HEAP_SIZE="10G"
>
> o   HEAP_NEWSIZE="800M"
>
>
>
>
>
> Regarding to the rest of questions you mention:
>
>
>
> -          Clients: we use the datastax java driver with this
> configuration:
>
> //Get contact points
>
>                   String[]
> contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
>
>           cluster = com.datastax.driver.core.Cluster.builder()
>
>                   .addContactPoints(contactPoints)
>
>
> //.addContactPoint(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL))
>
>
> .withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME),
>
>
>
>     this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
>
>                                   .withQueryOptions(new QueryOptions()
>
>
> .setConsistencyLevel(ConsistencyLevel.QUORUM))
>
>                                   //.withLoadBalancingPolicy(new
> TokenAwarePolicy(new DCAwareRoundRobinPolicy(CASSANDRA_PRIMARY_CLUSTER)))
>
>                                   .withLoadBalancingPolicy(new
> TokenAwarePolicy(new RoundRobinPolicy()))
>
>                                   //.withLoadBalancingPolicy(new
> TokenAwarePolicy((LoadBalancingPolicy) new RoundRobinBalancingPolicy()))
>
>                                   .withRetryPolicy(new
> LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
>
>
> .withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
>
>                                   .build();
>
>
>
> So request should be evenly distributed.
>
>
>
> -          Deletes are contained in a cql file, and I am using cqlsh to
> execute them. I will try to run the deletes in small batches and separate
> nodes, but same problem appear when running repairs.
>
>
>
> I think the problem is related with one specific column family:
>
>
>
> CREATE TABLE snpaware.snpsearch (
>
>     idline1 bigint,
>
>     idline2 bigint,
>
>     partid int,
>
>     id uuid,
>
>     alleles int,
>
>     coverage int,
>
>     distancetonext int,
>
>     distancetonextbyline int,
>
>     distancetoprev int,
>
>     distancetoprevbyline int,
>
>     frequency double,
>
>     idindividual bigint,
>
>     idindividualmorph bigint,
>
>     idreferencebuild bigint,
>
>     isinexon boolean,
>
>     isinorf boolean,
>
>     max_length int,
>
>     morphid bigint,
>
>     position int,
>
>     qualityflag int,
>
>     ranking int,
>
>     referencebuildlength int,
>
>     snpsearchid uuid,
>
>     synonymous boolean,
>
>     PRIMARY KEY ((idline1, idline2, partid), id)
>
> ) WITH CLUSTERING ORDER BY (id ASC)
>
>     AND bloom_filter_fp_chance = 0.01
>
>     AND caching = 'KEYS_ONLY'
>
>     AND comment = 'Table with the snp between lines'
>
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
>
>    AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>
>     AND dclocal_read_repair_chance = 0.0
>
>     AND default_time_to_live = 0
>
>     AND gc_grace_seconds = 864000
>
>     AND index_interval = 128
>
>     AND memtable_flush_period_in_ms = 0
>
>     AND populate_io_cache_on_flush = false
>
>     AND read_repair_chance = 0.1
>
>     AND replicate_on_write = true
>
>     AND speculative_retry = '99.0PERCENTILE';
>
> CREATE INDEX snpsearch_morphid ON snpaware.snpsearch (morphid);
>
>
>
> Which holds a lot of data. It is normaly a cf which needs to be readed but
> sometimes updated and deleted and I think the problem is there. I wanted to
> change the compaction strategy but that means that a compaction will be
> executed and then timeouts will appear and I can not do that on the live
> cluster right now.
>
>
>
> I will try bring a snapshot of the cf to a test cluster and test the
> repair there (I can not snaphost the data from the live cluster completely
> because it does not fit in our test cluster). Following your recommendation
> I will postpone the upgrade of the cluster (but the partial repair in
> version 2.1 looks a good fit for my situation to decrease the pressure on
> the nodes when running compactions).
>
>
>
> Anyway I have ordered two new nodes, because maybe that will help. The
> problem is that adding a new node will need to run clean up in all nodes,
> the clean up implies a compaction? If the answer to this is yes, then the
> timeouts will appear again.
>
>
>
>
>
> *From:* Alain RODRIGUEZ [mailto:arodrime@gmail.com]
> *Sent:* dinsdag 5 april 2016 15:11
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: all the nost are not reacheable when running massive
> deletes
>
>
>
>  Over use the cluster was one thing which I was thinking about, and I
> have requested two new nodes (anyway it was something already planned). But
> the pattern of nodes with high CPU load is only visible in 1 or two of the
> nodes, the rest are working correctly. That made me think that adding two
> new nodes maybe will not help.
>
>
>
> Well, then you could trying to replace this node as soon as you have more
> nodes available. I would use this procedure as I believe it is the most
> efficient one:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
> .
>
>
>
> Yet I believe it might not be a hardware or cluster throughput issue, and
> if it is a hardware issues you probably want to dig it as this machine is
> yours and not a virtual one. You might want to reuse it anyway.
>
>
>
> Some questions about the machine and their usage.
>
>
>
> Disk:
>
> What disk hardware and configuration do you use.
>
> iostat -mx 5 100 gives you? How is iowait?
>
> Any error in the system / kernel logs?
>
>
>
> CPU
>
> How much used are the CPUs in general / worst cases?
>
> What is the load average / max and how many cores have the cpu?
>
>
>
> RAM
>
> You are using 10GB heap and CMS right? You seems to say that GC activity
> looks ok, can you confirm?
>
> How much total RAM are the machines using?
>
>
>
> The point here is to see if we can spot the bottleneck. If there is none,
> Cassandra is probably badly configured at some point.
>
>
>
> when running “massive deletes” on one of the nodes
>
>
>
>  Run the deletes at slower at constant path sounds good and definitely I
> will try that.
>
>
>
> Are clients and queries well configured to use all the nodes evenly? Are
> deletes well balanced also? If not, balancing the usage of the nodes will
> probably alleviate things.
>
>
>
> The update of Cassandra is a good point but I am afraid that if I start
> the updates right now the timeouts problems will appear again. During an
> update compactions are executed? If it is not I think is safe to update the
> cluster.
>
>
>
> I do not recommend you to upgrade right now indeed. Yet I would do it asap
> (= as soon as the cluster is ready and clients are compatible with the new
> version). You should always start operations with an healthy cluster or you
> might end in a worst situation. Compactions will run normally. Make sure
> not to run any streaming process (repairs / bootstrap / node removal)
> during the upgrade and while you have not yet run "nodetool
> upgradesstable". There is a lot of informations out there about upgrades.
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - alain@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
> 2016-04-05 10:32 GMT+02:00 Paco Trujillo <F....@genetwister.nl>:
>
> Hi daemeon
>
>
>
> We have check network and it is ok, in fact the nodes are connecting
> between themselves with a dedicated network.
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* maandag 4 april 2016 18:42
> *To:* user@cassandra.apache.org
> *Subject:* Re: all the nost are not reacheable when running massive
> deletes
>
>
>
> Network issues. Could be jumbo frames not consistent or other.
>
> sent from my mobile
>
> sent from my mobile
> Daemeon C.M. Reiydelle
> USA 415.501.0198
> London +44.0.20.8144.9872
>
> On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl> wrote:
>
> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256     14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256     14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256     13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256     14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256     14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB     256     13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256     15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>
>
>

RE: all the nost are not reacheable when running massive deletes

Posted by Paco Trujillo <F....@genetwister.nl>.
Well, then you could trying to replace this node as soon as you have more nodes available. I would use this procedure as I believe it is the most efficient one: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.

It is not always the same node, it is always one node from the seven in the cluster which has the high load but not always the same.

Respect to the question of the hardware ( from one of the nodes, all of them have the same configuration)

Disk:


-          We use sdd disks

-          Output from iostat -mx 5 100:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,00    0,00    0,40    0,03    0,00   98,57

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,20     0,00     0,00     8,00     0,00    0,00   0,00   0,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdc               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdd               0,00     0,20    0,00    0,40     0,00     0,00    12,00     0,00    2,50   2,50   0,10


-          Logs, I do not see nothing on the messages log except this:

Apr  3 03:07:01 GT-cassandra7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1504" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Apr  3 18:24:55 GT-cassandra7 ntpd[1847]: 0.0.0.0 06a8 08 no_sys_peer
Apr  4 06:56:18 GT-cassandra7 ntpd[1847]: 0.0.0.0 06b8 08 no_sys_peer

CPU:


-          General use: 1 – 4 %

-          Worst case: 98% .It is when the problem comes, running massive deletes(even in a different machine which is receiving the deletes) or running a repair.

RAM:


-          We are using CMS.

-          Each node have 16GB, and we dedicate to Cassandra

o   MAX_HEAP_SIZE="10G"

o   HEAP_NEWSIZE="800M"


Regarding to the rest of questions you mention:


-          Clients: we use the datastax java driver with this configuration:
//Get contact points
                  String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
          cluster = com.datastax.driver.core.Cluster.builder()
                  .addContactPoints(contactPoints)
                      //.addContactPoint(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL))
                      .withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME),
                                  this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
                                  .withQueryOptions(new QueryOptions()
                                  .setConsistencyLevel(ConsistencyLevel.QUORUM))
                                  //.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy(CASSANDRA_PRIMARY_CLUSTER)))
                                  .withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
                                  //.withLoadBalancingPolicy(new TokenAwarePolicy((LoadBalancingPolicy) new RoundRobinBalancingPolicy()))
                                  .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
                                  .withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
                                  .build();

So request should be evenly distributed.


-          Deletes are contained in a cql file, and I am using cqlsh to execute them. I will try to run the deletes in small batches and separate nodes, but same problem appear when running repairs.

I think the problem is related with one specific column family:

CREATE TABLE snpaware.snpsearch (
    idline1 bigint,
    idline2 bigint,
    partid int,
    id uuid,
    alleles int,
    coverage int,
    distancetonext int,
    distancetonextbyline int,
    distancetoprev int,
    distancetoprevbyline int,
    frequency double,
    idindividual bigint,
    idindividualmorph bigint,
    idreferencebuild bigint,
    isinexon boolean,
    isinorf boolean,
    max_length int,
    morphid bigint,
    position int,
    qualityflag int,
    ranking int,
    referencebuildlength int,
    snpsearchid uuid,
    synonymous boolean,
    PRIMARY KEY ((idline1, idline2, partid), id)
) WITH CLUSTERING ORDER BY (id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = 'KEYS_ONLY'
    AND comment = 'Table with the snp between lines'
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
   AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND index_interval = 128
    AND memtable_flush_period_in_ms = 0
    AND populate_io_cache_on_flush = false
    AND read_repair_chance = 0.1
    AND replicate_on_write = true
    AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX snpsearch_morphid ON snpaware.snpsearch (morphid);

Which holds a lot of data. It is normaly a cf which needs to be readed but sometimes updated and deleted and I think the problem is there. I wanted to change the compaction strategy but that means that a compaction will be executed and then timeouts will appear and I can not do that on the live cluster right now.

I will try bring a snapshot of the cf to a test cluster and test the repair there (I can not snaphost the data from the live cluster completely because it does not fit in our test cluster). Following your recommendation I will postpone the upgrade of the cluster (but the partial repair in version 2.1 looks a good fit for my situation to decrease the pressure on the nodes when running compactions).

Anyway I have ordered two new nodes, because maybe that will help. The problem is that adding a new node will need to run clean up in all nodes, the clean up implies a compaction? If the answer to this is yes, then the timeouts will appear again.


From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: dinsdag 5 april 2016 15:11
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

 Over use the cluster was one thing which I was thinking about, and I have requested two new nodes (anyway it was something already planned). But the pattern of nodes with high CPU load is only visible in 1 or two of the nodes, the rest are working correctly. That made me think that adding two new nodes maybe will not help.

Well, then you could trying to replace this node as soon as you have more nodes available. I would use this procedure as I believe it is the most efficient one: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.

Yet I believe it might not be a hardware or cluster throughput issue, and if it is a hardware issues you probably want to dig it as this machine is yours and not a virtual one. You might want to reuse it anyway.

Some questions about the machine and their usage.

Disk:
What disk hardware and configuration do you use.
iostat -mx 5 100 gives you? How is iowait?
Any error in the system / kernel logs?

CPU
How much used are the CPUs in general / worst cases?
What is the load average / max and how many cores have the cpu?

RAM
You are using 10GB heap and CMS right? You seems to say that GC activity looks ok, can you confirm?
How much total RAM are the machines using?

The point here is to see if we can spot the bottleneck. If there is none, Cassandra is probably badly configured at some point.

when running “massive deletes” on one of the nodes

 Run the deletes at slower at constant path sounds good and definitely I will try that.

Are clients and queries well configured to use all the nodes evenly? Are deletes well balanced also? If not, balancing the usage of the nodes will probably alleviate things.

The update of Cassandra is a good point but I am afraid that if I start the updates right now the timeouts problems will appear again. During an update compactions are executed? If it is not I think is safe to update the cluster.

I do not recommend you to upgrade right now indeed. Yet I would do it asap (= as soon as the cluster is ready and clients are compatible with the new version). You should always start operations with an healthy cluster or you might end in a worst situation. Compactions will run normally. Make sure not to run any streaming process (repairs / bootstrap / node removal) during the upgrade and while you have not yet run "nodetool upgradesstable". There is a lot of informations out there about upgrades.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-05 10:32 GMT+02:00 Paco Trujillo <F....@genetwister.nl>>:
Hi daemeon

We have check network and it is ok, in fact the nodes are connecting between themselves with a dedicated network.

From: daemeon reiydelle [mailto:daemeonr@gmail.com<ma...@gmail.com>]
Sent: maandag 4 april 2016 18:42
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: all the nost are not reacheable when running massive deletes


Network issues. Could be jumbo frames not consistent or other.

sent from my mobile

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198<tel:415.501.0198>
London +44.0.20.8144.9872<tel:%2B44.0.20.8144.9872>
On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl>> wrote:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem?

Thanks in advance


Re: all the nost are not reacheable when running massive deletes

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
>
>  Over use the cluster was one thing which I was thinking about, and I
> have requested two new nodes (anyway it was something already planned). But
> the pattern of nodes with high CPU load is only visible in 1 or two of the
> nodes, the rest are working correctly. That made me think that adding two
> new nodes maybe will not help.
>

Well, then you could trying to replace this node as soon as you have more
nodes available. I would use this procedure as I believe it is the most
efficient one:
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
.

Yet I believe it might not be a hardware or cluster throughput issue, and
if it is a hardware issues you probably want to dig it as this machine is
yours and not a virtual one. You might want to reuse it anyway.

Some questions about the machine and their usage.

Disk:
What disk hardware and configuration do you use.
iostat -mx 5 100 gives you? How is iowait?
Any error in the system / kernel logs?

CPU
How much used are the CPUs in general / worst cases?
What is the load average / max and how many cores have the cpu?

RAM
You are using 10GB heap and CMS right? You seems to say that GC activity
looks ok, can you confirm?
How much total RAM are the machines using?

The point here is to see if we can spot the bottleneck. If there is none,
Cassandra is probably badly configured at some point.

when running “massive deletes” on one of the nodes
>

 Run the deletes at slower at constant path sounds good and definitely I
> will try that.


Are clients and queries well configured to use all the nodes evenly? Are
deletes well balanced also? If not, balancing the usage of the nodes will
probably alleviate things.

The update of Cassandra is a good point but I am afraid that if I start the
> updates right now the timeouts problems will appear again. During an update
> compactions are executed? If it is not I think is safe to update the
> cluster.


I do not recommend you to upgrade right now indeed. Yet I would do it asap
(= as soon as the cluster is ready and clients are compatible with the new
version). You should always start operations with an healthy cluster or you
might end in a worst situation. Compactions will run normally. Make sure
not to run any streaming process (repairs / bootstrap / node removal)
during the upgrade and while you have not yet run "nodetool
upgradesstable". There is a lot of informations out there about upgrades.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-05 10:32 GMT+02:00 Paco Trujillo <F....@genetwister.nl>:

> Hi daemeon
>
>
>
> We have check network and it is ok, in fact the nodes are connecting
> between themselves with a dedicated network.
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* maandag 4 april 2016 18:42
> *To:* user@cassandra.apache.org
> *Subject:* Re: all the nost are not reacheable when running massive
> deletes
>
>
>
> Network issues. Could be jumbo frames not consistent or other.
>
> sent from my mobile
>
> sent from my mobile
> Daemeon C.M. Reiydelle
> USA 415.501.0198
> London +44.0.20.8144.9872
>
> On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl> wrote:
>
> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256     14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256     14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256     13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256     14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256     14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB     256     13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256     15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>

RE: all the nost are not reacheable when running massive deletes

Posted by Paco Trujillo <F....@genetwister.nl>.
Hi daemeon

We have check network and it is ok, in fact the nodes are connecting between themselves with a dedicated network.

From: daemeon reiydelle [mailto:daemeonr@gmail.com]
Sent: maandag 4 april 2016 18:42
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes


Network issues. Could be jumbo frames not consistent or other.

sent from my mobile

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl>> wrote:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem?

Thanks in advance

Re: all the nost are not reacheable when running massive deletes

Posted by daemeon reiydelle <da...@gmail.com>.
Network issues. Could be jumbo frames not consistent or other.

sent from my mobile

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Apr 4, 2016 5:34 AM, "Paco Trujillo" <F....@genetwister.nl> wrote:

> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256     14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256     14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256     13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256     14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256     14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB     256     13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256     15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>