You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Max Campos <mc...@core43.com> on 2017/12/14 08:18:15 UTC

Lots of simultaneous connections?

Hi -

We’re finally putting our new application under load, and we’re starting to get this error message from the Python driver when under heavy load:

('Unable to connect to any servers', {‘x.y.z.205': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': OperationTimedOut('errors=None, last_host=None',)})' (22.7s)

Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM reads/writes. We have a few thousand machines which are each making 1-10 connections to C* at once, but each of these connections only reads/writes a few records, waits several minutes, and then writes a few records — so while netstat reports ~5K connections per node, they’re generally idle. Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node. Read/write latencies peaked at 2.5ms.

Some questions:
1) Is anyone else out there making this many simultaneous connections? Any idea what a reasonable number of connections is, what is too many, etc?

2) Any thoughts on which JMX metrics I should look at to better understand what exactly is exploding? Is there a “number of active connections” metric? We currently look at:
- client reads/writes per sec
- read/write latency
- compaction tasks
- repair tasks
- disk used by node
- disk used by table
- avg partition size per table

3) Any other advice?

I think I’ll try doing an explicit disconnect during the waiting period of our application’s execution; so as to get the C* connection count down. Hopefully that will solve the timeout problem.

Thanks for your help.

- Max
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Lots of simultaneous connections?

Posted by kurt greaves <ku...@instaclustr.com>.

Yep. With those kind of numbers you're likely overwhelming the cluster with
connections. You be better off if you can configure it to have either 1
connection per machine, or at least 1 connection per test. Creating lots of
connections is definitely not a good idea, and remember that each
connection will actually create at least an individual connection to every
node in the cluster (might be more), so things are going to blow out
massively if you have a lot of clients.

Re: Lots of simultaneous connections?

Posted by Max Campos <mc...@core43.com>.

Hi Kurt, thanks for your reply — really appreciate your (and everyone else’s!) continual assistance of people in the C* user community.

All of these clients & servers are on the same (internal) network, so there is no firewall between the clients & servers.  

Our C* application is a QA test results system.  We have thousands of machines in-house which we use to test the software (not C* related) which we sell, and we’re using C* to capture the results of those tests.

So the flow is:
On each machine (~2500):
… we run tests (~5-20 per machine)
… each test has ~8 steps
… each step makes a connection to the DB, logs to C* the start time, sub invokes the test script which runs the step (2 mins to 20 hours — no C* usage during this part), and then captures the result to C* (end time, exit status, etc).

Today we’re not disconnecting C* during the “run the step” part - and we’re getting OperationTimedOut errors as we scale up the number of tests executing using our C* application.  My theory is that we’re overwhelming C* with the sheer number of (mostly idle) connections to our 3-node cluster.

I’m hoping someone has seen this sort of problem and can say “Yeah, that’s too many connections — I’m sure that’s your problem.”  or “We regularly make 12M connections per C* node — you’re screwed up in some other way — have you checked file descriptor limits?  What’s your Java __whatever__ setting?”  etc.

thanks Kurt.  :-)

- Max

> On Dec 14, 2017, at 6:19 am, kurt greaves <kurt@instaclustr.com <ma...@instaclustr.com>> wrote:
> 
> I see time outs and I immediately blame firewalls. Have you triple checked then?
> Is this only occurring to a subset of clients?
> 
> Also, 3.0.6 is pretty dated and has many bugs, you should definitely upgrade to the latest 3.0 (don't forget to read news.txt)
> On 14 Dec. 2017 19:18, "Max Campos" <mc_cassandra@core43.com <ma...@core43.com>> wrote:
> Hi -
> 
> We’re finally putting our new application under load, and we’re starting to get this error message from the Python driver when under heavy load:
> 
> ('Unable to connect to any servers', {‘x.y.z.205': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': OperationTimedOut('errors=None, last_host=None',)})' (22.7s)
> 
> Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM reads/writes.  We have a few thousand machines which are each making 1-10 connections to C* at once, but each of these connections only reads/writes a few records, waits several minutes, and then writes a few records — so while netstat reports ~5K connections per node, they’re generally idle.  Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  Read/write latencies peaked at 2.5ms.
> 
> Some questions:
> 1) Is anyone else out there making this many simultaneous connections?  Any idea what a reasonable number of connections is, what is too many, etc?
> 
> 2) Any thoughts on which JMX metrics I should look at to better understand what exactly is exploding?  Is there a “number of active connections” metric?  We currently look at:
> - client reads/writes per sec
> - read/write latency
> - compaction tasks
> - repair tasks
> - disk used by node
> - disk used by table
> - avg partition size per table
> 
> 3) Any other advice?
> 
> I think I’ll try doing an explicit disconnect during the waiting period of our application’s execution; so as to get the C* connection count down.  Hopefully that will solve the timeout problem.
> 
> Thanks for your help.
> 
> - Max
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org <ma...@cassandra.apache.org>
> For additional commands, e-mail: user-help@cassandra.apache.org <ma...@cassandra.apache.org>
> 
>

Re: Lots of simultaneous connections?

Posted by kurt greaves <ku...@instaclustr.com>.

I see time outs and I immediately blame firewalls. Have you triple checked
then?
Is this only occurring to a subset of clients?

Also, 3.0.6 is pretty dated and has many bugs, you should definitely
upgrade to the latest 3.0 (don't forget to read news.txt)
On 14 Dec. 2017 19:18, "Max Campos" <mc...@core43.com> wrote:

Hi -

We’re finally putting our new application under load, and we’re starting to
get this error message from the Python driver when under heavy load:

('Unable to connect to any servers', {‘x.y.z.205':
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204':
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206':
OperationTimedOut('errors=None, last_host=None',)})' (22.7s)

Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM
reads/writes.  We have a few thousand machines which are each making 1-10
connections to C* at once, but each of these connections only reads/writes
a few records, waits several minutes, and then writes a few records — so
while netstat reports ~5K connections per node, they’re generally idle.
Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node.
Read/write latencies peaked at 2.5ms.

Some questions:
1) Is anyone else out there making this many simultaneous connections?  Any
idea what a reasonable number of connections is, what is too many, etc?

2) Any thoughts on which JMX metrics I should look at to better understand
what exactly is exploding?  Is there a “number of active connections”
metric?  We currently look at:
- client reads/writes per sec
- read/write latency
- compaction tasks
- repair tasks
- disk used by node
- disk used by table
- avg partition size per table

3) Any other advice?

I think I’ll try doing an explicit disconnect during the waiting period of
our application’s execution; so as to get the C* connection count down.
Hopefully that will solve the timeout problem.

Thanks for your help.

- Max
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

RE: [EXTERNAL] Lots of simultaneous connections?

Posted by "Durity, Sean R" <SE...@homedepot.com>.

Have you determined if a specific query is the one getting timed out? It is possible that the query/data model does not scale well, especially if you are trying to do something like a full table scan.

It is also possible that your OS settings will limit the number of connections to the host. Do you see any timewait connections in netstat? I would agree that 5,000 connections per host seems on the high side. Each one requires resources, like memory, so reducing connections is a good idea.


Sean Durity

-----Original Message-----
From: Max Campos [mailto:mc_cassandra@core43.com]
Sent: Thursday, December 14, 2017 3:18 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Lots of simultaneous connections?

Hi -

We’re finally putting our new application under load, and we’re starting to get this error message from the Python driver when under heavy load:

('Unable to connect to any servers', {‘x.y.z.205': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': OperationTimedOut('errors=None, last_host=None',)})' (22.7s)

Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM reads/writes.  We have a few thousand machines which are each making 1-10 connections to C* at once, but each of these connections only reads/writes a few records, waits several minutes, and then writes a few records — so while netstat reports ~5K connections per node, they’re generally idle.  Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  Read/write latencies peaked at 2.5ms.

Some questions:
1) Is anyone else out there making this many simultaneous connections?  Any idea what a reasonable number of connections is, what is too many, etc?

2) Any thoughts on which JMX metrics I should look at to better understand what exactly is exploding?  Is there a “number of active connections” metric?  We currently look at:
- client reads/writes per sec
- read/write latency
- compaction tasks
- repair tasks
- disk used by node
- disk used by table
- avg partition size per table

3) Any other advice?

I think I’ll try doing an explicit disconnect during the waiting period of our application’s execution; so as to get the C* connection count down.  Hopefully that will solve the timeout problem.

Thanks for your help.

- Max
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.