You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Henrik Schröder <sk...@gmail.com> on 2012/06/14 16:38:53 UTC

Random slow connects.

Hi everyone,

We have problem with our Cassandra cluster, and that is that sometimes it
takes several seconds to open a new Thrift connection to the server. We've
had this issue when we ran on windows, and we have this issue now that we
run on Ubuntu. We've had it with our old networking setup, and we have it
with our new networking setup where we're running it over a dedicated
gigabit network. Normally estabishing a new connection is instant, but once
in a while it seems like it's not accepting any new connections until three
seconds have passed.

We're of course running a connection-pooling client which mitigates this,
since once a connection is established, it's rock solid.

We tried switching the rpc_server_type to hsha, but that seems to have made
the problem worse, we're seeing more connection timeouts because of this.

For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and
our connection pool is configured to abort a connection attempt after two
seconds, and each connection lives for six hours and then it's recycled.
Under current load we do about 500 writes/s and 100 reads/s, we have 20
clients, but each has a very small connection pool of maybe up to 5
simultaneous connections against each Cassandra server. We see these
connection issues maybe once a day, but always at random intervals.

We've tried to get more information through Datastax Opscenter, the JMX
console, and our own application monitoring and logging, but we can't see
anything out of the ordinary. Sometimes, seemingly by random, it's just
really slow to connect. We're all out of ideas. Does anyone here have
suggestions on where to look and what to do next?


/Henrik

Re: How to speed up data loading

Posted by Tupshin Harper <tu...@tupshin.com>.

Any chance your server has been running for the last two weeks with the
leap second bug?
http://www.datastax.com/dev/blog/linux-cassandra-and-saturdays-leap-second-problem

-Tupshin
On Jul 12, 2012 1:43 PM, "Leonid Ilyevsky" <li...@mooncapital.com>
wrote:

>  I am loading a large set of data into a CF with composite key. The load
> is going pretty slow, hundreds or even thousands times slower than it would
> do in RDBMS.****
>
> I have a choice of how granular my physical key (the first component of
> the primary key) is, this way I can balance between smaller rows and too
> many keys vs. wide rows and fewer keys. What are the guidelines about this?
> How the width of the physical row affects the speed of load?****
>
> ** **
>
> I see that Cassandra is doing a lot of processing behind the scene, even
> when I kill the client, the server is still consuming a lot of CPU for a
> long time.****
>
> ** **
>
> What else should I look at ? Anything in configuration? ****
>
> ------------------------------
> This email, along with any attachments, is confidential and may be legally
> privileged or otherwise protected from disclosure. Any unauthorized
> dissemination, copying or use of the contents of this email is strictly
> prohibited and may be in violation of law. If you are not the intended
> recipient, any disclosure, copying, forwarding or distribution of this
> email is strictly prohibited and this email and any attachments should be
> deleted immediately. This email and any attachments do not constitute an
> offer to sell or a solicitation of an offer to purchase any interest in any
> investment vehicle sponsored by Moon Capital Management LP (“Moon
> Capital”). Moon Capital does not provide legal, accounting or tax advice.
> Any statement regarding legal, accounting or tax matters was not intended
> or written to be relied upon by any person as advice. Moon Capital does not
> waive confidentiality or privilege as a result of this email.
>

How to speed up data loading

Posted by Leonid Ilyevsky <li...@mooncapital.com>.

I am loading a large set of data into a CF with composite key. The load is going pretty slow, hundreds or even thousands times slower than it would do in RDBMS.
I have a choice of how granular my physical key (the first component of the primary key) is, this way I can balance between smaller rows and too many keys vs. wide rows and fewer keys. What are the guidelines about this? How the width of the physical row affects the speed of load?

I see that Cassandra is doing a lot of processing behind the scene, even when I kill the client, the server is still consuming a lot of CPU for a long time.

What else should I look at ? Anything in configuration?

________________________________
This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email.

Re: Random slow connects.

Posted by aaron morton <aa...@thelastpickle.com>.

You could also try adding some logging in the client to track down the exactly where the delay is. If it is in waiting for the socket to open on the server or say managing the connection client side.

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/06/2012, at 4:51 AM, Tyler Hobbs wrote:

> As a random guess, you might want to check your open file descriptor limit on the C* servers.  Use "cat /proc/<pid>/limits", where <pid> is the pid of the Cassandra process; it's the most reliable way to check this.
> 
> On Thu, Jun 14, 2012 at 10:43 AM, Henrik Schröder <sk...@gmail.com> wrote:
> Hi Mina,
> 
> The delay is not constant, in the absolute majority of cases, connecting is almost instant, but occasionally, connecting to a server takes a few seconds.
> 
> We can't even reproduce it reliably, we can see in our server logs that sometimes, maybe a few times a day, maybe once every few days, a cassandra server will be slow in accepting connections, and after a little while everything will be ok again. It's not a network saturation error, it's not a CPU saturation error. Not even GC pauses.
> 
> Has anyone else noticed something similar? Or is this simply a result of us running a tight connection pool which recycles connections every few hours and only waits a few seconds for a connection before timing out?
> 
> 
> /Henrik
> 
> 
> On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <mi...@bloomdigital.com> wrote:
> 
> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:
> 
> > Hi everyone,
> >
> > We have problem with our Cassandra cluster, and that is that sometimes it takes several seconds to open a new Thrift connection to the server. We've had this issue when we ran on windows, and we have this issue now that we run on Ubuntu. We've had it with our old networking setup, and we have it with our new networking setup where we're running it over a dedicated gigabit network. Normally estabishing a new connection is instant, but once in a while it seems like it's not accepting any new connections until three seconds have passed.
> >
> > We're of course running a connection-pooling client which mitigates this, since once a connection is established, it's rock solid.
> >
> > We tried switching the rpc_server_type to hsha, but that seems to have made the problem worse, we're seeing more connection timeouts because of this.
> >
> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our connection pool is configured to abort a connection attempt after two seconds, and each connection lives for six hours and then it's recycled. Under current load we do about 500 writes/s and 100 reads/s, we have 20 clients, but each has a very small connection pool of maybe up to 5 simultaneous connections against each Cassandra server. We see these connection issues maybe once a day, but always at random intervals.
> >
> > We've tried to get more information through Datastax Opscenter, the JMX console, and our own application monitoring and logging, but we can't see anything out of the ordinary. Sometimes, seemingly by random, it's just really slow to connect. We're all out of ideas. Does anyone here have suggestions on where to look and what to do next?
> 
> Have you ironed out non-cassandra potential causes ?
> 
> 3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you contact cassandra via a hostname or IP address ?  If via hostname, iron out DNS.
> 
> Either way, I'd fire up tcpdump, both on both the client and the server, and observe the TCP handshake.  Specifically see if the SYN packet is sent and received, whether the SYN-ACK is sent back right away and received, and final ACK.
> 
> If that looks good, then TCP-wise you're in good shape and the problem is in a higher layer (thrift).  If not, see where the delay/drop/retry happens.  If it's in the first packet, it may be a networking/routing issue.  If in the second, it may me capacity at the server (investigate with lsof/netstat/JMX), etc..
> 
> 
> 
> 
> 
> 
> -- 
> Tyler Hobbs
> DataStax
>

Re: Random slow connects.

Posted by Tyler Hobbs <ty...@datastax.com>.

As a random guess, you might want to check your open file descriptor limit
on the C* servers.  Use "cat /proc/<pid>/limits", where <pid> is the pid of
the Cassandra process; it's the most reliable way to check this.

On Thu, Jun 14, 2012 at 10:43 AM, Henrik Schröder <sk...@gmail.com> wrote:

> Hi Mina,
>
> The delay is not constant, in the absolute majority of cases, connecting
> is almost instant, but occasionally, connecting to a server takes a few
> seconds.
>
> We can't even reproduce it reliably, we can see in our server logs that
> sometimes, maybe a few times a day, maybe once every few days, a cassandra
> server will be slow in accepting connections, and after a little while
> everything will be ok again. It's not a network saturation error, it's not
> a CPU saturation error. Not even GC pauses.
>
> Has anyone else noticed something similar? Or is this simply a result of
> us running a tight connection pool which recycles connections every few
> hours and only waits a few seconds for a connection before timing out?
>
>
> /Henrik
>
>
> On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <mina.naguib@bloomdigital.com
> > wrote:
>
>>
>> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:
>>
>> > Hi everyone,
>> >
>> > We have problem with our Cassandra cluster, and that is that sometimes
>> it takes several seconds to open a new Thrift connection to the server.
>> We've had this issue when we ran on windows, and we have this issue now
>> that we run on Ubuntu. We've had it with our old networking setup, and we
>> have it with our new networking setup where we're running it over a
>> dedicated gigabit network. Normally estabishing a new connection is
>> instant, but once in a while it seems like it's not accepting any new
>> connections until three seconds have passed.
>> >
>> > We're of course running a connection-pooling client which mitigates
>> this, since once a connection is established, it's rock solid.
>> >
>> > We tried switching the rpc_server_type to hsha, but that seems to have
>> made the problem worse, we're seeing more connection timeouts because of
>> this.
>> >
>> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu,
>> and our connection pool is configured to abort a connection attempt after
>> two seconds, and each connection lives for six hours and then it's
>> recycled. Under current load we do about 500 writes/s and 100 reads/s, we
>> have 20 clients, but each has a very small connection pool of maybe up to 5
>> simultaneous connections against each Cassandra server. We see these
>> connection issues maybe once a day, but always at random intervals.
>> >
>> > We've tried to get more information through Datastax Opscenter, the JMX
>> console, and our own application monitoring and logging, but we can't see
>> anything out of the ordinary. Sometimes, seemingly by random, it's just
>> really slow to connect. We're all out of ideas. Does anyone here have
>> suggestions on where to look and what to do next?
>>
>> Have you ironed out non-cassandra potential causes ?
>>
>> 3 seconds constantly sounds it could be a timeout/retry somewhere.  Do
>> you contact cassandra via a hostname or IP address ?  If via hostname, iron
>> out DNS.
>>
>> Either way, I'd fire up tcpdump, both on both the client and the server,
>> and observe the TCP handshake.  Specifically see if the SYN packet is sent
>> and received, whether the SYN-ACK is sent back right away and received, and
>> final ACK.
>>
>> If that looks good, then TCP-wise you're in good shape and the problem is
>> in a higher layer (thrift).  If not, see where the delay/drop/retry
>> happens.  If it's in the first packet, it may be a networking/routing
>> issue.  If in the second, it may me capacity at the server (investigate
>> with lsof/netstat/JMX), etc..
>>
>>
>>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Random slow connects.

Posted by Henrik Schröder <sk...@gmail.com>.

Hi Mina,

The delay is not constant, in the absolute majority of cases, connecting is
almost instant, but occasionally, connecting to a server takes a few
seconds.

We can't even reproduce it reliably, we can see in our server logs that
sometimes, maybe a few times a day, maybe once every few days, a cassandra
server will be slow in accepting connections, and after a little while
everything will be ok again. It's not a network saturation error, it's not
a CPU saturation error. Not even GC pauses.

Has anyone else noticed something similar? Or is this simply a result of us
running a tight connection pool which recycles connections every few hours
and only waits a few seconds for a connection before timing out?


/Henrik

On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib
<mi...@bloomdigital.com>wrote:

>
> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:
>
> > Hi everyone,
> >
> > We have problem with our Cassandra cluster, and that is that sometimes
> it takes several seconds to open a new Thrift connection to the server.
> We've had this issue when we ran on windows, and we have this issue now
> that we run on Ubuntu. We've had it with our old networking setup, and we
> have it with our new networking setup where we're running it over a
> dedicated gigabit network. Normally estabishing a new connection is
> instant, but once in a while it seems like it's not accepting any new
> connections until three seconds have passed.
> >
> > We're of course running a connection-pooling client which mitigates
> this, since once a connection is established, it's rock solid.
> >
> > We tried switching the rpc_server_type to hsha, but that seems to have
> made the problem worse, we're seeing more connection timeouts because of
> this.
> >
> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu,
> and our connection pool is configured to abort a connection attempt after
> two seconds, and each connection lives for six hours and then it's
> recycled. Under current load we do about 500 writes/s and 100 reads/s, we
> have 20 clients, but each has a very small connection pool of maybe up to 5
> simultaneous connections against each Cassandra server. We see these
> connection issues maybe once a day, but always at random intervals.
> >
> > We've tried to get more information through Datastax Opscenter, the JMX
> console, and our own application monitoring and logging, but we can't see
> anything out of the ordinary. Sometimes, seemingly by random, it's just
> really slow to connect. We're all out of ideas. Does anyone here have
> suggestions on where to look and what to do next?
>
> Have you ironed out non-cassandra potential causes ?
>
> 3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you
> contact cassandra via a hostname or IP address ?  If via hostname, iron out
> DNS.
>
> Either way, I'd fire up tcpdump, both on both the client and the server,
> and observe the TCP handshake.  Specifically see if the SYN packet is sent
> and received, whether the SYN-ACK is sent back right away and received, and
> final ACK.
>
> If that looks good, then TCP-wise you're in good shape and the problem is
> in a higher layer (thrift).  If not, see where the delay/drop/retry
> happens.  If it's in the first packet, it may be a networking/routing
> issue.  If in the second, it may me capacity at the server (investigate
> with lsof/netstat/JMX), etc..
>
>
>

Re: Random slow connects.

Posted by Mina Naguib <mi...@bloomdigital.com>.

On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:

> Hi everyone,
> 
> We have problem with our Cassandra cluster, and that is that sometimes it takes several seconds to open a new Thrift connection to the server. We've had this issue when we ran on windows, and we have this issue now that we run on Ubuntu. We've had it with our old networking setup, and we have it with our new networking setup where we're running it over a dedicated gigabit network. Normally estabishing a new connection is instant, but once in a while it seems like it's not accepting any new connections until three seconds have passed.
> 
> We're of course running a connection-pooling client which mitigates this, since once a connection is established, it's rock solid.
> 
> We tried switching the rpc_server_type to hsha, but that seems to have made the problem worse, we're seeing more connection timeouts because of this.
> 
> For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our connection pool is configured to abort a connection attempt after two seconds, and each connection lives for six hours and then it's recycled. Under current load we do about 500 writes/s and 100 reads/s, we have 20 clients, but each has a very small connection pool of maybe up to 5 simultaneous connections against each Cassandra server. We see these connection issues maybe once a day, but always at random intervals.
> 
> We've tried to get more information through Datastax Opscenter, the JMX console, and our own application monitoring and logging, but we can't see anything out of the ordinary. Sometimes, seemingly by random, it's just really slow to connect. We're all out of ideas. Does anyone here have suggestions on where to look and what to do next?

Have you ironed out non-cassandra potential causes ?

3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you contact cassandra via a hostname or IP address ?  If via hostname, iron out DNS.

Either way, I'd fire up tcpdump, both on both the client and the server, and observe the TCP handshake.  Specifically see if the SYN packet is sent and received, whether the SYN-ACK is sent back right away and received, and final ACK.

If that looks good, then TCP-wise you're in good shape and the problem is in a higher layer (thrift).  If not, see where the delay/drop/retry happens.  If it's in the first packet, it may be a networking/routing issue.  If in the second, it may me capacity at the server (investigate with lsof/netstat/JMX), etc..