You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Luís Ferreira <za...@gmail.com> on 2012/05/05 21:14:22 UTC

Timeout Exception in get_slice

Hi, 

I'm doing get_slice on huge rows (3 million columns) and even though I am doing it iteratively I keep getting TimeoutExceptions. I've tried to change the number of columns fetched but it did not work. 

I have a 5 machine cluster, each with 4GB of which 3 are dedicated to cassandra's heap, but still the all consume all of the memory and get huge IO wait due to the amout of reads.

I am running tests with 100 clients all performing multiple operations mostly get_slice, get and multi_get, but the timeouts only occur in the get_slice.

Does this have anything to do with cassandra's ability (or lack thereof) to keep the rows in memory? Or am I doing anything wrong? Any tips?

Cumpliments,
Luís Ferreira




Re: Timeout Exception in get_slice

Posted by Luís Ferreira <za...@gmail.com>.
The multi get batches range from 100 to 200.

The tests I'm running need to do get_slices and the multigets on those results. I can't turn either of them off.

I was only setting 16 threads for reading, but I'll boost it up to 32 and see what happens.

On May 9, 2012, at 11:03 AM, aaron morton wrote:

> How big are the multi get batches ?
> 
> How do the wide row get_slice calls behave when the multi gets are not running ?
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 9/05/2012, at 1:47 AM, Luís Ferreira wrote:
> 
>> Maybe one of the problems is that I am reading the columns in a row and the rows themselves in batches, using the count attribute in the SliceRange and by changing the start column or the corresponding for rows with the KeyRange. According to your blog post, using start key to read for millions of rows/columns has a lot of latency, but how else can I read an entire row that does not fit into memory?
>> 
>> I'll have to run some tests again and check the tpstats. Still, do you think that adding more machines to the cluster will help a lot? I say this, because I started with a 3 node cluster and have scaled to a 5 node cluster with little improvement... 
>> 
>> Thanks anyway.
>> 
>> On May 8, 2012, at 9:54 AM, aaron morton wrote:
>> 
>>> If I was rebuilding my power after spending the first thousand years of the Third Age as a shapeless evil I would cast my Eye of Fire in the direction of the filthy little multi_gets. 
>>> 
>>> A node can fail to respond to a query with rpc_timeout for two reasons: either the command did not run or the command started but did not complete. The former is much more likely. If it is happening you will see  large pending counts and dropped messages in nodetool tpstats, you will also see log entries about dropped messages.
>>> 
>>> When you send a multi_get each row you request becomes a message in the read thread pool. If you request 100 rows you will put 100 messages in the pool, which by default has 32 threads. If some clients are sending large multi get (or batch mutations) you can overload nodes and starve other clients. 
>>> 
>>> for background, some metrics here for selecting from 10million columns http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>>> 
>>> Hope that helps. 
>>> 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 6/05/2012, at 7:14 AM, Luís Ferreira wrote:
>>> 
>>>> Hi, 
>>>> 
>>>> I'm doing get_slice on huge rows (3 million columns) and even though I am doing it iteratively I keep getting TimeoutExceptions. I've tried to change the number of columns fetched but it did not work. 
>>>> 
>>>> I have a 5 machine cluster, each with 4GB of which 3 are dedicated to cassandra's heap, but still the all consume all of the memory and get huge IO wait due to the amout of reads.
>>>> 
>>>> I am running tests with 100 clients all performing multiple operations mostly get_slice, get and multi_get, but the timeouts only occur in the get_slice.
>>>> 
>>>> Does this have anything to do with cassandra's ability (or lack thereof) to keep the rows in memory? Or am I doing anything wrong? Any tips?
>>>> 
>>>> Cumpliments,
>>>> Luís Ferreira
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> Cumprimentos,
>> Luís Ferreira
>> 
>> 
>> 
> 

Cumprimentos,
Luís Ferreira




Re: Timeout Exception in get_slice

Posted by aaron morton <aa...@thelastpickle.com>.
How big are the multi get batches ?

How do the wide row get_slice calls behave when the multi gets are not running ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 9/05/2012, at 1:47 AM, Luís Ferreira wrote:

> Maybe one of the problems is that I am reading the columns in a row and the rows themselves in batches, using the count attribute in the SliceRange and by changing the start column or the corresponding for rows with the KeyRange. According to your blog post, using start key to read for millions of rows/columns has a lot of latency, but how else can I read an entire row that does not fit into memory?
> 
> I'll have to run some tests again and check the tpstats. Still, do you think that adding more machines to the cluster will help a lot? I say this, because I started with a 3 node cluster and have scaled to a 5 node cluster with little improvement... 
> 
> Thanks anyway.
> 
> On May 8, 2012, at 9:54 AM, aaron morton wrote:
> 
>> If I was rebuilding my power after spending the first thousand years of the Third Age as a shapeless evil I would cast my Eye of Fire in the direction of the filthy little multi_gets. 
>> 
>> A node can fail to respond to a query with rpc_timeout for two reasons: either the command did not run or the command started but did not complete. The former is much more likely. If it is happening you will see  large pending counts and dropped messages in nodetool tpstats, you will also see log entries about dropped messages.
>> 
>> When you send a multi_get each row you request becomes a message in the read thread pool. If you request 100 rows you will put 100 messages in the pool, which by default has 32 threads. If some clients are sending large multi get (or batch mutations) you can overload nodes and starve other clients. 
>> 
>> for background, some metrics here for selecting from 10million columns http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>> 
>> Hope that helps. 
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 6/05/2012, at 7:14 AM, Luís Ferreira wrote:
>> 
>>> Hi, 
>>> 
>>> I'm doing get_slice on huge rows (3 million columns) and even though I am doing it iteratively I keep getting TimeoutExceptions. I've tried to change the number of columns fetched but it did not work. 
>>> 
>>> I have a 5 machine cluster, each with 4GB of which 3 are dedicated to cassandra's heap, but still the all consume all of the memory and get huge IO wait due to the amout of reads.
>>> 
>>> I am running tests with 100 clients all performing multiple operations mostly get_slice, get and multi_get, but the timeouts only occur in the get_slice.
>>> 
>>> Does this have anything to do with cassandra's ability (or lack thereof) to keep the rows in memory? Or am I doing anything wrong? Any tips?
>>> 
>>> Cumpliments,
>>> Luís Ferreira
>>> 
>>> 
>>> 
>>> 
>> 
> 
> Cumprimentos,
> Luís Ferreira
> 
> 
> 


Re: Timeout Exception in get_slice

Posted by Luís Ferreira <za...@gmail.com>.
Maybe one of the problems is that I am reading the columns in a row and the rows themselves in batches, using the count attribute in the SliceRange and by changing the start column or the corresponding for rows with the KeyRange. According to your blog post, using start key to read for millions of rows/columns has a lot of latency, but how else can I read an entire row that does not fit into memory?

I'll have to run some tests again and check the tpstats. Still, do you think that adding more machines to the cluster will help a lot? I say this, because I started with a 3 node cluster and have scaled to a 5 node cluster with little improvement... 

Thanks anyway.

On May 8, 2012, at 9:54 AM, aaron morton wrote:

> If I was rebuilding my power after spending the first thousand years of the Third Age as a shapeless evil I would cast my Eye of Fire in the direction of the filthy little multi_gets. 
> 
> A node can fail to respond to a query with rpc_timeout for two reasons: either the command did not run or the command started but did not complete. The former is much more likely. If it is happening you will see  large pending counts and dropped messages in nodetool tpstats, you will also see log entries about dropped messages.
> 
> When you send a multi_get each row you request becomes a message in the read thread pool. If you request 100 rows you will put 100 messages in the pool, which by default has 32 threads. If some clients are sending large multi get (or batch mutations) you can overload nodes and starve other clients. 
> 
> for background, some metrics here for selecting from 10million columns http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
> 
> Hope that helps. 
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 6/05/2012, at 7:14 AM, Luís Ferreira wrote:
> 
>> Hi, 
>> 
>> I'm doing get_slice on huge rows (3 million columns) and even though I am doing it iteratively I keep getting TimeoutExceptions. I've tried to change the number of columns fetched but it did not work. 
>> 
>> I have a 5 machine cluster, each with 4GB of which 3 are dedicated to cassandra's heap, but still the all consume all of the memory and get huge IO wait due to the amout of reads.
>> 
>> I am running tests with 100 clients all performing multiple operations mostly get_slice, get and multi_get, but the timeouts only occur in the get_slice.
>> 
>> Does this have anything to do with cassandra's ability (or lack thereof) to keep the rows in memory? Or am I doing anything wrong? Any tips?
>> 
>> Cumpliments,
>> Luís Ferreira
>> 
>> 
>> 
>> 
> 

Cumprimentos,
Luís Ferreira




Re: Timeout Exception in get_slice

Posted by aaron morton <aa...@thelastpickle.com>.
If I was rebuilding my power after spending the first thousand years of the Third Age as a shapeless evil I would cast my Eye of Fire in the direction of the filthy little multi_gets. 

A node can fail to respond to a query with rpc_timeout for two reasons: either the command did not run or the command started but did not complete. The former is much more likely. If it is happening you will see  large pending counts and dropped messages in nodetool tpstats, you will also see log entries about dropped messages.

When you send a multi_get each row you request becomes a message in the read thread pool. If you request 100 rows you will put 100 messages in the pool, which by default has 32 threads. If some clients are sending large multi get (or batch mutations) you can overload nodes and starve other clients. 

for background, some metrics here for selecting from 10million columns http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/05/2012, at 7:14 AM, Luís Ferreira wrote:

> Hi, 
> 
> I'm doing get_slice on huge rows (3 million columns) and even though I am doing it iteratively I keep getting TimeoutExceptions. I've tried to change the number of columns fetched but it did not work. 
> 
> I have a 5 machine cluster, each with 4GB of which 3 are dedicated to cassandra's heap, but still the all consume all of the memory and get huge IO wait due to the amout of reads.
> 
> I am running tests with 100 clients all performing multiple operations mostly get_slice, get and multi_get, but the timeouts only occur in the get_slice.
> 
> Does this have anything to do with cassandra's ability (or lack thereof) to keep the rows in memory? Or am I doing anything wrong? Any tips?
> 
> Cumpliments,
> Luís Ferreira
> 
> 
> 
>