You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Maxim Potekhin <po...@bnl.gov> on 2012/01/04 22:13:50 UTC

Should I throttle deletes?

Now that my cluster appears to run smoothly and after a few successful
repairs and compacts, I'm back in the business of deletion of portions
of data based on its date of insertion. For reasons too lengthy to be
explained here, I don't want to use TTL.

I use a batch mutator in Pycassa to delete ~1M rows based on
a longish list of keys I'm extracting from an auxiliary CF (with no
problem of any sort).

Now, it appears that such heads-on delete puts a temporary
but large load on the cluster. I have SSD's and they go to 100%
utilization, and the CPU spikes to significant loads.

Does anyone do throttling on such mass-delete procedure?

Thanks in advance,

Maxim


Re: Should I throttle deletes?

Posted by Maxim Potekhin <po...@bnl.gov>.
Thanks, that's quite helpful. I'm wondering though if multiplying the 
number of clients will
end up doing same thing.

On 1/5/2012 3:29 PM, Philippe wrote:
>
>     Then I do have a question, what do people generally use as the
>     batch size?
>
> I used to do batches from 500 to 2000 like you do.
> After investigating issues such as the one you've encountered I've 
> moved to batches of 20 for writes and 256 for reads. Everything is a 
> lot smoother : no more timeouts.
>
> The downside though is that I have to run more client threads in 
> parallele to maximize throughput.
>
> Cheers


Re: Should I throttle deletes?

Posted by Maxim Potekhin <po...@bnl.gov>.
Thanks, this makes sense. I'll try that.

Maxim

On 1/6/2012 10:51 AM, Vitalii Tymchyshyn wrote:
> Do you mean on writes? Yes, your timeouts must be so that your write 
> batch could complete until timeout elapsed. But this will lower write 
> load, so reads should not timeout.
>
> Best regards, Vitalii Tymchyshym
>
> 06.01.12 17:37, Philippe написав(ла):
>>
>> But you will then get timeouts.
>>
>> Le 6 janv. 2012 15:17, "Vitalii Tymchyshyn" <tivv00@gmail.com 
>> <ma...@gmail.com>> a écrit :
>>
>>     05.01.12 22:29, Philippe написав(ла):
>>>
>>>         Then I do have a question, what do people generally use as
>>>         the batch size?
>>>
>>>     I used to do batches from 500 to 2000 like you do.
>>>     After investigating issues such as the one you've encountered
>>>     I've moved to batches of 20 for writes and 256 for reads.
>>>     Everything is a lot smoother : no more timeouts.
>>>
>>     I'd better reduce mutation thread pool with concurrent_writes
>>     setting. This will lower server load no matter, how many clients
>>     are sending batches, at the same time you still have good batching.
>>
>>     Best regards, Vitalii Tymchyshyn
>>
>


Re: Should I throttle deletes?

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.
Do you mean on writes? Yes, your timeouts must be so that your write 
batch could complete until timeout elapsed. But this will lower write 
load, so reads should not timeout.

Best regards, Vitalii Tymchyshym

06.01.12 17:37, Philippe написав(ла):
>
> But you will then get timeouts.
>
> Le 6 janv. 2012 15:17, "Vitalii Tymchyshyn" <tivv00@gmail.com 
> <ma...@gmail.com>> a écrit :
>
>     05.01.12 22:29, Philippe написав(ла):
>>
>>         Then I do have a question, what do people generally use as
>>         the batch size?
>>
>>     I used to do batches from 500 to 2000 like you do.
>>     After investigating issues such as the one you've encountered
>>     I've moved to batches of 20 for writes and 256 for reads.
>>     Everything is a lot smoother : no more timeouts.
>>
>     I'd better reduce mutation thread pool with concurrent_writes
>     setting. This will lower server load no matter, how many clients
>     are sending batches, at the same time you still have good batching.
>
>     Best regards, Vitalii Tymchyshyn
>


Re: Should I throttle deletes?

Posted by Philippe <wa...@gmail.com>.
But you will then get timeouts.
Le 6 janv. 2012 15:17, "Vitalii Tymchyshyn" <ti...@gmail.com> a écrit :

> **
> 05.01.12 22:29, Philippe написав(ла):
>
>  Then I do have a question, what do people generally use as the batch
>> size?
>>
>  I used to do batches from 500 to 2000 like you do.
> After investigating issues such as the one you've encountered I've moved
> to batches of 20 for writes and 256 for reads. Everything is a lot smoother
> : no more timeouts.
>
>  I'd better reduce mutation thread pool with concurrent_writes setting.
> This will lower server load no matter, how many clients are sending
> batches, at the same time you still have good batching.
>
> Best regards, Vitalii Tymchyshyn
>

Re: Should I throttle deletes?

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.
05.01.12 22:29, Philippe ???????(??):
>
>     Then I do have a question, what do people generally use as the
>     batch size?
>
> I used to do batches from 500 to 2000 like you do.
> After investigating issues such as the one you've encountered I've 
> moved to batches of 20 for writes and 256 for reads. Everything is a 
> lot smoother : no more timeouts.
>
I'd better reduce mutation thread pool with concurrent_writes setting. 
This will lower server load no matter, how many clients are sending 
batches, at the same time you still have good batching.

Best regards, Vitalii Tymchyshyn

Re: Should I throttle deletes?

Posted by Philippe <wa...@gmail.com>.
>
> Then I do have a question, what do people generally use as the batch size?
>
I used to do batches from 500 to 2000 like you do.
After investigating issues such as the one you've encountered I've moved to
batches of 20 for writes and 256 for reads. Everything is a lot smoother :
no more timeouts.

The downside though is that I have to run more client threads in parallele
to maximize throughput.

Cheers

Re: Should I throttle deletes?

Posted by Maxim Potekhin <po...@bnl.gov>.
Hello Aaron,

On 1/5/2012 4:25 AM, aaron morton wrote:
>> I use a batch mutator in Pycassa to delete ~1M rows based on
>> a longish list of keys I'm extracting from an auxiliary CF (with no
>> problem of any sort).
> What is the size of the deletion batches ?

2000 mutations.


>
>> Now, it appears that such heads-on delete puts a temporary
>> but large load on the cluster. I have SSD's and they go to 100%
>> utilization, and the CPU spikes to significant loads.
> Does the load spike during the deletion or after it ?

During.


> Do any of the thread pool back up in nodetool tpstats during the load ?

Haven't checked, thank you for the lead.

> I can think of a few general issues you may want to avoid:
>
> * Each row in a batch mutation is handled by a task in a thread pool 
> on the nodes. So if you send a batch to delete 1,000 rows it will put 
> 1,000 tasks in the Mutation stage. This will reduce the query throughput.

Aah. I didn't know that. I was under the impression that batching saves 
the communication overhead, and that's it.

Then I do have a question, what do people generally use as the batch size?

Thanks

Maxim



Re: Should I throttle deletes?

Posted by aaron morton <aa...@thelastpickle.com>.
> I use a batch mutator in Pycassa to delete ~1M rows based on
> a longish list of keys I'm extracting from an auxiliary CF (with no
> problem of any sort).
What is the size of the deletion batches ?

> Now, it appears that such heads-on delete puts a temporary
> but large load on the cluster. I have SSD's and they go to 100%
> utilization, and the CPU spikes to significant loads.
Does the load spike during the deletion or after it ? 
Do any of the thread pool back up in nodetool tpstats during the load ?  

I can think of a few general issues you may want to avoid:

* Each row in a batch mutation is handled by a task in a thread pool on the nodes. So if you send a batch to delete 1,000 rows it will put 1,000 tasks in the Mutation stage. This will reduce the query throughput.
* Lots of deletes in a row will add overhead to reads on the row. 

You may want to check for excessive memtable flushing, but if you have default automatic memory management running lots of deletes should not result in extra flushing.  

Hope that helps
Aaron

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/01/2012, at 10:13 AM, Maxim Potekhin wrote:

> Now that my cluster appears to run smoothly and after a few successful
> repairs and compacts, I'm back in the business of deletion of portions
> of data based on its date of insertion. For reasons too lengthy to be
> explained here, I don't want to use TTL.
> 
> I use a batch mutator in Pycassa to delete ~1M rows based on
> a longish list of keys I'm extracting from an auxiliary CF (with no
> problem of any sort).
> 
> Now, it appears that such heads-on delete puts a temporary
> but large load on the cluster. I have SSD's and they go to 100%
> utilization, and the CPU spikes to significant loads.
> 
> Does anyone do throttling on such mass-delete procedure?
> 
> Thanks in advance,
> 
> Maxim
>