You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Petter. Andreas" <a....@seeburger.de> on 2015/07/23 12:55:14 UTC

Slow performance because of used-up "Waste" in AtomicBTreeColumns

Hello everyone,

we are experiencing performance issues with Cassandra overloading effects (dropped mutations and node drop-outs) with the following workload:

create table test (year bigint, spread bigint, time bigint, batchid bigint, value set<text>, primary key ((year, spread), time, batchid))
inserting data using an update statement ("+" operator to merge the sets). Data _is_being_ordered_ before the mutation is executed on the session. Number of inserts range from 400k to a few millions.

Originally we were using scalding/summingbird and thought the problem to be in our Cassandra-storage-code. To test that i wrote a simple cascading-hadoop job (not using BulkOutputFormat, but the Datastax driver). I was a little bit surprised to still see Cassandra _overload_ (3 reducers/Hadoop-writers and 3 co-located Cassandra nodes, as well as a setup with 4/4 nodes). The internal reason seems to be that many worker threads go into state BLOCKED in AtomicBTreeColumns.addAllWithSizeDelta, because s.th. called "waste" is used up and Cassandra switches to pessimistic locking.

However, i re-wrote the job using plain Hadoop-mapred (without cascading) but using the same storage abstraction for writing and Cassandra _did_not_overload_ and the job has the great write-performance i'm used to (and threads are not going into state BLOCKED). We're totally lost and puzzled.

So i have a few questions:
1. What is this "waste" used for? Is it a way of braking or load shedding? Why is locking being used in AtomicBTreeColumns?
2. Is it o.k. to order columns before inserts are being performed?
3. What could be the reason that "waste" is being used-up in the cascading job and not in the plain Hadoop-job (sorting order?)?
4. Is there any way to circumvent using up "waste" (except for scaling nodes, which does not seem to be the answer, as the plain Hadoop job runs Cassandra-"friendly")?

thanks in advance,
regards,
Andi

SEEBURGER AG Vorstand/SEEBURGER Executive Board:
Sitz der Gesellschaft/Registered Office: Bernd Seeburger, Axel Haas, Michael Kleeberg, Friedemann Heinz, Dr. Martin Kuntz, Matthias Feßenbecker
Edisonstr. 1
D-75015 Bretten Vorsitzende des Aufsichtsrats/Chairperson of the SEEBURGER Supervisory Board:
Tel.: 07252 / 96 - 0 Prof. Dr. Simone Zeuchner
Fax: 07252 / 96 - 2222
Internet: http://www.seeburger.de Registergericht/Commercial Register:
e-mail: info@seeburger.de HRB 240708 Mannheim

Dieses E-Mail ist nur für den Empfänger bestimmt, an den es gerichtet ist und kann vertrauliches bzw. unter das Berufsgeheimnis fallendes Material enthalten. Jegliche darin enthaltene Ansicht oder Meinungsäußerung ist die des Autors und stellt nicht notwendigerweise die Ansicht oder Meinung der SEEBURGER AG dar. Sind Sie nicht der Empfänger, so haben Sie diese E-Mail irrtümlich erhalten und jegliche Verwendung, Veröffentlichung, Weiterleitung, Abschrift oder jeglicher Druck dieser E-Mail ist strengstens untersagt. Weder die SEEBURGER AG noch der Absender (Petter. Andreas) übernehmen die Haftung für Viren; es obliegt Ihrer Verantwortung, die E-Mail und deren Anhänge auf Viren zu prüfen.

This email is intended only for the recipient(s) to whom it is addressed. This email may contain confidential material that may be protected by professional secrecy. Any fact or opinion contained, or expression of the material herein, does not necessarily reflect that of SEEBURGER AG. If you are not the addressee or if you have received this email in error, any use, publication or distribution including forwarding, copying or printing is strictly prohibited. Neither SEEBURGER AG, nor the sender (Petter. Andreas) accept liability for viruses; it is your responsibility to check this email and its attachments for viruses.

Re: Slow performance because of used-up "Waste" in AtomicBTreeColumns

Posted by Graham Sanderson <gr...@vast.com>.

Multiple writes to a single partition key are guaranteed to be atomic. Therefore there has to be some protection. 

First rule of thumb, don’t write at insanely high rates to the same partition key concurrently (you can probably avoid this, but hints as currently implemented suffer because the partition key is the node id - that will be fixed in 3; also OpsCenter does fast burst inserts of per node data)

The general strategy taken is one of optimistic concurrency; each thread makes its own sub-copy of the tree from the root to the inserted data, sharing existing nodes where possible. It then tries to CAS the new tree in place. The problem with very high concurrency is that a huge amount of work is done and memory allocated (if you are doing lots of writes to the same partition then the whole memtable may be one AtomicBTreeColum) only to have the CAS fail, and that thread to have to start over. 

Anyway, this CAS failing was giving effectively zero concurrency anyway, but high extreme CPU usage (wastage) while allocating 10s of gigabytes of garbage a second leading to GC issues also, so in 2.1 the AtomicBTreeColumn (which holds state for one partition in the memtable) was altered to estimate the amount of memory it was wasting over time, and flip to pessimistic locking if a threshold was exceeded. The decision was made not to make it flip back for simplicity, and that if you are writing data that fast, the memtable and hence AtomicBTrreeColumn won’t last long anyway

There is a DEBUG log level in Memtable that alerts you this is happening.

So the short answer is don’t do it - maybe the trigger is a bit too sensitive for your needs, but it’d be interesting to know how many inserts you are doing a second when going FAST, and then consider if that sounds like a lot if they are sorted by partition_key

The longer term answer, that Benedict suggested is having lazy writes under contention which would be applied by next un-contended write or repaired on read (or flush) - this was also a reason not to add a flag to turn on/off the new behavior, along with the fact that in testing we didn’t manage to make it perform worse, but did get it perform very much better. It also has no effect on un-contended writes.

> On Jul 23, 2015, at 5:55 AM, Petter. Andreas <a....@seeburger.de> wrote:
> 
> Hello everyone,
> 
> we are experiencing performance issues with Cassandra overloading effects (dropped mutations and node drop-outs) with the following workload:
> 
> create table test (year bigint, spread bigint, time bigint, batchid bigint, value set<text>, primary key ((year, spread), time, batchid))
> inserting data using an update statement ("+" operator to merge the sets). Data _is_being_ordered_ before the mutation is executed on the session. Number of inserts range from 400k to a few millions.
> 
> Originally we were using scalding/summingbird and thought the problem to be in our Cassandra-storage-code. To test that i wrote a simple cascading-hadoop job (not using BulkOutputFormat, but the Datastax driver). I was a little bit surprised to still see Cassandra _overload_ (3 reducers/Hadoop-writers and 3 co-located Cassandra nodes, as well as a setup with 4/4 nodes). The internal reason seems to be that many worker threads go into state BLOCKED in AtomicBTreeColumns.addAllWithSizeDelta, because s.th <http://s.th/>. called "waste" is used up and Cassandra switches to pessimistic locking.
> 
> However, i re-wrote the job using plain Hadoop-mapred (without cascading) but using the same storage abstraction for writing and Cassandra _did_not_overload_ and the job has the great write-performance i'm used to (and threads are not going into state BLOCKED).  We're totally lost and puzzled. 
> 
> So i have a few questions:
> 1. What is this "waste" used for? Is it a way of braking or load shedding? Why is locking being used in AtomicBTreeColumns?
> 2. Is it o.k. to order columns before inserts are being performed?
> 3. What could be the reason that "waste" is being used-up in the cascading job and not  in the plain Hadoop-job (sorting order?)?
> 4. Is there any way to circumvent using up "waste" (except for scaling nodes, which does not seem to be the answer, as the plain Hadoop job runs Cassandra-"friendly")?
> 
> thanks in advance,
> regards,
> Andi
> 
> 
> 
> 
> 
>  	 	 
> 
> 
> SEEBURGER AG	 	Vorstand/SEEBURGER Executive Board:
> Sitz der Gesellschaft/Registered Office:	 	Bernd Seeburger, Axel Haas, Michael Kleeberg, Friedemann Heinz, Dr. Martin Kuntz, Matthias Feßenbecker
> Edisonstr. 1	 	
> D-75015 Bretten		Vorsitzende des Aufsichtsrats/Chairperson of the SEEBURGER Supervisory Board:
> Tel.: 07252 / 96 - 0		Prof. Dr. Simone Zeuchner
> Fax: 07252 / 96 - 2222		
> Internet: http://www.seeburger.de <http://www.seeburger.de/>		Registergericht/Commercial Register:
> e-mail: info@seeburger.de <ma...@seeburger.de>		HRB 240708 Mannheim
> 
> Dieses E-Mail ist nur für den Empfänger bestimmt, an den es gerichtet ist und kann vertrauliches bzw. unter das Berufsgeheimnis fallendes Material enthalten. Jegliche darin enthaltene Ansicht oder Meinungsäußerung ist die des Autors und stellt nicht notwendigerweise die Ansicht oder Meinung der SEEBURGER AG dar. Sind Sie nicht der Empfänger, so haben Sie diese E-Mail irrtümlich erhalten und jegliche Verwendung, Veröffentlichung, Weiterleitung, Abschrift oder jeglicher Druck dieser E-Mail ist strengstens untersagt. Weder die SEEBURGER AG noch der Absender (Petter. Andreas) übernehmen die Haftung für Viren; es obliegt Ihrer Verantwortung, die E-Mail und deren Anhänge auf Viren zu prüfen. 
> 
> This email is intended only for the recipient(s) to whom it is addressed. This email may contain confidential material that may be protected by professional secrecy. Any fact or opinion contained, or expression of the material herein, does not necessarily reflect that of SEEBURGER AG. If you are not the addressee or if you have received this email in error, any use, publication or distribution including forwarding, copying or printing is strictly prohibited. Neither SEEBURGER AG, nor the sender (Petter. Andreas) accept liability for viruses; it is your responsibility to check this email and its attachments for viruses.