You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jon Haddad (Jira)" <ji...@apache.org> on 2019/12/23 23:12:00 UTC
[jira] [Commented] (CASSANDRA-15464) Inserts to set slow due to AtomicBTreePartition for ComplexColumnData.dataSize

    [ https://issues.apache.org/jira/browse/CASSANDRA-15464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002533#comment-17002533 ] 

Jon Haddad commented on CASSANDRA-15464:
----------------------------------------

I've pushed up a new workload to tlp-stress that can hammer on the set type that should help with this issue: https://github.com/thelastpickle/tlp-stress/issues/122, it'll be in the next release.


> Inserts to set<text> slow due to AtomicBTreePartition for ComplexColumnData.dataSize
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15464
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15464
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Eric Jacobsen
>            Priority: Normal
>
> Concurrent inserts to set<text> can cause client timeouts and excessive CPU due to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As the length of the set gets longer, the probability of doing the compare decreases.
> The problem we saw in production was with insertions into a set<text> with len(set<text>) hundreds to thousands. Because of the semantics of what we store in the set, we had not anticipated the length being more than about 10. (Almost all rows have length <= 6, the largest observed was 7032. Total number of rows < 4000. 3 machines were used.)
> The bad behavior we saw was all machines went to 100% cpu on all cores, and clients were timing out. Our immediate solution in production was adding more machines (went from 3 machines to 6 machines). The stack included partitions.AtomicBTreePartition.addAllWithSizeDelta … ComplexColumnData.dataSize.
> The AtomicBTreePartition code uses a Compare And Swap approach, yet the time between compares is dependent on the length of the set. When the length of the set is long, with concurrent updates, each loop is unlikely to make forward progress and can be delayed looping.
> Here is one example call stack:
> {noformat}
> "SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 nid=0x2ee5 runnable [0x00007f9b067f4000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
> at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
> at org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
> at org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
> at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
> at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
> at org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
> at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
> at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
> at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
> at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
> at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
> at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
> at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
> at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> In a test program to repro the problem, we raise the number of concurrent users and lower the think time between queries. Updating elements of low-length sets can occur without errors, and with long-length sets, clients time out with errors and there are periods with all cores 99.x% CPU and with jstack shows time going to  ComplexColumnData.dataSize.
> Here is the schema. Our long term application solution was to just have the set elements be part of the primary key and avoid using set<text>, thus guaranteeing the code does not go through ComplexColumnData.dataSize
> {noformat}
> CREATE TABLE x.x (
>  x int PRIMARY KEY,
>  y set<text> ) ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org