You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mohammed Guller <mo...@glassbeam.com> on 2014/12/11 21:43:01 UTC

batch_size_warn_threshold_in_kb

Hi -
The cassandra.yaml file has property called batch_size_warn_threshold_in_kb.
The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability.

Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability?

Mohammed

Re: batch_size_warn_threshold_in_kb

Posted by Shane Hansen <sh...@gmail.com>.

I don't know why 5kb was chosen.

The general trend is that larger batches will put more stress on the
coordinator node. The precise point at which
things fall over will vary.

On Thu, Dec 11, 2014 at 1:43 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:

>   Hi –
>
> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
> *
>
> The default size is 5kb and according to the comments in the yaml file, it
> is used to log WARN on any batch size exceeding this value in kilobytes. It
> says caution should be taken on increasing the size of this threshold as it
> can lead to node instability.
>
>
>
> Does anybody know the significance of this magic number 5kb? Why would a
> higher number (say 10kb) lead to node instability?
>
>
>
> Mohammed
>

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

Any insert, update, or delete

On Fri, Dec 12, 2014 at 1:31 AM, Jens Rantil <je...@tink.se> wrote:
>
> Maybe slightly off-topic, but what is a mutation? Is it equivalent to a
> CQL row? Or maybe a column in a row? Does include tombstones within the
> selected range?
>
> Thanks,
> Jens
>
>
>
> On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla <rs...@datastax.com> wrote:
>
>> Nothing magic, just put in there based on experience. You can find the
>> story behind the original recommendation here
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>
>> Key reasoning for the desire comes from Patrick McFadden:
>>
>> "Yes that was in bytes. Just in my own experience, I don't recommend more
>> than ~100 mutations per batch. Doing some quick math I came up with 5k as
>> 100 x 50 byte mutations.
>>
>> Totally up for debate."
>>
>> It's totally changeable, however, it's there in no small part because so
>> many people confuse the BATCH keyword as a performance optimization, this
>> helps flag those cases of misuse.
>>
>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
>> wrote:
>>>
>>>   Hi –
>>>
>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>> *
>>>
>>> The default size is 5kb and according to the comments in the yaml file,
>>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>>> It says caution should be taken on increasing the size of this threshold as
>>> it can lead to node instability.
>>>
>>>
>>>
>>> Does anybody know the significance of this magic number 5kb? Why would a
>>> higher number (say 10kb) lead to node instability?
>>>
>>>
>>>
>>> Mohammed
>>>
>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Jens Rantil <je...@tink.se>.

Maybe slightly off-topic, but what is a mutation? Is it equivalent to a CQL row? Or maybe a column in a row? Does include tombstones within the selected range?

Thanks,
Jens

On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla <rs...@datastax.com> wrote:

> Nothing magic, just put in there based on experience. You can find the
> story behind the original recommendation here
> https://issues.apache.org/jira/browse/CASSANDRA-6487
> Key reasoning for the desire comes from Patrick McFadden:
> "Yes that was in bytes. Just in my own experience, I don't recommend more
> than ~100 mutations per batch. Doing some quick math I came up with 5k as
> 100 x 50 byte mutations.
> Totally up for debate."
> It's totally changeable, however, it's there in no small part because so
> many people confuse the BATCH keyword as a performance optimization, this
> helps flag those cases of misuse.
> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>>
>>   Hi –
>>
>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>> *
>>
>> The default size is 5kb and according to the comments in the yaml file, it
>> is used to log WARN on any batch size exceeding this value in kilobytes. It
>> says caution should be taken on increasing the size of this threshold as it
>> can lead to node instability.
>>
>>
>>
>> Does anybody know the significance of this magic number 5kb? Why would a
>> higher number (say 10kb) lead to node instability?
>>
>>
>>
>> Mohammed
>>
> -- 
> [image: datastax_logo.png] <http://www.datastax.com/>
> Ryan Svihla
> Solution Architect
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

It's a rough observation and estimate, nothing more. In other words, some
clusters can handle more, some can't, it depends on how many writes per
second you're doing, cluster sizing, how far over that 5kb limit you are,
heap size, disk IO, cpu speed, and many more factors. This is why it's just
a warning and not an error, and it's something that's changeable.

There is no one perfect answer here, but I can safely say in practice with
today's hardware, I've not seen many clusters work well with more than 5kb
writes.


On Fri, Dec 12, 2014 at 1:12 AM, Mohammed Guller <mo...@glassbeam.com>
wrote:
>
>  Ryan,
>
> Thanks for the quick response.
>
>
>
> I did see that jira before posting my question on this list. However, I
> didn’t see any information about why 5kb+ data will cause instability. 5kb
> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
> then with just 5 mutations, you will hit that threshold.
>
>
>
> In addition, Patrick is saying that he does not recommend more than 100
> mutations per batch. So why not warn users just on the # of mutations in a
> batch?
>
>
>
> Mohammed
>
>
>
> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
> *Sent:* Thursday, December 11, 2014 12:56 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: batch_size_warn_threshold_in_kb
>
>
>
> Nothing magic, just put in there based on experience. You can find the
> story behind the original recommendation here
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-6487
>
>
>
> Key reasoning for the desire comes from Patrick McFadden:
>
>
> "Yes that was in bytes. Just in my own experience, I don't recommend more
> than ~100 mutations per batch. Doing some quick math I came up with 5k as
> 100 x 50 byte mutations.
>
> Totally up for debate."
>
>
>
> It's totally changeable, however, it's there in no small part because so
> many people confuse the BATCH keyword as a performance optimization, this
> helps flag those cases of misuse.
>
>
>
> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
> Hi –
>
> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
> *
>
> The default size is 5kb and according to the comments in the yaml file, it
> is used to log WARN on any batch size exceeding this value in kilobytes. It
> says caution should be taken on increasing the size of this threshold as it
> can lead to node instability.
>
>
>
> Does anybody know the significance of this magic number 5kb? Why would a
> higher number (say 10kb) lead to node instability?
>
>
>
> Mohammed
>
>
>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
>
> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

Here is my test code; this was written as disposable code, so it's not
especially well documented, and it includes some chunks copied from
elsewhere in our stack, but hopefully it's readable.
https://gist.github.com/MightyE/1c98912fca104f6138fc

Here's some test runs after I reduced RF to 1, to introduce the effects of
proxying.  In the interests of total run time, I'm only running 10,000
records per run this time (but still 25 runs).  This is actually a bigger
percentage difference between single vs batch than the results I got
yesterday (500% difference between strategies with RF=3 , 800% difference
between strategies with RF=1).


==== Execution Results for 25 runs of 10000 records =============
25 runs of 10,000 records (3 protos, 5 agents, ~15 per bucket) as *single
statements*
Total Run Time
        futures test1 ((aid, bckt), proto, end) reverse order         =
8,336,113,981
        scatter test5 ((aid, bckt, end))                              =
8,434,901,305
        scatter test2 ((aid, bckt), end)                              =
8,464,319,637
        futures test3 ((aid, bckt), end, proto) reverse order         =
8,638,385,133
        futures test2 ((aid, bckt), end)                              =
8,684,263,854
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
8,708,544,870
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
8,861,009,724
        futures test5 ((aid, bckt, end))                              =
8,868,274,674
        scatter test3 ((aid, bckt), end, proto) reverse order         =
8,956,848,500
        scatter test1 ((aid, bckt), proto, end) reverse order         =
9,124,160,168
        parallel test2 ((aid, bckt), end)                             =
123,400,905,337
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
127,282,210,321
        parallel test1 ((aid, bckt), proto, end) reverse order        =
128,039,113,464
        parallel test5 ((aid, bckt, end))                             =
130,788,491,325
        parallel test3 ((aid, bckt), end, proto) reverse order        =
130,795,365,099
Fastest Run
        futures test1 ((aid, bckt), proto, end) reverse order         =
249,455,814
        futures test3 ((aid, bckt), end, proto) reverse order         =
252,083,763
        scatter test2 ((aid, bckt), end)                              =
267,123,409
        scatter test3 ((aid, bckt), end, proto) reverse order         =
268,881,765
        futures test2 ((aid, bckt), end)                              =
269,801,288
        scatter test1 ((aid, bckt), proto, end) reverse order         =
271,331,470
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
277,473,727
        scatter test5 ((aid, bckt, end))                              =
279,758,721
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
289,355,345
        futures test5 ((aid, bckt, end))                              =
292,192,770
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
4,457,096,575
        parallel test2 ((aid, bckt), end)                             =
4,481,373,458
        parallel test5 ((aid, bckt, end))                             =
4,488,146,940
        parallel test3 ((aid, bckt), end, proto) reverse order        =
4,514,067,019
        parallel test1 ((aid, bckt), proto, end) reverse order        =
4,559,735,697
Slowest Run
        scatter test2 ((aid, bckt), end)                              =
448,123,797
        futures test2 ((aid, bckt), end)                              =
451,813,924
        futures test1 ((aid, bckt), proto, end) reverse order         =
470,783,312
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
475,451,703
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
475,549,535
        scatter test5 ((aid, bckt, end))                              =
488,438,970
        futures test5 ((aid, bckt, end))                              =
512,183,494
        futures test3 ((aid, bckt), end, proto) reverse order         =
550,359,667
        scatter test3 ((aid, bckt), end, proto) reverse order         =
552,363,684
        scatter test1 ((aid, bckt), proto, end) reverse order         =
891,138,017
        parallel test2 ((aid, bckt), end)                             =
5,934,352,510
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
5,956,012,555
        parallel test3 ((aid, bckt), end, proto) reverse order        =
6,425,883,376
        parallel test1 ((aid, bckt), proto, end) reverse order        =
7,171,289,389
        parallel test5 ((aid, bckt, end))                             =
7,435,430,002

==== Execution Results for 25 runs of 10000 records =============
25 runs of 10,000 records (3 protos, 5 agents, ~15 per bucket) in *batches
of 100*
Total Run Time
        futures test1 ((aid, bckt), proto, end) reverse order         =
1,052,817,240
        futures test3 ((aid, bckt), end, proto) reverse order         =
1,172,170,041
        scatter test3 ((aid, bckt), end, proto) reverse order         =
1,288,773,642
        futures test2 ((aid, bckt), end)                              =
1,349,688,669
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
1,540,364,041
        scatter test5 ((aid, bckt, end))                              =
1,726,854,978
        futures test5 ((aid, bckt, end))                              =
1,937,696,565
        scatter test2 ((aid, bckt), end)                              =
1,977,232,999
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
2,025,161,088
        scatter test1 ((aid, bckt), proto, end) reverse order         =
2,219,169,805
        parallel test2 ((aid, bckt), end)                             =
4,046,119,649
        parallel test3 ((aid, bckt), end, proto) reverse order        =
4,125,620,234
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
4,216,984,290
        parallel test1 ((aid, bckt), proto, end) reverse order        =
5,168,760,200
        parallel test5 ((aid, bckt, end))                             =
6,523,329,362
Fastest Run
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
28,186,746
        scatter test1 ((aid, bckt), proto, end) reverse order         =
28,669,691
        scatter test3 ((aid, bckt), end, proto) reverse order         =
28,759,003
        futures test1 ((aid, bckt), proto, end) reverse order         =
28,779,728
        futures test3 ((aid, bckt), end, proto) reverse order         =
28,965,161
        scatter test2 ((aid, bckt), end)                              =
29,538,146
        futures test2 ((aid, bckt), end)                              =
29,698,726
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
29,985,012
        futures test5 ((aid, bckt, end))                              =
45,124,539
        scatter test5 ((aid, bckt, end))                              =
45,457,721
        parallel test1 ((aid, bckt), proto, end) reverse order        =
117,402,203
        parallel test3 ((aid, bckt), end, proto) reverse order        =
118,801,014
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
120,824,212
        parallel test2 ((aid, bckt), end)                             =
123,429,148
        parallel test5 ((aid, bckt, end))                             =
150,506,937
Slowest Run
        futures test1 ((aid, bckt), proto, end) reverse order         =
123,977,889
        scatter test5 ((aid, bckt, end))                              =
132,461,653
        scatter test3 ((aid, bckt), end, proto) reverse order         =
162,124,982
        futures test5 ((aid, bckt, end))                              =
169,519,567
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
179,664,998
        futures test3 ((aid, bckt), end, proto) reverse order         =
194,697,734
        futures test2 ((aid, bckt), end)                              =
196,393,591
        parallel test3 ((aid, bckt), end, proto) reverse order        =
283,316,486
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
297,212,641
        parallel test2 ((aid, bckt), end)                             =
313,276,348
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
575,294,961
        parallel test1 ((aid, bckt), proto, end) reverse order        =
630,158,596
        scatter test1 ((aid, bckt), proto, end) reverse order         =
694,183,510
        parallel test5 ((aid, bckt, end))                             =
740,323,929
        scatter test2 ((aid, bckt), end)                              =
833,222,720



On Sat, Dec 13, 2014 at 9:44 AM, Eric Stevens <mi...@gmail.com> wrote:

> You can seen what the partition key strategies are for each of the tables,
> test5 shows the least improvement.  The set (aid, end) should be unique,
> and bckt is derived from end.  Some of these layouts result in clustering
> on the same partition keys, that's actually tunable with the "~15 per
> bucket" reported (exact number of entries per bucket will vary but should
> have a mean of 15 in that run - it's an input parameter to my tests).
>  "test5" obviously ends up being exclusively unique partitions for each
> record.
>
> Your points about:
> 1) Failed batches having a higher cost than failed single statements
> 2) In my test, every node was a replica for all data.
>
> These are both very good points.
>
> For #1, since the worst case scenario is nearly twice fast in batches as
> its single statement equivalent, in terms of impact on the client, you'd
> have to be retrying half your batches before you broke even there (but of
> course those retries are not free to the cluster, so you probably make the
> performance tipping point approach a lot faster).  This alone may be cause
> to justify avoiding batches, or at least severely limiting their size (hey,
> that's what this discussion is about!).
>
> For #2, that's certainly a good point, for this test cluster, I should at
> least re-run with RF=1 so that proxying times start to matter.  If you're
> not using a token aware client or not using a token aware policy for
> whatever reason, this should even out though, no?  Each node will end up
> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
> batched or single statements.  The DS driver is very careful to caution
> that the topology map it maintains makes no guarantees on freshness, so you
> may see a significant performance penalty in your client when the topology
> changes if you're depending on token aware routing as part of your
> performance requirements.
>
>
> I'm curious what your thoughts are on grouping statements by primary
> replica according to the routing policy, and executing unlogged batches
> that way (so that for token aware routing, all statements are executed on a
> replica, for others it'd make no difference).  Retries are still more
> expensive, but token aware proxying avoidance is still had.  It's pretty
> easy to do in Scala:
>
>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
> session: Session): Map[Host, Seq[Statement]] = {
>     val meta = session.getCluster.getMetadata
>     statements.groupBy { st =>
>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>     }
>   }
>   val result =
> Future.traverse(groupByFirstReplica(statements).values).map(st =>
> newBatch(st).executeAsync())
>
>
> Let me get together my test code, it depends on some existing utilities we
> use elsewhere, such as implicit conversions between Google and Scala native
> futures.  I'll try to put this together in a format that's runnable for you
> in a Scala REPL console without having to resolve our internal
> dependencies.  This may not be today though.
>
> Also, @Ryan, I don't think that shuffling would make a difference for my
> above tests since as Jon observed, all my nodes were already replicas there.
>
>
> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com> wrote:
>
>> Also..what happens when you turn on shuffle with token aware?
>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>
>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>>
>>> To add to Ryan's (extremely valid!) point, your test works because the
>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>> Batching works great at RF=N=3 because it always gets to write to local and
>>> talk to exactly 2 other servers on every request.  Consider what happens
>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>> overhead on the server side.
>>>
>>> To save network overhead, Cassandra 2.1 added support for response
>>> grouping (see
>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>> which massively helps performance.  It provides the benefit of batches but
>>> without the coordinator overhead.
>>>
>>> Can you post your benchmark code?
>>>
>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> There are cases where it can.  For instance, if you batch multiple
>>>> mutations to the same partition (and talk to a replica for that partition)
>>>> they can reduce network overhead because they're effectively a single
>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>> most people aren't!) you end up putting additional pressure on the
>>>> coordinator because now it has to talk to several other servers.  If you
>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>> a coordinator that's
>>>>
>>>> 1) talking to every machine in the cluster and
>>>> b) waiting on a response from a significant portion of them
>>>>
>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>> disk, can affect the performance of the entire batch.
>>>>
>>>>
>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>> jack@basetechnology.com> wrote:
>>>>
>>>>>   Jonathan and Ryan,
>>>>>
>>>>> Jonathan says “It is absolutely not going to help you if you're trying
>>>>> to lump queries together to reduce network & server overhead - in fact
>>>>> it'll do the opposite”, but I would note that the CQL3 spec says “The
>>>>> BATCH statement ... serves several purposes: 1. It saves network
>>>>> round-trips between the client and the server (and sometimes between the
>>>>> server coordinator and the replicas) when batching multiple updates.” Is
>>>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>>>
>>>>> See:
>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>
>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>> change to make it accurate.
>>>>>
>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>> can save network exchanges between the client/server and server
>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>> successful, as described in Using and misusing batches section. For
>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>> loading without the Batch keyword."”
>>>>>
>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>> then let the driver determine what degree of batching and asynchronous
>>>>> operation is appropriate.
>>>>>
>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>> based on overall cluster load.
>>>>>
>>>>> I would also note that the example in the spec has multiple inserts
>>>>> with different partition key values, which flies in the face of the
>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>
>>>>> At a minimum the CQL spec should make a more clear statement of intent
>>>>> and non-intent for BATCH.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>
>>>>> The really important thing to really take away from Ryan's original
>>>>> post is that batches are not there for performance.  The only case I
>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>> this is when you've got multiple tables that are serving as different views
>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>> failures).
>>>>>
>>>>> tl;dr: you probably don't want batch, you most likely want many async
>>>>> calls
>>>>>
>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>> mohammed@glassbeam.com> wrote:
>>>>>
>>>>>>  Ryan,
>>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I did see that jira before posting my question on this list. However,
>>>>>> I didn’t see any information about why 5kb+ data will cause instability.
>>>>>> 5kb or even 50kb seems too small. For example, if each mutation is 1000+
>>>>>> bytes, then with just 5 mutations, you will hit that threshold.
>>>>>>
>>>>>>
>>>>>>
>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>> in a batch?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>> the story behind the original recommendation here
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>
>>>>>>
>>>>>>
>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>
>>>>>>
>>>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>>>> as 100 x 50 byte mutations.
>>>>>>
>>>>>> Totally up for debate."
>>>>>>
>>>>>>
>>>>>>
>>>>>> It's totally changeable, however, it's there in no small part because
>>>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>>>> this helps flag those cases of misuse.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>
>>>>>> Hi –
>>>>>>
>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>> *
>>>>>>
>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>> threshold as it can lead to node instability.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>
>>>>>> Ryan Svihla
>>>>>>
>>>>>> Solution Architect
>>>>>>
>>>>>>
>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>> is the database technology and transactional backbone of choice for the
>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

And by the way Jon and Ryan, I want to thank you for engaging in the
conversation.  I hope I'm not coming across as argumentative or combative
or anything like that.  But I would definitely love to reconcile my
measurements with recommended practices so that I can make good decisions
about how to model my application.

On Sat, Dec 13, 2014 at 10:58 AM, Eric Stevens <mi...@gmail.com> wrote:

> Isn't the net effect of coordination overhead incurred by batches
> basically the same as the overhead incurred by RoundRobin or other
> non-token-aware request routing?  As the cluster size increases, each node
> would coordinate the same percentage of writes in batches under token
> awareness as they would under a more naive single statement routing
> strategy.  If write volume per time unit is the same in both approaches,
> each node ends up coordinating the majority of writes under either strategy
> as the cluster grows.
>
> GC pressure in the cluster is a concern of course, as you observe.  But
> delta performance is *substantial* from what I can see.  As in the case
> where you're bumping up against retries, this will cause you to fall over
> much more rapidly as you approach your tipping point, but in a healthy
> cluster, it's the same write volume, just a longer tenancy in eden.  If
> reasonable sized batches are causing survivors, you're not far off from
> falling over anyway.
>
> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> One thing to keep in mind is the overhead of a batch goes up as the
>> number of servers increases.  Talking to 3 is going to have a much
>> different performance profile than talking to 20.  Keep in mind that the
>> coordinator is going to be talking to every server in the cluster with a
>> big batch.  The amount of local writes will decrease as it owns a smaller
>> portion of the ring.  All you've done is add an extra network hop between
>> your client and where the data should actually be.  You also start to have
>> an impact on GC in a very negative way.
>>
>> Your point is valid about topology changes, but that's a relatively rare
>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>> optimize for that case.
>>
>> Can you post your test code in a gist or something?  I can't really talk
>> about your benchmark without seeing it and you're basing your stance on the
>> premise that it is correct, which it may not be.
>>
>>
>>
>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:
>>
>>> You can seen what the partition key strategies are for each of the
>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>> unique, and bckt is derived from end.  Some of these layouts result in
>>> clustering on the same partition keys, that's actually tunable with the
>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>> should have a mean of 15 in that run - it's an input parameter to my
>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>> each record.
>>>
>>> Your points about:
>>> 1) Failed batches having a higher cost than failed single statements
>>> 2) In my test, every node was a replica for all data.
>>>
>>> These are both very good points.
>>>
>>> For #1, since the worst case scenario is nearly twice fast in batches as
>>> its single statement equivalent, in terms of impact on the client, you'd
>>> have to be retrying half your batches before you broke even there (but of
>>> course those retries are not free to the cluster, so you probably make the
>>> performance tipping point approach a lot faster).  This alone may be cause
>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>> that's what this discussion is about!).
>>>
>>> For #2, that's certainly a good point, for this test cluster, I should
>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>> you're not using a token aware client or not using a token aware policy for
>>> whatever reason, this should even out though, no?  Each node will end up
>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>> batched or single statements.  The DS driver is very careful to caution
>>> that the topology map it maintains makes no guarantees on freshness, so you
>>> may see a significant performance penalty in your client when the topology
>>> changes if you're depending on token aware routing as part of your
>>> performance requirements.
>>>
>>>
>>> I'm curious what your thoughts are on grouping statements by primary
>>> replica according to the routing policy, and executing unlogged batches
>>> that way (so that for token aware routing, all statements are executed on a
>>> replica, for others it'd make no difference).  Retries are still more
>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>> easy to do in Scala:
>>>
>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>> session: Session): Map[Host, Seq[Statement]] = {
>>>     val meta = session.getCluster.getMetadata
>>>     statements.groupBy { st =>
>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>>     }
>>>   }
>>>   val result =
>>> Future.traverse(groupByFirstReplica(statements).values).map(st =>
>>> newBatch(st).executeAsync())
>>>
>>>
>>> Let me get together my test code, it depends on some existing utilities
>>> we use elsewhere, such as implicit conversions between Google and Scala
>>> native futures.  I'll try to put this together in a format that's runnable
>>> for you in a Scala REPL console without having to resolve our internal
>>> dependencies.  This may not be today though.
>>>
>>> Also, @Ryan, I don't think that shuffling would make a difference for my
>>> above tests since as Jon observed, all my nodes were already replicas there.
>>>
>>>
>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>> wrote:
>>>
>>>> Also..what happens when you turn on shuffle with token aware?
>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>>>
>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>>
>>>>> To add to Ryan's (extremely valid!) point, your test works because the
>>>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>> overhead on the server side.
>>>>>
>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>> grouping (see
>>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>>>> which massively helps performance.  It provides the benefit of batches but
>>>>> without the coordinator overhead.
>>>>>
>>>>> Can you post your benchmark code?
>>>>>
>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>> they can reduce network overhead because they're effectively a single
>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>> a coordinator that's
>>>>>>
>>>>>> 1) talking to every machine in the cluster and
>>>>>> b) waiting on a response from a significant portion of them
>>>>>>
>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>> disk, can affect the performance of the entire batch.
>>>>>>
>>>>>>
>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>> jack@basetechnology.com> wrote:
>>>>>>
>>>>>>>   Jonathan and Ryan,
>>>>>>>
>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>> statement.
>>>>>>>
>>>>>>> See:
>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>
>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>> change to make it accurate.
>>>>>>>
>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>> can save network exchanges between the client/server and server
>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>> loading without the Batch keyword."”
>>>>>>>
>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>> operation is appropriate.
>>>>>>>
>>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>>>> based on overall cluster load.
>>>>>>>
>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>> with different partition key values, which flies in the face of the
>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>
>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>> intent and non-intent for BATCH.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>
>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>> post is that batches are not there for performance.  The only case I
>>>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>>>> this is when you've got multiple tables that are serving as different views
>>>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>> failures).
>>>>>>>
>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>> async calls
>>>>>>>
>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>
>>>>>>>>  Ryan,
>>>>>>>>
>>>>>>>> Thanks for the quick response.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>> threshold.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>>>> in a batch?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>>> the story behind the original recommendation here
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>
>>>>>>>>
>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>
>>>>>>>> Totally up for debate."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>> Hi –
>>>>>>>>
>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>> *
>>>>>>>>
>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>
>>>>>>>> Ryan Svihla
>>>>>>>>
>>>>>>>> Solution Architect
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Ryan Svihla
>>>>
>>>> Solution Architect
>>>>
>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>>
>>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

> You are, of course, free to use batches in your application

I'm not looking to justify the use of batches, I'm looking for the path
forward that will give us the Best Results™ both near and long term, for
some definition of Best (which would be a balance of client throughput and
cluster pressure).  If individual writes are best for us, that's what I
want to do.  If batches are best for us, that's what I want to do.

I'm just struggling that I'm not able to reproduce your advice
experimentally, and it's not just a few percent difference, it's 5x to 8x
difference.  It's really difficult for me to adopt advice blindly when it
differs from my own observations by such a substantial amount.  That means
something is wrong either with my observations or with the advice, and I
would really like to know which.  I'm not trying to be argumentative or
push for a particular approach, I'm trying to resolve an inconsistency.


RE your questions: I'm sorry this turns into a wall of text, simple
questions about parallelism and distributed systems rarely can be
adequately answered in just a few words.  I'm trying to be open and
transparent about my testing approach because I want to find out where the
disconnect is here.  At the same time I'm trying to bridge the knowledge
gap since I'm working with parallelism toolset with which you're not
familiar, and that could obviously have a substantial impact on the
results.  Hopefully someone else in the community familiar with Scala will
notice this and provide feedback that I'm not making a fundamental mistake.


1) My original runs were in EC2 being driven by a different server than the
Cassandra cluster, but in the same AZ as one of the Cassandra
servers (typical 3-AZ setup for Cassandra).  All four instances (3x C*, 1x
test driver) were i2.2xl, so have gigabit network between them.


2) The system was under some moderate other load, this is our test cluster
that takes a steady stream of simulated data to provide other developers
with something to work against.  That load is quite constant and doesn't
work these servers particularly hard - only a few thousand records per
second typically.  Load averages between 1 and 3 most of the time.

Unfortunately I'm not successful getting cassandra-stress talking to this
cluster because of ssl configuration (it doesn't seem to actually pay
attention to -ts and -tspw command line flags).  I can find out if our ops
guys would be ok with turning off ssl for a while, but as that would break
our other applications using the same cluster, and may block our other
engineers as a result.  So it has farther reaching implications than just
being something I can happily turn on or off at whim.

I'm curious how you would expect the performance of my stress tool to
differ when the cluster was being overworked - could you explain what you
anticipate the change in results to look like?  I.e. would single-writes
remain about constant for performance while batches would degrade in
performance?


3) Well I specifically attempt to control for this by testing three
different concurrency models, these were named by me "parallel," "scatter,"
and "traverse" (just aliases to make it easier to control the driver).  You
can see the code between the different approaches here - they are pretty
similar to each other, but probably involve some knowledge of how
concurrency works in Scala to really appreciate the differences:
https://gist.github.com/MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L181-L203

I know you're not a Scala guy, so I'll explain roughly what they do, but
the point is that I'm trying hard to control for just having chosen a bad
concurrency model:

scatter -> Take all of the Statements and call executeAsync() on them as
*fast* the Session will let me.  This is the Unintelligent Brute Force
approach, and it's definitely not how I would model a typical production
application as it doesn't attempt to respond to system pressure at all, and
it's trying to gobble up as many resources as it can.  Use the Scala
Futures system to combine the the set of async calls into a single Future
that completes when all the futures returned from executeAsync() have
completed.

traverse -> Give all of the Statements to the Scala Futures system and tell
it to call executeAsync() on them all at the rate that it thinks is
appropriate.  This would be much closer to my recommendation on how to
model a production application, because in a real application, there's more
than a single class of work to be done, and the Futures system schedules
both this work and other work intelligently and configurably.  It gives us
a single awaitable Future that completes when it has finished all of its
work and all of the async calls have been completed.  You guys are using
Netty for your native protocol, and Netty offers true event driven
concurrency which gets along famously well with Scala's Futures system.

parallel -> Use a Scala Parallel collection to perform parallel synchronous
execution.  This is very much like a typical Java multithreaded approach.
There is a thread pool (defaults to 32 threads), and we fill up that pool
with workers, each of which will take one element from the work queue, work
on it until it completes, before taking another one from the queue.  The
whole collection completes when the last worker finishes its last task, and
the parallel operation blocks until then.

So:
3a) This is the approach used by the "scatter" strategy.  I'm sure it puts
a lot of pressure on the cluster (it's designed to put as much pressure as
the client driver will let it), but I didn't see any GC in the logs for
either batch or single statements.
3b) I await the result of an entire run, not of individual calls.  In the
earliest tests I posted here, that was the completion of 50,000 records for
each run (so either 50,000 simultaneous statement.execAsync calls, or 500
batch.execAsync calls based on a batch size of 100).  The cluster may have
a chance to calm down between runs while some straggler finishes up, but
inside of a run both the "scatter" and "traverse" approaches should be
keeping it pretty busy (the parallel approach only allows for 32 executors
at a time by default, so I'm pretty sure that I'd need a whole lot more
workers before I was able to keep a cluster busy).

4)
> you're still only using 3 servers.

It's true, I don't have access to a larger test cluster.  I'm reluctant to
try to purposefully stress out our customer facing cluster.  If you have
access to a larger test cluster, I'd be *very happy* to bundle up a
standalone executable of this test suite so that you can try it out there
for yourself, and I'd love to see such results.

For what it's worth, our testing on our larger (~50 nodes) production
cluster which still uses Thrift MutationBatches we found that batch sizes
in the neighborhood of 1500 business objects (about 10-15 mutations each)
gave us the best throughput.  Thrift is not as asynchronous as the native
protocol, so I would expect larger batches to offer better performance
there than over the native protocol as each round trip blocks a connection
in the pool from doing any other work.

> The horror of using batches increases linearly as you add servers

I'm curious if you could explain this statement in more detail, you've said
something like it a couple of times, but I'm still not understanding.

You talked about coordination overhead, but practical experience shows this
is not likely to produce a linear degradation as cluster size increases, as
mostly the cost you described which was related to cluster size was
associated with coordination overhead.  Writes need to be coordinated no
matter what in a typical setup, and this overhead should not be worse than
token-naive writing.  Token awareness is awesome and we should try to
benefit from, but it's hard to understand how token naïveté degrades
linearly, especially for writes, and the practical lower bound for
performance as cluster size approaches infinity should be the one where all
operations need to be coordinated.

In fact even the DataStax driver is not perfect in its token awareness.  If
you're not using prepared statements you automatically fall back to the
TokenAwarePolicy's child policy, which would have to be a token naive
policy (try new com.datastax.driver.core.SimpleStatement("some CQL",
someValues*).getRoutingKey() - you'll get back null, which triggers
TokenAwarePolicy to defer to its child policy).

I even ran a test yesterday which grouped statements to have a common
replica and ran batches where all statements in the batch possessed the
same replica (RF=1, completely eliminating coordination overhead), the
results were not statistically different from token naive batches.

You also talked about possible GC pressure, but that wouldn't be related to
cluster size, only to batch size.

5) I'm summing time time between the first call to .executeAsync, and the
time when the last Future completes within a given test run.

On Mon, Dec 15, 2014 at 7:56 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> You are, of course, free to use batches in your application.  Keep in mind
> however, that both my and Ryan's advice is coming from debugging issues in
> production.  I don't know why your Scala script is performing better on
> batches than async.  It could be:
>
> 1) network.  are you running the test script on your laptop and connecting
> to cluster over WAN?  If so, I would not be shocked if batch was faster
> since your latency is going to be crazy high.
>
> 2) is the system under any other load?  I'd love to see the results of the
> tests while cassandra stress was running.  This is a step closer to
> production where you have to worry about such things
>
> 3) The logic for doing async queries may be incorrect.
> a) Are you just throwing all the queries at once against the cluster?  If
> so, I'd love to see what's happening with GC.  Typically in a real workload
> you'd be
> b) Are you keeping the servers busy?  If you're calling wait() on a group
> of futures, you're now blocking requests from being submitted and limiting
> the throughput.
>
> 4) you're still only using 3 servers.  The horror of using batches
> increases linearly as you add servers.
>
> 5) What exactly are you summing in the end?  The total real time taken, or
> an aggregation of the async query times?  If it's the async query times
> that's going to be pretty misleading (and incorrect).  Again, my Scala is
> terrible so I could be reading it wrong.
>
> Sorry I don't have more time to debug the script.  Any of the above ideas
> apply?
>
> Jon
>
> On Mon Dec 15 2014 at 1:11:43 PM Eric Stevens <mi...@gmail.com> wrote:
>
>> > Unfortunately my Scala isn't the best so I'm going to have to take a
>> little bit to wade through the code.
>>
>> I think the important thing to take from this code is that:
>>
>> 1) execution order is randomized for each run, and new data is randomly
>> generated for each run to eliminate biases.
>> 2) we write to five different key layouts in an attempt to eliminate bias
>> from some poorly chosen scheme, we test both clustering and non-clustering
>> approaches
>> 3) We can fork *just* on batch-vs-single strategy (see
>> https://gist.github.com/MightyE/1c98912fca104f6138fc/
>> a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L167-L180 )
>> thanks to the DS driver having a common executable ancestor between them
>> (an extremely nice feature)
>> 4) We test three different parallelism strategies to eliminate bias from
>> a poorly chosen concurrency model (see https://gist.github.com/
>> MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d6939
>> 1ee285c080#file-testsuite-L181-L203 )
>> 5) The code path is identical wherever possible between strategies.
>> 6) Principally this just sets up an Iterable of Statement (sometimes
>> members are batches, sometimes members are single statements), and times
>> how long they take to execute and complete with different concurrency
>> models.
>>
>> *RE: Cassandra-Stress*
>> > It may be useful to run cassandra-stress (it doesn't seem to have a
>> mode for batches) to get a baseline on non-batches.  I'm curious to know if
>> you get different numbers than the scala profiler.
>>
>> We always use SSL for everything, and I've struggled to get
>> cassandra-stress to talk to our SSL cluster.  Just so I don't keep spinning
>> my wheels on a temporary effort, I used CCM to stand up a 2.0.11 cluster
>> locally, and ran both tools against here.  I'm dubious about what you can
>> infer from such a test because it's not apples to apples (they write
>> different data).
>>
>> Nevertheless, here is the output of "ccm stress" against my local machine
>> - I inserted 113,825 records in 62 seconds, and used this data size to
>> drive my tool:
>>
>> Created keyspaces. Sleeping 3s for propagation.
>> total   interval_op_rate  interval_key_rate  latency  95th   99.9th
>>  elapsed_time
>> 11271   1127              1127               8.9      144.7  401.1   10
>> 27998   1672              1672               9.5      140.5  399.4   20
>> 42189   1419              1419,              9.3      148.0  494.5   31
>> 59335   1714              1714               9.3      147.0  493.2   41
>> 84957   2562              2562               6.1      137.1  493.3   51
>> 113825  2886              2886               5.1      131.5  493.3   62
>>
>>
>> After a ccm clear && ccm start , here's my tool this same local cluster
>> (note that I'm actually writing a total of 5x the records because I write
>> the same data to each of 5 tables).  My little local cluster just about
>> brought down my machine under this test (especially the second one).
>>
>> ==== Execution Results for 1 runs of 113825 records =============
>> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) as single
>> statements
>> Total Run Time
>> traverse test2 ((aid, bckt), end)                             =
>> 25,488,179,000
>> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
>> 25,497,183,000
>> traverse test5 ((aid, bckt, end))                             =
>> 25,529,444,000
>> traverse test3 ((aid, bckt), end, proto) reverse order        =
>> 31,495,348,000
>> traverse test1 ((aid, bckt), proto, end) reverse order        =
>> 33,686,013,000
>>
>> ==== Execution Results for 1 runs of 113825 records =============
>> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
>> of 10
>> Total Run Time
>> traverse test3 ((aid, bckt), end, proto) reverse order        =
>> 11,030,788,000
>> traverse test1 ((aid, bckt), proto, end) reverse order        =
>> 13,345,962,000
>> traverse test2 ((aid, bckt), end)                             =
>> 15,110,208,000
>> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
>> 16,398,982,000
>> traverse test5 ((aid, bckt, end))                             =
>> 22,166,119,000
>>
>> For giggles I added token aware batching (grouping statements within a
>> single batch by meta.getReplicas(statement.getKeyspace,
>> statement.getRoutingKey).iterator().next - see https://gist.github.com/
>> MightyE/1c98912fca104f6138fc#file-testsuite-L176-L189 ), here's that
>> run; comparable results with before, and easily inside one sigma of
>> non-token-aware batching, so not a statistically significant difference.
>>
>> ==== Execution Results for 1 runs of 113825 records =============
>> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
>> of 10
>> Total Run Time
>> traverse test2 ((aid, bckt), end)                             =
>> 11,429,008,000
>> traverse test1 ((aid, bckt), proto, end) reverse order        =
>> 12,593,034,000
>> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
>> 13,111,244,000
>> traverse test3 ((aid, bckt), end, proto) reverse order        =
>> 25,163,064,000
>> traverse test5 ((aid, bckt, end))                             =
>> 30,233,744,000
>>
>>
>>
>> On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>>
>>>
>>> On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mi...@gmail.com>
>>> wrote:
>>>
>>>> Isn't the net effect of coordination overhead incurred by batches
>>>> basically the same as the overhead incurred by RoundRobin or other
>>>> non-token-aware request routing?  As the cluster size increases, each node
>>>> would coordinate the same percentage of writes in batches under token
>>>> awareness as they would under a more naive single statement routing
>>>> strategy.  If write volume per time unit is the same in both approaches,
>>>> each node ends up coordinating the majority of writes under either strategy
>>>> as the cluster grows.
>>>>
>>>
>>> If you're not token aware, there's extra coordinator overhead, yes.  If
>>> you are token aware, not the case.  I'm operating under the assumption that
>>> you'd want to be token aware, since I don't see a point in not doing so :)
>>>
>>> Unfortunately my Scala isn't the best so I'm going to have to take a
>>> little bit to wade through the code.
>>>
>>> It may be useful to run cassandra-stress (it doesn't seem to have a mode
>>> for batches) to get a baseline on non-batches.  I'm curious to know if you
>>> get different numbers than the scala profiler.
>>>
>>>
>>>
>>>>
>>>> GC pressure in the cluster is a concern of course, as you observe.  But
>>>> delta performance is *substantial* from what I can see.  As in the
>>>> case where you're bumping up against retries, this will cause you to fall
>>>> over much more rapidly as you approach your tipping point, but in a healthy
>>>> cluster, it's the same write volume, just a longer tenancy in eden.  If
>>>> reasonable sized batches are causing survivors, you're not far off from
>>>> falling over anyway.
>>>>
>>>> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> One thing to keep in mind is the overhead of a batch goes up as the
>>>>> number of servers increases.  Talking to 3 is going to have a much
>>>>> different performance profile than talking to 20.  Keep in mind that the
>>>>> coordinator is going to be talking to every server in the cluster with a
>>>>> big batch.  The amount of local writes will decrease as it owns a smaller
>>>>> portion of the ring.  All you've done is add an extra network hop between
>>>>> your client and where the data should actually be.  You also start to have
>>>>> an impact on GC in a very negative way.
>>>>>
>>>>> Your point is valid about topology changes, but that's a relatively
>>>>> rare occurrence, and the driver is notified pretty quickly, so I wouldn't
>>>>> optimize for that case.
>>>>>
>>>>> Can you post your test code in a gist or something?  I can't really
>>>>> talk about your benchmark without seeing it and you're basing your stance
>>>>> on the premise that it is correct, which it may not be.
>>>>>
>>>>>
>>>>>
>>>>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can seen what the partition key strategies are for each of the
>>>>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>>>>> unique, and bckt is derived from end.  Some of these layouts result in
>>>>>> clustering on the same partition keys, that's actually tunable with the
>>>>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>>>>> should have a mean of 15 in that run - it's an input parameter to my
>>>>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>>>>> each record.
>>>>>>
>>>>>> Your points about:
>>>>>> 1) Failed batches having a higher cost than failed single statements
>>>>>> 2) In my test, every node was a replica for all data.
>>>>>>
>>>>>> These are both very good points.
>>>>>>
>>>>>> For #1, since the worst case scenario is nearly twice fast in batches
>>>>>> as its single statement equivalent, in terms of impact on the client, you'd
>>>>>> have to be retrying half your batches before you broke even there (but of
>>>>>> course those retries are not free to the cluster, so you probably make the
>>>>>> performance tipping point approach a lot faster).  This alone may be cause
>>>>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>>>>> that's what this discussion is about!).
>>>>>>
>>>>>> For #2, that's certainly a good point, for this test cluster, I
>>>>>> should at least re-run with RF=1 so that proxying times start to matter.
>>>>>> If you're not using a token aware client or not using a token aware policy
>>>>>> for whatever reason, this should even out though, no?  Each node will end
>>>>>> up coordinating 1/(nodecount-rf+1) mutations, regardless of whether they
>>>>>> are batched or single statements.  The DS driver is very careful to caution
>>>>>> that the topology map it maintains makes no guarantees on freshness, so you
>>>>>> may see a significant performance penalty in your client when the topology
>>>>>> changes if you're depending on token aware routing as part of your
>>>>>> performance requirements.
>>>>>>
>>>>>>
>>>>>> I'm curious what your thoughts are on grouping statements by primary
>>>>>> replica according to the routing policy, and executing unlogged batches
>>>>>> that way (so that for token aware routing, all statements are executed on a
>>>>>> replica, for others it'd make no difference).  Retries are still more
>>>>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>>>>> easy to do in Scala:
>>>>>>
>>>>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>>>>> session: Session): Map[Host, Seq[Statement]] = {
>>>>>>     val meta = session.getCluster.getMetadata
>>>>>>     statements.groupBy { st =>
>>>>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().
>>>>>> next
>>>>>>     }
>>>>>>   }
>>>>>>   val result = Future.traverse(groupByFirstReplica(statements).values).map(st
>>>>>> => newBatch(st).executeAsync())
>>>>>>
>>>>>>
>>>>>> Let me get together my test code, it depends on some existing
>>>>>> utilities we use elsewhere, such as implicit conversions between Google and
>>>>>> Scala native futures.  I'll try to put this together in a format that's
>>>>>> runnable for you in a Scala REPL console without having to resolve our
>>>>>> internal dependencies.  This may not be today though.
>>>>>>
>>>>>> Also, @Ryan, I don't think that shuffling would make a difference for
>>>>>> my above tests since as Jon observed, all my nodes were already replicas
>>>>>> there.
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Also..what happens when you turn on shuffle with token aware?
>>>>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/
>>>>>>> driver/core/policies/TokenAwarePolicy.html
>>>>>>>
>>>>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> To add to Ryan's (extremely valid!) point, your test works because
>>>>>>>> the coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>>>>> overhead on the server side.
>>>>>>>>
>>>>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>>>>> grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-
>>>>>>>> over-50-faster) which massively helps performance.  It provides
>>>>>>>> the benefit of batches but without the coordinator overhead.
>>>>>>>>
>>>>>>>> Can you post your benchmark code?
>>>>>>>>
>>>>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>>>>> they can reduce network overhead because they're effectively a single
>>>>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>>>>> a coordinator that's
>>>>>>>>>
>>>>>>>>> 1) talking to every machine in the cluster and
>>>>>>>>> b) waiting on a response from a significant portion of them
>>>>>>>>>
>>>>>>>>> before it can respond success or fail.  Any delay, from GC to a
>>>>>>>>> bad disk, can affect the performance of the entire batch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>>>>> jack@basetechnology.com> wrote:
>>>>>>>>>
>>>>>>>>>>   Jonathan and Ryan,
>>>>>>>>>>
>>>>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>>>>> statement.
>>>>>>>>>>
>>>>>>>>>> See:
>>>>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>>>>
>>>>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>>>>> change to make it accurate.
>>>>>>>>>>
>>>>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple
>>>>>>>>>> statements can save network exchanges between the client/server and server
>>>>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>>>>> loading without the Batch keyword."”
>>>>>>>>>>
>>>>>>>>>> Maybe what we really need is a “client/driver-side batch”, which
>>>>>>>>>> is simply a way to collect “batches” of operations in the client/driver and
>>>>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>>>>> operation is appropriate.
>>>>>>>>>>
>>>>>>>>>> It might also be nice to have an inquiry for the cluster as to
>>>>>>>>>> what batch size is most optimal for the cluster, like number of mutations
>>>>>>>>>> in a batch and number of simultaneous connections, and to have that be
>>>>>>>>>> dynamic based on overall cluster load.
>>>>>>>>>>
>>>>>>>>>> I would also note that the example in the spec has multiple
>>>>>>>>>> inserts with different partition key values, which flies in the face of the
>>>>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>>>>
>>>>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>>>>> intent and non-intent for BATCH.
>>>>>>>>>>
>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>
>>>>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla
>>>>>>>>>> <rs...@datastax.com>
>>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>>
>>>>>>>>>> The really important thing to really take away from Ryan's
>>>>>>>>>> original post is that batches are not there for performance.  The only case
>>>>>>>>>> I consider batches to be useful for is when you absolutely need to know
>>>>>>>>>> that several tables all get a mutation (via logged batches).  The use case
>>>>>>>>>> for this is when you've got multiple tables that are serving as different
>>>>>>>>>> views for data.  It is absolutely not going to help you if you're trying to
>>>>>>>>>> lump queries together to reduce network & server overhead - in fact it'll
>>>>>>>>>> do the opposite.  If you're trying to do that, instead perform many async
>>>>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>>>>> failures).
>>>>>>>>>>
>>>>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>>>>> async calls
>>>>>>>>>>
>>>>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Ryan,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the quick response.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>>>>> threshold.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In addition, Patrick is saying that he does not recommend more
>>>>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of
>>>>>>>>>>> mutations in a batch?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mohammed
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nothing magic, just put in there based on experience. You can
>>>>>>>>>>> find the story behind the original recommendation here
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>>>>
>>>>>>>>>>> Totally up for debate."
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi –
>>>>>>>>>>>
>>>>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>> The default size is 5kb and according to the comments in the
>>>>>>>>>>> yaml file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mohammed
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>>>>
>>>>>>>>>>> Ryan Svihla
>>>>>>>>>>>
>>>>>>>>>>> Solution Architect
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>>>>> linkedin.png]
>>>>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>
>>>>>>> Ryan Svihla
>>>>>>>
>>>>>>> Solution Architect
>>>>>>>
>>>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>
>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

You are, of course, free to use batches in your application.  Keep in mind
however, that both my and Ryan's advice is coming from debugging issues in
production.  I don't know why your Scala script is performing better on
batches than async.  It could be:

1) network.  are you running the test script on your laptop and connecting
to cluster over WAN?  If so, I would not be shocked if batch was faster
since your latency is going to be crazy high.

2) is the system under any other load?  I'd love to see the results of the
tests while cassandra stress was running.  This is a step closer to
production where you have to worry about such things

3) The logic for doing async queries may be incorrect.
a) Are you just throwing all the queries at once against the cluster?  If
so, I'd love to see what's happening with GC.  Typically in a real workload
you'd be
b) Are you keeping the servers busy?  If you're calling wait() on a group
of futures, you're now blocking requests from being submitted and limiting
the throughput.

4) you're still only using 3 servers.  The horror of using batches
increases linearly as you add servers.

5) What exactly are you summing in the end?  The total real time taken, or
an aggregation of the async query times?  If it's the async query times
that's going to be pretty misleading (and incorrect).  Again, my Scala is
terrible so I could be reading it wrong.

Sorry I don't have more time to debug the script.  Any of the above ideas
apply?

Jon

On Mon Dec 15 2014 at 1:11:43 PM Eric Stevens <mi...@gmail.com> wrote:

> > Unfortunately my Scala isn't the best so I'm going to have to take a
> little bit to wade through the code.
>
> I think the important thing to take from this code is that:
>
> 1) execution order is randomized for each run, and new data is randomly
> generated for each run to eliminate biases.
> 2) we write to five different key layouts in an attempt to eliminate bias
> from some poorly chosen scheme, we test both clustering and non-clustering
> approaches
> 3) We can fork *just* on batch-vs-single strategy (see
> https://gist.github.com/MightyE/1c98912fca104f6138fc/
> a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L167-L180 )
> thanks to the DS driver having a common executable ancestor between them
> (an extremely nice feature)
> 4) We test three different parallelism strategies to eliminate bias from a
> poorly chosen concurrency model (see https://gist.github.com/
> MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d6939
> 1ee285c080#file-testsuite-L181-L203 )
> 5) The code path is identical wherever possible between strategies.
> 6) Principally this just sets up an Iterable of Statement (sometimes
> members are batches, sometimes members are single statements), and times
> how long they take to execute and complete with different concurrency
> models.
>
> *RE: Cassandra-Stress*
> > It may be useful to run cassandra-stress (it doesn't seem to have a mode
> for batches) to get a baseline on non-batches.  I'm curious to know if you
> get different numbers than the scala profiler.
>
> We always use SSL for everything, and I've struggled to get
> cassandra-stress to talk to our SSL cluster.  Just so I don't keep spinning
> my wheels on a temporary effort, I used CCM to stand up a 2.0.11 cluster
> locally, and ran both tools against here.  I'm dubious about what you can
> infer from such a test because it's not apples to apples (they write
> different data).
>
> Nevertheless, here is the output of "ccm stress" against my local machine
> - I inserted 113,825 records in 62 seconds, and used this data size to
> drive my tool:
>
> Created keyspaces. Sleeping 3s for propagation.
> total   interval_op_rate  interval_key_rate  latency  95th   99.9th
>  elapsed_time
> 11271   1127              1127               8.9      144.7  401.1   10
> 27998   1672              1672               9.5      140.5  399.4   20
> 42189   1419              1419,              9.3      148.0  494.5   31
> 59335   1714              1714               9.3      147.0  493.2   41
> 84957   2562              2562               6.1      137.1  493.3   51
> 113825  2886              2886               5.1      131.5  493.3   62
>
>
> After a ccm clear && ccm start , here's my tool this same local cluster
> (note that I'm actually writing a total of 5x the records because I write
> the same data to each of 5 tables).  My little local cluster just about
> brought down my machine under this test (especially the second one).
>
> ==== Execution Results for 1 runs of 113825 records =============
> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) as single
> statements
> Total Run Time
> traverse test2 ((aid, bckt), end)                             =
> 25,488,179,000
> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
> 25,497,183,000
> traverse test5 ((aid, bckt, end))                             =
> 25,529,444,000
> traverse test3 ((aid, bckt), end, proto) reverse order        =
> 31,495,348,000
> traverse test1 ((aid, bckt), proto, end) reverse order        =
> 33,686,013,000
>
> ==== Execution Results for 1 runs of 113825 records =============
> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
> of 10
> Total Run Time
> traverse test3 ((aid, bckt), end, proto) reverse order        =
> 11,030,788,000
> traverse test1 ((aid, bckt), proto, end) reverse order        =
> 13,345,962,000
> traverse test2 ((aid, bckt), end)                             =
> 15,110,208,000
> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
> 16,398,982,000
> traverse test5 ((aid, bckt, end))                             =
> 22,166,119,000
>
> For giggles I added token aware batching (grouping statements within a
> single batch by meta.getReplicas(statement.getKeyspace,
> statement.getRoutingKey).iterator().next - see https://gist.github.com/
> MightyE/1c98912fca104f6138fc#file-testsuite-L176-L189 ), here's that run;
> comparable results with before, and easily inside one sigma of
> non-token-aware batching, so not a statistically significant difference.
>
> ==== Execution Results for 1 runs of 113825 records =============
> 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
> of 10
> Total Run Time
> traverse test2 ((aid, bckt), end)                             =
> 11,429,008,000
> traverse test1 ((aid, bckt), proto, end) reverse order        =
> 12,593,034,000
> traverse test4 ((aid, bckt), proto, end) no explicit ordering =
> 13,111,244,000
> traverse test3 ((aid, bckt), end, proto) reverse order        =
> 25,163,064,000
> traverse test5 ((aid, bckt, end))                             =
> 30,233,744,000
>
>
>
> On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>>
>>
>> On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mi...@gmail.com> wrote:
>>
>>> Isn't the net effect of coordination overhead incurred by batches
>>> basically the same as the overhead incurred by RoundRobin or other
>>> non-token-aware request routing?  As the cluster size increases, each node
>>> would coordinate the same percentage of writes in batches under token
>>> awareness as they would under a more naive single statement routing
>>> strategy.  If write volume per time unit is the same in both approaches,
>>> each node ends up coordinating the majority of writes under either strategy
>>> as the cluster grows.
>>>
>>
>> If you're not token aware, there's extra coordinator overhead, yes.  If
>> you are token aware, not the case.  I'm operating under the assumption that
>> you'd want to be token aware, since I don't see a point in not doing so :)
>>
>> Unfortunately my Scala isn't the best so I'm going to have to take a
>> little bit to wade through the code.
>>
>> It may be useful to run cassandra-stress (it doesn't seem to have a mode
>> for batches) to get a baseline on non-batches.  I'm curious to know if you
>> get different numbers than the scala profiler.
>>
>>
>>
>>>
>>> GC pressure in the cluster is a concern of course, as you observe.  But
>>> delta performance is *substantial* from what I can see.  As in the case
>>> where you're bumping up against retries, this will cause you to fall over
>>> much more rapidly as you approach your tipping point, but in a healthy
>>> cluster, it's the same write volume, just a longer tenancy in eden.  If
>>> reasonable sized batches are causing survivors, you're not far off from
>>> falling over anyway.
>>>
>>> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> One thing to keep in mind is the overhead of a batch goes up as the
>>>> number of servers increases.  Talking to 3 is going to have a much
>>>> different performance profile than talking to 20.  Keep in mind that the
>>>> coordinator is going to be talking to every server in the cluster with a
>>>> big batch.  The amount of local writes will decrease as it owns a smaller
>>>> portion of the ring.  All you've done is add an extra network hop between
>>>> your client and where the data should actually be.  You also start to have
>>>> an impact on GC in a very negative way.
>>>>
>>>> Your point is valid about topology changes, but that's a relatively
>>>> rare occurrence, and the driver is notified pretty quickly, so I wouldn't
>>>> optimize for that case.
>>>>
>>>> Can you post your test code in a gist or something?  I can't really
>>>> talk about your benchmark without seeing it and you're basing your stance
>>>> on the premise that it is correct, which it may not be.
>>>>
>>>>
>>>>
>>>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can seen what the partition key strategies are for each of the
>>>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>>>> unique, and bckt is derived from end.  Some of these layouts result in
>>>>> clustering on the same partition keys, that's actually tunable with the
>>>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>>>> should have a mean of 15 in that run - it's an input parameter to my
>>>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>>>> each record.
>>>>>
>>>>> Your points about:
>>>>> 1) Failed batches having a higher cost than failed single statements
>>>>> 2) In my test, every node was a replica for all data.
>>>>>
>>>>> These are both very good points.
>>>>>
>>>>> For #1, since the worst case scenario is nearly twice fast in batches
>>>>> as its single statement equivalent, in terms of impact on the client, you'd
>>>>> have to be retrying half your batches before you broke even there (but of
>>>>> course those retries are not free to the cluster, so you probably make the
>>>>> performance tipping point approach a lot faster).  This alone may be cause
>>>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>>>> that's what this discussion is about!).
>>>>>
>>>>> For #2, that's certainly a good point, for this test cluster, I should
>>>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>>>> you're not using a token aware client or not using a token aware policy for
>>>>> whatever reason, this should even out though, no?  Each node will end up
>>>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>>>> batched or single statements.  The DS driver is very careful to caution
>>>>> that the topology map it maintains makes no guarantees on freshness, so you
>>>>> may see a significant performance penalty in your client when the topology
>>>>> changes if you're depending on token aware routing as part of your
>>>>> performance requirements.
>>>>>
>>>>>
>>>>> I'm curious what your thoughts are on grouping statements by primary
>>>>> replica according to the routing policy, and executing unlogged batches
>>>>> that way (so that for token aware routing, all statements are executed on a
>>>>> replica, for others it'd make no difference).  Retries are still more
>>>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>>>> easy to do in Scala:
>>>>>
>>>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>>>> session: Session): Map[Host, Seq[Statement]] = {
>>>>>     val meta = session.getCluster.getMetadata
>>>>>     statements.groupBy { st =>
>>>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().
>>>>> next
>>>>>     }
>>>>>   }
>>>>>   val result = Future.traverse(groupByFirstReplica(statements).values).map(st
>>>>> => newBatch(st).executeAsync())
>>>>>
>>>>>
>>>>> Let me get together my test code, it depends on some existing
>>>>> utilities we use elsewhere, such as implicit conversions between Google and
>>>>> Scala native futures.  I'll try to put this together in a format that's
>>>>> runnable for you in a Scala REPL console without having to resolve our
>>>>> internal dependencies.  This may not be today though.
>>>>>
>>>>> Also, @Ryan, I don't think that shuffling would make a difference for
>>>>> my above tests since as Jon observed, all my nodes were already replicas
>>>>> there.
>>>>>
>>>>>
>>>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>>>> wrote:
>>>>>
>>>>>> Also..what happens when you turn on shuffle with token aware?
>>>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/
>>>>>> driver/core/policies/TokenAwarePolicy.html
>>>>>>
>>>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> To add to Ryan's (extremely valid!) point, your test works because
>>>>>>> the coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>>>> overhead on the server side.
>>>>>>>
>>>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>>>> grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-
>>>>>>> over-50-faster) which massively helps performance.  It provides the
>>>>>>> benefit of batches but without the coordinator overhead.
>>>>>>>
>>>>>>> Can you post your benchmark code?
>>>>>>>
>>>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>>>> they can reduce network overhead because they're effectively a single
>>>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>>>> a coordinator that's
>>>>>>>>
>>>>>>>> 1) talking to every machine in the cluster and
>>>>>>>> b) waiting on a response from a significant portion of them
>>>>>>>>
>>>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>>>> disk, can affect the performance of the entire batch.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>>>> jack@basetechnology.com> wrote:
>>>>>>>>
>>>>>>>>>   Jonathan and Ryan,
>>>>>>>>>
>>>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>>>> statement.
>>>>>>>>>
>>>>>>>>> See:
>>>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>>>
>>>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>>>> change to make it accurate.
>>>>>>>>>
>>>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple
>>>>>>>>> statements can save network exchanges between the client/server and server
>>>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>>>> loading without the Batch keyword."”
>>>>>>>>>
>>>>>>>>> Maybe what we really need is a “client/driver-side batch”, which
>>>>>>>>> is simply a way to collect “batches” of operations in the client/driver and
>>>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>>>> operation is appropriate.
>>>>>>>>>
>>>>>>>>> It might also be nice to have an inquiry for the cluster as to
>>>>>>>>> what batch size is most optimal for the cluster, like number of mutations
>>>>>>>>> in a batch and number of simultaneous connections, and to have that be
>>>>>>>>> dynamic based on overall cluster load.
>>>>>>>>>
>>>>>>>>> I would also note that the example in the spec has multiple
>>>>>>>>> inserts with different partition key values, which flies in the face of the
>>>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>>>
>>>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>>>> intent and non-intent for BATCH.
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla
>>>>>>>>> <rs...@datastax.com>
>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>
>>>>>>>>> The really important thing to really take away from Ryan's
>>>>>>>>> original post is that batches are not there for performance.  The only case
>>>>>>>>> I consider batches to be useful for is when you absolutely need to know
>>>>>>>>> that several tables all get a mutation (via logged batches).  The use case
>>>>>>>>> for this is when you've got multiple tables that are serving as different
>>>>>>>>> views for data.  It is absolutely not going to help you if you're trying to
>>>>>>>>> lump queries together to reduce network & server overhead - in fact it'll
>>>>>>>>> do the opposite.  If you're trying to do that, instead perform many async
>>>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>>>> failures).
>>>>>>>>>
>>>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>>>> async calls
>>>>>>>>>
>>>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>
>>>>>>>>>>  Ryan,
>>>>>>>>>>
>>>>>>>>>> Thanks for the quick response.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>>>> threshold.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In addition, Patrick is saying that he does not recommend more
>>>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of
>>>>>>>>>> mutations in a batch?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mohammed
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nothing magic, just put in there based on experience. You can
>>>>>>>>>> find the story behind the original recommendation here
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>>>
>>>>>>>>>> Totally up for debate."
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi –
>>>>>>>>>>
>>>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mohammed
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>>>
>>>>>>>>>> Ryan Svihla
>>>>>>>>>>
>>>>>>>>>> Solution Architect
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>>>> linkedin.png]
>>>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>
>>>>>> Ryan Svihla
>>>>>>
>>>>>> Solution Architect
>>>>>>
>>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>
>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>> is the database technology and transactional backbone of choice for the
>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

> Unfortunately my Scala isn't the best so I'm going to have to take a
little bit to wade through the code.

I think the important thing to take from this code is that:

1) execution order is randomized for each run, and new data is randomly
generated for each run to eliminate biases.
2) we write to five different key layouts in an attempt to eliminate bias
from some poorly chosen scheme, we test both clustering and non-clustering
approaches
3) We can fork *just* on batch-vs-single strategy (see
https://gist.github.com/MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L167-L180
) thanks to the DS driver having a common executable ancestor between them
(an extremely nice feature)
4) We test three different parallelism strategies to eliminate bias from a
poorly chosen concurrency model (see
https://gist.github.com/MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L181-L203
)
5) The code path is identical wherever possible between strategies.
6) Principally this just sets up an Iterable of Statement (sometimes
members are batches, sometimes members are single statements), and times
how long they take to execute and complete with different concurrency
models.

*RE: Cassandra-Stress*
> It may be useful to run cassandra-stress (it doesn't seem to have a mode
for batches) to get a baseline on non-batches.  I'm curious to know if you
get different numbers than the scala profiler.

We always use SSL for everything, and I've struggled to get
cassandra-stress to talk to our SSL cluster.  Just so I don't keep spinning
my wheels on a temporary effort, I used CCM to stand up a 2.0.11 cluster
locally, and ran both tools against here.  I'm dubious about what you can
infer from such a test because it's not apples to apples (they write
different data).

Nevertheless, here is the output of "ccm stress" against my local machine -
I inserted 113,825 records in 62 seconds, and used this data size to drive
my tool:

Created keyspaces. Sleeping 3s for propagation.
total   interval_op_rate  interval_key_rate  latency  95th   99.9th
 elapsed_time
11271   1127              1127               8.9      144.7  401.1   10
27998   1672              1672               9.5      140.5  399.4   20
42189   1419              1419,              9.3      148.0  494.5   31
59335   1714              1714               9.3      147.0  493.2   41
84957   2562              2562               6.1      137.1  493.3   51
113825  2886              2886               5.1      131.5  493.3   62


After a ccm clear && ccm start , here's my tool this same local cluster
(note that I'm actually writing a total of 5x the records because I write
the same data to each of 5 tables).  My little local cluster just about
brought down my machine under this test (especially the second one).

==== Execution Results for 1 runs of 113825 records =============
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) as single
statements
Total Run Time
traverse test2 ((aid, bckt), end)                             =
25,488,179,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
25,497,183,000
traverse test5 ((aid, bckt, end))                             =
25,529,444,000
traverse test3 ((aid, bckt), end, proto) reverse order        =
31,495,348,000
traverse test1 ((aid, bckt), proto, end) reverse order        =
33,686,013,000

==== Execution Results for 1 runs of 113825 records =============
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
of 10
Total Run Time
traverse test3 ((aid, bckt), end, proto) reverse order        =
11,030,788,000
traverse test1 ((aid, bckt), proto, end) reverse order        =
13,345,962,000
traverse test2 ((aid, bckt), end)                             =
15,110,208,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
16,398,982,000
traverse test5 ((aid, bckt, end))                             =
22,166,119,000

For giggles I added token aware batching (grouping statements within a
single batch by meta.getReplicas(statement.getKeyspace,
statement.getRoutingKey).iterator().next - see
https://gist.github.com/MightyE/1c98912fca104f6138fc#file-testsuite-L176-L189
), here's that run; comparable results with before, and easily inside one
sigma of non-token-aware batching, so not a statistically significant
difference.

==== Execution Results for 1 runs of 113825 records =============
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
of 10
Total Run Time
traverse test2 ((aid, bckt), end)                             =
11,429,008,000
traverse test1 ((aid, bckt), proto, end) reverse order        =
12,593,034,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
13,111,244,000
traverse test3 ((aid, bckt), end, proto) reverse order        =
25,163,064,000
traverse test5 ((aid, bckt, end))                             =
30,233,744,000



On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

>
>
> On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mi...@gmail.com> wrote:
>
>> Isn't the net effect of coordination overhead incurred by batches
>> basically the same as the overhead incurred by RoundRobin or other
>> non-token-aware request routing?  As the cluster size increases, each node
>> would coordinate the same percentage of writes in batches under token
>> awareness as they would under a more naive single statement routing
>> strategy.  If write volume per time unit is the same in both approaches,
>> each node ends up coordinating the majority of writes under either strategy
>> as the cluster grows.
>>
>
> If you're not token aware, there's extra coordinator overhead, yes.  If
> you are token aware, not the case.  I'm operating under the assumption that
> you'd want to be token aware, since I don't see a point in not doing so :)
>
> Unfortunately my Scala isn't the best so I'm going to have to take a
> little bit to wade through the code.
>
> It may be useful to run cassandra-stress (it doesn't seem to have a mode
> for batches) to get a baseline on non-batches.  I'm curious to know if you
> get different numbers than the scala profiler.
>
>
>
>>
>> GC pressure in the cluster is a concern of course, as you observe.  But
>> delta performance is *substantial* from what I can see.  As in the case
>> where you're bumping up against retries, this will cause you to fall over
>> much more rapidly as you approach your tipping point, but in a healthy
>> cluster, it's the same write volume, just a longer tenancy in eden.  If
>> reasonable sized batches are causing survivors, you're not far off from
>> falling over anyway.
>>
>> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> One thing to keep in mind is the overhead of a batch goes up as the
>>> number of servers increases.  Talking to 3 is going to have a much
>>> different performance profile than talking to 20.  Keep in mind that the
>>> coordinator is going to be talking to every server in the cluster with a
>>> big batch.  The amount of local writes will decrease as it owns a smaller
>>> portion of the ring.  All you've done is add an extra network hop between
>>> your client and where the data should actually be.  You also start to have
>>> an impact on GC in a very negative way.
>>>
>>> Your point is valid about topology changes, but that's a relatively rare
>>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>>> optimize for that case.
>>>
>>> Can you post your test code in a gist or something?  I can't really talk
>>> about your benchmark without seeing it and you're basing your stance on the
>>> premise that it is correct, which it may not be.
>>>
>>>
>>>
>>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:
>>>
>>>> You can seen what the partition key strategies are for each of the
>>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>>> unique, and bckt is derived from end.  Some of these layouts result in
>>>> clustering on the same partition keys, that's actually tunable with the
>>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>>> should have a mean of 15 in that run - it's an input parameter to my
>>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>>> each record.
>>>>
>>>> Your points about:
>>>> 1) Failed batches having a higher cost than failed single statements
>>>> 2) In my test, every node was a replica for all data.
>>>>
>>>> These are both very good points.
>>>>
>>>> For #1, since the worst case scenario is nearly twice fast in batches
>>>> as its single statement equivalent, in terms of impact on the client, you'd
>>>> have to be retrying half your batches before you broke even there (but of
>>>> course those retries are not free to the cluster, so you probably make the
>>>> performance tipping point approach a lot faster).  This alone may be cause
>>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>>> that's what this discussion is about!).
>>>>
>>>> For #2, that's certainly a good point, for this test cluster, I should
>>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>>> you're not using a token aware client or not using a token aware policy for
>>>> whatever reason, this should even out though, no?  Each node will end up
>>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>>> batched or single statements.  The DS driver is very careful to caution
>>>> that the topology map it maintains makes no guarantees on freshness, so you
>>>> may see a significant performance penalty in your client when the topology
>>>> changes if you're depending on token aware routing as part of your
>>>> performance requirements.
>>>>
>>>>
>>>> I'm curious what your thoughts are on grouping statements by primary
>>>> replica according to the routing policy, and executing unlogged batches
>>>> that way (so that for token aware routing, all statements are executed on a
>>>> replica, for others it'd make no difference).  Retries are still more
>>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>>> easy to do in Scala:
>>>>
>>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>>> session: Session): Map[Host, Seq[Statement]] = {
>>>>     val meta = session.getCluster.getMetadata
>>>>     statements.groupBy { st =>
>>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>>>     }
>>>>   }
>>>>   val result =
>>>> Future.traverse(groupByFirstReplica(statements).values).map(st =>
>>>> newBatch(st).executeAsync())
>>>>
>>>>
>>>> Let me get together my test code, it depends on some existing utilities
>>>> we use elsewhere, such as implicit conversions between Google and Scala
>>>> native futures.  I'll try to put this together in a format that's runnable
>>>> for you in a Scala REPL console without having to resolve our internal
>>>> dependencies.  This may not be today though.
>>>>
>>>> Also, @Ryan, I don't think that shuffling would make a difference for
>>>> my above tests since as Jon observed, all my nodes were already replicas
>>>> there.
>>>>
>>>>
>>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>>> wrote:
>>>>
>>>>> Also..what happens when you turn on shuffle with token aware?
>>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>>>>
>>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>>
>>>>>> To add to Ryan's (extremely valid!) point, your test works because
>>>>>> the coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>>> overhead on the server side.
>>>>>>
>>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>>> grouping (see
>>>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>>>>> which massively helps performance.  It provides the benefit of batches but
>>>>>> without the coordinator overhead.
>>>>>>
>>>>>> Can you post your benchmark code?
>>>>>>
>>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>>> they can reduce network overhead because they're effectively a single
>>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>>> a coordinator that's
>>>>>>>
>>>>>>> 1) talking to every machine in the cluster and
>>>>>>> b) waiting on a response from a significant portion of them
>>>>>>>
>>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>>> disk, can affect the performance of the entire batch.
>>>>>>>
>>>>>>>
>>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>>> jack@basetechnology.com> wrote:
>>>>>>>
>>>>>>>>   Jonathan and Ryan,
>>>>>>>>
>>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>>> statement.
>>>>>>>>
>>>>>>>> See:
>>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>>
>>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>>> change to make it accurate.
>>>>>>>>
>>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>>> can save network exchanges between the client/server and server
>>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>>> loading without the Batch keyword."”
>>>>>>>>
>>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>>> operation is appropriate.
>>>>>>>>
>>>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>>>>> based on overall cluster load.
>>>>>>>>
>>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>>> with different partition key values, which flies in the face of the
>>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>>
>>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>>> intent and non-intent for BATCH.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla
>>>>>>>> <rs...@datastax.com>
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>>> post is that batches are not there for performance.  The only case I
>>>>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>>>>> this is when you've got multiple tables that are serving as different views
>>>>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>>> failures).
>>>>>>>>
>>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>>> async calls
>>>>>>>>
>>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>>>  Ryan,
>>>>>>>>>
>>>>>>>>> Thanks for the quick response.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>>> threshold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In addition, Patrick is saying that he does not recommend more
>>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of
>>>>>>>>> mutations in a batch?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>>>> the story behind the original recommendation here
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>>
>>>>>>>>> Totally up for debate."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi –
>>>>>>>>>
>>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>>
>>>>>>>>> Ryan Svihla
>>>>>>>>>
>>>>>>>>> Solution Architect
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>>> linkedin.png]
>>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Ryan Svihla
>>>>>
>>>>> Solution Architect
>>>>>
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>>
>>>>
>>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Not a problem - it's good to hash this stuff out and understand the
technical reasons why something works or doesn't work.


On Sat Dec 13 2014 at 10:07:10 AM Jonathan Haddad <jo...@jonhaddad.com> wrote:

> On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mi...@gmail.com> wrote:
>
>> Isn't the net effect of coordination overhead incurred by batches
>> basically the same as the overhead incurred by RoundRobin or other
>> non-token-aware request routing?  As the cluster size increases, each node
>> would coordinate the same percentage of writes in batches under token
>> awareness as they would under a more naive single statement routing
>> strategy.  If write volume per time unit is the same in both approaches,
>> each node ends up coordinating the majority of writes under either strategy
>> as the cluster grows.
>>
>
> If you're not token aware, there's extra coordinator overhead, yes.  If
> you are token aware, not the case.  I'm operating under the assumption that
> you'd want to be token aware, since I don't see a point in not doing so :)
>
> Unfortunately my Scala isn't the best so I'm going to have to take a
> little bit to wade through the code.
>
> It may be useful to run cassandra-stress (it doesn't seem to have a mode
> for batches) to get a baseline on non-batches.  I'm curious to know if you
> get different numbers than the scala profiler.
>
>
>
>>
>> GC pressure in the cluster is a concern of course, as you observe.  But
>> delta performance is *substantial* from what I can see.  As in the case
>> where you're bumping up against retries, this will cause you to fall over
>> much more rapidly as you approach your tipping point, but in a healthy
>> cluster, it's the same write volume, just a longer tenancy in eden.  If
>> reasonable sized batches are causing survivors, you're not far off from
>> falling over anyway.
>>
>> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> One thing to keep in mind is the overhead of a batch goes up as the
>>> number of servers increases.  Talking to 3 is going to have a much
>>> different performance profile than talking to 20.  Keep in mind that the
>>> coordinator is going to be talking to every server in the cluster with a
>>> big batch.  The amount of local writes will decrease as it owns a smaller
>>> portion of the ring.  All you've done is add an extra network hop between
>>> your client and where the data should actually be.  You also start to have
>>> an impact on GC in a very negative way.
>>>
>>> Your point is valid about topology changes, but that's a relatively rare
>>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>>> optimize for that case.
>>>
>>> Can you post your test code in a gist or something?  I can't really talk
>>> about your benchmark without seeing it and you're basing your stance on the
>>> premise that it is correct, which it may not be.
>>>
>>>
>>>
>>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:
>>>
>>>> You can seen what the partition key strategies are for each of the
>>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>>> unique, and bckt is derived from end.  Some of these layouts result in
>>>> clustering on the same partition keys, that's actually tunable with the
>>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>>> should have a mean of 15 in that run - it's an input parameter to my
>>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>>> each record.
>>>>
>>>> Your points about:
>>>> 1) Failed batches having a higher cost than failed single statements
>>>> 2) In my test, every node was a replica for all data.
>>>>
>>>> These are both very good points.
>>>>
>>>> For #1, since the worst case scenario is nearly twice fast in batches
>>>> as its single statement equivalent, in terms of impact on the client, you'd
>>>> have to be retrying half your batches before you broke even there (but of
>>>> course those retries are not free to the cluster, so you probably make the
>>>> performance tipping point approach a lot faster).  This alone may be cause
>>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>>> that's what this discussion is about!).
>>>>
>>>> For #2, that's certainly a good point, for this test cluster, I should
>>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>>> you're not using a token aware client or not using a token aware policy for
>>>> whatever reason, this should even out though, no?  Each node will end up
>>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>>> batched or single statements.  The DS driver is very careful to caution
>>>> that the topology map it maintains makes no guarantees on freshness, so you
>>>> may see a significant performance penalty in your client when the topology
>>>> changes if you're depending on token aware routing as part of your
>>>> performance requirements.
>>>>
>>>>
>>>> I'm curious what your thoughts are on grouping statements by primary
>>>> replica according to the routing policy, and executing unlogged batches
>>>> that way (so that for token aware routing, all statements are executed on a
>>>> replica, for others it'd make no difference).  Retries are still more
>>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>>> easy to do in Scala:
>>>>
>>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>>> session: Session): Map[Host, Seq[Statement]] = {
>>>>     val meta = session.getCluster.getMetadata
>>>>     statements.groupBy { st =>
>>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().
>>>> next
>>>>     }
>>>>   }
>>>>   val result = Future.traverse(groupByFirstReplica(statements).values).map(st
>>>> => newBatch(st).executeAsync())
>>>>
>>>>
>>>> Let me get together my test code, it depends on some existing utilities
>>>> we use elsewhere, such as implicit conversions between Google and Scala
>>>> native futures.  I'll try to put this together in a format that's runnable
>>>> for you in a Scala REPL console without having to resolve our internal
>>>> dependencies.  This may not be today though.
>>>>
>>>> Also, @Ryan, I don't think that shuffling would make a difference for
>>>> my above tests since as Jon observed, all my nodes were already replicas
>>>> there.
>>>>
>>>>
>>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>>> wrote:
>>>>
>>>>> Also..what happens when you turn on shuffle with token aware?
>>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/
>>>>> driver/core/policies/TokenAwarePolicy.html
>>>>>
>>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>>
>>>>>> To add to Ryan's (extremely valid!) point, your test works because
>>>>>> the coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>>> overhead on the server side.
>>>>>>
>>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>>> grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-
>>>>>> over-50-faster) which massively helps performance.  It provides the
>>>>>> benefit of batches but without the coordinator overhead.
>>>>>>
>>>>>> Can you post your benchmark code?
>>>>>>
>>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>>> they can reduce network overhead because they're effectively a single
>>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>>> a coordinator that's
>>>>>>>
>>>>>>> 1) talking to every machine in the cluster and
>>>>>>> b) waiting on a response from a significant portion of them
>>>>>>>
>>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>>> disk, can affect the performance of the entire batch.
>>>>>>>
>>>>>>>
>>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>>> jack@basetechnology.com> wrote:
>>>>>>>
>>>>>>>>   Jonathan and Ryan,
>>>>>>>>
>>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>>> statement.
>>>>>>>>
>>>>>>>> See:
>>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>>
>>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>>> change to make it accurate.
>>>>>>>>
>>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>>> can save network exchanges between the client/server and server
>>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>>> loading without the Batch keyword."”
>>>>>>>>
>>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>>> operation is appropriate.
>>>>>>>>
>>>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>>>>> based on overall cluster load.
>>>>>>>>
>>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>>> with different partition key values, which flies in the face of the
>>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>>
>>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>>> intent and non-intent for BATCH.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla
>>>>>>>> <rs...@datastax.com>
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>>> post is that batches are not there for performance.  The only case I
>>>>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>>>>> this is when you've got multiple tables that are serving as different views
>>>>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>>> failures).
>>>>>>>>
>>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>>> async calls
>>>>>>>>
>>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>>>  Ryan,
>>>>>>>>>
>>>>>>>>> Thanks for the quick response.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>>> threshold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In addition, Patrick is saying that he does not recommend more
>>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of
>>>>>>>>> mutations in a batch?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>>>> the story behind the original recommendation here
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>>
>>>>>>>>> Totally up for debate."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi –
>>>>>>>>>
>>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>>
>>>>>>>>> Ryan Svihla
>>>>>>>>>
>>>>>>>>> Solution Architect
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>>> linkedin.png]
>>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Ryan Svihla
>>>>>
>>>>> Solution Architect
>>>>>
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>>
>>>>
>>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mi...@gmail.com> wrote:

> Isn't the net effect of coordination overhead incurred by batches
> basically the same as the overhead incurred by RoundRobin or other
> non-token-aware request routing?  As the cluster size increases, each node
> would coordinate the same percentage of writes in batches under token
> awareness as they would under a more naive single statement routing
> strategy.  If write volume per time unit is the same in both approaches,
> each node ends up coordinating the majority of writes under either strategy
> as the cluster grows.
>

If you're not token aware, there's extra coordinator overhead, yes.  If you
are token aware, not the case.  I'm operating under the assumption that
you'd want to be token aware, since I don't see a point in not doing so :)

Unfortunately my Scala isn't the best so I'm going to have to take a little
bit to wade through the code.

It may be useful to run cassandra-stress (it doesn't seem to have a mode
for batches) to get a baseline on non-batches.  I'm curious to know if you
get different numbers than the scala profiler.



>
> GC pressure in the cluster is a concern of course, as you observe.  But
> delta performance is *substantial* from what I can see.  As in the case
> where you're bumping up against retries, this will cause you to fall over
> much more rapidly as you approach your tipping point, but in a healthy
> cluster, it's the same write volume, just a longer tenancy in eden.  If
> reasonable sized batches are causing survivors, you're not far off from
> falling over anyway.
>
> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> One thing to keep in mind is the overhead of a batch goes up as the
>> number of servers increases.  Talking to 3 is going to have a much
>> different performance profile than talking to 20.  Keep in mind that the
>> coordinator is going to be talking to every server in the cluster with a
>> big batch.  The amount of local writes will decrease as it owns a smaller
>> portion of the ring.  All you've done is add an extra network hop between
>> your client and where the data should actually be.  You also start to have
>> an impact on GC in a very negative way.
>>
>> Your point is valid about topology changes, but that's a relatively rare
>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>> optimize for that case.
>>
>> Can you post your test code in a gist or something?  I can't really talk
>> about your benchmark without seeing it and you're basing your stance on the
>> premise that it is correct, which it may not be.
>>
>>
>>
>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:
>>
>>> You can seen what the partition key strategies are for each of the
>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>> unique, and bckt is derived from end.  Some of these layouts result in
>>> clustering on the same partition keys, that's actually tunable with the
>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>> should have a mean of 15 in that run - it's an input parameter to my
>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>> each record.
>>>
>>> Your points about:
>>> 1) Failed batches having a higher cost than failed single statements
>>> 2) In my test, every node was a replica for all data.
>>>
>>> These are both very good points.
>>>
>>> For #1, since the worst case scenario is nearly twice fast in batches as
>>> its single statement equivalent, in terms of impact on the client, you'd
>>> have to be retrying half your batches before you broke even there (but of
>>> course those retries are not free to the cluster, so you probably make the
>>> performance tipping point approach a lot faster).  This alone may be cause
>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>> that's what this discussion is about!).
>>>
>>> For #2, that's certainly a good point, for this test cluster, I should
>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>> you're not using a token aware client or not using a token aware policy for
>>> whatever reason, this should even out though, no?  Each node will end up
>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>> batched or single statements.  The DS driver is very careful to caution
>>> that the topology map it maintains makes no guarantees on freshness, so you
>>> may see a significant performance penalty in your client when the topology
>>> changes if you're depending on token aware routing as part of your
>>> performance requirements.
>>>
>>>
>>> I'm curious what your thoughts are on grouping statements by primary
>>> replica according to the routing policy, and executing unlogged batches
>>> that way (so that for token aware routing, all statements are executed on a
>>> replica, for others it'd make no difference).  Retries are still more
>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>> easy to do in Scala:
>>>
>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>> session: Session): Map[Host, Seq[Statement]] = {
>>>     val meta = session.getCluster.getMetadata
>>>     statements.groupBy { st =>
>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>>     }
>>>   }
>>>   val result =
>>> Future.traverse(groupByFirstReplica(statements).values).map(st =>
>>> newBatch(st).executeAsync())
>>>
>>>
>>> Let me get together my test code, it depends on some existing utilities
>>> we use elsewhere, such as implicit conversions between Google and Scala
>>> native futures.  I'll try to put this together in a format that's runnable
>>> for you in a Scala REPL console without having to resolve our internal
>>> dependencies.  This may not be today though.
>>>
>>> Also, @Ryan, I don't think that shuffling would make a difference for my
>>> above tests since as Jon observed, all my nodes were already replicas there.
>>>
>>>
>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>>> wrote:
>>>
>>>> Also..what happens when you turn on shuffle with token aware?
>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>>>
>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>>
>>>>> To add to Ryan's (extremely valid!) point, your test works because the
>>>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>> overhead on the server side.
>>>>>
>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>> grouping (see
>>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>>>> which massively helps performance.  It provides the benefit of batches but
>>>>> without the coordinator overhead.
>>>>>
>>>>> Can you post your benchmark code?
>>>>>
>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>> they can reduce network overhead because they're effectively a single
>>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>>> a coordinator that's
>>>>>>
>>>>>> 1) talking to every machine in the cluster and
>>>>>> b) waiting on a response from a significant portion of them
>>>>>>
>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>> disk, can affect the performance of the entire batch.
>>>>>>
>>>>>>
>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>> jack@basetechnology.com> wrote:
>>>>>>
>>>>>>>   Jonathan and Ryan,
>>>>>>>
>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your
>>>>>>> statement.
>>>>>>>
>>>>>>> See:
>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>
>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>> change to make it accurate.
>>>>>>>
>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>> can save network exchanges between the client/server and server
>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>> loading without the Batch keyword."”
>>>>>>>
>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>> operation is appropriate.
>>>>>>>
>>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>>>> based on overall cluster load.
>>>>>>>
>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>> with different partition key values, which flies in the face of the
>>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>>
>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>> intent and non-intent for BATCH.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>
>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>> post is that batches are not there for performance.  The only case I
>>>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>>>> this is when you've got multiple tables that are serving as different views
>>>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>> failures).
>>>>>>>
>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>> async calls
>>>>>>>
>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>
>>>>>>>>  Ryan,
>>>>>>>>
>>>>>>>> Thanks for the quick response.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>> threshold.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>>>> in a batch?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>>> the story behind the original recommendation here
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>
>>>>>>>>
>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I came
>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>
>>>>>>>> Totally up for debate."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>> Hi –
>>>>>>>>
>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>> *
>>>>>>>>
>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>
>>>>>>>> Ryan Svihla
>>>>>>>>
>>>>>>>> Solution Architect
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Ryan Svihla
>>>>
>>>> Solution Architect
>>>>
>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>>
>>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

Isn't the net effect of coordination overhead incurred by batches basically
the same as the overhead incurred by RoundRobin or other non-token-aware
request routing?  As the cluster size increases, each node would coordinate
the same percentage of writes in batches under token awareness as they
would under a more naive single statement routing strategy.  If write
volume per time unit is the same in both approaches, each node ends up
coordinating the majority of writes under either strategy as the cluster
grows.

GC pressure in the cluster is a concern of course, as you observe.  But
delta performance is *substantial* from what I can see.  As in the case
where you're bumping up against retries, this will cause you to fall over
much more rapidly as you approach your tipping point, but in a healthy
cluster, it's the same write volume, just a longer tenancy in eden.  If
reasonable sized batches are causing survivors, you're not far off from
falling over anyway.

On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> One thing to keep in mind is the overhead of a batch goes up as the number
> of servers increases.  Talking to 3 is going to have a much different
> performance profile than talking to 20.  Keep in mind that the coordinator
> is going to be talking to every server in the cluster with a big batch.
> The amount of local writes will decrease as it owns a smaller portion of
> the ring.  All you've done is add an extra network hop between your client
> and where the data should actually be.  You also start to have an impact on
> GC in a very negative way.
>
> Your point is valid about topology changes, but that's a relatively rare
> occurrence, and the driver is notified pretty quickly, so I wouldn't
> optimize for that case.
>
> Can you post your test code in a gist or something?  I can't really talk
> about your benchmark without seeing it and you're basing your stance on the
> premise that it is correct, which it may not be.
>
>
>
> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:
>
>> You can seen what the partition key strategies are for each of the
>> tables, test5 shows the least improvement.  The set (aid, end) should be
>> unique, and bckt is derived from end.  Some of these layouts result in
>> clustering on the same partition keys, that's actually tunable with the
>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>> should have a mean of 15 in that run - it's an input parameter to my
>> tests).  "test5" obviously ends up being exclusively unique partitions for
>> each record.
>>
>> Your points about:
>> 1) Failed batches having a higher cost than failed single statements
>> 2) In my test, every node was a replica for all data.
>>
>> These are both very good points.
>>
>> For #1, since the worst case scenario is nearly twice fast in batches as
>> its single statement equivalent, in terms of impact on the client, you'd
>> have to be retrying half your batches before you broke even there (but of
>> course those retries are not free to the cluster, so you probably make the
>> performance tipping point approach a lot faster).  This alone may be cause
>> to justify avoiding batches, or at least severely limiting their size (hey,
>> that's what this discussion is about!).
>>
>> For #2, that's certainly a good point, for this test cluster, I should at
>> least re-run with RF=1 so that proxying times start to matter.  If you're
>> not using a token aware client or not using a token aware policy for
>> whatever reason, this should even out though, no?  Each node will end up
>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>> batched or single statements.  The DS driver is very careful to caution
>> that the topology map it maintains makes no guarantees on freshness, so you
>> may see a significant performance penalty in your client when the topology
>> changes if you're depending on token aware routing as part of your
>> performance requirements.
>>
>>
>> I'm curious what your thoughts are on grouping statements by primary
>> replica according to the routing policy, and executing unlogged batches
>> that way (so that for token aware routing, all statements are executed on a
>> replica, for others it'd make no difference).  Retries are still more
>> expensive, but token aware proxying avoidance is still had.  It's pretty
>> easy to do in Scala:
>>
>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>> session: Session): Map[Host, Seq[Statement]] = {
>>     val meta = session.getCluster.getMetadata
>>     statements.groupBy { st =>
>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>     }
>>   }
>>   val result =
>> Future.traverse(groupByFirstReplica(statements).values).map(st =>
>> newBatch(st).executeAsync())
>>
>>
>> Let me get together my test code, it depends on some existing utilities
>> we use elsewhere, such as implicit conversions between Google and Scala
>> native futures.  I'll try to put this together in a format that's runnable
>> for you in a Scala REPL console without having to resolve our internal
>> dependencies.  This may not be today though.
>>
>> Also, @Ryan, I don't think that shuffling would make a difference for my
>> above tests since as Jon observed, all my nodes were already replicas there.
>>
>>
>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com>
>> wrote:
>>
>>> Also..what happens when you turn on shuffle with token aware?
>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>>
>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>>
>>>> To add to Ryan's (extremely valid!) point, your test works because the
>>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>> Batching works great at RF=N=3 because it always gets to write to local and
>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>> overhead on the server side.
>>>>
>>>> To save network overhead, Cassandra 2.1 added support for response
>>>> grouping (see
>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>>> which massively helps performance.  It provides the benefit of batches but
>>>> without the coordinator overhead.
>>>>
>>>> Can you post your benchmark code?
>>>>
>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>> they can reduce network overhead because they're effectively a single
>>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>>> most people aren't!) you end up putting additional pressure on the
>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>>> a coordinator that's
>>>>>
>>>>> 1) talking to every machine in the cluster and
>>>>> b) waiting on a response from a significant portion of them
>>>>>
>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>> disk, can affect the performance of the entire batch.
>>>>>
>>>>>
>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>> jack@basetechnology.com> wrote:
>>>>>
>>>>>>   Jonathan and Ryan,
>>>>>>
>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>> trying to lump queries together to reduce network & server overhead - in
>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>> The BATCH statement ... serves several purposes: 1. It saves network
>>>>>> round-trips between the client and the server (and sometimes between the
>>>>>> server coordinator and the replicas) when batching multiple updates.” Is
>>>>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>>>>
>>>>>> See:
>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>
>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>> change to make it accurate.
>>>>>>
>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>> can save network exchanges between the client/server and server
>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>>> successful, as described in Using and misusing batches section. For
>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>> loading without the Batch keyword."”
>>>>>>
>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>> operation is appropriate.
>>>>>>
>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>>> based on overall cluster load.
>>>>>>
>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>> with different partition key values, which flies in the face of the
>>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>>
>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>> intent and non-intent for BATCH.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>
>>>>>> The really important thing to really take away from Ryan's original
>>>>>> post is that batches are not there for performance.  The only case I
>>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>>> this is when you've got multiple tables that are serving as different views
>>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>> failures).
>>>>>>
>>>>>> tl;dr: you probably don't want batch, you most likely want many async
>>>>>> calls
>>>>>>
>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>
>>>>>>>  Ryan,
>>>>>>>
>>>>>>> Thanks for the quick response.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I did see that jira before posting my question on this list.
>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>> threshold.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>>> in a batch?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mohammed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>> *To:* user@cassandra.apache.org
>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>> the story behind the original recommendation here
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>
>>>>>>>
>>>>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>>>>> as 100 x 50 byte mutations.
>>>>>>>
>>>>>>> Totally up for debate."
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>
>>>>>>> Hi –
>>>>>>>
>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>> *
>>>>>>>
>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>>> threshold as it can lead to node instability.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mohammed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>
>>>>>>> Ryan Svihla
>>>>>>>
>>>>>>> Solution Architect
>>>>>>>
>>>>>>>
>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>
>>> Ryan Svihla
>>>
>>> Solution Architect
>>>
>>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>
>>>
>>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

One thing to keep in mind is the overhead of a batch goes up as the number
of servers increases.  Talking to 3 is going to have a much different
performance profile than talking to 20.  Keep in mind that the coordinator
is going to be talking to every server in the cluster with a big batch.
The amount of local writes will decrease as it owns a smaller portion of
the ring.  All you've done is add an extra network hop between your client
and where the data should actually be.  You also start to have an impact on
GC in a very negative way.

Your point is valid about topology changes, but that's a relatively rare
occurrence, and the driver is notified pretty quickly, so I wouldn't
optimize for that case.

Can you post your test code in a gist or something?  I can't really talk
about your benchmark without seeing it and you're basing your stance on the
premise that it is correct, which it may not be.



On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mi...@gmail.com> wrote:

> You can seen what the partition key strategies are for each of the tables,
> test5 shows the least improvement.  The set (aid, end) should be unique,
> and bckt is derived from end.  Some of these layouts result in clustering
> on the same partition keys, that's actually tunable with the "~15 per
> bucket" reported (exact number of entries per bucket will vary but should
> have a mean of 15 in that run - it's an input parameter to my tests).
>  "test5" obviously ends up being exclusively unique partitions for each
> record.
>
> Your points about:
> 1) Failed batches having a higher cost than failed single statements
> 2) In my test, every node was a replica for all data.
>
> These are both very good points.
>
> For #1, since the worst case scenario is nearly twice fast in batches as
> its single statement equivalent, in terms of impact on the client, you'd
> have to be retrying half your batches before you broke even there (but of
> course those retries are not free to the cluster, so you probably make the
> performance tipping point approach a lot faster).  This alone may be cause
> to justify avoiding batches, or at least severely limiting their size (hey,
> that's what this discussion is about!).
>
> For #2, that's certainly a good point, for this test cluster, I should at
> least re-run with RF=1 so that proxying times start to matter.  If you're
> not using a token aware client or not using a token aware policy for
> whatever reason, this should even out though, no?  Each node will end up
> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
> batched or single statements.  The DS driver is very careful to caution
> that the topology map it maintains makes no guarantees on freshness, so you
> may see a significant performance penalty in your client when the topology
> changes if you're depending on token aware routing as part of your
> performance requirements.
>
>
> I'm curious what your thoughts are on grouping statements by primary
> replica according to the routing policy, and executing unlogged batches
> that way (so that for token aware routing, all statements are executed on a
> replica, for others it'd make no difference).  Retries are still more
> expensive, but token aware proxying avoidance is still had.  It's pretty
> easy to do in Scala:
>
>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
> session: Session): Map[Host, Seq[Statement]] = {
>     val meta = session.getCluster.getMetadata
>     statements.groupBy { st =>
>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>     }
>   }
>   val result =
> Future.traverse(groupByFirstReplica(statements).values).map(st =>
> newBatch(st).executeAsync())
>
>
> Let me get together my test code, it depends on some existing utilities we
> use elsewhere, such as implicit conversions between Google and Scala native
> futures.  I'll try to put this together in a format that's runnable for you
> in a Scala REPL console without having to resolve our internal
> dependencies.  This may not be today though.
>
> Also, @Ryan, I don't think that shuffling would make a difference for my
> above tests since as Jon observed, all my nodes were already replicas there.
>
>
> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com> wrote:
>
>> Also..what happens when you turn on shuffle with token aware?
>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>
>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>>
>>> To add to Ryan's (extremely valid!) point, your test works because the
>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>> Batching works great at RF=N=3 because it always gets to write to local and
>>> talk to exactly 2 other servers on every request.  Consider what happens
>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>> overhead on the server side.
>>>
>>> To save network overhead, Cassandra 2.1 added support for response
>>> grouping (see
>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>> which massively helps performance.  It provides the benefit of batches but
>>> without the coordinator overhead.
>>>
>>> Can you post your benchmark code?
>>>
>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> There are cases where it can.  For instance, if you batch multiple
>>>> mutations to the same partition (and talk to a replica for that partition)
>>>> they can reduce network overhead because they're effectively a single
>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>> most people aren't!) you end up putting additional pressure on the
>>>> coordinator because now it has to talk to several other servers.  If you
>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>> a coordinator that's
>>>>
>>>> 1) talking to every machine in the cluster and
>>>> b) waiting on a response from a significant portion of them
>>>>
>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>> disk, can affect the performance of the entire batch.
>>>>
>>>>
>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>> jack@basetechnology.com> wrote:
>>>>
>>>>>   Jonathan and Ryan,
>>>>>
>>>>> Jonathan says “It is absolutely not going to help you if you're trying
>>>>> to lump queries together to reduce network & server overhead - in fact
>>>>> it'll do the opposite”, but I would note that the CQL3 spec says “The
>>>>> BATCH statement ... serves several purposes: 1. It saves network
>>>>> round-trips between the client and the server (and sometimes between the
>>>>> server coordinator and the replicas) when batching multiple updates.” Is
>>>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>>>
>>>>> See:
>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>
>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>> change to make it accurate.
>>>>>
>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>> can save network exchanges between the client/server and server
>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>> successful, as described in Using and misusing batches section. For
>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>> loading without the Batch keyword."”
>>>>>
>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>> then let the driver determine what degree of batching and asynchronous
>>>>> operation is appropriate.
>>>>>
>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>> based on overall cluster load.
>>>>>
>>>>> I would also note that the example in the spec has multiple inserts
>>>>> with different partition key values, which flies in the face of the
>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>
>>>>> At a minimum the CQL spec should make a more clear statement of intent
>>>>> and non-intent for BATCH.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>
>>>>> The really important thing to really take away from Ryan's original
>>>>> post is that batches are not there for performance.  The only case I
>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>> this is when you've got multiple tables that are serving as different views
>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>> failures).
>>>>>
>>>>> tl;dr: you probably don't want batch, you most likely want many async
>>>>> calls
>>>>>
>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>> mohammed@glassbeam.com> wrote:
>>>>>
>>>>>>  Ryan,
>>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I did see that jira before posting my question on this list. However,
>>>>>> I didn’t see any information about why 5kb+ data will cause instability.
>>>>>> 5kb or even 50kb seems too small. For example, if each mutation is 1000+
>>>>>> bytes, then with just 5 mutations, you will hit that threshold.
>>>>>>
>>>>>>
>>>>>>
>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>> in a batch?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>> the story behind the original recommendation here
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>
>>>>>>
>>>>>>
>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>
>>>>>>
>>>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>>>> as 100 x 50 byte mutations.
>>>>>>
>>>>>> Totally up for debate."
>>>>>>
>>>>>>
>>>>>>
>>>>>> It's totally changeable, however, it's there in no small part because
>>>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>>>> this helps flag those cases of misuse.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>
>>>>>> Hi –
>>>>>>
>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>> *
>>>>>>
>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>> threshold as it can lead to node instability.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>
>>>>>> Ryan Svihla
>>>>>>
>>>>>> Solution Architect
>>>>>>
>>>>>>
>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>>> is the database technology and transactional backbone of choice for the
>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

You can seen what the partition key strategies are for each of the tables,
test5 shows the least improvement.  The set (aid, end) should be unique,
and bckt is derived from end.  Some of these layouts result in clustering
on the same partition keys, that's actually tunable with the "~15 per
bucket" reported (exact number of entries per bucket will vary but should
have a mean of 15 in that run - it's an input parameter to my tests).
 "test5" obviously ends up being exclusively unique partitions for each
record.

Your points about:
1) Failed batches having a higher cost than failed single statements
2) In my test, every node was a replica for all data.

These are both very good points.

For #1, since the worst case scenario is nearly twice fast in batches as
its single statement equivalent, in terms of impact on the client, you'd
have to be retrying half your batches before you broke even there (but of
course those retries are not free to the cluster, so you probably make the
performance tipping point approach a lot faster).  This alone may be cause
to justify avoiding batches, or at least severely limiting their size (hey,
that's what this discussion is about!).

For #2, that's certainly a good point, for this test cluster, I should at
least re-run with RF=1 so that proxying times start to matter.  If you're
not using a token aware client or not using a token aware policy for
whatever reason, this should even out though, no?  Each node will end up
coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
batched or single statements.  The DS driver is very careful to caution
that the topology map it maintains makes no guarantees on freshness, so you
may see a significant performance penalty in your client when the topology
changes if you're depending on token aware routing as part of your
performance requirements.


I'm curious what your thoughts are on grouping statements by primary
replica according to the routing policy, and executing unlogged batches
that way (so that for token aware routing, all statements are executed on a
replica, for others it'd make no difference).  Retries are still more
expensive, but token aware proxying avoidance is still had.  It's pretty
easy to do in Scala:

  def groupByFirstReplica(statements: Iterable[Statement])(implicit
session: Session): Map[Host, Seq[Statement]] = {
    val meta = session.getCluster.getMetadata
    statements.groupBy { st =>
      meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
    }
  }
  val result =
Future.traverse(groupByFirstReplica(statements).values).map(st =>
newBatch(st).executeAsync())


Let me get together my test code, it depends on some existing utilities we
use elsewhere, such as implicit conversions between Google and Scala native
futures.  I'll try to put this together in a format that's runnable for you
in a Scala REPL console without having to resolve our internal
dependencies.  This may not be today though.

Also, @Ryan, I don't think that shuffling would make a difference for my
above tests since as Jon observed, all my nodes were already replicas there.


On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rs...@datastax.com> wrote:

> Also..what happens when you turn on shuffle with token aware?
> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>
> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>>
>> To add to Ryan's (extremely valid!) point, your test works because the
>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>> Batching works great at RF=N=3 because it always gets to write to local and
>> talk to exactly 2 other servers on every request.  Consider what happens
>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>> overhead on the server side.
>>
>> To save network overhead, Cassandra 2.1 added support for response
>> grouping (see
>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which
>> massively helps performance.  It provides the benefit of batches but
>> without the coordinator overhead.
>>
>> Can you post your benchmark code?
>>
>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> There are cases where it can.  For instance, if you batch multiple
>>> mutations to the same partition (and talk to a replica for that partition)
>>> they can reduce network overhead because they're effectively a single
>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>> most people aren't!) you end up putting additional pressure on the
>>> coordinator because now it has to talk to several other servers.  If you
>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>> a coordinator that's
>>>
>>> 1) talking to every machine in the cluster and
>>> b) waiting on a response from a significant portion of them
>>>
>>> before it can respond success or fail.  Any delay, from GC to a bad
>>> disk, can affect the performance of the entire batch.
>>>
>>>
>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <ja...@basetechnology.com>
>>> wrote:
>>>
>>>>   Jonathan and Ryan,
>>>>
>>>> Jonathan says “It is absolutely not going to help you if you're trying
>>>> to lump queries together to reduce network & server overhead - in fact
>>>> it'll do the opposite”, but I would note that the CQL3 spec says “The
>>>> BATCH statement ... serves several purposes: 1. It saves network
>>>> round-trips between the client and the server (and sometimes between the
>>>> server coordinator and the replicas) when batching multiple updates.” Is
>>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>>
>>>> See:
>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>
>>>> I see the spec as gospel – if it’s not accurate, let’s propose a change
>>>> to make it accurate.
>>>>
>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements can
>>>> save network exchanges between the client/server and server
>>>> coordinator/replicas. However, because of the distributed nature of
>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>> optimize performance. Using batches to optimize performance is usually not
>>>> successful, as described in Using and misusing batches section. For
>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>> loading without the Batch keyword."”
>>>>
>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>> simply a way to collect “batches” of operations in the client/driver and
>>>> then let the driver determine what degree of batching and asynchronous
>>>> operation is appropriate.
>>>>
>>>> It might also be nice to have an inquiry for the cluster as to what
>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>> based on overall cluster load.
>>>>
>>>> I would also note that the example in the spec has multiple inserts
>>>> with different partition key values, which flies in the face of the
>>>> admonition to to refrain from using server-side distribution of requests.
>>>>
>>>> At a minimum the CQL spec should make a more clear statement of intent
>>>> and non-intent for BATCH.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>
>>>> The really important thing to really take away from Ryan's original
>>>> post is that batches are not there for performance.  The only case I
>>>> consider batches to be useful for is when you absolutely need to know that
>>>> several tables all get a mutation (via logged batches).  The use case for
>>>> this is when you've got multiple tables that are serving as different views
>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>> queries together to reduce network & server overhead - in fact it'll do the
>>>> opposite.  If you're trying to do that, instead perform many async
>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>> failures).
>>>>
>>>> tl;dr: you probably don't want batch, you most likely want many async
>>>> calls
>>>>
>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>> mohammed@glassbeam.com> wrote:
>>>>
>>>>>  Ryan,
>>>>>
>>>>> Thanks for the quick response.
>>>>>
>>>>>
>>>>>
>>>>> I did see that jira before posting my question on this list. However,
>>>>> I didn’t see any information about why 5kb+ data will cause instability.
>>>>> 5kb or even 50kb seems too small. For example, if each mutation is 1000+
>>>>> bytes, then with just 5 mutations, you will hit that threshold.
>>>>>
>>>>>
>>>>>
>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>> in a batch?
>>>>>
>>>>>
>>>>>
>>>>> Mohammed
>>>>>
>>>>>
>>>>>
>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>
>>>>>
>>>>>
>>>>> Nothing magic, just put in there based on experience. You can find the
>>>>> story behind the original recommendation here
>>>>>
>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>
>>>>>
>>>>>
>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>
>>>>>
>>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>>> as 100 x 50 byte mutations.
>>>>>
>>>>> Totally up for debate."
>>>>>
>>>>>
>>>>>
>>>>> It's totally changeable, however, it's there in no small part because
>>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>>> this helps flag those cases of misuse.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>> mohammed@glassbeam.com> wrote:
>>>>>
>>>>> Hi –
>>>>>
>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>> *
>>>>>
>>>>> The default size is 5kb and according to the comments in the yaml
>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>> threshold as it can lead to node instability.
>>>>>
>>>>>
>>>>>
>>>>> Does anybody know the significance of this magic number 5kb? Why would
>>>>> a higher number (say 10kb) lead to node instability?
>>>>>
>>>>>
>>>>>
>>>>> Mohammed
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Ryan Svihla
>>>>>
>>>>> Solution Architect
>>>>>
>>>>>
>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>
>>>>>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>>
>>>>>
>>>>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

Also..what happens when you turn on shuffle with token aware?
http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html

On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
>
> To add to Ryan's (extremely valid!) point, your test works because the
> coordinator is always a replica.  Try again using 20 (or 50) nodes.
> Batching works great at RF=N=3 because it always gets to write to local and
> talk to exactly 2 other servers on every request.  Consider what happens
> when the coordinator needs to talk to 100 servers.  It's unnecessary
> overhead on the server side.
>
> To save network overhead, Cassandra 2.1 added support for response
> grouping (see
> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which
> massively helps performance.  It provides the benefit of batches but
> without the coordinator overhead.
>
> Can you post your benchmark code?
>
> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> There are cases where it can.  For instance, if you batch multiple
>> mutations to the same partition (and talk to a replica for that partition)
>> they can reduce network overhead because they're effectively a single
>> mutation in the eye of the cluster.  However, if you're not doing that (and
>> most people aren't!) you end up putting additional pressure on the
>> coordinator because now it has to talk to several other servers.  If you
>> have 100 servers, and perform a mutation on 100 partitions, you could have
>> a coordinator that's
>>
>> 1) talking to every machine in the cluster and
>> b) waiting on a response from a significant portion of them
>>
>> before it can respond success or fail.  Any delay, from GC to a bad disk,
>> can affect the performance of the entire batch.
>>
>>
>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <ja...@basetechnology.com>
>> wrote:
>>
>>>   Jonathan and Ryan,
>>>
>>> Jonathan says “It is absolutely not going to help you if you're trying
>>> to lump queries together to reduce network & server overhead - in fact
>>> it'll do the opposite”, but I would note that the CQL3 spec says “The
>>> BATCH statement ... serves several purposes: 1. It saves network
>>> round-trips between the client and the server (and sometimes between the
>>> server coordinator and the replicas) when batching multiple updates.” Is
>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>
>>> See:
>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>
>>> I see the spec as gospel – if it’s not accurate, let’s propose a change
>>> to make it accurate.
>>>
>>> The DataStax CQL doc is more nuanced: “Batching multiple statements can
>>> save network exchanges between the client/server and server
>>> coordinator/replicas. However, because of the distributed nature of
>>> Cassandra, spread requests across nearby nodes as much as possible to
>>> optimize performance. Using batches to optimize performance is usually not
>>> successful, as described in Using and misusing batches section. For
>>> information about the fastest way to load data, see "Cassandra: Batch
>>> loading without the Batch keyword."”
>>>
>>> Maybe what we really need is a “client/driver-side batch”, which is
>>> simply a way to collect “batches” of operations in the client/driver and
>>> then let the driver determine what degree of batching and asynchronous
>>> operation is appropriate.
>>>
>>> It might also be nice to have an inquiry for the cluster as to what
>>> batch size is most optimal for the cluster, like number of mutations in a
>>> batch and number of simultaneous connections, and to have that be dynamic
>>> based on overall cluster load.
>>>
>>> I would also note that the example in the spec has multiple inserts with
>>> different partition key values, which flies in the face of the admonition
>>> to to refrain from using server-side distribution of requests.
>>>
>>> At a minimum the CQL spec should make a more clear statement of intent
>>> and non-intent for BATCH.
>>>
>>> -- Jack Krupansky
>>>
>>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>
>>> The really important thing to really take away from Ryan's original post
>>> is that batches are not there for performance.  The only case I consider
>>> batches to be useful for is when you absolutely need to know that several
>>> tables all get a mutation (via logged batches).  The use case for this is
>>> when you've got multiple tables that are serving as different views for
>>> data.  It is absolutely not going to help you if you're trying to lump
>>> queries together to reduce network & server overhead - in fact it'll do the
>>> opposite.  If you're trying to do that, instead perform many async
>>> queries.  The overhead of batches in cassandra is significant and you're
>>> going to hit a lot of problems if you use them excessively (timeouts /
>>> failures).
>>>
>>> tl;dr: you probably don't want batch, you most likely want many async
>>> calls
>>>
>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>> mohammed@glassbeam.com> wrote:
>>>
>>>>  Ryan,
>>>>
>>>> Thanks for the quick response.
>>>>
>>>>
>>>>
>>>> I did see that jira before posting my question on this list. However, I
>>>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>>>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>>>> then with just 5 mutations, you will hit that threshold.
>>>>
>>>>
>>>>
>>>> In addition, Patrick is saying that he does not recommend more than 100
>>>> mutations per batch. So why not warn users just on the # of mutations in a
>>>> batch?
>>>>
>>>>
>>>>
>>>> Mohammed
>>>>
>>>>
>>>>
>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>
>>>>
>>>>
>>>> Nothing magic, just put in there based on experience. You can find the
>>>> story behind the original recommendation here
>>>>
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>
>>>>
>>>>
>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>
>>>>
>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>> as 100 x 50 byte mutations.
>>>>
>>>> Totally up for debate."
>>>>
>>>>
>>>>
>>>> It's totally changeable, however, it's there in no small part because
>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>> this helps flag those cases of misuse.
>>>>
>>>>
>>>>
>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>> mohammed@glassbeam.com> wrote:
>>>>
>>>> Hi –
>>>>
>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>> *
>>>>
>>>> The default size is 5kb and according to the comments in the yaml file,
>>>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>>>> It says caution should be taken on increasing the size of this threshold as
>>>> it can lead to node instability.
>>>>
>>>>
>>>>
>>>> Does anybody know the significance of this magic number 5kb? Why would
>>>> a higher number (say 10kb) lead to node instability?
>>>>
>>>>
>>>>
>>>> Mohammed
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Ryan Svihla
>>>>
>>>> Solution Architect
>>>>
>>>>
>>>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>
>>>>
>>>>
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>>
>>>>
>>>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

To add to Ryan's (extremely valid!) point, your test works because the
coordinator is always a replica.  Try again using 20 (or 50) nodes.
Batching works great at RF=N=3 because it always gets to write to local and
talk to exactly 2 other servers on every request.  Consider what happens
when the coordinator needs to talk to 100 servers.  It's unnecessary
overhead on the server side.

To save network overhead, Cassandra 2.1 added support for response grouping
(see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
which massively helps performance.  It provides the benefit of batches but
without the coordinator overhead.

Can you post your benchmark code?

On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jo...@jonhaddad.com> wrote:

> There are cases where it can.  For instance, if you batch multiple
> mutations to the same partition (and talk to a replica for that partition)
> they can reduce network overhead because they're effectively a single
> mutation in the eye of the cluster.  However, if you're not doing that (and
> most people aren't!) you end up putting additional pressure on the
> coordinator because now it has to talk to several other servers.  If you
> have 100 servers, and perform a mutation on 100 partitions, you could have
> a coordinator that's
>
> 1) talking to every machine in the cluster and
> b) waiting on a response from a significant portion of them
>
> before it can respond success or fail.  Any delay, from GC to a bad disk,
> can affect the performance of the entire batch.
>
>
> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
>>   Jonathan and Ryan,
>>
>> Jonathan says “It is absolutely not going to help you if you're trying to
>> lump queries together to reduce network & server overhead - in fact it'll
>> do the opposite”, but I would note that the CQL3 spec says “The BATCH statement
>> ... serves several purposes: 1. It saves network round-trips between the
>> client and the server (and sometimes between the server coordinator and the
>> replicas) when batching multiple updates.” Is the spec inaccurate? I mean,
>> it seems in conflict with your statement.
>>
>> See:
>> https://cassandra.apache.org/doc/cql3/CQL.html
>>
>> I see the spec as gospel – if it’s not accurate, let’s propose a change
>> to make it accurate.
>>
>> The DataStax CQL doc is more nuanced: “Batching multiple statements can
>> save network exchanges between the client/server and server
>> coordinator/replicas. However, because of the distributed nature of
>> Cassandra, spread requests across nearby nodes as much as possible to
>> optimize performance. Using batches to optimize performance is usually not
>> successful, as described in Using and misusing batches section. For
>> information about the fastest way to load data, see "Cassandra: Batch
>> loading without the Batch keyword."”
>>
>> Maybe what we really need is a “client/driver-side batch”, which is
>> simply a way to collect “batches” of operations in the client/driver and
>> then let the driver determine what degree of batching and asynchronous
>> operation is appropriate.
>>
>> It might also be nice to have an inquiry for the cluster as to what batch
>> size is most optimal for the cluster, like number of mutations in a batch
>> and number of simultaneous connections, and to have that be dynamic based
>> on overall cluster load.
>>
>> I would also note that the example in the spec has multiple inserts with
>> different partition key values, which flies in the face of the admonition
>> to to refrain from using server-side distribution of requests.
>>
>> At a minimum the CQL spec should make a more clear statement of intent
>> and non-intent for BATCH.
>>
>> -- Jack Krupansky
>>
>>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
>> *Sent:* Friday, December 12, 2014 12:58 PM
>> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>
>> The really important thing to really take away from Ryan's original post
>> is that batches are not there for performance.  The only case I consider
>> batches to be useful for is when you absolutely need to know that several
>> tables all get a mutation (via logged batches).  The use case for this is
>> when you've got multiple tables that are serving as different views for
>> data.  It is absolutely not going to help you if you're trying to lump
>> queries together to reduce network & server overhead - in fact it'll do the
>> opposite.  If you're trying to do that, instead perform many async
>> queries.  The overhead of batches in cassandra is significant and you're
>> going to hit a lot of problems if you use them excessively (timeouts /
>> failures).
>>
>> tl;dr: you probably don't want batch, you most likely want many async
>> calls
>>
>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com>
>> wrote:
>>
>>>  Ryan,
>>>
>>> Thanks for the quick response.
>>>
>>>
>>>
>>> I did see that jira before posting my question on this list. However, I
>>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>>> then with just 5 mutations, you will hit that threshold.
>>>
>>>
>>>
>>> In addition, Patrick is saying that he does not recommend more than 100
>>> mutations per batch. So why not warn users just on the # of mutations in a
>>> batch?
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>
>>>
>>>
>>> Nothing magic, just put in there based on experience. You can find the
>>> story behind the original recommendation here
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>
>>>
>>>
>>> Key reasoning for the desire comes from Patrick McFadden:
>>>
>>>
>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>> as 100 x 50 byte mutations.
>>>
>>> Totally up for debate."
>>>
>>>
>>>
>>> It's totally changeable, however, it's there in no small part because so
>>> many people confuse the BATCH keyword as a performance optimization, this
>>> helps flag those cases of misuse.
>>>
>>>
>>>
>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
>>> wrote:
>>>
>>> Hi –
>>>
>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>> *
>>>
>>> The default size is 5kb and according to the comments in the yaml file,
>>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>>> It says caution should be taken on increasing the size of this threshold as
>>> it can lead to node instability.
>>>
>>>
>>>
>>> Does anybody know the significance of this magic number 5kb? Why would a
>>> higher number (say 10kb) lead to node instability?
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>>
>>> --
>>>
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>
>>> Ryan Svihla
>>>
>>> Solution Architect
>>>
>>>
>>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>
>>>
>>>
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>
>>>
>>>
>>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

There are cases where it can.  For instance, if you batch multiple
mutations to the same partition (and talk to a replica for that partition)
they can reduce network overhead because they're effectively a single
mutation in the eye of the cluster.  However, if you're not doing that (and
most people aren't!) you end up putting additional pressure on the
coordinator because now it has to talk to several other servers.  If you
have 100 servers, and perform a mutation on 100 partitions, you could have
a coordinator that's

1) talking to every machine in the cluster and
b) waiting on a response from a significant portion of them

before it can respond success or fail.  Any delay, from GC to a bad disk,
can affect the performance of the entire batch.

On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <ja...@basetechnology.com>
wrote:

>   Jonathan and Ryan,
>
> Jonathan says “It is absolutely not going to help you if you're trying to
> lump queries together to reduce network & server overhead - in fact it'll
> do the opposite”, but I would note that the CQL3 spec says “The BATCH statement
> ... serves several purposes: 1. It saves network round-trips between the
> client and the server (and sometimes between the server coordinator and the
> replicas) when batching multiple updates.” Is the spec inaccurate? I mean,
> it seems in conflict with your statement.
>
> See:
> https://cassandra.apache.org/doc/cql3/CQL.html
>
> I see the spec as gospel – if it’s not accurate, let’s propose a change to
> make it accurate.
>
> The DataStax CQL doc is more nuanced: “Batching multiple statements can
> save network exchanges between the client/server and server
> coordinator/replicas. However, because of the distributed nature of
> Cassandra, spread requests across nearby nodes as much as possible to
> optimize performance. Using batches to optimize performance is usually not
> successful, as described in Using and misusing batches section. For
> information about the fastest way to load data, see "Cassandra: Batch
> loading without the Batch keyword."”
>
> Maybe what we really need is a “client/driver-side batch”, which is simply
> a way to collect “batches” of operations in the client/driver and then let
> the driver determine what degree of batching and asynchronous operation is
> appropriate.
>
> It might also be nice to have an inquiry for the cluster as to what batch
> size is most optimal for the cluster, like number of mutations in a batch
> and number of simultaneous connections, and to have that be dynamic based
> on overall cluster load.
>
> I would also note that the example in the spec has multiple inserts with
> different partition key values, which flies in the face of the admonition
> to to refrain from using server-side distribution of requests.
>
> At a minimum the CQL spec should make a more clear statement of intent and
> non-intent for BATCH.
>
> -- Jack Krupansky
>
>  *From:* Jonathan Haddad <jo...@jonhaddad.com>
> *Sent:* Friday, December 12, 2014 12:58 PM
> *To:* user@cassandra.apache.org ; Ryan Svihla <rs...@datastax.com>
> *Subject:* Re: batch_size_warn_threshold_in_kb
>
> The really important thing to really take away from Ryan's original post
> is that batches are not there for performance.  The only case I consider
> batches to be useful for is when you absolutely need to know that several
> tables all get a mutation (via logged batches).  The use case for this is
> when you've got multiple tables that are serving as different views for
> data.  It is absolutely not going to help you if you're trying to lump
> queries together to reduce network & server overhead - in fact it'll do the
> opposite.  If you're trying to do that, instead perform many async
> queries.  The overhead of batches in cassandra is significant and you're
> going to hit a lot of problems if you use them excessively (timeouts /
> failures).
>
> tl;dr: you probably don't want batch, you most likely want many async calls
>
> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
>>  Ryan,
>>
>> Thanks for the quick response.
>>
>>
>>
>> I did see that jira before posting my question on this list. However, I
>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>> then with just 5 mutations, you will hit that threshold.
>>
>>
>>
>> In addition, Patrick is saying that he does not recommend more than 100
>> mutations per batch. So why not warn users just on the # of mutations in a
>> batch?
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>> *Sent:* Thursday, December 11, 2014 12:56 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>
>>
>>
>> Nothing magic, just put in there based on experience. You can find the
>> story behind the original recommendation here
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>
>>
>>
>> Key reasoning for the desire comes from Patrick McFadden:
>>
>>
>> "Yes that was in bytes. Just in my own experience, I don't recommend more
>> than ~100 mutations per batch. Doing some quick math I came up with 5k as
>> 100 x 50 byte mutations.
>>
>> Totally up for debate."
>>
>>
>>
>> It's totally changeable, however, it's there in no small part because so
>> many people confuse the BATCH keyword as a performance optimization, this
>> helps flag those cases of misuse.
>>
>>
>>
>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
>> wrote:
>>
>> Hi –
>>
>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>> *
>>
>> The default size is 5kb and according to the comments in the yaml file,
>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>> It says caution should be taken on increasing the size of this threshold as
>> it can lead to node instability.
>>
>>
>>
>> Does anybody know the significance of this magic number 5kb? Why would a
>> higher number (say 10kb) lead to node instability?
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>>
>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Jack Krupansky <ja...@basetechnology.com>.

Jonathan and Ryan,

Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network & server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement.

See:
https://cassandra.apache.org/doc/cql3/CQL.html

I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate.

The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see "Cassandra: Batch loading without the Batch keyword."”

Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate.

It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load.

I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests.

At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH.

-- Jack Krupansky

From: Jonathan Haddad 
Sent: Friday, December 12, 2014 12:58 PM
To: user@cassandra.apache.org ; Ryan Svihla 
Subject: Re: batch_size_warn_threshold_in_kb

The really important thing to really take away from Ryan's original post is that batches are not there for performance.  The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches).  The use case for this is when you've got multiple tables that are serving as different views for data.  It is absolutely not going to help you if you're trying to lump queries together to reduce network & server overhead - in fact it'll do the opposite.  If you're trying to do that, instead perform many async queries.  The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). 

tl;dr: you probably don't want batch, you most likely want many async calls


On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com> wrote:

  Ryan,

  Thanks for the quick response.



  I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. 



  In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch?



  Mohammed



  From: Ryan Svihla [mailto:rsvihla@datastax.com] 
  Sent: Thursday, December 11, 2014 12:56 PM
  To: user@cassandra.apache.org
  Subject: Re: batch_size_warn_threshold_in_kb



  Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here



  https://issues.apache.org/jira/browse/CASSANDRA-6487



  Key reasoning for the desire comes from Patrick McFadden:


  "Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations.

  Totally up for debate."



  It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse.



  On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com> wrote:

  Hi – 

  The cassandra.yaml file has property called batch_size_warn_threshold_in_kb. 

  The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability.



  Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability?



  Mohammed 






  -- 



  Ryan Svihla

  Solution Architect 






  DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

Also..to piggy back on what Jon is saying. what is the cost of retrying a
failed batch versus retrying a single failed record in your logical batch?
I'd suggest in the example you provide there is a 100x difference in retry
cost of a single record.

Regardless, I'm glad you're testing this out, always good to push back on
theory discussions with numbers.

On Sat, Dec 13, 2014 at 8:12 AM, Ryan Svihla <rs...@datastax.com> wrote:
>
> Are batches to the same partition key (which results in a single mutation,
> and obviously eliminates the primary problem)? Is your client network
> and/or CPU bound?
>
> Remember, the coordinator node is _just_ doing what your client is doing
> with executeAsync, only now it's dealing with the heap pressure of
> compaction and flush writers, while youre client is busy writing code.
>
> Not trying to be argumentative, but I talk to the driver writers almost
> daily, and I've moved a lot of customers off batches and every single one
> of them sped up things substantially, that experience plus the theory leads
> me to believe there is a bottleneck on your client. Final point the more
> you grow your cluster the more the cost of losing token awareness in all
> writes in the batch grows
>
>
> On Sat, Dec 13, 2014 at 7:32 AM, Eric Stevens <mi...@gmail.com> wrote:
>>
>> Jon,
>>
>> > The really important thing to really take away from Ryan's original
>> post is that batches are not there for performance.
>> > tl;dr: you probably don't want batch, you most likely want many async
>> calls
>>
>> My own rudimentary testing does not bear this out - at least not if you
>> mean to say that batches don't offer a performance advantage (vs this just
>> being a happy side effect).  Unlogged batches provide a substantial
>> improvement on performance for burst writes in my findings.
>>
>> My test setup:
>>
>>    - Amazon i2.8xl instances in 3 AZ's using EC2Snitch
>>    - Cluster size of 3, RF=3
>>    - DataStax Java Driver, with token aware routing, using Prepared
>>    Statements, vs Unlogged Batches of Prepared Statements.
>>    - Test client on separate machine in same AZ as one of the server
>>    nodes
>>    - Data Size: 50,000 records
>>    - Test Runs: 25 (unique data generated before each run)
>>    - Data written to 5 tables, one table at a time (all 500k records go
>>    to each table)
>>    - Timing begins when first record is written to a table and ends when
>>    the last async call completes for that table.  Timing is measured
>>    independently for each strategy, table, and run.
>>    - To eliminate bias, order between tables is randomized on each run,
>>    and order between single vs batched execution is randomized on each run.
>>    - Asynchronicity is tested using three different typical Scala
>>    parallelism strategies.
>>       - "traverse" = Futures.traverse(statements).map(_.executeAsync())
>>       - let the Futures system schedule the parallelism it thinks is appropriate
>>       - "scatter" = Futures.sequence(statements.map(_.executeAsync())) -
>>       Create as many async calls as possible at a time, then let the Futures
>>       system gather together the results
>>       - "parallel" = statements.par.map(_.execute()) - using a parallel
>>       collection to initiate as many blocking calls as possible within the
>>       default thread pool.
>>    - I kept an eye on compaction throughout, and we never went above 2
>>    pending compaction tasks
>>
>> I know this test is fairly contrived, but it's difficult to dismiss a
>> throughput differences of this magnitude over several million data points.
>> Times are in nanos.
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
>> statements using strategy scatter
>> Total Run Time
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 51,391,100,107
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 52,206,907,605
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 53,903,886,095
>>         test2 ((aid, bckt), end)                             =
>> 54,613,620,320
>>         test5 ((aid, bckt, end))                             =
>> 55,820,739,557
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
>> of 100 using strategy scatter
>> Total Run Time
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 9,199,579,182
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 11,661,638,491
>>         test2 ((aid, bckt), end)                             =
>> 12,059,853,548
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 12,957,113,345
>>         test5 ((aid, bckt, end))                             =
>> 31,166,071,275
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
>> statements using strategy traverse
>> Total Run Time
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 52,368,815,408
>>         test2 ((aid, bckt), end)                             =
>> 52,676,830,110
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 54,096,838,258
>>         test5 ((aid, bckt, end))                             =
>> 54,657,464,976
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 55,668,202,827
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
>> of 100 using strategy traverse
>> Total Run Time
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 9,633,141,094
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 12,519,381,544
>>         test2 ((aid, bckt), end)                             =
>> 12,653,843,637
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 17,644,182,274
>>         test5 ((aid, bckt, end))                             =
>> 27,902,501,534
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
>> statements using strategy parallel
>> Total Run Time
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 360,523,086,443
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 364,375,212,413
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 370,989,615,452
>>         test2 ((aid, bckt), end)                             =
>> 378,368,728,469
>>         test5 ((aid, bckt, end))                             =
>> 380,737,675,612
>>
>> ==== Execution Results for 25 runs of 50000 records =============
>> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
>> of 100 using strategy parallel
>> Total Run Time
>>         test3 ((aid, bckt), end, proto) reverse order        =
>> 20,971,045,814
>>         test1 ((aid, bckt), proto, end) reverse order        =
>> 21,379,583,690
>>         test4 ((aid, bckt), proto, end) no explicit ordering =
>> 21,505,965,087
>>         test2 ((aid, bckt), end)                             =
>> 24,433,580,144
>>         test5 ((aid, bckt, end))                             =
>> 37,346,062,553
>>
>>
>>
>> On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> The really important thing to really take away from Ryan's original post
>>> is that batches are not there for performance.  The only case I consider
>>> batches to be useful for is when you absolutely need to know that several
>>> tables all get a mutation (via logged batches).  The use case for this is
>>> when you've got multiple tables that are serving as different views for
>>> data.  It is absolutely not going to help you if you're trying to lump
>>> queries together to reduce network & server overhead - in fact it'll do the
>>> opposite.  If you're trying to do that, instead perform many async
>>> queries.  The overhead of batches in cassandra is significant and you're
>>> going to hit a lot of problems if you use them excessively (timeouts /
>>> failures).
>>>
>>> tl;dr: you probably don't want batch, you most likely want many async
>>> calls
>>>
>>>
>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>> mohammed@glassbeam.com> wrote:
>>>
>>>>  Ryan,
>>>>
>>>> Thanks for the quick response.
>>>>
>>>>
>>>>
>>>> I did see that jira before posting my question on this list. However, I
>>>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>>>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>>>> then with just 5 mutations, you will hit that threshold.
>>>>
>>>>
>>>>
>>>> In addition, Patrick is saying that he does not recommend more than 100
>>>> mutations per batch. So why not warn users just on the # of mutations in a
>>>> batch?
>>>>
>>>>
>>>>
>>>> Mohammed
>>>>
>>>>
>>>>
>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>
>>>>
>>>>
>>>> Nothing magic, just put in there based on experience. You can find the
>>>> story behind the original recommendation here
>>>>
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>
>>>>
>>>>
>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>
>>>>
>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>>> as 100 x 50 byte mutations.
>>>>
>>>> Totally up for debate."
>>>>
>>>>
>>>>
>>>> It's totally changeable, however, it's there in no small part because
>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>> this helps flag those cases of misuse.
>>>>
>>>>
>>>>
>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>> mohammed@glassbeam.com> wrote:
>>>>
>>>> Hi –
>>>>
>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>> *
>>>>
>>>> The default size is 5kb and according to the comments in the yaml file,
>>>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>>>> It says caution should be taken on increasing the size of this threshold as
>>>> it can lead to node instability.
>>>>
>>>>
>>>>
>>>> Does anybody know the significance of this magic number 5kb? Why would
>>>> a higher number (say 10kb) lead to node instability?
>>>>
>>>>
>>>>
>>>> Mohammed
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Ryan Svihla
>>>>
>>>> Solution Architect
>>>>
>>>>
>>>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>
>>>>
>>>>
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>>
>>>>
>>>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

Are batches to the same partition key (which results in a single mutation,
and obviously eliminates the primary problem)? Is your client network
and/or CPU bound?

Remember, the coordinator node is _just_ doing what your client is doing
with executeAsync, only now it's dealing with the heap pressure of
compaction and flush writers, while youre client is busy writing code.

Not trying to be argumentative, but I talk to the driver writers almost
daily, and I've moved a lot of customers off batches and every single one
of them sped up things substantially, that experience plus the theory leads
me to believe there is a bottleneck on your client. Final point the more
you grow your cluster the more the cost of losing token awareness in all
writes in the batch grows


On Sat, Dec 13, 2014 at 7:32 AM, Eric Stevens <mi...@gmail.com> wrote:
>
> Jon,
>
> > The really important thing to really take away from Ryan's original
> post is that batches are not there for performance.
> > tl;dr: you probably don't want batch, you most likely want many async
> calls
>
> My own rudimentary testing does not bear this out - at least not if you
> mean to say that batches don't offer a performance advantage (vs this just
> being a happy side effect).  Unlogged batches provide a substantial
> improvement on performance for burst writes in my findings.
>
> My test setup:
>
>    - Amazon i2.8xl instances in 3 AZ's using EC2Snitch
>    - Cluster size of 3, RF=3
>    - DataStax Java Driver, with token aware routing, using Prepared
>    Statements, vs Unlogged Batches of Prepared Statements.
>    - Test client on separate machine in same AZ as one of the server nodes
>    - Data Size: 50,000 records
>    - Test Runs: 25 (unique data generated before each run)
>    - Data written to 5 tables, one table at a time (all 500k records go
>    to each table)
>    - Timing begins when first record is written to a table and ends when
>    the last async call completes for that table.  Timing is measured
>    independently for each strategy, table, and run.
>    - To eliminate bias, order between tables is randomized on each run,
>    and order between single vs batched execution is randomized on each run.
>    - Asynchronicity is tested using three different typical Scala
>    parallelism strategies.
>       - "traverse" = Futures.traverse(statements).map(_.executeAsync()) -
>       let the Futures system schedule the parallelism it thinks is appropriate
>       - "scatter" = Futures.sequence(statements.map(_.executeAsync())) -
>       Create as many async calls as possible at a time, then let the Futures
>       system gather together the results
>       - "parallel" = statements.par.map(_.execute()) - using a parallel
>       collection to initiate as many blocking calls as possible within the
>       default thread pool.
>    - I kept an eye on compaction throughout, and we never went above 2
>    pending compaction tasks
>
> I know this test is fairly contrived, but it's difficult to dismiss a
> throughput differences of this magnitude over several million data points.
> Times are in nanos.
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
> statements using strategy scatter
> Total Run Time
>         test3 ((aid, bckt), end, proto) reverse order        =
> 51,391,100,107
>         test1 ((aid, bckt), proto, end) reverse order        =
> 52,206,907,605
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 53,903,886,095
>         test2 ((aid, bckt), end)                             =
> 54,613,620,320
>         test5 ((aid, bckt, end))                             =
> 55,820,739,557
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
> of 100 using strategy scatter
> Total Run Time
>         test3 ((aid, bckt), end, proto) reverse order        =
> 9,199,579,182
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 11,661,638,491
>         test2 ((aid, bckt), end)                             =
> 12,059,853,548
>         test1 ((aid, bckt), proto, end) reverse order        =
> 12,957,113,345
>         test5 ((aid, bckt, end))                             =
> 31,166,071,275
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
> statements using strategy traverse
> Total Run Time
>         test1 ((aid, bckt), proto, end) reverse order        =
> 52,368,815,408
>         test2 ((aid, bckt), end)                             =
> 52,676,830,110
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 54,096,838,258
>         test5 ((aid, bckt, end))                             =
> 54,657,464,976
>         test3 ((aid, bckt), end, proto) reverse order        =
> 55,668,202,827
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
> of 100 using strategy traverse
> Total Run Time
>         test3 ((aid, bckt), end, proto) reverse order        =
> 9,633,141,094
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 12,519,381,544
>         test2 ((aid, bckt), end)                             =
> 12,653,843,637
>         test1 ((aid, bckt), proto, end) reverse order        =
> 17,644,182,274
>         test5 ((aid, bckt, end))                             =
> 27,902,501,534
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
> statements using strategy parallel
> Total Run Time
>         test1 ((aid, bckt), proto, end) reverse order        =
> 360,523,086,443
>         test3 ((aid, bckt), end, proto) reverse order        =
> 364,375,212,413
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 370,989,615,452
>         test2 ((aid, bckt), end)                             =
> 378,368,728,469
>         test5 ((aid, bckt, end))                             =
> 380,737,675,612
>
> ==== Execution Results for 25 runs of 50000 records =============
> 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
> of 100 using strategy parallel
> Total Run Time
>         test3 ((aid, bckt), end, proto) reverse order        =
> 20,971,045,814
>         test1 ((aid, bckt), proto, end) reverse order        =
> 21,379,583,690
>         test4 ((aid, bckt), proto, end) no explicit ordering =
> 21,505,965,087
>         test2 ((aid, bckt), end)                             =
> 24,433,580,144
>         test5 ((aid, bckt, end))                             =
> 37,346,062,553
>
>
>
> On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> The really important thing to really take away from Ryan's original post
>> is that batches are not there for performance.  The only case I consider
>> batches to be useful for is when you absolutely need to know that several
>> tables all get a mutation (via logged batches).  The use case for this is
>> when you've got multiple tables that are serving as different views for
>> data.  It is absolutely not going to help you if you're trying to lump
>> queries together to reduce network & server overhead - in fact it'll do the
>> opposite.  If you're trying to do that, instead perform many async
>> queries.  The overhead of batches in cassandra is significant and you're
>> going to hit a lot of problems if you use them excessively (timeouts /
>> failures).
>>
>> tl;dr: you probably don't want batch, you most likely want many async
>> calls
>>
>>
>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com>
>> wrote:
>>
>>>  Ryan,
>>>
>>> Thanks for the quick response.
>>>
>>>
>>>
>>> I did see that jira before posting my question on this list. However, I
>>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>>> then with just 5 mutations, you will hit that threshold.
>>>
>>>
>>>
>>> In addition, Patrick is saying that he does not recommend more than 100
>>> mutations per batch. So why not warn users just on the # of mutations in a
>>> batch?
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>
>>>
>>>
>>> Nothing magic, just put in there based on experience. You can find the
>>> story behind the original recommendation here
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>
>>>
>>>
>>> Key reasoning for the desire comes from Patrick McFadden:
>>>
>>>
>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>> more than ~100 mutations per batch. Doing some quick math I came up with 5k
>>> as 100 x 50 byte mutations.
>>>
>>> Totally up for debate."
>>>
>>>
>>>
>>> It's totally changeable, however, it's there in no small part because so
>>> many people confuse the BATCH keyword as a performance optimization, this
>>> helps flag those cases of misuse.
>>>
>>>
>>>
>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
>>> wrote:
>>>
>>> Hi –
>>>
>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>> *
>>>
>>> The default size is 5kb and according to the comments in the yaml file,
>>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>>> It says caution should be taken on increasing the size of this threshold as
>>> it can lead to node instability.
>>>
>>>
>>>
>>> Does anybody know the significance of this magic number 5kb? Why would a
>>> higher number (say 10kb) lead to node instability?
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>>
>>> --
>>>
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>
>>> Ryan Svihla
>>>
>>> Solution Architect
>>>
>>>
>>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>
>>>
>>>
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>
>>>
>>>
>>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Eric Stevens <mi...@gmail.com>.

Jon,

> The really important thing to really take away from Ryan's original post
is that batches are not there for performance.
> tl;dr: you probably don't want batch, you most likely want many async
calls

My own rudimentary testing does not bear this out - at least not if you
mean to say that batches don't offer a performance advantage (vs this just
being a happy side effect).  Unlogged batches provide a substantial
improvement on performance for burst writes in my findings.

My test setup:

   - Amazon i2.8xl instances in 3 AZ's using EC2Snitch
   - Cluster size of 3, RF=3
   - DataStax Java Driver, with token aware routing, using Prepared
   Statements, vs Unlogged Batches of Prepared Statements.
   - Test client on separate machine in same AZ as one of the server nodes
   - Data Size: 50,000 records
   - Test Runs: 25 (unique data generated before each run)
   - Data written to 5 tables, one table at a time (all 500k records go to
   each table)
   - Timing begins when first record is written to a table and ends when
   the last async call completes for that table.  Timing is measured
   independently for each strategy, table, and run.
   - To eliminate bias, order between tables is randomized on each run, and
   order between single vs batched execution is randomized on each run.
   - Asynchronicity is tested using three different typical Scala
   parallelism strategies.
      - "traverse" = Futures.traverse(statements).map(_.executeAsync()) -
      let the Futures system schedule the parallelism it thinks is appropriate
      - "scatter" = Futures.sequence(statements.map(_.executeAsync())) -
      Create as many async calls as possible at a time, then let the Futures
      system gather together the results
      - "parallel" = statements.par.map(_.execute()) - using a parallel
      collection to initiate as many blocking calls as possible within the
      default thread pool.
   - I kept an eye on compaction throughout, and we never went above 2
   pending compaction tasks

I know this test is fairly contrived, but it's difficult to dismiss a
throughput differences of this magnitude over several million data points.
Times are in nanos.

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
statements using strategy scatter
Total Run Time
        test3 ((aid, bckt), end, proto) reverse order        =
51,391,100,107
        test1 ((aid, bckt), proto, end) reverse order        =
52,206,907,605
        test4 ((aid, bckt), proto, end) no explicit ordering =
53,903,886,095
        test2 ((aid, bckt), end)                             =
54,613,620,320
        test5 ((aid, bckt, end))                             =
55,820,739,557

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
of 100 using strategy scatter
Total Run Time
        test3 ((aid, bckt), end, proto) reverse order        = 9,199,579,182
        test4 ((aid, bckt), proto, end) no explicit ordering =
11,661,638,491
        test2 ((aid, bckt), end)                             =
12,059,853,548
        test1 ((aid, bckt), proto, end) reverse order        =
12,957,113,345
        test5 ((aid, bckt, end))                             =
31,166,071,275

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
statements using strategy traverse
Total Run Time
        test1 ((aid, bckt), proto, end) reverse order        =
52,368,815,408
        test2 ((aid, bckt), end)                             =
52,676,830,110
        test4 ((aid, bckt), proto, end) no explicit ordering =
54,096,838,258
        test5 ((aid, bckt, end))                             =
54,657,464,976
        test3 ((aid, bckt), end, proto) reverse order        =
55,668,202,827

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
of 100 using strategy traverse
Total Run Time
        test3 ((aid, bckt), end, proto) reverse order        = 9,633,141,094
        test4 ((aid, bckt), proto, end) no explicit ordering =
12,519,381,544
        test2 ((aid, bckt), end)                             =
12,653,843,637
        test1 ((aid, bckt), proto, end) reverse order        =
17,644,182,274
        test5 ((aid, bckt, end))                             =
27,902,501,534

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
statements using strategy parallel
Total Run Time
        test1 ((aid, bckt), proto, end) reverse order        =
360,523,086,443
        test3 ((aid, bckt), end, proto) reverse order        =
364,375,212,413
        test4 ((aid, bckt), proto, end) no explicit ordering =
370,989,615,452
        test2 ((aid, bckt), end)                             =
378,368,728,469
        test5 ((aid, bckt, end))                             =
380,737,675,612

==== Execution Results for 25 runs of 50000 records =============
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
of 100 using strategy parallel
Total Run Time
        test3 ((aid, bckt), end, proto) reverse order        =
20,971,045,814
        test1 ((aid, bckt), proto, end) reverse order        =
21,379,583,690
        test4 ((aid, bckt), proto, end) no explicit ordering =
21,505,965,087
        test2 ((aid, bckt), end)                             =
24,433,580,144
        test5 ((aid, bckt, end))                             =
37,346,062,553



On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad <jo...@jonhaddad.com> wrote:

> The really important thing to really take away from Ryan's original post
> is that batches are not there for performance.  The only case I consider
> batches to be useful for is when you absolutely need to know that several
> tables all get a mutation (via logged batches).  The use case for this is
> when you've got multiple tables that are serving as different views for
> data.  It is absolutely not going to help you if you're trying to lump
> queries together to reduce network & server overhead - in fact it'll do the
> opposite.  If you're trying to do that, instead perform many async
> queries.  The overhead of batches in cassandra is significant and you're
> going to hit a lot of problems if you use them excessively (timeouts /
> failures).
>
> tl;dr: you probably don't want batch, you most likely want many async calls
>
>
> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
>>  Ryan,
>>
>> Thanks for the quick response.
>>
>>
>>
>> I did see that jira before posting my question on this list. However, I
>> didn’t see any information about why 5kb+ data will cause instability. 5kb
>> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
>> then with just 5 mutations, you will hit that threshold.
>>
>>
>>
>> In addition, Patrick is saying that he does not recommend more than 100
>> mutations per batch. So why not warn users just on the # of mutations in a
>> batch?
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>> *Sent:* Thursday, December 11, 2014 12:56 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>
>>
>>
>> Nothing magic, just put in there based on experience. You can find the
>> story behind the original recommendation here
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>
>>
>>
>> Key reasoning for the desire comes from Patrick McFadden:
>>
>>
>> "Yes that was in bytes. Just in my own experience, I don't recommend more
>> than ~100 mutations per batch. Doing some quick math I came up with 5k as
>> 100 x 50 byte mutations.
>>
>> Totally up for debate."
>>
>>
>>
>> It's totally changeable, however, it's there in no small part because so
>> many people confuse the BATCH keyword as a performance optimization, this
>> helps flag those cases of misuse.
>>
>>
>>
>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
>> wrote:
>>
>> Hi –
>>
>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>> *
>>
>> The default size is 5kb and according to the comments in the yaml file,
>> it is used to log WARN on any batch size exceeding this value in kilobytes.
>> It says caution should be taken on increasing the size of this threshold as
>> it can lead to node instability.
>>
>>
>>
>> Does anybody know the significance of this magic number 5kb? Why would a
>> higher number (say 10kb) lead to node instability?
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>>
>> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>>
>

Re: batch_size_warn_threshold_in_kb

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

The really important thing to really take away from Ryan's original post is
that batches are not there for performance.  The only case I consider
batches to be useful for is when you absolutely need to know that several
tables all get a mutation (via logged batches).  The use case for this is
when you've got multiple tables that are serving as different views for
data.  It is absolutely not going to help you if you're trying to lump
queries together to reduce network & server overhead - in fact it'll do the
opposite.  If you're trying to do that, instead perform many async
queries.  The overhead of batches in cassandra is significant and you're
going to hit a lot of problems if you use them excessively (timeouts /
failures).

tl;dr: you probably don't want batch, you most likely want many async calls

On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <mo...@glassbeam.com>
wrote:

>  Ryan,
>
> Thanks for the quick response.
>
>
>
> I did see that jira before posting my question on this list. However, I
> didn’t see any information about why 5kb+ data will cause instability. 5kb
> or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
> then with just 5 mutations, you will hit that threshold.
>
>
>
> In addition, Patrick is saying that he does not recommend more than 100
> mutations per batch. So why not warn users just on the # of mutations in a
> batch?
>
>
>
> Mohammed
>
>
>
> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
> *Sent:* Thursday, December 11, 2014 12:56 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: batch_size_warn_threshold_in_kb
>
>
>
> Nothing magic, just put in there based on experience. You can find the
> story behind the original recommendation here
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-6487
>
>
>
> Key reasoning for the desire comes from Patrick McFadden:
>
>
> "Yes that was in bytes. Just in my own experience, I don't recommend more
> than ~100 mutations per batch. Doing some quick math I came up with 5k as
> 100 x 50 byte mutations.
>
> Totally up for debate."
>
>
>
> It's totally changeable, however, it's there in no small part because so
> many people confuse the BATCH keyword as a performance optimization, this
> helps flag those cases of misuse.
>
>
>
> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
> Hi –
>
> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
> *
>
> The default size is 5kb and according to the comments in the yaml file, it
> is used to log WARN on any batch size exceeding this value in kilobytes. It
> says caution should be taken on increasing the size of this threshold as it
> can lead to node instability.
>
>
>
> Does anybody know the significance of this magic number 5kb? Why would a
> higher number (say 10kb) lead to node instability?
>
>
>
> Mohammed
>
>
>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
>
> [image: twitter.png] <https://twitter.com/foundev>[image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>
>

RE: batch_size_warn_threshold_in_kb

Posted by Mohammed Guller <mo...@glassbeam.com>.

Ryan,
Thanks for the quick response.

I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold.

In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch?

Mohammed

From: Ryan Svihla [mailto:rsvihla@datastax.com]
Sent: Thursday, December 11, 2014 12:56 PM
To: user@cassandra.apache.org
Subject: Re: batch_size_warn_threshold_in_kb

Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here

https://issues.apache.org/jira/browse/CASSANDRA-6487

Key reasoning for the desire comes from Patrick McFadden:

"Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations.

Totally up for debate."

It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse.

On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
Hi –
The cassandra.yaml file has property called batch_size_warn_threshold_in_kb.
The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability.

Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability?

Mohammed


--

[datastax_logo.png]<http://www.datastax.com/>

Ryan Svihla

Solution Architect

[twitter.png]<https://twitter.com/foundev>[linkedin.png]<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>


DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: batch_size_warn_threshold_in_kb

Posted by Ryan Svihla <rs...@datastax.com>.

Nothing magic, just put in there based on experience. You can find the
story behind the original recommendation here

https://issues.apache.org/jira/browse/CASSANDRA-6487

Key reasoning for the desire comes from Patrick McFadden:

"Yes that was in bytes. Just in my own experience, I don't recommend more
than ~100 mutations per batch. Doing some quick math I came up with 5k as
100 x 50 byte mutations.

Totally up for debate."

It's totally changeable, however, it's there in no small part because so
many people confuse the BATCH keyword as a performance optimization, this
helps flag those cases of misuse.

On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:
>
>   Hi –
>
> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
> *
>
> The default size is 5kb and according to the comments in the yaml file, it
> is used to log WARN on any batch size exceeding this value in kilobytes. It
> says caution should be taken on increasing the size of this threshold as it
> can lead to node instability.
>
>
>
> Does anybody know the significance of this magic number 5kb? Why would a
> higher number (say 10kb) lead to node instability?
>
>
>
> Mohammed
>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.