You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2015/01/12 18:23:34 UTC

[jira] [Commented] (CASSANDRA-8597) Stress: make simple things simple

    [ https://issues.apache.org/jira/browse/CASSANDRA-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273799#comment-14273799 ] 

Benedict commented on CASSANDRA-8597:
-------------------------------------

# see CASSANDRA-7980
# This is a known problem with heavily skewed distributions, and is challenging to resolve, largely because we don't know how many values we will generate for any lower tier when deciding if we will descend from an upper tier (tier being clustering column prefix); this would be even worse with 7980. I've given this a little thought in the past, but since stress hasn't been considered a major priority have left on the back burner to try and resolve. One possibility is, instead of generating the number of values on a per tier, we _could_ instead generate a total number of values for all tiers, then generate a distribution for ratio of adoption for each tier, and each part of the tier. This is pretty difficult to conceptualise though, and implement. There are some other possibilities but they don't avoid similar problems. For instance, we could visit all of the lower tiers with the defined select chance, but since the upper tier may be filtered out with higher chance than it deserves, these rows will be visited with much lower likelihood. TL;DR: this is a complex ticket of its own, and requires a mini-research project to improve.
# i'm not sure what's meant here? it's a deterministic workload if you use the -pop seq=1..N, except for thread interleavings and ancillary chances like "select". Do you mean a deterministic non-uniform distribution? Deterministic select behaviour?
# With 7980, we can simulate a workload very similar to a time-series one, by generating giant partitions with a temporal component and visiting their contents in ascending order. _Exactly_ simulating one requires some thought as to how to best define, model and deliver it though. The TODO in generator.Dates helps, but is probably not the best avenue; permitting expressions for ranges based on the partition seed might be a better route. I have idly wondered if, generally, we shouldn't permit some arbitrary javascript with a couple of predefined inputs to generate values, or the value ranges since this would be the most elegant and general way of supporting this. Again, not trivial though.

> Stress: make simple things simple
> ---------------------------------
>
>                 Key: CASSANDRA-8597
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8597
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: T Jake Luciani
>             Fix For: 2.1.3
>
>
> Some of the trouble people have with stress is a documentation problem, but some is functional.
> Comments from [~iamaleksey]:
> # 3 clustering columns, make a million cells in a single partition, should be simple, but it's not. have to tweak 'clustering' on the three columns just right to make stress work at all. w/ some values it'd just gets stuck forever computing batches
> # for others, it generates huge, megabyte-size batches, utterly disrespecting 'select' clause in 'insert'
> #  I want a sequential generator too, to be able to predict deterministic result sets. uniform() only gets you so far
> # impossible to simulate a time series workload
> /cc [~jshook] [~aweisberg] [~benedict]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)