You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "Saleil Bhat (BLOOMBERG/ 731 LEX)" <sb...@bloomberg.net> on 2018/12/12 22:27:09 UTC

cassandra-stress HexStrings generator

Hi, 

I have a question about the behavior of the HexStrings value generator in the cassandra-stress tool, particularly concerning its population/identity distribution.  


Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML profile, the population field in a columnspec “represents the total unique population distribution of that column across rows.”


I interpreted this to mean that if I specify some distribution 'F' for a column, then the probability of occurrence for each potential value of that column is given by 'F'. 

So, for example, if I provided the following columnspec for a text column: 
  name: fake_column 
           size: fixed(32) 
     population: gaussian(1..100)  
and then generated a large amount of data according to this specification, 
I would expect there to be 100 distinct values for ‘fake_column’, and that a histogram of the frequency of occurrence of each value would be roughly bell-shaped. 



However, the current implementation of the HexStrings generator deviates from this expectation. In the current implementation, each CHARACTER in the string is drawn from F, rather than the string as a whole. Therefore, if you plot the histogram of frequency of occurrence for each character, you get a bell-shaped curve, but the distribution of the occurrences of whole strings (the actual columns) is something else. 


My question is, is this the desired behavior for string columns? Was my expectation/interpretation incorrect? If so, can anyone give some insight as to why strings are designed to behave this way and what the use case is for this behavior? 

Thanks, 
-Saleil 


Re: cassandra-stress HexStrings generator

Posted by Benedict Elliott Smith <be...@apache.org>.
Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s been a long time so I cannot remember much for certain).  

It should be implemented like the Strings generator.  It looks like both HexStrings and HexBytes are incorrect, and have been for a long time.


> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) <sb...@bloomberg.net> wrote:
> 
> Hi, 
> 
> I have a question about the behavior of the HexStrings value generator in the cassandra-stress tool, particularly concerning its population/identity distribution.  
> 
> 
> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML profile, the population field in a columnspec “represents the total unique population distribution of that column across rows.”
> 
> 
> I interpreted this to mean that if I specify some distribution 'F' for a column, then the probability of occurrence for each potential value of that column is given by 'F'. 
> 
> So, for example, if I provided the following columnspec for a text column: 
>  name: fake_column 
>           size: fixed(32) 
>     population: gaussian(1..100)  
> and then generated a large amount of data according to this specification, 
> I would expect there to be 100 distinct values for ‘fake_column’, and that a histogram of the frequency of occurrence of each value would be roughly bell-shaped. 
> 
> 
> 
> However, the current implementation of the HexStrings generator deviates from this expectation. In the current implementation, each CHARACTER in the string is drawn from F, rather than the string as a whole. Therefore, if you plot the histogram of frequency of occurrence for each character, you get a bell-shaped curve, but the distribution of the occurrences of whole strings (the actual columns) is something else. 
> 
> 
> My question is, is this the desired behavior for string columns? Was my expectation/interpretation incorrect? If so, can anyone give some insight as to why strings are designed to behave this way and what the use case is for this behavior? 
> 
> Thanks, 
> -Saleil 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org