You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anurag Khandelwal <an...@berkeley.edu> on 2016/01/05 21:16:33 UTC

Cassandra Performance on a Single Machine

Hi,

I’ve been benchmarking Cassandra to get an idea of how the performance scales with more data on a single machine. I just wanted to get some feedback to whether these are the numbers I should expect.

The benchmarks are quite simple — I measure the latency and throughput for two kinds of queries:

1. get() queries - These fetch an entire row for a given primary key.
2. search() queries - These fetch all the primary keys for rows where a particular column matches a particular value (e.g., “name” is “John Smith”). 

Indexes are constructed for all columns that are queried.

Dataset

The dataset used comprises of ~1.5KB records (on an average) when represented as CSV; there are 105 attributes in each record.

Queries

For get() queries, randomly generated primary keys are used.

For search() queries, column values are selected such that their total number of occurrences in the dataset is between 1 - 4000. For example, a query for  “name” = “John Smith” would only be performed if the number of rows that contain the same lies between 1-4000.

The results for the benchmarks are provided below:

Latency Measurements

The latency measurements are an average of 10000 queries.





Throughput Measurements

The throughput measurements were repeated for 1-16 client threads, and the numbers reported for each input size is for the configuration (i.e., # client threads) with the highest throughput.





Any feedback here would be greatly appreciated!

Thanks!
Anurag

Re: Cassandra Performance on a Single Machine

Posted by Anurag Khandelwal <an...@berkeley.edu>.

Hi Jack,

> So, your 1GB input size means roughly 716 thousand rows of data and 128GB means roughly 92 million rows, correct?

Yes, that's correct.

> Are your gets and searches returning single rows, or a significant number of rows?

Like I mentioned in my first email, get always returns a single row, and search returns variable number of rows. The number of rows returned varies from 1-4000.

> -- Jack Krupansky
> 
>> On Thu, Jan 14, 2016 at 4:43 PM, Anurag Khandelwal <an...@berkeley.edu> wrote:
>> To clarify: Input size is the size of the dataset as a CSV file, before loading it into Cassandra; for each input size, the number of columns is fixed but the number of rows is different. By 1.5KB record, I meant that each row, when represented as a CSV entry, occupies 1500 bytes. I've used the terms "row" and "record" interchangeably, which might have been the source of some confusion.
>> 
>> I'll run the stress tool and report the results as well; the hardware is whatever AWS provides for c3.8xlarge EC2 instance.
>> 
>> Anurag
>> 
>>> On Jan 14, 2016, at 1:33 PM, Jack Krupansky <ja...@gmail.com> wrote:
>>> 
>>> What exactly is "input size" here (1GB to 128GB)? I mean, the test spec "The dataset used comprises of ~1.5KB records...  there are 105 attributes in each record." Does each test run have exactly the same number of rows and columns and you're just making each column bigger, or what?
>>> 
>>> Cassandra doesn't have "records", so are you really saying that you show 1,500 rows? Is it one row per partition or do you have clustering?
>>> 
>>> What are you actually trying to measure? (Some more context would help.)
>>> 
>>> In any case, a latency of 200ms (5 per second) for yor search query seems rather low, but we need some clarity on input size.
>>> 
>>> If you just run the cassandra stress tool on your hardware, what kinds of numbers do you get. That should be the starting point for any benchmarking - how does your hardware perform processing basic requests, before you layer your own data modeling on top of that.
>>> 
>>> -- Jack Krupansky
>>> 
>>>> On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
>>>> I think you actually get a really useful metric by benchmarking 1 machine.  You understand your cluster's theoretical maximum performance, which would be Nodes * number of queries.  Yes, adding in replication and CL is important, but 1 machine lets you isolate certain performance metrics. 
>>>> 
>>>>> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rw...@fold3.com> wrote:
>>>>> I disagree. I think that you can extrapolate very little information about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>>>>> 
>>>>>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu> wrote:
>>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> Thanks for responding!
>>>>>> 
>>>>>> The aim of this benchmark was not to benchmark Cassandra as an end-to-end distributed system, but to understand a break down of the performance. For instance, if we understand the performance characteristics that we can expect from a single machine cassandra instance with RF=Consistency=1, we can have a good estimate of what the distributed performance with higher replication factors and consistency are going to look like. Even in the ideal case, the performance improvement would scale at most linearly with more machines and replicas.
>>>>>> 
>>>>>> That being said, I still want to understand whether this is the performance I should expect for the setup I described; if the performance for the current setup can be improved, then clearly the performance for a production setup (with multiple nodes, replicas) would also improve. Does that make sense?
>>>>>> 
>>>>>> Thanks!
>>>>>> Anurag
>>>>>> 
>>>>>>> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
>>>>>>> 
>>>>>>> Anurag,
>>>>>>> 
>>>>>>> Unless you are planning on continuing to use only one machine with RF=1 benchmarking a single system using RF=Consistancy=1 is mostly a waste of time. If you are going to use RF=1 and a single host then why use Cassandra at all. Plain old relational dbs should do the job just fine.
>>>>>>> Cassandra is designed to be distributed. You won't get the full impact of how it scales and the limits on scaling unless you benchmark a distributed system. For example the scaling impact of secondary indexes will not be visible on a single node.
>>>>>>> 
>>>>>>> John
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I’ve been benchmarking Cassandra to get an idea of how the performance scales with more data on a single machine. I just wanted to get some feedback to whether these are the numbers I should expect.
>>>>>>>> 
>>>>>>>> The benchmarks are quite simple — I measure the latency and throughput for two kinds of queries:
>>>>>>>> 
>>>>>>>> 1. get() queries - These fetch an entire row for a given primary key.
>>>>>>>> 2. search() queries - These fetch all the primary keys for rows where a particular column matches a particular value (e.g., “name” is “John Smith”). 
>>>>>>>> 
>>>>>>>> Indexes are constructed for all columns that are queried.
>>>>>>>> 
>>>>>>>> Dataset
>>>>>>>> 
>>>>>>>> The dataset used comprises of ~1.5KB records (on an average) when represented as CSV; there are 105 attributes in each record.
>>>>>>>> 
>>>>>>>> Queries
>>>>>>>> 
>>>>>>>> For get() queries, randomly generated primary keys are used.
>>>>>>>> 
>>>>>>>> For search() queries, column values are selected such that their total number of occurrences in the dataset is between 1 - 4000. For example, a query for  “name” = “John Smith” would only be performed if the number of rows that contain the same lies between 1-4000.
>>>>>>>> 
>>>>>>>> The results for the benchmarks are provided below:
>>>>>>>> 
>>>>>>>> Latency Measurements
>>>>>>>> 
>>>>>>>> The latency measurements are an average of 10000 queries.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Throughput Measurements
>>>>>>>> 
>>>>>>>> The throughput measurements were repeated for 1-16 client threads, and the numbers reported for each input size is for the configuration (i.e., # client threads) with the highest throughput.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any feedback here would be greatly appreciated!
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> Anurag
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> John H. Schulz
>>>>>>> Principal Consultant
>>>>>>> Pythian - Love your data
>>>>>>> 
>>>>>>> schulz@pythian.com |  Linkedin www.linkedin.com/pub/john-schulz/13/ab2/930/
>>>>>>> Mobile: 248-376-3380
>>>>>>> www.pythian.com
>>>>>>> 
>>>>>>> --
>>>>>>> 
>

Re: Cassandra Performance on a Single Machine

Posted by Jack Krupansky <ja...@gmail.com>.

Thanks for that clarification.

So, your 1GB input size means roughly 716 thousand rows of data and 128GB
means roughly 92 million rows, correct?

FWIW, a best practice recommendation is that you avoid using secondary
indexes in favor of using "query tables" - store the same data in multiple
tables but with a primary key that includes data column you wish to query
by. In general, avoid using secondary indexes with either very high or very
low cardinality.

Are your gets and searches returning single rows, or a significant number
of rows?


-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:43 PM, Anurag Khandelwal <an...@berkeley.edu>
wrote:

> To clarify: Input size is the size of the dataset as a CSV file, before
> loading it into Cassandra; for each input size, the number of columns is
> fixed but the number of rows is different. By 1.5KB record, I meant that
> each row, when represented as a CSV entry, occupies 1500 bytes. I've used
> the terms "row" and "record" interchangeably, which might have been the
> source of some confusion.
>
> I'll run the stress tool and report the results as well; the hardware is
> whatever AWS provides for c3.8xlarge EC2 instance.
>
> Anurag
>
> On Jan 14, 2016, at 1:33 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> What exactly is "input size" here (1GB to 128GB)? I mean, the test spec "The
> dataset used comprises of ~1.5KB records...  there are 105 attributes in
> each record." Does each test run have exactly the same number of rows and
> columns and you're just making each column bigger, or what?
>
> Cassandra doesn't have "records", so are you really saying that you show
> 1,500 rows? Is it one row per partition or do you have clustering?
>
> What are you actually trying to measure? (Some more context would help.)
>
> In any case, a latency of 200ms (5 per second) for yor search query seems
> rather low, but we need some clarity on input size.
>
> If you just run the cassandra stress tool on your hardware, what kinds of
> numbers do you get. That should be the starting point for any benchmarking
> - how does your hardware perform processing basic requests, before you
> layer your own data modeling on top of that.
>
> -- Jack Krupansky
>
> On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> I think you actually get a really useful metric by benchmarking 1
>> machine.  You understand your cluster's theoretical maximum performance,
>> which would be Nodes * number of queries.  Yes, adding in replication and
>> CL is important, but 1 machine lets you isolate certain performance
>> metrics.
>>
>> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rw...@fold3.com> wrote:
>>
>>> I disagree. I think that you can extrapolate very little information
>>> about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>>>
>>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu>
>>> wrote:
>>>
>>> Hi John,
>>>
>>> Thanks for responding!
>>>
>>> The aim of this benchmark was not to benchmark Cassandra as an
>>> end-to-end distributed system, but to understand a break down of the
>>> performance. For instance, if we understand the performance characteristics
>>> that we can expect from a single machine cassandra instance with
>>> RF=Consistency=1, we can have a good estimate of what the distributed
>>> performance with higher replication factors and consistency are going to
>>> look like. Even in the ideal case, the performance improvement would scale
>>> at most linearly with more machines and replicas.
>>>
>>> That being said, I still want to understand whether this is the
>>> performance I should expect for the setup I described; if the performance
>>> for the current setup can be improved, then clearly the performance for a
>>> production setup (with multiple nodes, replicas) would also improve. Does
>>> that make sense?
>>>
>>> Thanks!
>>> Anurag
>>>
>>> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
>>>
>>> Anurag,
>>>
>>> Unless you are planning on continuing to use only one machine with RF=1
>>> benchmarking a single system using RF=Consistancy=1 is mostly a waste of
>>> time. If you are going to use RF=1 and a single host then why use Cassandra
>>> at all. Plain old relational dbs should do the job just fine.
>>>
>>> Cassandra is designed to be distributed. You won't get the full impact
>>> of how it scales and the limits on scaling unless you benchmark a
>>> distributed system. For example the scaling impact of secondary indexes
>>> will not be visible on a single node.
>>>
>>> John
>>>
>>>
>>>
>>>
>>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I’ve been benchmarking Cassandra to get an idea of how the performance
>>>> scales with more data on a single machine. I just wanted to get some
>>>> feedback to whether these are the numbers I should expect.
>>>>
>>>> The benchmarks are quite simple — I measure the latency and throughput
>>>> for two kinds of queries:
>>>>
>>>> 1. get() queries - These fetch an entire row for a given primary key.
>>>> 2. search() queries - These fetch all the primary keys for rows where a
>>>> particular column matches a particular value (e.g., “name” is “John
>>>> Smith”).
>>>>
>>>> Indexes are constructed for all columns that are queried.
>>>>
>>>> *Dataset*
>>>>
>>>> The dataset used comprises of ~1.5KB records (on an average) when
>>>> represented as CSV; there are 105 attributes in each record.
>>>>
>>>> *Queries*
>>>>
>>>> For get() queries, randomly generated primary keys are used.
>>>>
>>>> For search() queries, column values are selected such that their total
>>>> number of occurrences in the dataset is between 1 - 4000. For example, a
>>>> query for  “name” = “John Smith” would only be performed if the number of
>>>> rows that contain the same lies between 1-4000.
>>>>
>>>> The results for the benchmarks are provided below:
>>>>
>>>> *Latency Measurements*
>>>>
>>>> The latency measurements are an average of 10000 queries.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Throughput Measurements*
>>>>
>>>> The throughput measurements were repeated for 1-16 client threads, and
>>>> the numbers reported for each input size is for the configuration (i.e., #
>>>> client threads) with the highest throughput.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Any feedback here would be greatly appreciated!
>>>>
>>>> Thanks!
>>>> Anurag
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> John H. Schulz
>>>
>>> Principal Consultant
>>>
>>> Pythian - Love your data
>>>
>>>
>>> schulz@pythian.com |  Linkedin
>>> www.linkedin.com/pub/john-schulz/13/ab2/930/
>>>
>>> Mobile: 248-376-3380
>>>
>>> *www.pythian.com <http://www.pythian.com/>*
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Cassandra Performance on a Single Machine

Posted by Anurag Khandelwal <an...@berkeley.edu>.

To clarify: Input size is the size of the dataset as a CSV file, before loading it into Cassandra; for each input size, the number of columns is fixed but the number of rows is different. By 1.5KB record, I meant that each row, when represented as a CSV entry, occupies 1500 bytes. I've used the terms "row" and "record" interchangeably, which might have been the source of some confusion.

I'll run the stress tool and report the results as well; the hardware is whatever AWS provides for c3.8xlarge EC2 instance.

Anurag

> On Jan 14, 2016, at 1:33 PM, Jack Krupansky <ja...@gmail.com> wrote:
> 
> What exactly is "input size" here (1GB to 128GB)? I mean, the test spec "The dataset used comprises of ~1.5KB records...  there are 105 attributes in each record." Does each test run have exactly the same number of rows and columns and you're just making each column bigger, or what?
> 
> Cassandra doesn't have "records", so are you really saying that you show 1,500 rows? Is it one row per partition or do you have clustering?
> 
> What are you actually trying to measure? (Some more context would help.)
> 
> In any case, a latency of 200ms (5 per second) for yor search query seems rather low, but we need some clarity on input size.
> 
> If you just run the cassandra stress tool on your hardware, what kinds of numbers do you get. That should be the starting point for any benchmarking - how does your hardware perform processing basic requests, before you layer your own data modeling on top of that.
> 
> -- Jack Krupansky
> 
>> On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
>> I think you actually get a really useful metric by benchmarking 1 machine.  You understand your cluster's theoretical maximum performance, which would be Nodes * number of queries.  Yes, adding in replication and CL is important, but 1 machine lets you isolate certain performance metrics. 
>> 
>>> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rw...@fold3.com> wrote:
>>> I disagree. I think that you can extrapolate very little information about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>>> 
>>>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu> wrote:
>>>> 
>>>> Hi John,
>>>> 
>>>> Thanks for responding!
>>>> 
>>>> The aim of this benchmark was not to benchmark Cassandra as an end-to-end distributed system, but to understand a break down of the performance. For instance, if we understand the performance characteristics that we can expect from a single machine cassandra instance with RF=Consistency=1, we can have a good estimate of what the distributed performance with higher replication factors and consistency are going to look like. Even in the ideal case, the performance improvement would scale at most linearly with more machines and replicas.
>>>> 
>>>> That being said, I still want to understand whether this is the performance I should expect for the setup I described; if the performance for the current setup can be improved, then clearly the performance for a production setup (with multiple nodes, replicas) would also improve. Does that make sense?
>>>> 
>>>> Thanks!
>>>> Anurag
>>>> 
>>>>> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
>>>>> 
>>>>> Anurag,
>>>>> 
>>>>> Unless you are planning on continuing to use only one machine with RF=1 benchmarking a single system using RF=Consistancy=1 is mostly a waste of time. If you are going to use RF=1 and a single host then why use Cassandra at all. Plain old relational dbs should do the job just fine.
>>>>> Cassandra is designed to be distributed. You won't get the full impact of how it scales and the limits on scaling unless you benchmark a distributed system. For example the scaling impact of secondary indexes will not be visible on a single node.
>>>>> 
>>>>> John
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I’ve been benchmarking Cassandra to get an idea of how the performance scales with more data on a single machine. I just wanted to get some feedback to whether these are the numbers I should expect.
>>>>>> 
>>>>>> The benchmarks are quite simple — I measure the latency and throughput for two kinds of queries:
>>>>>> 
>>>>>> 1. get() queries - These fetch an entire row for a given primary key.
>>>>>> 2. search() queries - These fetch all the primary keys for rows where a particular column matches a particular value (e.g., “name” is “John Smith”). 
>>>>>> 
>>>>>> Indexes are constructed for all columns that are queried.
>>>>>> 
>>>>>> Dataset
>>>>>> 
>>>>>> The dataset used comprises of ~1.5KB records (on an average) when represented as CSV; there are 105 attributes in each record.
>>>>>> 
>>>>>> Queries
>>>>>> 
>>>>>> For get() queries, randomly generated primary keys are used.
>>>>>> 
>>>>>> For search() queries, column values are selected such that their total number of occurrences in the dataset is between 1 - 4000. For example, a query for  “name” = “John Smith” would only be performed if the number of rows that contain the same lies between 1-4000.
>>>>>> 
>>>>>> The results for the benchmarks are provided below:
>>>>>> 
>>>>>> Latency Measurements
>>>>>> 
>>>>>> The latency measurements are an average of 10000 queries.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Throughput Measurements
>>>>>> 
>>>>>> The throughput measurements were repeated for 1-16 client threads, and the numbers reported for each input size is for the configuration (i.e., # client threads) with the highest throughput.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Any feedback here would be greatly appreciated!
>>>>>> 
>>>>>> Thanks!
>>>>>> Anurag
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> John H. Schulz
>>>>> Principal Consultant
>>>>> Pythian - Love your data
>>>>> 
>>>>> schulz@pythian.com |  Linkedin www.linkedin.com/pub/john-schulz/13/ab2/930/
>>>>> Mobile: 248-376-3380
>>>>> www.pythian.com
>>>>> 
>>>>> --
>>>>> 
>

Re: Cassandra Performance on a Single Machine

Posted by Jack Krupansky <ja...@gmail.com>.

What exactly is "input size" here (1GB to 128GB)? I mean, the test spec "The
dataset used comprises of ~1.5KB records...  there are 105 attributes in
each record." Does each test run have exactly the same number of rows and
columns and you're just making each column bigger, or what?

Cassandra doesn't have "records", so are you really saying that you show
1,500 rows? Is it one row per partition or do you have clustering?

What are you actually trying to measure? (Some more context would help.)

In any case, a latency of 200ms (5 per second) for yor search query seems
rather low, but we need some clarity on input size.

If you just run the cassandra stress tool on your hardware, what kinds of
numbers do you get. That should be the starting point for any benchmarking
- how does your hardware perform processing basic requests, before you
layer your own data modeling on top of that.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> I think you actually get a really useful metric by benchmarking 1
> machine.  You understand your cluster's theoretical maximum performance,
> which would be Nodes * number of queries.  Yes, adding in replication and
> CL is important, but 1 machine lets you isolate certain performance
> metrics.
>
> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rw...@fold3.com> wrote:
>
>> I disagree. I think that you can extrapolate very little information
>> about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>>
>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu>
>> wrote:
>>
>> Hi John,
>>
>> Thanks for responding!
>>
>> The aim of this benchmark was not to benchmark Cassandra as an end-to-end
>> distributed system, but to understand a break down of the performance. For
>> instance, if we understand the performance characteristics that we can
>> expect from a single machine cassandra instance with RF=Consistency=1, we
>> can have a good estimate of what the distributed performance with higher
>> replication factors and consistency are going to look like. Even in the
>> ideal case, the performance improvement would scale at most linearly with
>> more machines and replicas.
>>
>> That being said, I still want to understand whether this is the
>> performance I should expect for the setup I described; if the performance
>> for the current setup can be improved, then clearly the performance for a
>> production setup (with multiple nodes, replicas) would also improve. Does
>> that make sense?
>>
>> Thanks!
>> Anurag
>>
>> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
>>
>> Anurag,
>>
>> Unless you are planning on continuing to use only one machine with RF=1
>> benchmarking a single system using RF=Consistancy=1 is mostly a waste of
>> time. If you are going to use RF=1 and a single host then why use Cassandra
>> at all. Plain old relational dbs should do the job just fine.
>>
>> Cassandra is designed to be distributed. You won't get the full impact of
>> how it scales and the limits on scaling unless you benchmark a distributed
>> system. For example the scaling impact of secondary indexes will not be
>> visible on a single node.
>>
>> John
>>
>>
>>
>>
>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> I’ve been benchmarking Cassandra to get an idea of how the performance
>>> scales with more data on a single machine. I just wanted to get some
>>> feedback to whether these are the numbers I should expect.
>>>
>>> The benchmarks are quite simple — I measure the latency and throughput
>>> for two kinds of queries:
>>>
>>> 1. get() queries - These fetch an entire row for a given primary key.
>>> 2. search() queries - These fetch all the primary keys for rows where a
>>> particular column matches a particular value (e.g., “name” is “John
>>> Smith”).
>>>
>>> Indexes are constructed for all columns that are queried.
>>>
>>> *Dataset*
>>>
>>> The dataset used comprises of ~1.5KB records (on an average) when
>>> represented as CSV; there are 105 attributes in each record.
>>>
>>> *Queries*
>>>
>>> For get() queries, randomly generated primary keys are used.
>>>
>>> For search() queries, column values are selected such that their total
>>> number of occurrences in the dataset is between 1 - 4000. For example, a
>>> query for  “name” = “John Smith” would only be performed if the number of
>>> rows that contain the same lies between 1-4000.
>>>
>>> The results for the benchmarks are provided below:
>>>
>>> *Latency Measurements*
>>>
>>> The latency measurements are an average of 10000 queries.
>>>
>>>
>>>
>>>
>>>
>>> *Throughput Measurements*
>>>
>>> The throughput measurements were repeated for 1-16 client threads, and
>>> the numbers reported for each input size is for the configuration (i.e., #
>>> client threads) with the highest throughput.
>>>
>>>
>>>
>>>
>>>
>>> Any feedback here would be greatly appreciated!
>>>
>>> Thanks!
>>> Anurag
>>>
>>>
>>
>>
>> --
>>
>> John H. Schulz
>>
>> Principal Consultant
>>
>> Pythian - Love your data
>>
>>
>> schulz@pythian.com |  Linkedin
>> www.linkedin.com/pub/john-schulz/13/ab2/930/
>>
>> Mobile: 248-376-3380
>>
>> *www.pythian.com <http://www.pythian.com/>*
>>
>> --
>>
>>
>>
>>
>>
>>
>>

Re: Cassandra Performance on a Single Machine

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I think you actually get a really useful metric by benchmarking 1 machine.
You understand your cluster's theoretical maximum performance, which would
be Nodes * number of queries.  Yes, adding in replication and CL is
important, but 1 machine lets you isolate certain performance metrics.

On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rw...@fold3.com> wrote:

> I disagree. I think that you can extrapolate very little information about
> RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>
> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu>
> wrote:
>
> Hi John,
>
> Thanks for responding!
>
> The aim of this benchmark was not to benchmark Cassandra as an end-to-end
> distributed system, but to understand a break down of the performance. For
> instance, if we understand the performance characteristics that we can
> expect from a single machine cassandra instance with RF=Consistency=1, we
> can have a good estimate of what the distributed performance with higher
> replication factors and consistency are going to look like. Even in the
> ideal case, the performance improvement would scale at most linearly with
> more machines and replicas.
>
> That being said, I still want to understand whether this is the
> performance I should expect for the setup I described; if the performance
> for the current setup can be improved, then clearly the performance for a
> production setup (with multiple nodes, replicas) would also improve. Does
> that make sense?
>
> Thanks!
> Anurag
>
> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
>
> Anurag,
>
> Unless you are planning on continuing to use only one machine with RF=1
> benchmarking a single system using RF=Consistancy=1 is mostly a waste of
> time. If you are going to use RF=1 and a single host then why use Cassandra
> at all. Plain old relational dbs should do the job just fine.
>
> Cassandra is designed to be distributed. You won't get the full impact of
> how it scales and the limits on scaling unless you benchmark a distributed
> system. For example the scaling impact of secondary indexes will not be
> visible on a single node.
>
> John
>
>
>
>
> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu>
> wrote:
>
>> Hi,
>>
>> I’ve been benchmarking Cassandra to get an idea of how the performance
>> scales with more data on a single machine. I just wanted to get some
>> feedback to whether these are the numbers I should expect.
>>
>> The benchmarks are quite simple — I measure the latency and throughput
>> for two kinds of queries:
>>
>> 1. get() queries - These fetch an entire row for a given primary key.
>> 2. search() queries - These fetch all the primary keys for rows where a
>> particular column matches a particular value (e.g., “name” is “John
>> Smith”).
>>
>> Indexes are constructed for all columns that are queried.
>>
>> *Dataset*
>>
>> The dataset used comprises of ~1.5KB records (on an average) when
>> represented as CSV; there are 105 attributes in each record.
>>
>> *Queries*
>>
>> For get() queries, randomly generated primary keys are used.
>>
>> For search() queries, column values are selected such that their total
>> number of occurrences in the dataset is between 1 - 4000. For example, a
>> query for  “name” = “John Smith” would only be performed if the number of
>> rows that contain the same lies between 1-4000.
>>
>> The results for the benchmarks are provided below:
>>
>> *Latency Measurements*
>>
>> The latency measurements are an average of 10000 queries.
>>
>>
>>
>>
>>
>> *Throughput Measurements*
>>
>> The throughput measurements were repeated for 1-16 client threads, and
>> the numbers reported for each input size is for the configuration (i.e., #
>> client threads) with the highest throughput.
>>
>>
>>
>>
>>
>> Any feedback here would be greatly appreciated!
>>
>> Thanks!
>> Anurag
>>
>>
>
>
> --
>
> John H. Schulz
>
> Principal Consultant
>
> Pythian - Love your data
>
>
> schulz@pythian.com |  Linkedin
> www.linkedin.com/pub/john-schulz/13/ab2/930/
>
> Mobile: 248-376-3380
>
> *www.pythian.com <http://www.pythian.com/>*
>
> --
>
>
>
>
>
>
>

Re: Cassandra Performance on a Single Machine

Posted by Robert Wille <rw...@fold3.com>.

I disagree. I think that you can extrapolate very little information about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.

On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <an...@berkeley.edu>> wrote:

Hi John,

Thanks for responding!

The aim of this benchmark was not to benchmark Cassandra as an end-to-end distributed system, but to understand a break down of the performance. For instance, if we understand the performance characteristics that we can expect from a single machine cassandra instance with RF=Consistency=1, we can have a good estimate of what the distributed performance with higher replication factors and consistency are going to look like. Even in the ideal case, the performance improvement would scale at most linearly with more machines and replicas.

That being said, I still want to understand whether this is the performance I should expect for the setup I described; if the performance for the current setup can be improved, then clearly the performance for a production setup (with multiple nodes, replicas) would also improve. Does that make sense?

Thanks!
Anurag

On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com>> wrote:

Anurag,

Unless you are planning on continuing to use only one machine with RF=1 benchmarking a single system using RF=Consistancy=1 is mostly a waste of time. If you are going to use RF=1 and a single host then why use Cassandra at all. Plain old relational dbs should do the job just fine.

Cassandra is designed to be distributed. You won't get the full impact of how it scales and the limits on scaling unless you benchmark a distributed system. For example the scaling impact of secondary indexes will not be visible on a single node.

John

On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu>> wrote:
Hi,

I’ve been benchmarking Cassandra to get an idea of how the performance scales with more data on a single machine. I just wanted to get some feedback to whether these are the numbers I should expect.

The benchmarks are quite simple — I measure the latency and throughput for two kinds of queries:

1. get() queries - These fetch an entire row for a given primary key.
2. search() queries - These fetch all the primary keys for rows where a particular column matches a particular value (e.g., “name” is “John Smith”).

Indexes are constructed for all columns that are queried.

Dataset

The dataset used comprises of ~1.5KB records (on an average) when represented as CSV; there are 105 attributes in each record.

Queries

For get() queries, randomly generated primary keys are used.

For search() queries, column values are selected such that their total number of occurrences in the dataset is between 1 - 4000. For example, a query for “name” = “John Smith” would only be performed if the number of rows that contain the same lies between 1-4000.

The results for the benchmarks are provided below:

Latency Measurements

The latency measurements are an average of 10000 queries.

Throughput Measurements

The throughput measurements were repeated for 1-16 client threads, and the numbers reported for each input size is for the configuration (i.e., # client threads) with the highest throughput.

Any feedback here would be greatly appreciated!

Thanks!
Anurag

John H. Schulz

Principal Consultant

Pythian - Love your data

schulz@pythian.com<ma...@pythian.com> | Linkedin www.linkedin.com/pub/john-schulz/13/ab2/930/<http://www.linkedin.com/pub/john-schulz/13/ab2/930/>

Mobile: 248-376-3380

www.pythian.com<http://www.pythian.com/>

Re: Cassandra Performance on a Single Machine

Posted by Anurag Khandelwal <an...@berkeley.edu>.

Hi John,

Thanks for responding!

The aim of this benchmark was not to benchmark Cassandra as an end-to-end distributed system, but to understand a break down of the performance. For instance, if we understand the performance characteristics that we can expect from a single machine cassandra instance with RF=Consistency=1, we can have a good estimate of what the distributed performance with higher replication factors and consistency are going to look like. Even in the ideal case, the performance improvement would scale at most linearly with more machines and replicas.

That being said, I still want to understand whether this is the performance I should expect for the setup I described; if the performance for the current setup can be improved, then clearly the performance for a production setup (with multiple nodes, replicas) would also improve. Does that make sense?

Thanks!
Anurag

> On Jan 6, 2016, at 9:31 AM, John Schulz <sc...@pythian.com> wrote:
> 
> Anurag,
> 
> Unless you are planning on continuing to use only one machine with RF=1 benchmarking a single system using RF=Consistancy=1 is mostly a waste of time. If you are going to use RF=1 and a single host then why use Cassandra at all. Plain old relational dbs should do the job just fine.
> 
> Cassandra is designed to be distributed. You won't get the full impact of how it scales and the limits on scaling unless you benchmark a distributed system. For example the scaling impact of secondary indexes will not be visible on a single node.
> 
> John
> 
> 
> 
> 
> 
> 
> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <anuragk@berkeley.edu <ma...@berkeley.edu>> wrote:
> Hi,
> 
> I’ve been benchmarking Cassandra to get an idea of how the performance scales with more data on a single machine. I just wanted to get some feedback to whether these are the numbers I should expect.
> 
> The benchmarks are quite simple — I measure the latency and throughput for two kinds of queries:
> 
> 1. get() queries - These fetch an entire row for a given primary key.
> 2. search() queries - These fetch all the primary keys for rows where a particular column matches a particular value (e.g., “name” is “John Smith”). 
> 
> Indexes are constructed for all columns that are queried.
> 
> Dataset
> 
> The dataset used comprises of ~1.5KB records (on an average) when represented as CSV; there are 105 attributes in each record.
> 
> Queries
> 
> For get() queries, randomly generated primary keys are used.
> 
> For search() queries, column values are selected such that their total number of occurrences in the dataset is between 1 - 4000. For example, a query for  “name” = “John Smith” would only be performed if the number of rows that contain the same lies between 1-4000.
> 
> The results for the benchmarks are provided below:
> 
> Latency Measurements
> 
> The latency measurements are an average of 10000 queries.
> 
> 
> 
> 
> 
> Throughput Measurements
> 
> The throughput measurements were repeated for 1-16 client threads, and the numbers reported for each input size is for the configuration (i.e., # client threads) with the highest throughput.
> 
> 
> 
> 
> 
> Any feedback here would be greatly appreciated!
> 
> Thanks!
> Anurag
> 
> 
> 
> 
> -- 
> John H. Schulz
> Principal Consultant
> Pythian - Love your data
> 
> schulz@pythian.com <ma...@pythian.com> |  Linkedin www.linkedin.com/pub/john-schulz/13/ab2/930/ <http://www.linkedin.com/pub/john-schulz/13/ab2/930/>
> Mobile: 248-376-3380
> www.pythian.com <http://www.pythian.com/>
> --
> 
> 
> 
> 
>

Re: Cassandra Performance on a Single Machine

Posted by John Schulz <sc...@pythian.com>.

Anurag,

Unless you are planning on continuing to use only one machine with RF=1
benchmarking a single system using RF=Consistancy=1 is mostly a waste of
time. If you are going to use RF=1 and a single host then why use Cassandra
at all. Plain old relational dbs should do the job just fine.

Cassandra is designed to be distributed. You won't get the full impact of
how it scales and the limits on scaling unless you benchmark a distributed
system. For example the scaling impact of secondary indexes will not be
visible on a single node.

John




On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal <an...@berkeley.edu>
wrote:

> Hi,
>
> I’ve been benchmarking Cassandra to get an idea of how the performance
> scales with more data on a single machine. I just wanted to get some
> feedback to whether these are the numbers I should expect.
>
> The benchmarks are quite simple — I measure the latency and throughput for
> two kinds of queries:
>
> 1. get() queries - These fetch an entire row for a given primary key.
> 2. search() queries - These fetch all the primary keys for rows where a
> particular column matches a particular value (e.g., “name” is “John
> Smith”).
>
> Indexes are constructed for all columns that are queried.
>
> *Dataset*
>
> The dataset used comprises of ~1.5KB records (on an average) when
> represented as CSV; there are 105 attributes in each record.
>
> *Queries*
>
> For get() queries, randomly generated primary keys are used.
>
> For search() queries, column values are selected such that their total
> number of occurrences in the dataset is between 1 - 4000. For example, a
> query for  “name” = “John Smith” would only be performed if the number of
> rows that contain the same lies between 1-4000.
>
> The results for the benchmarks are provided below:
>
> *Latency Measurements*
>
> The latency measurements are an average of 10000 queries.
>
>
>
>
>
> *Throughput Measurements*
>
> The throughput measurements were repeated for 1-16 client threads, and the
> numbers reported for each input size is for the configuration (i.e., #
> client threads) with the highest throughput.
>
>
>
>
>
> Any feedback here would be greatly appreciated!
>
> Thanks!
> Anurag
>
>


-- 

John H. Schulz

Principal Consultant

Pythian - Love your data


schulz@pythian.com |  Linkedin www.linkedin.com/pub/john-schulz/13/ab2/930/

Mobile: 248-376-3380

*www.pythian.com <http://www.pythian.com/>*

-- 


--