You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Gustavo Gustavo <do...@gmail.com> on 2012/01/22 14:10:35 UTC

Cassandra x MySQL Sharded - Insert Comparison

Hello,

I've set up a testing evironment for Cassandra and MySQL, to compare both,
regarding *performance only*. And I must admit that I was expecting
Cassandra to beat MySQL. But I've not seen this happening up to now.
My application/use case is INSERT intensive, since I'm not updating
anything, just inserting all the time.
To compare both I created virtual machines with Ubuntu 11.10, and installed
the latest versions of each datastore. Each VM has 1GB of RAM. I've used
VMs as a way to give both datastores an equal sandbox.
MySQL is set up to work as sharded, with 2 databases, that means that
records are inserted to a specific instance based on key % 2. The engine is
MyISAM (InnoDB was really slow and not really needed to my case). There's a
primary compound key (integer and datetime columns) in this test table.
Let's name the "nodes" MySQL1 and MySQL2.
Cassandra is set up to work with 4 nodes, with keys (tokens) set up to
distribute records evenly across the 4 nodes (nodetool ring reports 25% to
each node), replication factor 1 and RandomPartitioner, the other configs
are left to default. Let's name the nodes Cassandra1, Cassandra2,
Cassandra3 and Cassandra4.

I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2
(MySQL) virtual machines, this way:
Machine1: MySQL1, Cassandra1, Cassandra3
Machine2: MySQL2, Cassandra2, Cassandra4
The machines have CPU and RAM enough to host Cassandra Cluster or MySQL
"Cluster" at a time.

The client test applicatin is running in a third physical machine, with 8
threads doing inserts. The test application is written in C# (Windows7)
using Aquiles high-level client.

My use case is a vehicle tracking system. So, let's suppose, from minute to
minute, the vehicle sends its position together with some other GPS data
and vehicle status information. The columns in my Cassandra cluster are
just the DateTime (long value) of a position for a specific vehicle, and
the value is all the other data serialized to binary format. Therefore, my
CF really grows in columns number. So all data is inserted only to one
CF/Table named Positions. The key to Cassandra is the VehicleID and to
MySQL VehicleID + PositionDateTime (MySQL creates an index to this
automatically). Important to note that MySQL threw tons of connection
exceptions, even though, the insert was retried until it got through MySQL.

My test case was to insert 1k positions for 1k vehicles to 10 days - which
gives 10.000.000 of inserts.

The final thoughtput that my application had for this scenario was:

Cassandra x 4
2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted 10000
positions for 1000 vehicles (10000000 inserts):
2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time:
2:37:03,359
2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput: 1061
inserts/s

And for MySQL x 2
2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted 10000
positions for 1000 vehicles (10000000 inserts):
2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time:
2:06:25,914
2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput: 1318
inserts/s

Is there something that I'm missing here? Is this excepted? Or the problem
is somewhere else and that's hard to say looking at this description?

Cheers,
Gustavo

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Maxim Potekhin <po...@bnl.gov>.

Hello,
I have some experience in benchmarking Cassandra against Oracle and in 
running on a VM cluster.

While the VM solution will work for many applications, it simply won't 
cut it for all. In particular, I observed a large difference in insert 
performance when I moved from VM to real hardware. Why this is the case, 
can be due to bazillion factors, including the high core count on my 
"real" machines, and vastly better I/O. The CPU is crucial for inserts 
in Cassandra, and it may not be for RDBMS.

Another factor is the potential bottleneck in the client. There are 
cases when you won't have enough muscle to handle the data in the client 
itself.

None of this is definitive, but I'm just throwing in bit of my 
experience from the past 12 months. Right now I'm able to sink data at 
insane speeds far beyond these of Oracle.

Maxim


On 1/22/2012 8:10 AM, Gustavo Gustavo wrote:
> Hello,
>
> I've set up a testing evironment for Cassandra and MySQL, to compare 
> both, regarding *performance only*. And I must admit that I was 
> expecting Cassandra to beat MySQL. But I've not seen this happening up 
> to now.
> My application/use case is INSERT intensive, since I'm not updating 
> anything, just inserting all the time.
> To compare both I created virtual machines with Ubuntu 11.10, and 
> installed the latest versions of each datastore. Each VM has 1GB of 
> RAM. I've used VMs as a way to give both datastores an equal sandbox.
> MySQL is set up to work as sharded, with 2 databases, that means that 
> records are inserted to a specific instance based on key % 2. The 
> engine is MyISAM (InnoDB was really slow and not really needed to my 
> case). There's a primary compound key (integer and datetime columns) 
> in this test table.
> Let's name the "nodes" MySQL1 and MySQL2.
> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to 
> distribute records evenly across the 4 nodes (nodetool ring reports 
> 25% to each node), replication factor 1 and RandomPartitioner, the 
> other configs are left to default. Let's name the nodes Cassandra1, 
> Cassandra2, Cassandra3 and Cassandra4.
>
> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 
> 2 (MySQL) virtual machines, this way:
> Machine1: MySQL1, Cassandra1, Cassandra3
> Machine2: MySQL2, Cassandra2, Cassandra4
> The machines have CPU and RAM enough to host Cassandra Cluster or 
> MySQL "Cluster" at a time.
>
> The client test applicatin is running in a third physical machine, 
> with 8 threads doing inserts. The test application is written in C# 
> (Windows7) using Aquiles high-level client.
>
> My use case is a vehicle tracking system. So, let's suppose, from 
> minute to minute, the vehicle sends its position together with some 
> other GPS data and vehicle status information. The columns in my 
> Cassandra cluster are just the DateTime (long value) of a position for 
> a specific vehicle, and the value is all the other data serialized to 
> binary format. Therefore, my CF really grows in columns number. So all 
> data is inserted only to one CF/Table named Positions. The key to 
> Cassandra is the VehicleID and to MySQL VehicleID + PositionDateTime 
> (MySQL creates an index to this automatically). Important to note that 
> MySQL threw tons of connection exceptions, even though, the insert was 
> retried until it got through MySQL.
>
> My test case was to insert 1k positions for 1k vehicles to 10 days - 
> which gives 10.000.000 of inserts.
>
> The final thoughtput that my application had for this scenario was:
>
> Cassandra x 4
> 2012-01-21 11 <tel:2012-01-21%2011>:45:38,044 #6         [Logger.Log] 
> INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 11 <tel:2012-01-21%2011>:45:38,082 #6         [Logger.Log] 
> INFO  - >> Total Time: 2:37:03,359
> 2012-01-21 11 <tel:2012-01-21%2011>:45:38,085 #6         [Logger.Log] 
> INFO  - >> Throughput: 1061 inserts/s
>
> And for MySQL x 2
> 2012-01-21 14 <tel:2012-01-21%2014>:26:25,197 #6         [Logger.Log] 
> INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 14 <tel:2012-01-21%2014>:26:25,250 #6         [Logger.Log] 
> INFO  - >> Total Time: 2:06:25,914
> 2012-01-21 14 <tel:2012-01-21%2014>:26:25,263 #6         [Logger.Log] 
> INFO  - >> Throughput: 1318 inserts/s
>
> Is there something that I'm missing here? Is this excepted? Or the 
> problem is somewhere else and that's hard to say looking at this 
> description?
>
> Cheers,
> Gustavo
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Gustavo Gustavo <do...@gmail.com>.

I was able to make Cassandra beat MySQL MyISAM (~10k inserts/s against 6k
inserts/s) using two physical machines (laptops) - one the client, and the
other one the server, with 50 inserting threads.
I don't know exactly why yet, but the high-level client that I was using to
C# (Aquiles) was taking a lot of CPU. I switched to fluent-cassandra and
things started to go pretty fast. This was the real problem I suspect.
Yep, dual boot is a good idea. I'll give it a try and see if I can push
both datastores forward. But I think the client won't have enough CPU to
handle much more than 50 threads.

/Gustavo

2012/1/24 Maxim Potekhin <po...@bnl.gov>

>  a) I hate to break it to you, but 6GB x 4 cores != 'high-end machine'.
> It's pretty much middle of the road consumer level these days.
>
> b) Hosting the client and Cassandra on the same node is a Bad Idea. It
> will depend on what exactly the client will do, but in my experience it
> won't work too well in general.
>
> c) Have you considered dual boot, so you can have a "good operating
> system" (as per Cassandra folks) in addition to Windows?
>
> Maxim
>
>
>
> On 1/22/2012 8:22 PM, Gustavo Gustavo wrote:
>
> Ok guys, thank you for the valuable hints you gave me.
> For sure, things will perform much better on a real hardware. But my
> object maybe isn't really to see what't the max throughput that the
> datastores have. It is more or less like, given an equal condition, which
> one would perform better.
> But I'll do this way, I'm going to use a high-end machine (6GB RAM, 4
> cores) and run Cassandra, MySQL and the Client Test Application on the same
> machine. Unfortunately, I'll have to use Windows 7 as a host to the
> datastores.
> >From your experience, do you think that even in single node, can
> Cassandra beat in inserts a RDBMS? I've seen that InnoDB (something that
> compares to the other databases relational engine) is pretty slow. But when
> it comes to MyISAM, things are much faster.
>
> /Gustavo
>
> 2012/1/22 Chris Gerken <ch...@mindspring.com>
>
>> Edward (and Maxim),
>>
>>  I agree.  I was just recalling previous performance bake-offs (for
>> other technologies, long time ago, galaxy far far away) in which the
>> customer had put together a mockup of the high throughput expected in
>> production and wanted to make a decision against that one set of numbers.
>>  We always found that both/all competing products could be made to run
>> faster due to unexpected factors in the non-production test build.  For our
>> side, we always started simple and built up the throughput until we found a
>> bottleneck.  We fixed the bottleneck. Rinse and repeat.
>>
>>   Chris Gerken
>>
>>  chrisgerken@mindspring.com
>> 512.587.5261
>> http://www.linkedin.com/in/chgerken
>>
>>
>>
>>  On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote:
>>
>> In some sense 1 for one performance "almost" does not matter. Thou I bet
>> you can get Cassandra better (I remember old school ycsb white paper
>> benches against a sharded mysql).
>>
>> One of the main bullet points of Cassandra is if you want to grow from 4
>> nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and
>> supports online adding and removing of nodes. A do-it-yourself hash mod
>> this algorithm really has no upgrade path
>>
>> Edward
>>
>> On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken <chrisgerken@mindspring.com
>> > wrote:
>>
>>> Howdy Gustavo,
>>>
>>>  One thing that jumped out at me is your having put two cassandra
>>> images on the same box.  There may be enough CPU and memory for the two
>>> images combined but you may be seeing some other resource not being shared
>>> so nicely - network card bandwidth, for example.
>>>
>>>  More generally, the real question is what the bottleneck is (for both
>>> db's, actually).  Start with Cassandra running in that configuration and
>>> start with one client thread sending one request a second.  Look at the
>>> CPU, network and memory metrics for all boxes (including the client).
>>>  Nothing should be even close to maxing out that that throughout.  Now
>>> incrementally increase one of the test parameters (number of clients or
>>> number of inserts per second) just a bit (say from one transaction to 5)
>>> and note the above metrics.  Keep slowly increasing the test parameters,
>>> one at a time, until one of the metrics maxes out.  That's the bottleneck
>>> you're wondering about.  Fix that and the db, be it Cassandra or MySQL)
>>> will move ahead of the other performance-wise.  Turn your attention to the
>>> other db and repeat.
>>>
>>>   - Chris Gerken
>>>
>>>   On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
>>>
>>> Hello,
>>>
>>> I've set up a testing evironment for Cassandra and MySQL, to compare
>>> both, regarding *performance only*. And I must admit that I was expecting
>>> Cassandra to beat MySQL. But I've not seen this happening up to now.
>>> My application/use case is INSERT intensive, since I'm not updating
>>> anything, just inserting all the time.
>>> To compare both I created virtual machines with Ubuntu 11.10, and
>>> installed the latest versions of each datastore. Each VM has 1GB of RAM.
>>> I've used VMs as a way to give both datastores an equal sandbox.
>>> MySQL is set up to work as sharded, with 2 databases, that means that
>>> records are inserted to a specific instance based on key % 2. The engine is
>>> MyISAM (InnoDB was really slow and not really needed to my case). There's a
>>> primary compound key (integer and datetime columns) in this test table.
>>> Let's name the "nodes" MySQL1 and MySQL2.
>>> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to
>>> distribute records evenly across the 4 nodes (nodetool ring reports 25% to
>>> each node), replication factor 1 and RandomPartitioner, the other configs
>>> are left to default. Let's name the nodes Cassandra1, Cassandra2,
>>> Cassandra3 and Cassandra4.
>>>
>>> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2
>>> (MySQL) virtual machines, this way:
>>> Machine1: MySQL1, Cassandra1, Cassandra3
>>> Machine2: MySQL2, Cassandra2, Cassandra4
>>> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL
>>> "Cluster" at a time.
>>>
>>> The client test applicatin is running in a third physical machine, with
>>> 8 threads doing inserts. The test application is written in C# (Windows7)
>>> using Aquiles high-level client.
>>>
>>> My use case is a vehicle tracking system. So, let's suppose, from minute
>>> to minute, the vehicle sends its position together with some other GPS data
>>> and vehicle status information. The columns in my Cassandra cluster are
>>> just the DateTime (long value) of a position for a specific vehicle, and
>>> the value is all the other data serialized to binary format. Therefore, my
>>> CF really grows in columns number. So all data is inserted only to one
>>> CF/Table named Positions. The key to Cassandra is the VehicleID and to
>>> MySQL VehicleID + PositionDateTime (MySQL creates an index to this
>>> automatically). Important to note that MySQL threw tons of connection
>>> exceptions, even though, the insert was retried until it got through MySQL.
>>>
>>> My test case was to insert 1k positions for 1k vehicles to 10 days -
>>> which gives 10.000.000 of inserts.
>>>
>>> The final thoughtput that my application had for this scenario was:
>>>
>>> Cassandra x 4
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,044 #6         [Logger.Log] INFO
>>> - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,082 #6         [Logger.Log] INFO
>>> - >> Total Time: 2:37:03,359
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,085 #6         [Logger.Log] INFO
>>> - >> Throughput: 1061 inserts/s
>>>
>>> And for MySQL x 2
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,197 #6         [Logger.Log] INFO
>>> - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,250 #6         [Logger.Log] INFO
>>> - >> Total Time: 2:06:25,914
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,263 #6         [Logger.Log] INFO
>>> - >> Throughput: 1318 inserts/s
>>>
>>> Is there something that I'm missing here? Is this excepted? Or the
>>> problem is somewhere else and that's hard to say looking at this
>>> description?
>>>
>>> Cheers,
>>> Gustavo
>>>
>>>
>>>
>>
>>
>
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Maxim Potekhin <po...@bnl.gov>.

a) I hate to break it to you, but 6GB x 4 cores != 'high-end machine'. 
It's pretty much middle of the road consumer level these days.

b) Hosting the client and Cassandra on the same node is a Bad Idea. It 
will depend on what exactly the client will do, but in my experience it 
won't work too well in general.

c) Have you considered dual boot, so you can have a "good operating 
system" (as per Cassandra folks) in addition to Windows?

Maxim


On 1/22/2012 8:22 PM, Gustavo Gustavo wrote:
> Ok guys, thank you for the valuable hints you gave me.
> For sure, things will perform much better on a real hardware. But my 
> object maybe isn't really to see what't the max throughput that the 
> datastores have. It is more or less like, given an equal condition, 
> which one would perform better.
> But I'll do this way, I'm going to use a high-end machine (6GB RAM, 4 
> cores) and run Cassandra, MySQL and the Client Test Application on the 
> same machine. Unfortunately, I'll have to use Windows 7 as a host to 
> the datastores.
> >From your experience, do you think that even in single node, can 
> Cassandra beat in inserts a RDBMS? I've seen that InnoDB (something 
> that compares to the other databases relational engine) is pretty 
> slow. But when it comes to MyISAM, things are much faster.
>
> /Gustavo
>
> 2012/1/22 Chris Gerken <chrisgerken@mindspring.com 
> <ma...@mindspring.com>>
>
>     Edward (and Maxim),
>
>     I agree.  I was just recalling previous performance bake-offs (for
>     other technologies, long time ago, galaxy far far away) in which
>     the customer had put together a mockup of the high throughput
>     expected in production and wanted to make a decision against that
>     one set of numbers.  We always found that both/all competing
>     products could be made to run faster due to unexpected factors in
>     the non-production test build.  For our side, we always started
>     simple and built up the throughput until we found a bottleneck.
>      We fixed the bottleneck. Rinse and repeat.
>
>     Chris Gerken
>
>     chrisgerken@mindspring.com <ma...@mindspring.com>
>     512.587.5261 <tel:512.587.5261>
>     http://www.linkedin.com/in/chgerken
>
>
>
>     On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote:
>
>>     In some sense 1 for one performance "almost" does not matter.
>>     Thou I bet you can get Cassandra better (I remember old school
>>     ycsb white paper benches against a sharded mysql).
>>
>>     One of the main bullet points of Cassandra is if you want to grow
>>     from 4 nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is
>>     elastic and supports online adding and removing of nodes. A
>>     do-it-yourself hash mod this algorithm really has no upgrade path
>>
>>     Edward
>>
>>     On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken
>>     <chrisgerken@mindspring.com <ma...@mindspring.com>>
>>     wrote:
>>
>>         Howdy Gustavo,
>>
>>         One thing that jumped out at me is your having put two
>>         cassandra images on the same box.  There may be enough CPU
>>         and memory for the two images combined but you may be seeing
>>         some other resource not being shared so nicely - network card
>>         bandwidth, for example.
>>
>>         More generally, the real question is what the bottleneck is
>>         (for both db's, actually).  Start with Cassandra running in
>>         that configuration and start with one client thread sending
>>         one request a second.  Look at the CPU, network and memory
>>         metrics for all boxes (including the client).  Nothing should
>>         be even close to maxing out that that throughout.  Now
>>         incrementally increase one of the test parameters (number of
>>         clients or number of inserts per second) just a bit (say from
>>         one transaction to 5) and note the above metrics.  Keep
>>         slowly increasing the test parameters, one at a time, until
>>         one of the metrics maxes out.  That's the bottleneck you're
>>         wondering about.  Fix that and the db, be it Cassandra or
>>         MySQL) will move ahead of the other performance-wise.  Turn
>>         your attention to the other db and repeat.
>>
>>         - Chris Gerken
>>
>>         On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
>>
>>>         Hello,
>>>
>>>         I've set up a testing evironment for Cassandra and MySQL, to
>>>         compare both, regarding *performance only*. And I must admit
>>>         that I was expecting Cassandra to beat MySQL. But I've not
>>>         seen this happening up to now.
>>>         My application/use case is INSERT intensive, since I'm not
>>>         updating anything, just inserting all the time.
>>>         To compare both I created virtual machines with Ubuntu
>>>         11.10, and installed the latest versions of each datastore.
>>>         Each VM has 1GB of RAM. I've used VMs as a way to give both
>>>         datastores an equal sandbox.
>>>         MySQL is set up to work as sharded, with 2 databases, that
>>>         means that records are inserted to a specific instance based
>>>         on key % 2. The engine is MyISAM (InnoDB was really slow and
>>>         not really needed to my case). There's a primary compound
>>>         key (integer and datetime columns) in this test table.
>>>         Let's name the "nodes" MySQL1 and MySQL2.
>>>         Cassandra is set up to work with 4 nodes, with keys (tokens)
>>>         set up to distribute records evenly across the 4 nodes
>>>         (nodetool ring reports 25% to each node), replication factor
>>>         1 and RandomPartitioner, the other configs are left to
>>>         default. Let's name the nodes Cassandra1, Cassandra2,
>>>         Cassandra3 and Cassandra4.
>>>
>>>         I'm using 2 physical machines (Windows7) to host the 4
>>>         (Cassandra) or 2 (MySQL) virtual machines, this way:
>>>         Machine1: MySQL1, Cassandra1, Cassandra3
>>>         Machine2: MySQL2, Cassandra2, Cassandra4
>>>         The machines have CPU and RAM enough to host Cassandra
>>>         Cluster or MySQL "Cluster" at a time.
>>>
>>>         The client test applicatin is running in a third physical
>>>         machine, with 8 threads doing inserts. The test application
>>>         is written in C# (Windows7) using Aquiles high-level client.
>>>
>>>         My use case is a vehicle tracking system. So, let's suppose,
>>>         from minute to minute, the vehicle sends its position
>>>         together with some other GPS data and vehicle status
>>>         information. The columns in my Cassandra cluster are just
>>>         the DateTime (long value) of a position for a specific
>>>         vehicle, and the value is all the other data serialized to
>>>         binary format. Therefore, my CF really grows in columns
>>>         number. So all data is inserted only to one CF/Table named
>>>         Positions. The key to Cassandra is the VehicleID and to
>>>         MySQL VehicleID + PositionDateTime (MySQL creates an index
>>>         to this automatically). Important to note that MySQL threw
>>>         tons of connection exceptions, even though, the insert was
>>>         retried until it got through MySQL.
>>>
>>>         My test case was to insert 1k positions for 1k vehicles to
>>>         10 days - which gives 10.000.000 of inserts.
>>>
>>>         The final thoughtput that my application had for this
>>>         scenario was:
>>>
>>>         Cassandra x 4
>>>         2012-01-21 11 <tel:2012-01-21%2011>:45:38,044 #6        
>>>         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000
>>>         vehicles (10000000 inserts):
>>>         2012-01-21 11 <tel:2012-01-21%2011>:45:38,082 #6        
>>>         [Logger.Log] INFO  - >> Total Time: 2:37:03,359
>>>         2012-01-21 11 <tel:2012-01-21%2011>:45:38,085 #6        
>>>         [Logger.Log] INFO  - >> Throughput: 1061 inserts/s
>>>
>>>         And for MySQL x 2
>>>         2012-01-21 14 <tel:2012-01-21%2014>:26:25,197 #6        
>>>         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000
>>>         vehicles (10000000 inserts):
>>>         2012-01-21 14 <tel:2012-01-21%2014>:26:25,250 #6        
>>>         [Logger.Log] INFO  - >> Total Time: 2:06:25,914
>>>         2012-01-21 14 <tel:2012-01-21%2014>:26:25,263 #6        
>>>         [Logger.Log] INFO  - >> Throughput: 1318 inserts/s
>>>
>>>         Is there something that I'm missing here? Is this excepted?
>>>         Or the problem is somewhere else and that's hard to say
>>>         looking at this description?
>>>
>>>         Cheers,
>>>         Gustavo
>>>
>>
>>
>
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Gustavo Gustavo <do...@gmail.com>.

Ok guys, thank you for the valuable hints you gave me.
For sure, things will perform much better on a real hardware. But my object
maybe isn't really to see what't the max throughput that the datastores
have. It is more or less like, given an equal condition, which one would
perform better.
But I'll do this way, I'm going to use a high-end machine (6GB RAM, 4
cores) and run Cassandra, MySQL and the Client Test Application on the same
machine. Unfortunately, I'll have to use Windows 7 as a host to the
datastores.
>From your experience, do you think that even in single node, can Cassandra
beat in inserts a RDBMS? I've seen that InnoDB (something that compares to
the other databases relational engine) is pretty slow. But when it comes to
MyISAM, things are much faster.

/Gustavo

2012/1/22 Chris Gerken <ch...@mindspring.com>

> Edward (and Maxim),
>
> I agree.  I was just recalling previous performance bake-offs (for other
> technologies, long time ago, galaxy far far away) in which the customer had
> put together a mockup of the high throughput expected in production and
> wanted to make a decision against that one set of numbers.  We always found
> that both/all competing products could be made to run faster due to
> unexpected factors in the non-production test build.  For our side, we
> always started simple and built up the throughput until we found a
> bottleneck.  We fixed the bottleneck. Rinse and repeat.
>
> Chris Gerken
>
> chrisgerken@mindspring.com
> 512.587.5261
> http://www.linkedin.com/in/chgerken
>
>
>
> On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote:
>
> In some sense 1 for one performance "almost" does not matter. Thou I bet
> you can get Cassandra better (I remember old school ycsb white paper
> benches against a sharded mysql).
>
> One of the main bullet points of Cassandra is if you want to grow from 4
> nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and
> supports online adding and removing of nodes. A do-it-yourself hash mod
> this algorithm really has no upgrade path
>
> Edward
>
> On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken <ch...@mindspring.com>wrote:
>
>> Howdy Gustavo,
>>
>> One thing that jumped out at me is your having put two cassandra images
>> on the same box.  There may be enough CPU and memory for the two images
>> combined but you may be seeing some other resource not being shared so
>> nicely - network card bandwidth, for example.
>>
>> More generally, the real question is what the bottleneck is (for both
>> db's, actually).  Start with Cassandra running in that configuration and
>> start with one client thread sending one request a second.  Look at the
>> CPU, network and memory metrics for all boxes (including the client).
>>  Nothing should be even close to maxing out that that throughout.  Now
>> incrementally increase one of the test parameters (number of clients or
>> number of inserts per second) just a bit (say from one transaction to 5)
>> and note the above metrics.  Keep slowly increasing the test parameters,
>> one at a time, until one of the metrics maxes out.  That's the bottleneck
>> you're wondering about.  Fix that and the db, be it Cassandra or MySQL)
>> will move ahead of the other performance-wise.  Turn your attention to the
>> other db and repeat.
>>
>>  - Chris Gerken
>>
>> On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
>>
>> Hello,
>>
>> I've set up a testing evironment for Cassandra and MySQL, to compare
>> both, regarding *performance only*. And I must admit that I was expecting
>> Cassandra to beat MySQL. But I've not seen this happening up to now.
>> My application/use case is INSERT intensive, since I'm not updating
>> anything, just inserting all the time.
>> To compare both I created virtual machines with Ubuntu 11.10, and
>> installed the latest versions of each datastore. Each VM has 1GB of RAM.
>> I've used VMs as a way to give both datastores an equal sandbox.
>> MySQL is set up to work as sharded, with 2 databases, that means that
>> records are inserted to a specific instance based on key % 2. The engine is
>> MyISAM (InnoDB was really slow and not really needed to my case). There's a
>> primary compound key (integer and datetime columns) in this test table.
>> Let's name the "nodes" MySQL1 and MySQL2.
>> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to
>> distribute records evenly across the 4 nodes (nodetool ring reports 25% to
>> each node), replication factor 1 and RandomPartitioner, the other configs
>> are left to default. Let's name the nodes Cassandra1, Cassandra2,
>> Cassandra3 and Cassandra4.
>>
>> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2
>> (MySQL) virtual machines, this way:
>> Machine1: MySQL1, Cassandra1, Cassandra3
>> Machine2: MySQL2, Cassandra2, Cassandra4
>> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL
>> "Cluster" at a time.
>>
>> The client test applicatin is running in a third physical machine, with 8
>> threads doing inserts. The test application is written in C# (Windows7)
>> using Aquiles high-level client.
>>
>> My use case is a vehicle tracking system. So, let's suppose, from minute
>> to minute, the vehicle sends its position together with some other GPS data
>> and vehicle status information. The columns in my Cassandra cluster are
>> just the DateTime (long value) of a position for a specific vehicle, and
>> the value is all the other data serialized to binary format. Therefore, my
>> CF really grows in columns number. So all data is inserted only to one
>> CF/Table named Positions. The key to Cassandra is the VehicleID and to
>> MySQL VehicleID + PositionDateTime (MySQL creates an index to this
>> automatically). Important to note that MySQL threw tons of connection
>> exceptions, even though, the insert was retried until it got through MySQL.
>>
>> My test case was to insert 1k positions for 1k vehicles to 10 days -
>> which gives 10.000.000 of inserts.
>>
>> The final thoughtput that my application had for this scenario was:
>>
>> Cassandra x 4
>> 2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted
>> 10000 positions for 1000 vehicles (10000000 inserts):
>> 2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time:
>> 2:37:03,359
>> 2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput:
>> 1061 inserts/s
>>
>> And for MySQL x 2
>> 2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted
>> 10000 positions for 1000 vehicles (10000000 inserts):
>> 2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time:
>> 2:06:25,914
>> 2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput:
>> 1318 inserts/s
>>
>> Is there something that I'm missing here? Is this excepted? Or the
>> problem is somewhere else and that's hard to say looking at this
>> description?
>>
>> Cheers,
>> Gustavo
>>
>>
>>
>
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Chris Gerken <ch...@mindspring.com>.

Edward (and Maxim),

I agree.  I was just recalling previous performance bake-offs (for other technologies, long time ago, galaxy far far away) in which the customer had put together a mockup of the high throughput expected in production and wanted to make a decision against that one set of numbers.  We always found that both/all competing products could be made to run faster due to unexpected factors in the non-production test build.  For our side, we always started simple and built up the throughput until we found a bottleneck.  We fixed the bottleneck. Rinse and repeat.

Chris Gerken

chrisgerken@mindspring.com
512.587.5261
http://www.linkedin.com/in/chgerken



On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote:

> In some sense 1 for one performance "almost" does not matter. Thou I bet you can get Cassandra better (I remember old school ycsb white paper benches against a sharded mysql). 
> 
> One of the main bullet points of Cassandra is if you want to grow from 4 nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and supports online adding and removing of nodes. A do-it-yourself hash mod this algorithm really has no upgrade path
> 
> Edward
> 
> On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken <ch...@mindspring.com> wrote:
> Howdy Gustavo,
> 
> One thing that jumped out at me is your having put two cassandra images on the same box.  There may be enough CPU and memory for the two images combined but you may be seeing some other resource not being shared so nicely - network card bandwidth, for example.
> 
> More generally, the real question is what the bottleneck is (for both db's, actually).  Start with Cassandra running in that configuration and start with one client thread sending one request a second.  Look at the CPU, network and memory metrics for all boxes (including the client).  Nothing should be even close to maxing out that that throughout.  Now incrementally increase one of the test parameters (number of clients or number of inserts per second) just a bit (say from one transaction to 5) and note the above metrics.  Keep slowly increasing the test parameters, one at a time, until one of the metrics maxes out.  That's the bottleneck you're wondering about.  Fix that and the db, be it Cassandra or MySQL) will move ahead of the other performance-wise.  Turn your attention to the other db and repeat.
> 
> - Chris Gerken
> 
> On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
> 
>> Hello,
>> 
>> I've set up a testing evironment for Cassandra and MySQL, to compare both, regarding *performance only*. And I must admit that I was expecting Cassandra to beat MySQL. But I've not seen this happening up to now.
>> My application/use case is INSERT intensive, since I'm not updating anything, just inserting all the time.
>> To compare both I created virtual machines with Ubuntu 11.10, and installed the latest versions of each datastore. Each VM has 1GB of RAM. I've used VMs as a way to give both datastores an equal sandbox.
>> MySQL is set up to work as sharded, with 2 databases, that means that records are inserted to a specific instance based on key % 2. The engine is MyISAM (InnoDB was really slow and not really needed to my case). There's a primary compound key (integer and datetime columns) in this test table.
>> Let's name the "nodes" MySQL1 and MySQL2.
>> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to distribute records evenly across the 4 nodes (nodetool ring reports 25% to each node), replication factor 1 and RandomPartitioner, the other configs are left to default. Let's name the nodes Cassandra1, Cassandra2, Cassandra3 and Cassandra4.
>> 
>> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2 (MySQL) virtual machines, this way:
>> Machine1: MySQL1, Cassandra1, Cassandra3
>> Machine2: MySQL2, Cassandra2, Cassandra4
>> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL "Cluster" at a time.
>> 
>> The client test applicatin is running in a third physical machine, with 8 threads doing inserts. The test application is written in C# (Windows7) using Aquiles high-level client.
>> 
>> My use case is a vehicle tracking system. So, let's suppose, from minute to minute, the vehicle sends its position together with some other GPS data and vehicle status information. The columns in my Cassandra cluster are just the DateTime (long value) of a position for a specific vehicle, and the value is all the other data serialized to binary format. Therefore, my CF really grows in columns number. So all data is inserted only to one CF/Table named Positions. The key to Cassandra is the VehicleID and to MySQL VehicleID + PositionDateTime (MySQL creates an index to this automatically). Important to note that MySQL threw tons of connection exceptions, even though, the insert was retried until it got through MySQL.
>> 
>> My test case was to insert 1k positions for 1k vehicles to 10 days - which gives 10.000.000 of inserts.
>> 
>> The final thoughtput that my application had for this scenario was:
>> 
>> Cassandra x 4
>> 2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts): 
>> 2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time: 2:37:03,359
>> 2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput: 1061 inserts/s
>> 
>> And for MySQL x 2
>> 2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts): 
>> 2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time: 2:06:25,914
>> 2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput: 1318 inserts/s
>> 
>> Is there something that I'm missing here? Is this excepted? Or the problem is somewhere else and that's hard to say looking at this description?
>> 
>> Cheers,
>> Gustavo
>> 
> 
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Edward Capriolo <ed...@gmail.com>.

In some sense 1 for one performance "almost" does not matter. Thou I bet
you can get Cassandra better (I remember old school ycsb white paper
benches against a sharded mysql).

One of the main bullet points of Cassandra is if you want to grow from 4
nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and
supports online adding and removing of nodes. A do-it-yourself hash mod
this algorithm really has no upgrade path

Edward

On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken <ch...@mindspring.com>wrote:

> Howdy Gustavo,
>
> One thing that jumped out at me is your having put two cassandra images on
> the same box.  There may be enough CPU and memory for the two images
> combined but you may be seeing some other resource not being shared so
> nicely - network card bandwidth, for example.
>
> More generally, the real question is what the bottleneck is (for both
> db's, actually).  Start with Cassandra running in that configuration and
> start with one client thread sending one request a second.  Look at the
> CPU, network and memory metrics for all boxes (including the client).
>  Nothing should be even close to maxing out that that throughout.  Now
> incrementally increase one of the test parameters (number of clients or
> number of inserts per second) just a bit (say from one transaction to 5)
> and note the above metrics.  Keep slowly increasing the test parameters,
> one at a time, until one of the metrics maxes out.  That's the bottleneck
> you're wondering about.  Fix that and the db, be it Cassandra or MySQL)
> will move ahead of the other performance-wise.  Turn your attention to the
> other db and repeat.
>
> - Chris Gerken
>
> On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
>
> Hello,
>
> I've set up a testing evironment for Cassandra and MySQL, to compare both,
> regarding *performance only*. And I must admit that I was expecting
> Cassandra to beat MySQL. But I've not seen this happening up to now.
> My application/use case is INSERT intensive, since I'm not updating
> anything, just inserting all the time.
> To compare both I created virtual machines with Ubuntu 11.10, and
> installed the latest versions of each datastore. Each VM has 1GB of RAM.
> I've used VMs as a way to give both datastores an equal sandbox.
> MySQL is set up to work as sharded, with 2 databases, that means that
> records are inserted to a specific instance based on key % 2. The engine is
> MyISAM (InnoDB was really slow and not really needed to my case). There's a
> primary compound key (integer and datetime columns) in this test table.
> Let's name the "nodes" MySQL1 and MySQL2.
> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to
> distribute records evenly across the 4 nodes (nodetool ring reports 25% to
> each node), replication factor 1 and RandomPartitioner, the other configs
> are left to default. Let's name the nodes Cassandra1, Cassandra2,
> Cassandra3 and Cassandra4.
>
> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2
> (MySQL) virtual machines, this way:
> Machine1: MySQL1, Cassandra1, Cassandra3
> Machine2: MySQL2, Cassandra2, Cassandra4
> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL
> "Cluster" at a time.
>
> The client test applicatin is running in a third physical machine, with 8
> threads doing inserts. The test application is written in C# (Windows7)
> using Aquiles high-level client.
>
> My use case is a vehicle tracking system. So, let's suppose, from minute
> to minute, the vehicle sends its position together with some other GPS data
> and vehicle status information. The columns in my Cassandra cluster are
> just the DateTime (long value) of a position for a specific vehicle, and
> the value is all the other data serialized to binary format. Therefore, my
> CF really grows in columns number. So all data is inserted only to one
> CF/Table named Positions. The key to Cassandra is the VehicleID and to
> MySQL VehicleID + PositionDateTime (MySQL creates an index to this
> automatically). Important to note that MySQL threw tons of connection
> exceptions, even though, the insert was retried until it got through MySQL.
>
> My test case was to insert 1k positions for 1k vehicles to 10 days - which
> gives 10.000.000 of inserts.
>
> The final thoughtput that my application had for this scenario was:
>
> Cassandra x 4
> 2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted 10000
> positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time:
> 2:37:03,359
> 2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput:
> 1061 inserts/s
>
> And for MySQL x 2
> 2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted 10000
> positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time:
> 2:06:25,914
> 2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput:
> 1318 inserts/s
>
> Is there something that I'm missing here? Is this excepted? Or the problem
> is somewhere else and that's hard to say looking at this description?
>
> Cheers,
> Gustavo
>
>
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by Chris Gerken <ch...@mindspring.com>.

Howdy Gustavo,

One thing that jumped out at me is your having put two cassandra images on the same box.  There may be enough CPU and memory for the two images combined but you may be seeing some other resource not being shared so nicely - network card bandwidth, for example.

More generally, the real question is what the bottleneck is (for both db's, actually).  Start with Cassandra running in that configuration and start with one client thread sending one request a second.  Look at the CPU, network and memory metrics for all boxes (including the client).  Nothing should be even close to maxing out that that throughout.  Now incrementally increase one of the test parameters (number of clients or number of inserts per second) just a bit (say from one transaction to 5) and note the above metrics.  Keep slowly increasing the test parameters, one at a time, until one of the metrics maxes out.  That's the bottleneck you're wondering about.  Fix that and the db, be it Cassandra or MySQL) will move ahead of the other performance-wise.  Turn your attention to the other db and repeat.

- Chris Gerken

On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:

> Hello,
> 
> I've set up a testing evironment for Cassandra and MySQL, to compare both, regarding *performance only*. And I must admit that I was expecting Cassandra to beat MySQL. But I've not seen this happening up to now.
> My application/use case is INSERT intensive, since I'm not updating anything, just inserting all the time.
> To compare both I created virtual machines with Ubuntu 11.10, and installed the latest versions of each datastore. Each VM has 1GB of RAM. I've used VMs as a way to give both datastores an equal sandbox.
> MySQL is set up to work as sharded, with 2 databases, that means that records are inserted to a specific instance based on key % 2. The engine is MyISAM (InnoDB was really slow and not really needed to my case). There's a primary compound key (integer and datetime columns) in this test table.
> Let's name the "nodes" MySQL1 and MySQL2.
> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to distribute records evenly across the 4 nodes (nodetool ring reports 25% to each node), replication factor 1 and RandomPartitioner, the other configs are left to default. Let's name the nodes Cassandra1, Cassandra2, Cassandra3 and Cassandra4.
> 
> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2 (MySQL) virtual machines, this way:
> Machine1: MySQL1, Cassandra1, Cassandra3
> Machine2: MySQL2, Cassandra2, Cassandra4
> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL "Cluster" at a time.
> 
> The client test applicatin is running in a third physical machine, with 8 threads doing inserts. The test application is written in C# (Windows7) using Aquiles high-level client.
> 
> My use case is a vehicle tracking system. So, let's suppose, from minute to minute, the vehicle sends its position together with some other GPS data and vehicle status information. The columns in my Cassandra cluster are just the DateTime (long value) of a position for a specific vehicle, and the value is all the other data serialized to binary format. Therefore, my CF really grows in columns number. So all data is inserted only to one CF/Table named Positions. The key to Cassandra is the VehicleID and to MySQL VehicleID + PositionDateTime (MySQL creates an index to this automatically). Important to note that MySQL threw tons of connection exceptions, even though, the insert was retried until it got through MySQL.
> 
> My test case was to insert 1k positions for 1k vehicles to 10 days - which gives 10.000.000 of inserts.
> 
> The final thoughtput that my application had for this scenario was:
> 
> Cassandra x 4
> 2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts): 
> 2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time: 2:37:03,359
> 2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput: 1061 inserts/s
> 
> And for MySQL x 2
> 2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts): 
> 2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time: 2:06:25,914
> 2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput: 1318 inserts/s
> 
> Is there something that I'm missing here? Is this excepted? Or the problem is somewhere else and that's hard to say looking at this description?
> 
> Cheers,
> Gustavo
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Posted by David Allsopp <dn...@gmail.com>.

When I did some performance testing on Cassandra 0.7.6, I was getting
10,000 - 20,000 inserts per second on a *single *Cassandra node, on real
hardware (a consumer desktop PC with 4 GB RAM). Cassandra has got
substantially faster since then. I was inserting 1KB columns each on a new
row, if I remember right, using multiple clients on localhost (each in its
own process).

So unless your values are much larger than 1KB, you should be able to get *much
*greater write throughput out of your 4-node cluster.

Several folks have pointed out issues - you need a physical machine per
node, and you need multiple clients in order to exploit the concurrency in
Cassandra (even when testing a single node). Ideally more RAM per node
would be good too.

On 22 January 2012 13:10, Gustavo Gustavo <do...@gmail.com>wrote:

>
>
> Cassandra x 4
> 2012-01-21 11:45:38,044 #6         [Logger.Log] INFO  - >> Inserted 10000
> positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 11:45:38,082 #6         [Logger.Log] INFO  - >> Total Time:
> 2:37:03,359
> 2012-01-21 11:45:38,085 #6         [Logger.Log] INFO  - >> Throughput:
> 1061 inserts/s
>
> And for MySQL x 2
> 2012-01-21 14:26:25,197 #6         [Logger.Log] INFO  - >> Inserted 10000
> positions for 1000 vehicles (10000000 inserts):
> 2012-01-21 14:26:25,250 #6         [Logger.Log] INFO  - >> Total Time:
> 2:06:25,914
> 2012-01-21 14:26:25,263 #6         [Logger.Log] INFO  - >> Throughput:
> 1318 inserts/s
>
> Is there something that I'm missing here? Is this excepted? Or the problem
> is somewhere else and that's hard to say looking at this description?
>
> Cheers,
> Gustavo
>
>