You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Yi Yang <yy...@me.com> on 2011/08/16 02:44:52 UTC

Cassandra for numerical data set

Dear all,

I wanna report my use case, and have a discussion with you guys.

I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.

1)
I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.

I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.

2)
I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.

3)
In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.

Thanks for your help.


Best,

Steve



Re: Cassandra for numerical data set

Posted by aaron morton <aa...@thelastpickle.com>.
>  Is that because cassandra really cost a huge disk space?
The general design approach is / has been that storage space is cheap and plentiful. 

> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance.

Not sure it's going to be possible to get the same performance from one machine as you would from several. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2011, at 10:24 AM, Yi Yang wrote:

> 
> Thanks Aaron.
> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will require much lower bandwidth cost and disk IO.
> 
>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>> If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to.  
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.   Hence I don't think it can fit into memory.    I'll try the caching strategy a little bit but I think it can improve my case a little bit.
> 
> I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra really cost a huge disk space?
> 
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance.
> 
> Best,
> Steve
> 
> 
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
> 
>>> 
>> 
>> Hope that helps. 
>> 
>>  
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>> 
>>> Dear all,
>>> 
>>> I wanna report my use case, and have a discussion with you guys.
>>> 
>>> I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
>>> 
>>> 1)
>>> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.
>>> 
>>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.
>>> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>>> 
>>> Thanks for your help.
>>> 
>>> 
>>> Best,
>>> 
>>> Steve
>>> 
>>> 
>> 
> 


Re: Cassandra for numerical data set

Posted by Yi Yang <yy...@me.com>.
BTW,
If I'm going to insert a SCF row with ~400 columns and ~50 subcolumns under each column, how often should I do a mutation? per column or per row?


On Aug 16, 2011, at 3:24 PM, Yi Yang wrote:

> 
> Thanks Aaron.
> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will require much lower bandwidth cost and disk IO.
> 
>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>> If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to.  
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.   Hence I don't think it can fit into memory.    I'll try the caching strategy a little bit but I think it can improve my case a little bit.
> 
> I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra really cost a huge disk space?
> 
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance.
> 
> Best,
> Steve
> 
> 
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
> 
>>> 
>> 
>> Hope that helps. 
>> 
>>  
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>> 
>>> Dear all,
>>> 
>>> I wanna report my use case, and have a discussion with you guys.
>>> 
>>> I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
>>> 
>>> 1)
>>> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.
>>> 
>>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.
>>> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>>> 
>>> Thanks for your help.
>>> 
>>> 
>>> Best,
>>> 
>>> Steve
>>> 
>>> 
>> 
> 


Re: Cassandra for numerical data set

Posted by Yi Yang <yy...@me.com>.
Thanks Aaron.

>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
> Have a look at http://www.datastax.com/dev/blog/bulk-loading
This is a great tool for me.   I'll try on this tool since it will require much lower bandwidth cost and disk IO.

> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
> If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to.  
The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.   Hence I don't think it can fit into memory.    I'll try the caching strategy a little bit but I think it can improve my case a little bit.

I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra really cost a huge disk space?

Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance.

Best,
Steve


On Aug 16, 2011, at 2:27 PM, aaron morton wrote:

>> 
> 
> Hope that helps. 
> 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
> 
>> Dear all,
>> 
>> I wanna report my use case, and have a discussion with you guys.
>> 
>> I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
>> 
>> 1)
>> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.
>> 
>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.
>> 
>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>> 
>> Thanks for your help.
>> 
>> 
>> Best,
>> 
>> Steve
>> 
>> 
> 


Re: Cassandra for numerical data set

Posted by aaron morton <aa...@thelastpickle.com>.
> 
> 2)
> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
Have a look at http://www.datastax.com/dev/blog/bulk-loading

> 3)
> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to.  

Hope that helps. 

 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16/08/2011, at 12:44 PM, Yi Yang wrote:

> Dear all,
> 
> I wanna report my use case, and have a discussion with you guys.
> 
> I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
> 
> 1)
> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.
> 
> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.
> 
> 2)
> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
> 
> 3)
> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
> 
> Thanks for your help.
> 
> 
> Best,
> 
> Steve
> 
>