You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by M Vieira <mv...@gmail.com> on 2011/09/29 19:20:34 UTC

Very large rows VS small rows

What would be the best approach
A) millions of ~2Kb rows, where each row could have ~6 columns
B) hundreds of ~100Gb rows, where each row could have ~1million columns

Considerarions:
Most entries will be searched for (read+write) at least once a day but no
more than 3 times a day.
Cheap hardware accross the cluster of 3 nodes each with 16Gb mem (heap =
8Gb)

Any input would be appreciated
M.

Re: Very large rows VS small rows

Posted by M Vieira <mv...@gmail.com>.
Thank you very much!
Just read some stuff in the wiki, such as limitations and secondary index.
Adding up to what you said, the search in large rows, by which I mean rows
with millions of columns, seams to be like searching normal hash instead of
btree style.

So model A it is!

Once again thank you for responding.

Re: Very large rows VS small rows

Posted by Jeremiah Jordan <je...@morningstar.com>.
So I need to read what I write before hitting send.  Should have been, 
"If A works for YOUR use case." and "Wide rows DON'T spread across nodes 
well"

On 09/29/2011 02:34 PM, Jeremiah Jordan wrote:
> If A works for our use case, it is a much better option.  A given row 
> has to be read in full to return data from it, there used to be 
> limitations that a row had to fit in memory, but there is now code to 
> page through the data, so while that isn't a limitation any more, it 
> means rows that don't fit in memory are very slow to use.  Also wide 
> rows spread across nodes.  You should also consider more nodes in your 
> cluster.  From our experience node perform better when they are only 
> managing a few Hundred GB each.  Pretty sure that 10TB+ of data (100's 
> * 100GB) will not perform very well on a 3 node cluster, especially if 
> you plan to have RF=3, making it 10TB+ per node.
>
> -Jeremiah
>
> On 09/29/2011 12:20 PM, M Vieira wrote:
>> What would be the best approach
>> A) millions of ~2Kb rows, where each row could have ~6 columns
>> B) hundreds of ~100Gb rows, where each row could have ~1million columns
>>
>> Considerarions:
>> Most entries will be searched for (read+write) at least once a day 
>> but no more than 3 times a day.
>> Cheap hardware accross the cluster of 3 nodes each with 16Gb mem 
>> (heap = 8Gb)
>>
>> Any input would be appreciated
>> M.

Re: Very large rows VS small rows

Posted by Jeremiah Jordan <je...@morningstar.com>.
If A works for our use case, it is a much better option.  A given row 
has to be read in full to return data from it, there used to be 
limitations that a row had to fit in memory, but there is now code to 
page through the data, so while that isn't a limitation any more, it 
means rows that don't fit in memory are very slow to use.  Also wide 
rows spread across nodes.  You should also consider more nodes in your 
cluster.  From our experience node perform better when they are only 
managing a few Hundred GB each.  Pretty sure that 10TB+ of data (100's * 
100GB) will not perform very well on a 3 node cluster, especially if you 
plan to have RF=3, making it 10TB+ per node.

-Jeremiah

On 09/29/2011 12:20 PM, M Vieira wrote:
> What would be the best approach
> A) millions of ~2Kb rows, where each row could have ~6 columns
> B) hundreds of ~100Gb rows, where each row could have ~1million columns
>
> Considerarions:
> Most entries will be searched for (read+write) at least once a day but 
> no more than 3 times a day.
> Cheap hardware accross the cluster of 3 nodes each with 16Gb mem (heap 
> = 8Gb)
>
> Any input would be appreciated
> M.