You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Morgan Segalis <ms...@gmail.com> on 2012/04/26 15:30:48 UTC

Data model question, storing Queue Message

Hi everyone !

I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
I have worked a while on it, but I can't seems to find the best model for what I'm looking for.

I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
the message queue, and deletes it. 

Here's some key point : 
- Users are identified by integer IDs
- Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time

I have a queue Message, and here's the operations I would need to do as fast as possible : 

- Store from 1 to X messages per registered user
- Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
- retrieve all messages from an user at once.
- delete all messages from an user at once.
- delete all messages that are older than Y months (from all users).

I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
I'm really looking for speed rather than storage optimization.

My configuration is 2 dedicated server which are both :
- 4 x Intel i7 2.66 Ghz
- 64 bits
- 24 Go
- 2 TB

Thank you all.

Re: Data model question, storing Queue Message

Posted by aaron morton <aa...@thelastpickle.com>.

> Isn't kafka too young for production using purpose ?
The best way to advance the project is to use it and contribute your experience and time.

btw, checking out kafka is a great idea. There are people around having Fun Times with Kafka in production

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 1/05/2012, at 3:11 AM, Morgan Segalis wrote:

> Isn't kafka too young for production using purpose ?
> 
> Clearly that would fit much better my needs but I can't afford early stage project not ready for production. Is it ?
> 
> Le 30 avr. 2012 à 14:28, samal <sa...@gmail.com> a écrit :
> 
>> 
>> 
>> On Mon, Apr 30, 2012 at 5:52 PM, Morgan Segalis <ms...@gmail.com> wrote:
>> Hi Samal,
>> 
>> Thanks for the TTL feature, I wasn't aware of it's existence.
>> 
>> Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) )
>> Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work.
>> 
>> TTL is the last day column can exist in c-world after that it is deleted. Deleting before TTL is fine.
>> Have you considered KAFKA http://incubator.apache.org/kafka/ 
>>   
>> 
>>  
>> Le 30 avr. 2012 à 13:16, samal a écrit :
>> 
>>> 
>>> 
>>> On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis <ms...@gmail.com> wrote:
>>> Hi Aaron,
>>> 
>>> Thank you for your answer, I was beginning to think that my question would never be answered ;-)
>>> 
>>> Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier.
>>> 
>>> USE TTL feature of column as it will remove column after TTL is over (no need for manual job). 
>>> 
>>> I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning?
>>> 
>>> key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. 
>>> 
>>>  
>>> At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ?
>>> 
>>> Best regards,
>>> 
>>> Morgan.
>>> 
>>> Le 30 avr. 2012 à 05:52, aaron morton a écrit :
>>> 
>>>> Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
>>>> 
>>>> It hard to create a model without some idea of the data load, but I would suggest you start with:
>>>> 
>>>> CF: UserMessages
>>>> Key: ReceiverID
>>>> Columns : column name = TimeUUID ; column value = message ID and Body
>>>> 
>>>> That will order the messages by time. 
>>>> 
>>>> Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month:
>>>> 
>>>> CF: UserMessagesMonth
>>>> Key: ReceiverID+YYYYMM
>>>> Columns : column name = TimeUUID ; column value = message ID and Body
>>>> 
>>>> Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. 
>>>> 
>>>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>>>> I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. 
>>>> 
>>>> Hope that helps. 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
>>>> 
>>>>> Hi everyone !
>>>>> 
>>>>> I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
>>>>> I have worked a while on it, but I can't seems to find the best model for what I'm looking for.
>>>>> 
>>>>> I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
>>>>> a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
>>>>> the message queue, and deletes it. 
>>>>> 
>>>>> Here's some key point : 
>>>>> - Users are identified by integer IDs
>>>>> - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time
>>>>> 
>>>>> I have a queue Message, and here's the operations I would need to do as fast as possible : 
>>>>> 
>>>>> - Store from 1 to X messages per registered user
>>>>> - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
>>>>> - retrieve all messages from an user at once.
>>>>> - delete all messages from an user at once.
>>>>> - delete all messages that are older than Y months (from all users).
>>>>> 
>>>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>>>>> I'm really looking for speed rather than storage optimization.
>>>>> 
>>>>> My configuration is 2 dedicated server which are both :
>>>>> - 4 x Intel i7 2.66 Ghz
>>>>> - 64 bits
>>>>> - 24 Go
>>>>> - 2 TB
>>>>> 
>>>>> Thank you all.
>>>> 
>>> 
>>> 
>> 
>>

Re: Data model question, storing Queue Message

Posted by Morgan Segalis <ms...@gmail.com>.

Isn't kafka too young for production using purpose ?

Clearly that would fit much better my needs but I can't afford early stage project not ready for production. Is it ?

Le 30 avr. 2012 �� 14:28, samal <sa...@gmail.com> a ��crit :

> 
> 
> On Mon, Apr 30, 2012 at 5:52 PM, Morgan Segalis <ms...@gmail.com> wrote:
> Hi Samal,
> 
> Thanks for the TTL feature, I wasn't aware of it's existence.
> 
> Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) )
> Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work.
> 
> TTL is the last day column can exist in c-world after that it is deleted. Deleting before TTL is fine.
> Have you considered KAFKA http://incubator.apache.org/kafka/ 
>   
> 
>  
> Le 30 avr. 2012 �� 13:16, samal a ��crit :
> 
>> 
>> 
>> On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis <ms...@gmail.com> wrote:
>> Hi Aaron,
>> 
>> Thank you for your answer, I was beginning to think that my question would never be answered ;-)
>> 
>> Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier.
>> 
>> USE TTL feature of column as it will remove column after TTL is over (no need for manual job). 
>> 
>> I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning?
>> 
>> key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. 
>> 
>>  
>> At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF�� Is that a good way to do ? Or should I stick with the first implementation ?
>> 
>> Best regards,
>> 
>> Morgan.
>> 
>> Le 30 avr. 2012 �� 05:52, aaron morton a ��crit :
>> 
>>> Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
>>> 
>>> It hard to create a model without some idea of the data load, but I would suggest you start with:
>>> 
>>> CF: UserMessages
>>> Key: ReceiverID
>>> Columns : column name = TimeUUID ; column value = message ID and Body
>>> 
>>> That will order the messages by time. 
>>> 
>>> Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month:
>>> 
>>> CF: UserMessagesMonth
>>> Key: ReceiverID+YYYYMM
>>> Columns : column name = TimeUUID ; column value = message ID and Body
>>> 
>>> Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. 
>>> 
>>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>>> I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. 
>>> 
>>> Hope that helps. 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
>>> 
>>>> Hi everyone !
>>>> 
>>>> I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
>>>> I have worked a while on it, but I can't seems to find the best model for what I'm looking for.
>>>> 
>>>> I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
>>>> a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
>>>> the message queue, and deletes it. 
>>>> 
>>>> Here's some key point : 
>>>> - Users are identified by integer IDs
>>>> - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time
>>>> 
>>>> I have a queue Message, and here's the operations I would need to do as fast as possible : 
>>>> 
>>>> - Store from 1 to X messages per registered user
>>>> - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
>>>> - retrieve all messages from an user at once.
>>>> - delete all messages from an user at once.
>>>> - delete all messages that are older than Y months (from all users).
>>>> 
>>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>>>> I'm really looking for speed rather than storage optimization.
>>>> 
>>>> My configuration is 2 dedicated server which are both :
>>>> - 4 x Intel i7 2.66 Ghz
>>>> - 64 bits
>>>> - 24 Go
>>>> - 2 TB
>>>> 
>>>> Thank you all.
>>> 
>> 
>> 
> 
>

Re: Data model question, storing Queue Message

Posted by samal <sa...@gmail.com>.

On Mon, Apr 30, 2012 at 5:52 PM, Morgan Segalis <ms...@gmail.com> wrote:

> Hi Samal,
>
> Thanks for the TTL feature, I wasn't aware of it's existence.
>
> Day's partitioning will be less wider than month partitionning (about 30
> times less give or take ;-) )
> Per day it should have something like 100 000 messages stored, most of it
> would be retrieved so deleted before the TTL feature should come do it's
> work.
>

TTL is the last day column can exist in c-world after that it is deleted.
Deleting before TTL is fine.
Have you considered KAFKA http://incubator.apache.org/kafka/




> Le 30 avr. 2012 à 13:16, samal a écrit :
>
>
>
> On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis <ms...@gmail.com>wrote:
>
>> Hi Aaron,
>>
>> Thank you for your answer, I was beginning to think that my question
>> would never be answered ;-)
>>
>> Actually, this is what I was going for, except one thing, instead of
>> partitioning row per month, I though about partitioning per day, like that
>> everyday I launch the cleaning tool, and it will delete the day from X
>> month earlier.
>>
>
> USE TTL feature of column as it will remove column after TTL is over (no
> need for manual job).
>
>  I guess that will reduce the workload drastically, does it have any
>> downside comparing to month partitioning?
>>
>
> key belongs to particular node , so depending on size of your data day or
> month wise partitioning matters. Other wise it can lead to Fat row which
> will cause system problem.
>
>
>
>> At one point I was going to do something like the twissandra example,
>> Having a CF per User's queue, and another CF per day storing every
>> message's ID of the day, in that way If I want to delete them, I only look
>> into this row, and delete them using ID's for deleting them in the User's
>> queue CF… Is that a good way to do ? Or should I stick with the first
>> implementation ?
>>
>> Best regards,
>>
>> Morgan.
>>
>> Le 30 avr. 2012 à 05:52, aaron morton a écrit :
>>
>> Message Queue is often not a great use case for Cassandra. For
>> information on how to handle high delete workloads see
>> http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
>>
>> It hard to create a model without some idea of the data load, but I would
>> suggest you start with:
>>
>> CF: UserMessages
>> Key: ReceiverID
>> Columns : column name = TimeUUID ; column value = message ID and Body
>>
>> That will order the messages by time.
>>
>> Depending on load (and to support deleting a previous months messages)
>> you may want to partition the rows by month:
>>
>> CF: UserMessagesMonth
>> Key: ReceiverID+YYYYMM
>> Columns : column name = TimeUUID ; column value = message ID and Body
>>
>> Everything the same as before. But now a user has a row for each month
>> and which you can delete as a whole. This also helps avoid very big rows.
>>
>> I really don't think that storage will be an issue, I have 2TB per nodes,
>> messages are 1KB limited.
>>
>> I would suggest you keep the per node limit to 300 to 400 GB. It can take
>> a long time to compact, repair and move the data when it gets above 400GB.
>>
>> Hope that helps.
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
>>
>> Hi everyone !
>>
>> I'm fairly new to cassandra and I'm not quite yet familiarized with
>> column oriented NoSQL model.
>> I have worked a while on it, but I can't seems to find the best model for
>> what I'm looking for.
>>
>> I have a Erlang software that let user connecting and communicate with
>> each others, when an user (A) sends
>> a message to a disconnected user (B), it stores it on the database and
>> wait for the user (B) to connect and retrieve
>> the message queue, and deletes it.
>>
>> Here's some key point :
>> - Users are identified by integer IDs
>> - Each message are unique by combination of : Sender ID - Receiver ID -
>> Message ID - time
>>
>> I have a queue Message, and here's the operations I would need to do as
>> fast as possible :
>>
>> - Store from 1 to X messages per registered user
>> - Get the number of stored messages per user (Can be a incremental
>> variable updated at each store // this is often retrieved)
>> - retrieve all messages from an user at once.
>> - delete all messages from an user at once.
>> - delete all messages that are older than Y months (from all users).
>>
>> I really don't think that storage will be an issue, I have 2TB per nodes,
>> messages are 1KB limited.
>> I'm really looking for speed rather than storage optimization.
>>
>> My configuration is 2 dedicated server which are both :
>> - 4 x Intel i7 2.66 Ghz
>> - 64 bits
>> - 24 Go
>> - 2 TB
>>
>> Thank you all.
>>
>>
>>
>>
>
>

Re: Data model question, storing Queue Message

Posted by Morgan Segalis <ms...@gmail.com>.

Hi Samal,

Thanks for the TTL feature, I wasn't aware of it's existence.

Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) )
Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work.

Le 30 avr. 2012 à 13:16, samal a écrit :

> 
> 
> On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis <ms...@gmail.com> wrote:
> Hi Aaron,
> 
> Thank you for your answer, I was beginning to think that my question would never be answered ;-)
> 
> Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier.
> 
> USE TTL feature of column as it will remove column after TTL is over (no need for manual job). 
> 
> I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning?
> 
> key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. 
> 
>  
> At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ?
> 
> Best regards,
> 
> Morgan.
> 
> Le 30 avr. 2012 à 05:52, aaron morton a écrit :
> 
>> Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
>> 
>> It hard to create a model without some idea of the data load, but I would suggest you start with:
>> 
>> CF: UserMessages
>> Key: ReceiverID
>> Columns : column name = TimeUUID ; column value = message ID and Body
>> 
>> That will order the messages by time. 
>> 
>> Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month:
>> 
>> CF: UserMessagesMonth
>> Key: ReceiverID+YYYYMM
>> Columns : column name = TimeUUID ; column value = message ID and Body
>> 
>> Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. 
>> 
>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>> I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. 
>> 
>> Hope that helps. 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
>> 
>>> Hi everyone !
>>> 
>>> I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
>>> I have worked a while on it, but I can't seems to find the best model for what I'm looking for.
>>> 
>>> I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
>>> a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
>>> the message queue, and deletes it. 
>>> 
>>> Here's some key point : 
>>> - Users are identified by integer IDs
>>> - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time
>>> 
>>> I have a queue Message, and here's the operations I would need to do as fast as possible : 
>>> 
>>> - Store from 1 to X messages per registered user
>>> - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
>>> - retrieve all messages from an user at once.
>>> - delete all messages from an user at once.
>>> - delete all messages that are older than Y months (from all users).
>>> 
>>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>>> I'm really looking for speed rather than storage optimization.
>>> 
>>> My configuration is 2 dedicated server which are both :
>>> - 4 x Intel i7 2.66 Ghz
>>> - 64 bits
>>> - 24 Go
>>> - 2 TB
>>> 
>>> Thank you all.
>> 
> 
>

Re: Data model question, storing Queue Message

Posted by samal <sa...@gmail.com>.

On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis <ms...@gmail.com> wrote:

> Hi Aaron,
>
> Thank you for your answer, I was beginning to think that my question would
> never be answered ;-)
>
> Actually, this is what I was going for, except one thing, instead of
> partitioning row per month, I though about partitioning per day, like that
> everyday I launch the cleaning tool, and it will delete the day from X
> month earlier.
>

USE TTL feature of column as it will remove column after TTL is over (no
need for manual job).

I guess that will reduce the workload drastically, does it have any
> downside comparing to month partitioning?
>

key belongs to particular node , so depending on size of your data day or
month wise partitioning matters. Other wise it can lead to Fat row which
will cause system problem.



> At one point I was going to do something like the twissandra example,
> Having a CF per User's queue, and another CF per day storing every
> message's ID of the day, in that way If I want to delete them, I only look
> into this row, and delete them using ID's for deleting them in the User's
> queue CF… Is that a good way to do ? Or should I stick with the first
> implementation ?
>
> Best regards,
>
> Morgan.
>
> Le 30 avr. 2012 à 05:52, aaron morton a écrit :
>
> Message Queue is often not a great use case for Cassandra. For information
> on how to handle high delete workloads see
> http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
>
> It hard to create a model without some idea of the data load, but I would
> suggest you start with:
>
> CF: UserMessages
> Key: ReceiverID
> Columns : column name = TimeUUID ; column value = message ID and Body
>
> That will order the messages by time.
>
> Depending on load (and to support deleting a previous months messages) you
> may want to partition the rows by month:
>
> CF: UserMessagesMonth
> Key: ReceiverID+YYYYMM
> Columns : column name = TimeUUID ; column value = message ID and Body
>
> Everything the same as before. But now a user has a row for each month and
> which you can delete as a whole. This also helps avoid very big rows.
>
> I really don't think that storage will be an issue, I have 2TB per nodes,
> messages are 1KB limited.
>
> I would suggest you keep the per node limit to 300 to 400 GB. It can take
> a long time to compact, repair and move the data when it gets above 400GB.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
>
> Hi everyone !
>
> I'm fairly new to cassandra and I'm not quite yet familiarized with column
> oriented NoSQL model.
> I have worked a while on it, but I can't seems to find the best model for
> what I'm looking for.
>
> I have a Erlang software that let user connecting and communicate with
> each others, when an user (A) sends
> a message to a disconnected user (B), it stores it on the database and
> wait for the user (B) to connect and retrieve
> the message queue, and deletes it.
>
> Here's some key point :
> - Users are identified by integer IDs
> - Each message are unique by combination of : Sender ID - Receiver ID -
> Message ID - time
>
> I have a queue Message, and here's the operations I would need to do as
> fast as possible :
>
> - Store from 1 to X messages per registered user
> - Get the number of stored messages per user (Can be a incremental
> variable updated at each store // this is often retrieved)
> - retrieve all messages from an user at once.
> - delete all messages from an user at once.
> - delete all messages that are older than Y months (from all users).
>
> I really don't think that storage will be an issue, I have 2TB per nodes,
> messages are 1KB limited.
> I'm really looking for speed rather than storage optimization.
>
> My configuration is 2 dedicated server which are both :
> - 4 x Intel i7 2.66 Ghz
> - 64 bits
> - 24 Go
> - 2 TB
>
> Thank you all.
>
>
>
>

Re: Data model question, storing Queue Message

Posted by Morgan Segalis <ms...@gmail.com>.

Hi Aaron,

Thank you for your answer, I was beginning to think that my question would never be answered ;-)

Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning?

At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ?

Best regards,

Morgan.

Le 30 avr. 2012 à 05:52, aaron morton a écrit :

> Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
> 
> It hard to create a model without some idea of the data load, but I would suggest you start with:
> 
> CF: UserMessages
> Key: ReceiverID
> Columns : column name = TimeUUID ; column value = message ID and Body
> 
> That will order the messages by time. 
> 
> Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month:
> 
> CF: UserMessagesMonth
> Key: ReceiverID+YYYYMM
> Columns : column name = TimeUUID ; column value = message ID and Body
> 
> Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. 
> 
>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
> I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. 
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:
> 
>> Hi everyone !
>> 
>> I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
>> I have worked a while on it, but I can't seems to find the best model for what I'm looking for.
>> 
>> I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
>> a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
>> the message queue, and deletes it. 
>> 
>> Here's some key point : 
>> - Users are identified by integer IDs
>> - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time
>> 
>> I have a queue Message, and here's the operations I would need to do as fast as possible : 
>> 
>> - Store from 1 to X messages per registered user
>> - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
>> - retrieve all messages from an user at once.
>> - delete all messages from an user at once.
>> - delete all messages that are older than Y months (from all users).
>> 
>> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
>> I'm really looking for speed rather than storage optimization.
>> 
>> My configuration is 2 dedicated server which are both :
>> - 4 x Intel i7 2.66 Ghz
>> - 64 bits
>> - 24 Go
>> - 2 TB
>> 
>> Thank you all.
>

Re: Data model question, storing Queue Message

Posted by aaron morton <aa...@thelastpickle.com>.

Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

It hard to create a model without some idea of the data load, but I would suggest you start with:

CF: UserMessages
Key: ReceiverID
Columns : column name = TimeUUID ; column value = message ID and Body

That will order the messages by time. 

Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month:

CF: UserMessagesMonth
Key: ReceiverID+YYYYMM
Columns : column name = TimeUUID ; column value = message ID and Body

Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. 

> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:

> Hi everyone !
> 
> I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model.
> I have worked a while on it, but I can't seems to find the best model for what I'm looking for.
> 
> I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends
> a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve
> the message queue, and deletes it. 
> 
> Here's some key point : 
> - Users are identified by integer IDs
> - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time
> 
> I have a queue Message, and here's the operations I would need to do as fast as possible : 
> 
> - Store from 1 to X messages per registered user
> - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved)
> - retrieve all messages from an user at once.
> - delete all messages from an user at once.
> - delete all messages that are older than Y months (from all users).
> 
> I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited.
> I'm really looking for speed rather than storage optimization.
> 
> My configuration is 2 dedicated server which are both :
> - 4 x Intel i7 2.66 Ghz
> - 64 bits
> - 24 Go
> - 2 TB
> 
> Thank you all.