You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sébastien Pierre <se...@gmail.com> on 2010/01/20 21:31:07 UTC

Cassandra to store logs as a list

Hi there !

I only looked briefly at Cassandra, and I would like to know how good it
would be at storing logs. I've been using Redis and its LIST structure to
store JSON-encoded log info, in the following fashion:

redis["site:0"] = ["{'visitor':1,'referer':'http://
...'}", "{'visitor':1,'referer':'http://...'}]

The problem is that the volume of logs is quite big, and would quickly
exhaust the memory on the server and kill performance -- which is why I'm
looking at Cassandra. Hence this question:
would it be possible to store multiple (ordered) values for the same key in
Cassandra ?

Thanks !

 -- Sébastien

Re: Cassandra to store logs as a list

Posted by Sébastien Pierre <se...@gmail.com>.
Haha, thanks :)

And just out of curiosity, what would be the write performance of a single
Cassandra node ? Would it be 1,000+ writes or more like 10,000+ writes /
second ?

 -- Sébastien

2010/1/20 Ville Lautanala <la...@gmail.com>

> Yes. TimedUUIDs use 60bits for time with 100 nanosecond precision, but in
> addition they have other parts to prevent collision. Assuming that everyone
> plays fair (i.e. you are not actively trying to create collisions), you are
> by orders of magnitudes more likely to be hit by a meteorite than having two
> keys colliding with each other.
>

Re: Cassandra to store logs as a list

Posted by Ville Lautanala <la...@gmail.com>.
Yes. TimedUUIDs use 60bits for time with 100 nanosecond precision, but in addition they have other parts to prevent collision. Assuming that everyone plays fair (i.e. you are not actively trying to create collisions), you are by orders of magnitudes more likely to be hit by a meteorite than having two keys colliding with each other.


-Ville

On Jan 20, 2010, at 23:03, Sébastien Pierre wrote:

> Hi Brandon,
> 
> And would TimeUUIDType allow nanosecond-level precision ? There's going to be a lot of data logged, and I would like to avoid having events being erased.
> 
>  -- Sébasiten
> 
> 2010/1/20 Brandon Williams <dr...@gmail.com>
> 2010/1/20 Sébastien Pierre <se...@gmail.com>
> 
> Hi there !
> 
> I only looked briefly at Cassandra, and I would like to know how good it would be at storing logs. I've been using Redis and its LIST structure to store JSON-encoded log info, in the following fashion:
> 
> redis["site:0"] = ["{'visitor':1,'referer':'http://...'}", "{'visitor':1,'referer':'http://...'}]
> 
> The problem is that the volume of logs is quite big, and would quickly exhaust the memory on the server and kill performance -- which is why I'm looking at Cassandra. Hence this question:
> would it be possible to store multiple (ordered) values for the same key in Cassandra ?
> 
> You could handle this equivalently in Cassandra by making the row name 'site:0', using a TimeUUIDType for the column, and JSON serialized data as the value.
> 
> -Brandon
> 


Re: Cassandra to store logs as a list

Posted by Sébastien Pierre <se...@gmail.com>.
Ahhh, OK !

I got a little bit confused in the terminology, but your explanation really
made it clear, thanks a lot ! I don't think there will be more than a
million columns per row, as it's already aggregated by campaign and day.

I'll let you know how this works for me :)

 -- Sébastien


2010/1/20 Brandon Williams <dr...@gmail.com>

> 2010/1/20 Sébastien Pierre <se...@gmail.com>
>
>> Hmmm, but the only thing that is not clear is how I would store a lot of
>> values for the same key ? With redis, I was using keys like
>> "campaign:<campaign_id>:<YYYY><MM><DD>" to store a *list* of
>> JSON-serialized log info, and the list could scale to litteraly millions of
>> entries. From my understanding, Cassandra can only store 1 value per (colum
>> key, field) couple, doesn't it ?
>
>
> Each row in Cassandra can have an arbitrary number of columns consisting of
> a name and value (and timestamp.)  The columns are sorted on the name based
> on the type used, which is why I recommended the TimeUUIDType so you would
> get time-based sorting.
>
> So your row keys would be like "campaign:<campaign_id>:<YYYY><MM><DD>",
> your column names a TimeUUIDType, and your values the JSON data.
>
> Millions of columns in a row is ok, I would begin using caution beyond
> perhaps 100M though.
>
> -Brandon
>

Re: Cassandra to store logs as a list

Posted by Brandon Williams <dr...@gmail.com>.
2010/1/20 Sébastien Pierre <se...@gmail.com>

> Hmmm, but the only thing that is not clear is how I would store a lot of
> values for the same key ? With redis, I was using keys like
> "campaign:<campaign_id>:<YYYY><MM><DD>" to store a *list* of
> JSON-serialized log info, and the list could scale to litteraly millions of
> entries. From my understanding, Cassandra can only store 1 value per (colum
> key, field) couple, doesn't it ?


Each row in Cassandra can have an arbitrary number of columns consisting of
a name and value (and timestamp.)  The columns are sorted on the name based
on the type used, which is why I recommended the TimeUUIDType so you would
get time-based sorting.

So your row keys would be like "campaign:<campaign_id>:<YYYY><MM><DD>", your
column names a TimeUUIDType, and your values the JSON data.

Millions of columns in a row is ok, I would begin using caution beyond
perhaps 100M though.

-Brandon

Re: Cassandra to store logs as a list

Posted by Sébastien Pierre <se...@gmail.com>.
Hmmm, but the only thing that is not clear is how I would store a lot of
values for the same key ? With redis, I was using keys like
"campaign:<campaign_id>:<YYYY><MM><DD>" to store a *list* of JSON-serialized
log info, and the list could scale to litteraly millions of entries. From my
understanding, Cassandra can only store 1 value per (colum key, field)
couple, doesn't it ?

 -- Sébastien

2010/1/20 Brandon Williams <dr...@gmail.com>

> 2010/1/20 Sébastien Pierre <se...@gmail.com>
>
>> Hi Mark,
>>
>> The most common query would be basically "get all the logs for this
>> particular day (and campaign)" or "get all the logs since this particular
>> time stamp (and campaign)", where everything would be aggregated by
>> "campaign id" (it's for an ad server).
>>
>> In this case, would using a key like the following improve balancing:
>> "campaign:<HEX_PADDED_CAMPAIGN_ID>:<NANOTIMESTAMP>" ? Also, if I add a
>> prefix (like "campaign:<HEX_PADDED_CAMPAIGN_ID>:"), would the key have to
>> be UTF8Type instead of TimeUUIDType ?
>>
>
> If this is your only query, then you don't need an OPP and don't have to
> worry about balancing with the RandomPartitioner.  I would make the keys
> something between "campaign_id:<year>"  and
> "capaign_id:<year>:<month>:<day>:<hour>" depending on how much volume you
> expect, so as not not overload a row.
>
> -Brandon
>

Re: Cassandra to store logs as a list

Posted by Brandon Williams <dr...@gmail.com>.
2010/1/20 Sébastien Pierre <se...@gmail.com>

> Hi Mark,
>
> The most common query would be basically "get all the logs for this
> particular day (and campaign)" or "get all the logs since this particular
> time stamp (and campaign)", where everything would be aggregated by
> "campaign id" (it's for an ad server).
>
> In this case, would using a key like the following improve balancing:
> "campaign:<HEX_PADDED_CAMPAIGN_ID>:<NANOTIMESTAMP>" ? Also, if I add a
> prefix (like "campaign:<HEX_PADDED_CAMPAIGN_ID>:"), would the key have to
> be UTF8Type instead of TimeUUIDType ?
>

If this is your only query, then you don't need an OPP and don't have to
worry about balancing with the RandomPartitioner.  I would make the keys
something between "campaign_id:<year>"  and
"capaign_id:<year>:<month>:<day>:<hour>" depending on how much volume you
expect, so as not not overload a row.

-Brandon

Re: Cassandra to store logs as a list

Posted by Sébastien Pierre <se...@gmail.com>.
Hi Mark,

The most common query would be basically "get all the logs for this
particular day (and campaign)" or "get all the logs since this particular
time stamp (and campaign)", where everything would be aggregated by
"campaign id" (it's for an ad server).

In this case, would using a key like the following improve balancing:
"campaign:<HEX_PADDED_CAMPAIGN_ID>:<NANOTIMESTAMP>" ? Also, if I add a
prefix (like "campaign:<HEX_PADDED_CAMPAIGN_ID>:"), would the key have to
be UTF8Type instead of TimeUUIDType ?

 -- Sébastien


2010/1/20 Mark Robson <ma...@gmail.com>

> I think you really want to be using the OrderPreservingPartitioner and
> using time-based keys.
>
> It depends exactly how you're querying it. All querying use-cases need to
> be taken into account when deciding how to structure your data.
>
> If you use a time-based key with OPP, typically data become very
> unbalanced, because the balancing algorithm (such as exists) depends on the
> keys continuing to have a similar distribution as when the nodes were
> kickstarted.
>
> One solution would be to put some other field on the beginning of the key
> that you might wish to use such as account id, customer id, site id, etc, if
> you have sufficient of these to spread the data out evenly (do it in hex and
> zero pad it, of course)
>
> Mark
>

Re: Cassandra to store logs as a list

Posted by Mark Robson <ma...@gmail.com>.
I think you really want to be using the OrderPreservingPartitioner and using
time-based keys.

It depends exactly how you're querying it. All querying use-cases need to be
taken into account when deciding how to structure your data.

If you use a time-based key with OPP, typically data become very unbalanced,
because the balancing algorithm (such as exists) depends on the keys
continuing to have a similar distribution as when the nodes were
kickstarted.

One solution would be to put some other field on the beginning of the key
that you might wish to use such as account id, customer id, site id, etc, if
you have sufficient of these to spread the data out evenly (do it in hex and
zero pad it, of course)

Mark

Re: Cassandra to store logs as a list

Posted by Sébastien Pierre <se...@gmail.com>.
Hi Brandon,

And would TimeUUIDType allow nanosecond-level precision ? There's going to
be a lot of data logged, and I would like to avoid having events being
erased.

 -- Sébasiten

2010/1/20 Brandon Williams <dr...@gmail.com>

> 2010/1/20 Sébastien Pierre <se...@gmail.com>
>
> Hi there !
>>
>> I only looked briefly at Cassandra, and I would like to know how good it
>> would be at storing logs. I've been using Redis and its LIST structure to
>> store JSON-encoded log info, in the following fashion:
>>
>> redis["site:0"] = ["{'visitor':1,'referer':'http://
>> ...'}", "{'visitor':1,'referer':'http://...'}]
>>
>> The problem is that the volume of logs is quite big, and would quickly
>> exhaust the memory on the server and kill performance -- which is why I'm
>> looking at Cassandra. Hence this question:
>> would it be possible to store multiple (ordered) values for the same key
>> in Cassandra ?
>>
>
> You could handle this equivalently in Cassandra by making the row name
> 'site:0', using a TimeUUIDType for the column, and JSON serialized data as
> the value.
>
> -Brandon
>

Re: Cassandra to store logs as a list

Posted by Brandon Williams <dr...@gmail.com>.
2010/1/20 Sébastien Pierre <se...@gmail.com>

> Hi there !
>
> I only looked briefly at Cassandra, and I would like to know how good it
> would be at storing logs. I've been using Redis and its LIST structure to
> store JSON-encoded log info, in the following fashion:
>
> redis["site:0"] = ["{'visitor':1,'referer':'http://
> ...'}", "{'visitor':1,'referer':'http://...'}]
>
> The problem is that the volume of logs is quite big, and would quickly
> exhaust the memory on the server and kill performance -- which is why I'm
> looking at Cassandra. Hence this question:
> would it be possible to store multiple (ordered) values for the same key in
> Cassandra ?
>

You could handle this equivalently in Cassandra by making the row name
'site:0', using a TimeUUIDType for the column, and JSON serialized data as
the value.

-Brandon