You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dave Viner <da...@gmail.com> on 2011/01/05 02:34:52 UTC

anyone using Cassandra as an analytics/data warehouse?

Does anyone use Cassandra to power an analytics or data warehouse
implementation?

As a concrete example, one could imagine Cassandra storing data for
something that reports on page-views on a website.  The basic notions might
be simple (url as row-key and columns as timeuuids of viewers).  But, how
would one store things like ip-geolocation to set of pages viewed?  Or
hour-of-day to pages viewed?

Also, how would one do a query like
- "tell me how many page views occurred between 12/01/2010 and 12/31/2010"?
- "tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US"?
- "tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US in the 9th hour of the day (in gmt)"?

Time slicing and dimension slicing seems like it might be very challenging
(especially since the windows of time would not be known in advance).

Thanks
Dave Viner

Re: anyone using Cassandra as an analytics/data warehouse?

Posted by Jake Luciani <ja...@gmail.com>.

Some relevant information here:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

On Tue, Jan 4, 2011 at 10:09 PM, Dave Viner <da...@gmail.com> wrote:

> Hi Peter,
>
> Thanks.  These are great ideas.  One comment tho.  I'm actually not as
> worried about the "logging into the system" performance and more
> speculating/imagining the querying out of the system.
>
> Most traditional data warehouses have a cube or a star schema or something
> similar.  I'm trying to imagine how one might use Cassandra in situations
> where that sort of design has historically been applied.
>
> But, I want to make sure I understand your suggestion A.
>
> Is it something like this?
>
> "a Column Family with the row key being the Unix time divided by 60x60 and
> a column key of... pretty much anything unique"
>     LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
> where 'hour-day-in-epoch-seconds' is something like the first second of the
> given hour of the day, so 01/04/2011 19:00:00 (in epoch
> seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
> the value of the entry.
>
> Then "look at the current row every hour to actually compile the numbers,
> and store the count in the same Column Family"
>     LogCF[hour-day-in-epoch-seconds][total] = x
> where 'x' is the sum of the number of timeuuid columns in the row?
>
>
> Is that what you're envisioning in Option A?
>
> Thanks
> Dave Viner
>
>
>
> On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison <ch...@gmail.com>wrote:
>
>> Okay, here is two ways to handle this, both are quite different from each
>> other.
>>
>>
>> A)
>>
>> This approach does not depend on counters. You simply have a Column Family
>> with the row key being the Unix time divided by 60x60 and a column key of...
>> pretty much anything unique. Then have another process look at the current
>> row every hour to actually compile the numbers, and store the count in the
>> same Column Family. This will solve the first and third use cases, as it is
>> just a matter of looking at the right rows. The second case will require a
>> similar index, but one which includes a country code to be appended to the
>> row key.
>>
>> The downside here is that you are storing lots of data on individual
>> requests and retaining it. If you don't want the detailed data you might add
>> a second process to purge the detail every hour.
>>
>> B)
>>
>> There is a "counter" feature added to the latest versions of Cassandra. I
>> have not used them, but they should be able to be used to achieve the same
>> effect without a second process cleaning up every hour. Also means it is
>> more of a real time system so you can see how many requests in the hour you
>> are currently in.
>>
>>
>>
>> Basically you have to design your approach based on the query you will be
>> doing. Don't get too hung up on traditional data structures and queries as
>> they have little relationship to a Cassandra approach.
>>
>>
>>
>> On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <da...@gmail.com> wrote:
>>
>>> Does anyone use Cassandra to power an analytics or data warehouse
>>> implementation?
>>>
>>> As a concrete example, one could imagine Cassandra storing data for
>>> something that reports on page-views on a website.  The basic notions might
>>> be simple (url as row-key and columns as timeuuids of viewers).  But, how
>>> would one store things like ip-geolocation to set of pages viewed?  Or
>>> hour-of-day to pages viewed?
>>>
>>> Also, how would one do a query like
>>> - "tell me how many page views occurred between 12/01/2010 and
>>> 12/31/2010"?
>>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>>> from the US"?
>>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>>> from the US in the 9th hour of the day (in gmt)"?
>>>
>>> Time slicing and dimension slicing seems like it might be very
>>> challenging (especially since the windows of time would not be known in
>>> advance).
>>>
>>> Thanks
>>> Dave Viner
>>>
>>
>>
>

Re: anyone using Cassandra as an analytics/data warehouse?

Posted by Peter Harrison <ch...@gmail.com>.

On Wed, Jan 5, 2011 at 4:09 PM, Dave Viner <da...@gmail.com> wrote:

>
> "a Column Family with the row key being the Unix time divided by 60x60 and
> a column key of... pretty much anything unique"
>     LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
> where 'hour-day-in-epoch-seconds' is something like the first second of the
> given hour of the day, so 01/04/2011 19:00:00 (in epoch
> seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
> the value of the entry.
>
> Then "look at the current row every hour to actually compile the numbers,
> and store the count in the same Column Family"
>     LogCF[hour-day-in-epoch-seconds][total] = x
> where 'x' is the sum of the number of timeuuid columns in the row?
>

This looks correct.

In terms of the query you need work out the row keys in order to query the
right rows. Obviously the longer the period the longer the query, but the
good news is that the query time doesn't increase with total data volume.
When I first used this approach I thought it felt a little too brute force.
There are ways to make it faster - such as storing the totals at different
time resolutions - but unless you are expecting queries over years rather
than a few weeks it's probably not worth the complexity introduced.

Re: anyone using Cassandra as an analytics/data warehouse?

Posted by Dave Viner <da...@gmail.com>.

Hi Peter,

Thanks.  These are great ideas.  One comment tho.  I'm actually not as
worried about the "logging into the system" performance and more
speculating/imagining the querying out of the system.

Most traditional data warehouses have a cube or a star schema or something
similar.  I'm trying to imagine how one might use Cassandra in situations
where that sort of design has historically been applied.

But, I want to make sure I understand your suggestion A.

Is it something like this?

"a Column Family with the row key being the Unix time divided by 60x60 and a
column key of... pretty much anything unique"
    LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
where 'hour-day-in-epoch-seconds' is something like the first second of the
given hour of the day, so 01/04/2011 19:00:00 (in epoch
seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
the value of the entry.

Then "look at the current row every hour to actually compile the numbers,
and store the count in the same Column Family"
    LogCF[hour-day-in-epoch-seconds][total] = x
where 'x' is the sum of the number of timeuuid columns in the row?


Is that what you're envisioning in Option A?

Thanks
Dave Viner



On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison <ch...@gmail.com> wrote:

> Okay, here is two ways to handle this, both are quite different from each
> other.
>
>
> A)
>
> This approach does not depend on counters. You simply have a Column Family
> with the row key being the Unix time divided by 60x60 and a column key of...
> pretty much anything unique. Then have another process look at the current
> row every hour to actually compile the numbers, and store the count in the
> same Column Family. This will solve the first and third use cases, as it is
> just a matter of looking at the right rows. The second case will require a
> similar index, but one which includes a country code to be appended to the
> row key.
>
> The downside here is that you are storing lots of data on individual
> requests and retaining it. If you don't want the detailed data you might add
> a second process to purge the detail every hour.
>
> B)
>
> There is a "counter" feature added to the latest versions of Cassandra. I
> have not used them, but they should be able to be used to achieve the same
> effect without a second process cleaning up every hour. Also means it is
> more of a real time system so you can see how many requests in the hour you
> are currently in.
>
>
>
> Basically you have to design your approach based on the query you will be
> doing. Don't get too hung up on traditional data structures and queries as
> they have little relationship to a Cassandra approach.
>
>
>
> On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <da...@gmail.com> wrote:
>
>> Does anyone use Cassandra to power an analytics or data warehouse
>> implementation?
>>
>> As a concrete example, one could imagine Cassandra storing data for
>> something that reports on page-views on a website.  The basic notions might
>> be simple (url as row-key and columns as timeuuids of viewers).  But, how
>> would one store things like ip-geolocation to set of pages viewed?  Or
>> hour-of-day to pages viewed?
>>
>> Also, how would one do a query like
>> - "tell me how many page views occurred between 12/01/2010 and
>> 12/31/2010"?
>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>> from the US"?
>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>> from the US in the 9th hour of the day (in gmt)"?
>>
>> Time slicing and dimension slicing seems like it might be very challenging
>> (especially since the windows of time would not be known in advance).
>>
>> Thanks
>> Dave Viner
>>
>
>

Re: anyone using Cassandra as an analytics/data warehouse?

Posted by Peter Harrison <ch...@gmail.com>.

Okay, here is two ways to handle this, both are quite different from each
other.

A)

This approach does not depend on counters. You simply have a Column Family
with the row key being the Unix time divided by 60x60 and a column key of...
pretty much anything unique. Then have another process look at the current
row every hour to actually compile the numbers, and store the count in the
same Column Family. This will solve the first and third use cases, as it is
just a matter of looking at the right rows. The second case will require a
similar index, but one which includes a country code to be appended to the
row key.

The downside here is that you are storing lots of data on individual
requests and retaining it. If you don't want the detailed data you might add
a second process to purge the detail every hour.

B)

There is a "counter" feature added to the latest versions of Cassandra. I
have not used them, but they should be able to be used to achieve the same
effect without a second process cleaning up every hour. Also means it is
more of a real time system so you can see how many requests in the hour you
are currently in.

Basically you have to design your approach based on the query you will be
doing. Don't get too hung up on traditional data structures and queries as
they have little relationship to a Cassandra approach.

On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <da...@gmail.com> wrote:

> Does anyone use Cassandra to power an analytics or data warehouse
> implementation?
>
> As a concrete example, one could imagine Cassandra storing data for
> something that reports on page-views on a website.  The basic notions might
> be simple (url as row-key and columns as timeuuids of viewers).  But, how
> would one store things like ip-geolocation to set of pages viewed?  Or
> hour-of-day to pages viewed?
>
> Also, how would one do a query like
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010"?
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
> from the US"?
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
> from the US in the 9th hour of the day (in gmt)"?
>
> Time slicing and dimension slicing seems like it might be very challenging
> (especially since the windows of time would not be known in advance).
>
> Thanks
> Dave Viner
>