You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dave Viner <da...@gmail.com> on 2011/02/24 00:21:07 UTC

cassandra as user-profile data store

Hi all,

I'm wondering if anyone has used cassandra as a datastore for a user-profile
service.  I'm thinking of applications like behavioral targeting, where
there are lots & lots of users (10s to 100s of millions), and lots & lots of
data about them intermixed in, say, weblogs (probably TBs worth).  The idea
would be to use Cassandra as a datastore for distributed parallel processing
of the TBs of files (say on hadoop).  Then the resulting user-profiles would
be query-able quickly.

Anyone know of that sort of application of Cassandra?  I'm trying to puzzle
out just what the column family might look like.  Seems like a mix of
time-oriented information (user x visits site y at time z), location
information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
120.10923), and derived information (because user x visited site y 15 times
within a 10 day window, user x must be interested in buying a car).

I don't have specifics as yet... just some general thoughts.  But this feels
like a Cassandra type problem.  (User profile can have lots of columns per
user, but the exact columns might differ from user to user... very scalable,
etc)

Thanks
Dave Viner

CassandraForums.com

Posted by kh jo <jo...@yahoo.com>.

Hi Guys,
for all of those who prefer forums over mailing lists, I setup a forum for cassandra, please have a look

http://www.cassandraforums.com/

thanks
Jo

Re: cassandra as user-profile data store

Posted by Tyler Hobbs <ty...@datastax.com>.

>
> I'm wondering if anyone has used cassandra as a datastore for a
> user-profile service.  I'm thinking of applications like behavioral
> targeting, where there are lots & lots of users (10s to 100s of millions),
> and lots & lots of data about them intermixed in, say, weblogs (probably TBs
> worth).  The idea would be to use Cassandra as a datastore for distributed
> parallel processing of the TBs of files (say on hadoop).  Then the resulting
> user-profiles would be query-able quickly.
>

Just to be clear, you're primarily interested in storing the processed data
(which you give examples of below) in Cassandra?


> Anyone know of that sort of application of Cassandra?  I'm trying to puzzle
> out just what the column family might look like.  Seems like a mix of
> time-oriented information (user x visits site y at time z), location
> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
> 120.10923), and derived information (because user x visited site y 15 times
> within a 10 day window, user x must be interested in buying a car).
>

For the time-oriented data, you generally want to dedicate one row  as a
timeline per user, using timestamps as column names.  I wouldn't expect any
of these to create extremely large rows, but if that's a possibility, you
should consider splitting the timelines into one row per year (or a smaller
time period) if needed.  If you have any need for an aggregate timeline with
a higher volume of data, different strategies apply.

How you store the location data depends on what you want to do with it.  If
you're only interested in going from user -> locations, not from location ->
users, then a couple of possibilities come to mind.  You might want a
timeline of locations that a user has appeared from, or you might want a
counter for each location a user has appeared from.  What would you like to
do with these?

As for the derived information, I think you would need to decide a little
more concretely exactly what data you'll have and and what you want to be
able to do with it.


> I don't have specifics as yet... just some general thoughts.
>

Let me know what specifics you can come up with and I'll try to give you
some more specific answers.  The devil is in the details when it comes to
data modeling in Cassandra!

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: cassandra as user-profile data store

Posted by Dave Gardner <da...@visualdna.com>.

Dave

We are in production with 0.6. We started with this and haven't had time to
figure out how to upgrade smoothly. It's on the horizon though; there's
loads of features we really could do with in 0.7.

In terms of strategy, we don't currently follow Tyler's suggestions. I can't
see any reason why we _wouldn't_ want to do this. However when we first
implemented Cassandra, the big issue was implementing a data store that
would handle a lot of updates to profiles and handling low-latency reads
on-demand (both when you have a large number of users). Right now we use a
bunch of different systems to generate the profiles including making use of
Amazon EMR (via Hive). All of this is subject to change soon though!

We do use Hadoop a lot to carry out analysis on the profiles.

It would be great to hear updates as and when you implement your system. If
you're ever in London, you could even present them at the Cassandra meetup!
http://meetup.com/Cassandra-London

Dave


On 1 March 2011 17:16, Dave Viner <da...@gmail.com> wrote:

> Hi Dave,
>
> Glad to hear others are using it in this fashion!
>
> Are you using Tyler's suggested strategy for user-profile data - one CF
> that stores the "timeline", with rows of user-ids, and TimeUUID columns for
> each data-collection-time.  Then some post-processing with Hadoop over the
> timelines for each user to build a "Profile"?
>
> Are you on 0.7 or 0.6.x?
>
> Dave Viner
>
>
> On Tue, Mar 1, 2011 at 1:31 AM, Dave Gardner <da...@visualdna.com>wrote:
>
>> Dave
>>
>> Tyler's answer already covers CFs etc..
>>
>> We are using Cassandra to store user profile data for exactly the sort of
>> use case you describe. We don't yet store _all_ the data in Cassandra;
>> currently we are focusing on the stuff we need available for real-time
>> access. We use Hadoop to analyse the profiles from within Cassandra.
>>
>> Dave
>>
>>
>> On 23 February 2011 23:21, Dave Viner <da...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm wondering if anyone has used cassandra as a datastore for a
>>> user-profile service.  I'm thinking of applications like behavioral
>>> targeting, where there are lots & lots of users (10s to 100s of millions),
>>> and lots & lots of data about them intermixed in, say, weblogs (probably TBs
>>> worth).  The idea would be to use Cassandra as a datastore for distributed
>>> parallel processing of the TBs of files (say on hadoop).  Then the resulting
>>> user-profiles would be query-able quickly.
>>>
>>> Anyone know of that sort of application of Cassandra?  I'm trying to
>>> puzzle out just what the column family might look like.  Seems like a mix of
>>> time-oriented information (user x visits site y at time z), location
>>> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
>>> 120.10923), and derived information (because user x visited site y 15 times
>>> within a 10 day window, user x must be interested in buying a car).
>>>
>>> I don't have specifics as yet... just some general thoughts.  But this
>>> feels like a Cassandra type problem.  (User profile can have lots of columns
>>> per user, but the exact columns might differ from user to user... very
>>> scalable, etc)
>>>
>>> Thanks
>>> Dave Viner
>>>
>>>
>>
>

Re: cassandra as user-profile data store

Posted by Dave Viner <da...@gmail.com>.

Hi Dave,

Glad to hear others are using it in this fashion!

Are you using Tyler's suggested strategy for user-profile data - one CF that
stores the "timeline", with rows of user-ids, and TimeUUID columns for each
data-collection-time.  Then some post-processing with Hadoop over the
timelines for each user to build a "Profile"?

Are you on 0.7 or 0.6.x?

Dave Viner


On Tue, Mar 1, 2011 at 1:31 AM, Dave Gardner <da...@visualdna.com>wrote:

> Dave
>
> Tyler's answer already covers CFs etc..
>
> We are using Cassandra to store user profile data for exactly the sort of
> use case you describe. We don't yet store _all_ the data in Cassandra;
> currently we are focusing on the stuff we need available for real-time
> access. We use Hadoop to analyse the profiles from within Cassandra.
>
> Dave
>
>
> On 23 February 2011 23:21, Dave Viner <da...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm wondering if anyone has used cassandra as a datastore for a
>> user-profile service.  I'm thinking of applications like behavioral
>> targeting, where there are lots & lots of users (10s to 100s of millions),
>> and lots & lots of data about them intermixed in, say, weblogs (probably TBs
>> worth).  The idea would be to use Cassandra as a datastore for distributed
>> parallel processing of the TBs of files (say on hadoop).  Then the resulting
>> user-profiles would be query-able quickly.
>>
>> Anyone know of that sort of application of Cassandra?  I'm trying to
>> puzzle out just what the column family might look like.  Seems like a mix of
>> time-oriented information (user x visits site y at time z), location
>> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
>> 120.10923), and derived information (because user x visited site y 15 times
>> within a 10 day window, user x must be interested in buying a car).
>>
>> I don't have specifics as yet... just some general thoughts.  But this
>> feels like a Cassandra type problem.  (User profile can have lots of columns
>> per user, but the exact columns might differ from user to user... very
>> scalable, etc)
>>
>> Thanks
>> Dave Viner
>>
>>
>

Re: cassandra as user-profile data store

Posted by Dave Gardner <da...@visualdna.com>.

Dave

Tyler's answer already covers CFs etc..

We are using Cassandra to store user profile data for exactly the sort of
use case you describe. We don't yet store _all_ the data in Cassandra;
currently we are focusing on the stuff we need available for real-time
access. We use Hadoop to analyse the profiles from within Cassandra.

Dave

On 23 February 2011 23:21, Dave Viner <da...@gmail.com> wrote:

> Hi all,
>
> I'm wondering if anyone has used cassandra as a datastore for a
> user-profile service.  I'm thinking of applications like behavioral
> targeting, where there are lots & lots of users (10s to 100s of millions),
> and lots & lots of data about them intermixed in, say, weblogs (probably TBs
> worth).  The idea would be to use Cassandra as a datastore for distributed
> parallel processing of the TBs of files (say on hadoop).  Then the resulting
> user-profiles would be query-able quickly.
>
> Anyone know of that sort of application of Cassandra?  I'm trying to puzzle
> out just what the column family might look like.  Seems like a mix of
> time-oriented information (user x visits site y at time z), location
> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
> 120.10923), and derived information (because user x visited site y 15 times
> within a 10 day window, user x must be interested in buying a car).
>
> I don't have specifics as yet... just some general thoughts.  But this
> feels like a Cassandra type problem.  (User profile can have lots of columns
> per user, but the exact columns might differ from user to user... very
> scalable, etc)
>
> Thanks
> Dave Viner
>
>