You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2012/01/18 18:23:24 UTC

How to store unique visitors in cassandra

I'm wondering how to modelize my CFs to store the number of unique visitors
in a time period in order to be able to request it fast.

I thought of sharding them by day (row = 20120118, column = visitor_id,
value = '') and perform a getcount. This would work to get unique visitors
per day, per week or per month but it wouldn't work if I want to get unique
visitors between 2 specific dates because 2 rows can share the same
visitors (same columns). I can have 1500 unique visitors today, 1000 unique
visitors yesterday but only 2000 new visitors when aggregating these days.

I could get all the columns for this 2 rows and perform an intersect with
my client language but performance won't be good with big data.

Has someone already thought about this modelization ?

Thanks for your help ;)

Alain

Re: How to store unique visitors in cassandra

Posted by Tyler Hobbs <ty...@datastax.com>.

On Thu, Jan 19, 2012 at 8:25 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

>
> I'm still in the dark about how to get the number of unique visitors
> between 2 dates (randomly chosen, because chosen by user) efficiently.
>
> I could easily count them per hour, day, week, month... But it's a bit
> harder to give this statistic between 2 unknown dates as explained at the
> start of this thread.
>
> Am I missing any clue in these slides ?
>

Sometimes you will be fetching slices of multiple rows.

Basically, here's the procedure, given a start time t1 and and end time t2:
1. Determine all buckets (row keys) that hold data between t1 and t2.
Usually this means finding the bucket that t1 falls in, the bucket that t2
falls in, and then all buckets inbetween.
2. Use t1 as the column slice start, t2 as the column slice end, and
multiget all of the buckets that you just calculated.
3. Merge the results by concatenating the rows in order.

Note that the only rows where you will end up getting a partial slice are
the first and last row.  For all of the rows inbetween, you will end up
fetching the entire row.  This is fine, because t1 will be less than all of
the columns in those rows, and t2 will be greater than all of the columns
in those rows.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: How to store unique visitors in cassandra

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Thanks aaron, I already paid attention to these slides and I just looked at
them again.

I'm still in the dark about how to get the number of unique visitors
between 2 dates (randomly chosen, because chosen by user) efficiently.

I could easily count them per hour, day, week, month... But it's a bit
harder to give this statistic between 2 unknown dates as explained at the
start of this thread.

Am I missing any clue in these slides ?

2012/1/19 aaron morton <aa...@thelastpickle.com>

> Some tips here from Matt Dennis on how to model time series data
> http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling
>
> Cheers
>  -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/01/2012, at 10:30 PM, Alain RODRIGUEZ wrote:
>
> Hi thanks for your answer but I don't want to add more layer on top of
> Cassandra. I also have done all of my application without Countandra and I
> would like to continue this way.
>
> Furthermore there is a Cassandra modeling problem that I would like to
> solve, and not just hide.
>
> Alain
>
> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
>
>> Why not http://www.countandra.org/
>>
>>
>> Lucas de Souza Santos (ldss)
>>
>>
>>
>> On Wed, Jan 18, 2012 at 3:23 PM, Alain RODRIGUEZ <ar...@gmail.com>wrote:
>>
>>> I'm wondering how to modelize my CFs to store the number of unique
>>> visitors in a time period in order to be able to request it fast.
>>>
>>> I thought of sharding them by day (row = 20120118, column = visitor_id,
>>> value = '') and perform a getcount. This would work to get unique visitors
>>> per day, per week or per month but it wouldn't work if I want to get unique
>>> visitors between 2 specific dates because 2 rows can share the same
>>> visitors (same columns). I can have 1500 unique visitors today, 1000 unique
>>> visitors yesterday but only 2000 new visitors when aggregating these days.
>>>
>>> I could get all the columns for this 2 rows and perform an intersect
>>> with my client language but performance won't be good with big data.
>>>
>>> Has someone already thought about this modelization ?
>>>
>>> Thanks for your help ;)
>>>
>>> Alain
>>>
>>
>>
>
>

Re: How to store unique visitors in cassandra

Posted by aaron morton <aa...@thelastpickle.com>.

Some tips here from Matt Dennis on how to model time series data 
http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling

Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/01/2012, at 10:30 PM, Alain RODRIGUEZ wrote:

> Hi thanks for your answer but I don't want to add more layer on top of Cassandra. I also have done all of my application without Countandra and I would like to continue this way.
> 
> Furthermore there is a Cassandra modeling problem that I would like to solve, and not just hide.
> 
> Alain
> 
> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
> Why not http://www.countandra.org/
> 
> 
> Lucas de Souza Santos (ldss)
> 
> 
> 
> On Wed, Jan 18, 2012 at 3:23 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
> I'm wondering how to modelize my CFs to store the number of unique visitors in a time period in order to be able to request it fast.
> 
> I thought of sharding them by day (row = 20120118, column = visitor_id, value = '') and perform a getcount. This would work to get unique visitors per day, per week or per month but it wouldn't work if I want to get unique visitors between 2 specific dates because 2 rows can share the same visitors (same columns). I can have 1500 unique visitors today, 1000 unique visitors yesterday but only 2000 new visitors when aggregating these days.
> 
> I could get all the columns for this 2 rows and perform an intersect with my client language but performance won't be good with big data.
> 
> Has someone already thought about this modelization ?
> 
> Thanks for your help ;)
> 
> Alain
> 
>

Re: How to store unique visitors in cassandra

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Laing, I think you answered the wrong mail =).

This one is around UV on custom range model.

But I am happy that you agree on my last message about the Datacenter
switch.

C*heers

2015-03-31 16:29 GMT+02:00 Laing, Michael <mi...@nytimes.com>:

> We use Alain's solution as well to make major operational revisions.
>
> We have a "red team" and a "blue team in each AWS region, so we just add
> and drop datacenters to get where we want to be.
>
> Pretty simple.
>
> ml
>
> On Tue, Mar 31, 2015 at 8:16 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> People keep asking me if we finally found a solution (even if this is 3+
>> years old) so I will just update this thread with our findings.
>>
>> We finally achieved doing this thanks to our bigdata and reporting stacks
>> by storing blobs corresponding to HLL (HyperLogLog) structures. HLL is an
>> algorithm used by Google, twitter and many more to solve count-distinct
>> problems. Structures built through this algorithm can be "summed" and give
>> a good approximation of the UV number.
>>
>> Precision you will reach depends on the size of structure you chose
>> (predictable precision). You can reach fairly acceptable approximation with
>> small data structures.
>>
>> So we basically store a HLL per hour and just "sum" HLL for all the hours
>> between 2 ranges (you can do it at day level or any other level depending
>> on your needs).
>>
>> Hope this will help some of you, we finally had this (good) idea after
>> more than 3 years. Actually we use HLL for a long time but the idea of
>> storing HLL structures instead of counts allow us to request on custom
>> ranges (at the price of more intelligence on the reporting stack that must
>> read and smartly sum HLLs stored as blobs). We are happy with it since.
>>
>> C*heers,
>>
>> Alain
>>
>> 2012-01-19 22:21 GMT+01:00 Milind Parikh <mi...@gmail.com>:
>>
>>> You might want to look at the code in countandra.org; regardless of
>>> whether you use it. It use a model of dynamic composite keys (although
>>> static composite keys would have worked as well). For the actual query,only
>>> one row is hit. This of course only works bc the data model is attuned for
>>> the query.
>>>
>>> Regards
>>> Milind
>>>
>>> /***********************
>>> sent from my android...please pardon occasional typos as I respond @ the
>>> speed of thought
>>> ************************/
>>>
>>> On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:
>>>
>>> Hi thanks for your answer but I don't want to add more layer on top of
>>> Cassandra. I also have done all of my application without Countandra and I
>>> would like to continue this way.
>>>
>>> Furthermore there is a Cassandra modeling problem that I would like to
>>> solve, and not just hide.
>>>
>>> Alain
>>>
>>>
>>>
>>> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
>>> >
>>> > Why not http://www.countandra.org/
>>> >
>>> >
>>> > ...
>>>
>>>
>>
>

Re: How to store unique visitors in cassandra

Posted by "Laing, Michael" <mi...@nytimes.com>.

We use Alain's solution as well to make major operational revisions.

We have a "red team" and a "blue team in each AWS region, so we just add
and drop datacenters to get where we want to be.

Pretty simple.

ml

On Tue, Mar 31, 2015 at 8:16 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> People keep asking me if we finally found a solution (even if this is 3+
> years old) so I will just update this thread with our findings.
>
> We finally achieved doing this thanks to our bigdata and reporting stacks
> by storing blobs corresponding to HLL (HyperLogLog) structures. HLL is an
> algorithm used by Google, twitter and many more to solve count-distinct
> problems. Structures built through this algorithm can be "summed" and give
> a good approximation of the UV number.
>
> Precision you will reach depends on the size of structure you chose
> (predictable precision). You can reach fairly acceptable approximation with
> small data structures.
>
> So we basically store a HLL per hour and just "sum" HLL for all the hours
> between 2 ranges (you can do it at day level or any other level depending
> on your needs).
>
> Hope this will help some of you, we finally had this (good) idea after
> more than 3 years. Actually we use HLL for a long time but the idea of
> storing HLL structures instead of counts allow us to request on custom
> ranges (at the price of more intelligence on the reporting stack that must
> read and smartly sum HLLs stored as blobs). We are happy with it since.
>
> C*heers,
>
> Alain
>
> 2012-01-19 22:21 GMT+01:00 Milind Parikh <mi...@gmail.com>:
>
>> You might want to look at the code in countandra.org; regardless of
>> whether you use it. It use a model of dynamic composite keys (although
>> static composite keys would have worked as well). For the actual query,only
>> one row is hit. This of course only works bc the data model is attuned for
>> the query.
>>
>> Regards
>> Milind
>>
>> /***********************
>> sent from my android...please pardon occasional typos as I respond @ the
>> speed of thought
>> ************************/
>>
>> On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:
>>
>> Hi thanks for your answer but I don't want to add more layer on top of
>> Cassandra. I also have done all of my application without Countandra and I
>> would like to continue this way.
>>
>> Furthermore there is a Cassandra modeling problem that I would like to
>> solve, and not just hide.
>>
>> Alain
>>
>>
>>
>> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
>> >
>> > Why not http://www.countandra.org/
>> >
>> >
>> > ...
>>
>>
>

Re: How to store unique visitors in cassandra

Posted by Jim Ancona <ji...@anconafamily.com>.

Very interesting. I had saved your email from three years ago in hopes of
an elegant answer. Thanks for sharing!

Jim

On Tue, Mar 31, 2015 at 8:16 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> People keep asking me if we finally found a solution (even if this is 3+
> years old) so I will just update this thread with our findings.
>
> We finally achieved doing this thanks to our bigdata and reporting stacks
> by storing blobs corresponding to HLL (HyperLogLog) structures. HLL is an
> algorithm used by Google, twitter and many more to solve count-distinct
> problems. Structures built through this algorithm can be "summed" and give
> a good approximation of the UV number.
>
> Precision you will reach depends on the size of structure you chose
> (predictable precision). You can reach fairly acceptable approximation with
> small data structures.
>
> So we basically store a HLL per hour and just "sum" HLL for all the hours
> between 2 ranges (you can do it at day level or any other level depending
> on your needs).
>
> Hope this will help some of you, we finally had this (good) idea after
> more than 3 years. Actually we use HLL for a long time but the idea of
> storing HLL structures instead of counts allow us to request on custom
> ranges (at the price of more intelligence on the reporting stack that must
> read and smartly sum HLLs stored as blobs). We are happy with it since.
>
> C*heers,
>
> Alain
>
> 2012-01-19 22:21 GMT+01:00 Milind Parikh <mi...@gmail.com>:
>
>> You might want to look at the code in countandra.org; regardless of
>> whether you use it. It use a model of dynamic composite keys (although
>> static composite keys would have worked as well). For the actual query,only
>> one row is hit. This of course only works bc the data model is attuned for
>> the query.
>>
>> Regards
>> Milind
>>
>> /***********************
>> sent from my android...please pardon occasional typos as I respond @ the
>> speed of thought
>> ************************/
>>
>> On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:
>>
>> Hi thanks for your answer but I don't want to add more layer on top of
>> Cassandra. I also have done all of my application without Countandra and I
>> would like to continue this way.
>>
>> Furthermore there is a Cassandra modeling problem that I would like to
>> solve, and not just hide.
>>
>> Alain
>>
>>
>>
>> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
>> >
>> > Why not http://www.countandra.org/
>> >
>> >
>> > ...
>>
>>
>

Re: How to store unique visitors in cassandra

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

People keep asking me if we finally found a solution (even if this is 3+
years old) so I will just update this thread with our findings.

We finally achieved doing this thanks to our bigdata and reporting stacks
by storing blobs corresponding to HLL (HyperLogLog) structures. HLL is an
algorithm used by Google, twitter and many more to solve count-distinct
problems. Structures built through this algorithm can be "summed" and give
a good approximation of the UV number.

Precision you will reach depends on the size of structure you chose
(predictable precision). You can reach fairly acceptable approximation with
small data structures.

So we basically store a HLL per hour and just "sum" HLL for all the hours
between 2 ranges (you can do it at day level or any other level depending
on your needs).

Hope this will help some of you, we finally had this (good) idea after more
than 3 years. Actually we use HLL for a long time but the idea of storing
HLL structures instead of counts allow us to request on custom ranges (at
the price of more intelligence on the reporting stack that must read and
smartly sum HLLs stored as blobs). We are happy with it since.

C*heers,

Alain

2012-01-19 22:21 GMT+01:00 Milind Parikh <mi...@gmail.com>:

> You might want to look at the code in countandra.org; regardless of
> whether you use it. It use a model of dynamic composite keys (although
> static composite keys would have worked as well). For the actual query,only
> one row is hit. This of course only works bc the data model is attuned for
> the query.
>
> Regards
> Milind
>
> /***********************
> sent from my android...please pardon occasional typos as I respond @ the
> speed of thought
> ************************/
>
> On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:
>
> Hi thanks for your answer but I don't want to add more layer on top of
> Cassandra. I also have done all of my application without Countandra and I
> would like to continue this way.
>
> Furthermore there is a Cassandra modeling problem that I would like to
> solve, and not just hide.
>
> Alain
>
>
>
> 2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
> >
> > Why not http://www.countandra.org/
> >
> >
> > ...
>
>

Re: How to store unique visitors in cassandra

Posted by Milind Parikh <mi...@gmail.com>.

You might want to look at the code in countandra.org; regardless of whether
you use it. It use a model of dynamic composite keys (although static
composite keys would have worked as well). For the actual query,only one
row is hit. This of course only works bc the data model is attuned for the
query.

Regards
Milind

/***********************
sent from my android...please pardon occasional typos as I respond @ the
speed of thought
************************/

On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ" <ar...@gmail.com> wrote:

Hi thanks for your answer but I don't want to add more layer on top of
Cassandra. I also have done all of my application without Countandra and I
would like to continue this way.

Furthermore there is a Cassandra modeling problem that I would like to
solve, and not just hide.

Alain

2012/1/18 Lucas de Souza Santos <lu...@gmail.com>
>
> Why not http://www.countandra.org/
>
>
> ...

Re: How to store unique visitors in cassandra

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi thanks for your answer but I don't want to add more layer on top of
Cassandra. I also have done all of my application without Countandra and I
would like to continue this way.

Furthermore there is a Cassandra modeling problem that I would like to
solve, and not just hide.

Alain

2012/1/18 Lucas de Souza Santos <lu...@gmail.com>

> Why not http://www.countandra.org/
>
>
> Lucas de Souza Santos (ldss)
>
>
>
> On Wed, Jan 18, 2012 at 3:23 PM, Alain RODRIGUEZ <ar...@gmail.com>wrote:
>
>> I'm wondering how to modelize my CFs to store the number of unique
>> visitors in a time period in order to be able to request it fast.
>>
>> I thought of sharding them by day (row = 20120118, column = visitor_id,
>> value = '') and perform a getcount. This would work to get unique visitors
>> per day, per week or per month but it wouldn't work if I want to get unique
>> visitors between 2 specific dates because 2 rows can share the same
>> visitors (same columns). I can have 1500 unique visitors today, 1000 unique
>> visitors yesterday but only 2000 new visitors when aggregating these days.
>>
>> I could get all the columns for this 2 rows and perform an intersect with
>> my client language but performance won't be good with big data.
>>
>> Has someone already thought about this modelization ?
>>
>> Thanks for your help ;)
>>
>> Alain
>>
>
>

Re: How to store unique visitors in cassandra

Posted by Lucas de Souza Santos <lu...@gmail.com>.

Why not http://www.countandra.org/


Lucas de Souza Santos (ldss)


On Wed, Jan 18, 2012 at 3:23 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> I'm wondering how to modelize my CFs to store the number of unique
> visitors in a time period in order to be able to request it fast.
>
> I thought of sharding them by day (row = 20120118, column = visitor_id,
> value = '') and perform a getcount. This would work to get unique visitors
> per day, per week or per month but it wouldn't work if I want to get unique
> visitors between 2 specific dates because 2 rows can share the same
> visitors (same columns). I can have 1500 unique visitors today, 1000 unique
> visitors yesterday but only 2000 new visitors when aggregating these days.
>
> I could get all the columns for this 2 rows and perform an intersect with
> my client language but performance won't be good with big data.
>
> Has someone already thought about this modelization ?
>
> Thanks for your help ;)
>
> Alain
>