You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ian Holsman <ha...@holsman.net> on 2011/06/09 21:41:24 UTC

need some help with counters

Hi.

I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was wondering if anyone can help me with my problem.

I want to keep some page-view stats on a URL at different levels of granularity (page views per hour, page views per day, page views per year etc etc).


so my thinking was to create something a counter with a key based on Year-Month-Day-Hour, and simply increment the counter as I go along. 

this work's well and I'm getting my metrics beautifully put into the right places.

the only problem I have is that I only need the last 48-hours worth of metrics at the hour level.

how do I get rid of the old counters? 
do I need to write a archiver that will go through each url (could be millions) and just delete them?

I'm sure other people have encountered this, and was wondering how they approached it.

TIA
Ian

Re: need some help with counters

Posted by Yang <te...@gmail.com>.

something like this:
https://issues.apache.org/jira/browse/CASSANDRA-2103

<https://issues.apache.org/jira/browse/CASSANDRA-2103>but this turns out not
feasible

On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman <ha...@holsman.net> wrote:

> Hi.
>
> I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was
> wondering if anyone can help me with my problem.
>
> I want to keep some page-view stats on a URL at different levels of
> granularity (page views per hour, page views per day, page views per year
> etc etc).
>
>
> so my thinking was to create something a counter with a key based on
> Year-Month-Day-Hour, and simply increment the counter as I go along.
>
> this work's well and I'm getting my metrics beautifully put into the right
> places.
>
> the only problem I have is that I only need the last 48-hours worth of
> metrics at the hour level.
>
> how do I get rid of the old counters?
> do I need to write a archiver that will go through each url (could be
> millions) and just delete them?
>
> I'm sure other people have encountered this, and was wondering how they
> approached it.
>
> TIA
> Ian

Re: need some help with counters

Posted by Colin <co...@gmail.com>.

Hey guy, have you tried amazon turk?

--
Colin Clark
+1 315 886 3422 cell
+1 701 212 4314 office
http://cloudeventprocessing.com
http://blog.cloudeventprocessing.com
@EventCloudPro

*Sent from Star Trek like flat panel device, which although larger than my Star Trek like communicator device, may have typo's and exhibit improper grammar due to haste and less than perfect use of the virtual keyboard*
 

On Jun 9, 2011, at 3:41 PM, Ian Holsman <ha...@holsman.net> wrote:

> Hi.
> 
> I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was wondering if anyone can help me with my problem.
> 
> I want to keep some page-view stats on a URL at different levels of granularity (page views per hour, page views per day, page views per year etc etc).
> 
> 
> so my thinking was to create something a counter with a key based on Year-Month-Day-Hour, and simply increment the counter as I go along. 
> 
> this work's well and I'm getting my metrics beautifully put into the right places.
> 
> the only problem I have is that I only need the last 48-hours worth of metrics at the hour level.
> 
> how do I get rid of the old counters? 
> do I need to write a archiver that will go through each url (could be millions) and just delete them?
> 
> I'm sure other people have encountered this, and was wondering how they approached it.
> 
> TIA
> Ian

Re: need some help with counters

Posted by Ian Holsman <ha...@holsman.net>.

On Jun 13, 2011, at 5:10 AM, aaron morton wrote:

>> I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. 
> 
> AFAIK thats not a great application for counters. You would need range support in the secondary indexes so you could get the first X rows ordered by a column value. 
> 
> To be honest, depending on scale, I'd consider a sorted set in redis for that. 

It does.
Thanks Aaron.

> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 11 Jun 2011, at 00:36, Ian Holsman wrote:
> 
>> 
>> On Jun 9, 2011, at 10:04 PM, aaron morton wrote:
>> 
>>> I may be missing something but could you use a column for each of the last 48 hours all in the same row for a url ?
>>> 
>>> e.g. 
>>> {
>>> 	"/url.com/hourly" : {
>>> 		"20110609T01:00:00" : 456,
>>> 		"20110609T02:00:00" : 4567,
>>> 	}
>>> }
>> 
>> yes.. that would work better... I was storing all the different times in the same row.
>> {
>> 	"/url.com" : {
>> 	 "H-20110609T01:00:00" : 456,
>> 	 "H-0110609T02:00:00" : 4567,
>> 	 "D-0110609" : 5678,
>> 	}
>> }
>> 
>> I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. 
>> 
>>> 
>>> Increment the current hour only. Delete the older columns either when a read detects there are old values or as a maintenance job. Or as part of writing values for the first 5 minutes of any hour. 
>> 
>> yes.. I thought of that. The problem with doing it on read is there may be a case where a old URL never gets read.. so it will just sit there taking up space.. the maintenance job is the route I went down.
>> 
>>> 
>>> The row will get spread out over a lot of sstables which may reduce read speed. If this is a problem consider a separate CF with more aggressive GC and compaction settings. 
>> 
>> Thanks!
>>> 
>>> Cheers
>>> 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 10 Jun 2011, at 09:28, Ian Holsman wrote:
>>> 
>>>> So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. 
>>>> 
>>>> I could always use a logfile and run the archiver off that as a worst case I guess. 
>>>> Would doing so many deletes screw up the db/cause other problems?
>>>> 
>>>> ---
>>>> Ian Holsman - 703 879-3128
>>>> 
>>>> I saw the angel in the marble and carved until I set him free -- Michelangelo
>>>> 
>>>> On 09/06/2011, at 4:22 PM, Ryan King <ry...@twitter.com> wrote:
>>>> 
>>>>> On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
>>>>>> Hi Ryan.
>>>>>> you wouldn't have your version of cassandra up on github would you??
>>>>> 
>>>>> No, and the patch isn't in our version yet either. We're still working on it.
>>>>> 
>>>>> -ryan
>>> 
>> 
>

Re: need some help with counters

Posted by aaron morton <aa...@thelastpickle.com>.

> I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. 

AFAIK thats not a great application for counters. You would need range support in the secondary indexes so you could get the first X rows ordered by a column value. 

To be honest, depending on scale, I'd consider a sorted set in redis for that. 

Hope that helps. 
  
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 11 Jun 2011, at 00:36, Ian Holsman wrote:

> 
> On Jun 9, 2011, at 10:04 PM, aaron morton wrote:
> 
>> I may be missing something but could you use a column for each of the last 48 hours all in the same row for a url ?
>> 
>> e.g. 
>> {
>> 	"/url.com/hourly" : {
>> 		"20110609T01:00:00" : 456,
>> 		"20110609T02:00:00" : 4567,
>> 	}
>> }
> 
> yes.. that would work better... I was storing all the different times in the same row.
> {
> 	"/url.com" : {
> 	 "H-20110609T01:00:00" : 456,
> 	 "H-0110609T02:00:00" : 4567,
> 	 "D-0110609" : 5678,
> 	}
> }
> 
> I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. 
> 
>> 
>> Increment the current hour only. Delete the older columns either when a read detects there are old values or as a maintenance job. Or as part of writing values for the first 5 minutes of any hour. 
> 
> yes.. I thought of that. The problem with doing it on read is there may be a case where a old URL never gets read.. so it will just sit there taking up space.. the maintenance job is the route I went down.
> 
>> 
>> The row will get spread out over a lot of sstables which may reduce read speed. If this is a problem consider a separate CF with more aggressive GC and compaction settings. 
> 
> Thanks!
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 10 Jun 2011, at 09:28, Ian Holsman wrote:
>> 
>>> So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. 
>>> 
>>> I could always use a logfile and run the archiver off that as a worst case I guess. 
>>> Would doing so many deletes screw up the db/cause other problems?
>>> 
>>> ---
>>> Ian Holsman - 703 879-3128
>>> 
>>> I saw the angel in the marble and carved until I set him free -- Michelangelo
>>> 
>>> On 09/06/2011, at 4:22 PM, Ryan King <ry...@twitter.com> wrote:
>>> 
>>>> On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
>>>>> Hi Ryan.
>>>>> you wouldn't have your version of cassandra up on github would you??
>>>> 
>>>> No, and the patch isn't in our version yet either. We're still working on it.
>>>> 
>>>> -ryan
>> 
>

Re: need some help with counters

Posted by Ian Holsman <ha...@holsman.net>.

On Jun 9, 2011, at 10:04 PM, aaron morton wrote:

> I may be missing something but could you use a column for each of the last 48 hours all in the same row for a url ?
> 
> e.g. 
> {
> 	"/url.com/hourly" : {
> 		"20110609T01:00:00" : 456,
> 		"20110609T02:00:00" : 4567,
> 	}
> }

yes.. that would work better... I was storing all the different times in the same row.
{
 	"/url.com" : {
	 "H-20110609T01:00:00" : 456,
	 "H-0110609T02:00:00" : 4567,
	 "D-0110609" : 5678,
	}
}

I am wondering how to index on the most recent hour as well. (ie show me top 5 URLs type query).. 

> 
> Increment the current hour only. Delete the older columns either when a read detects there are old values or as a maintenance job. Or as part of writing values for the first 5 minutes of any hour. 

yes.. I thought of that. The problem with doing it on read is there may be a case where a old URL never gets read.. so it will just sit there taking up space.. the maintenance job is the route I went down.

> 
> The row will get spread out over a lot of sstables which may reduce read speed. If this is a problem consider a separate CF with more aggressive GC and compaction settings. 

Thanks!
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 10 Jun 2011, at 09:28, Ian Holsman wrote:
> 
>> So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. 
>> 
>> I could always use a logfile and run the archiver off that as a worst case I guess. 
>> Would doing so many deletes screw up the db/cause other problems?
>> 
>> ---
>> Ian Holsman - 703 879-3128
>> 
>> I saw the angel in the marble and carved until I set him free -- Michelangelo
>> 
>> On 09/06/2011, at 4:22 PM, Ryan King <ry...@twitter.com> wrote:
>> 
>>> On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
>>>> Hi Ryan.
>>>> you wouldn't have your version of cassandra up on github would you??
>>> 
>>> No, and the patch isn't in our version yet either. We're still working on it.
>>> 
>>> -ryan
>

Re: need some help with counters

Posted by aaron morton <aa...@thelastpickle.com>.

I may be missing something but could you use a column for each of the last 48 hours all in the same row for a url ?

e.g. 
{
	"/url.com/hourly" : {
		"20110609T01:00:00" : 456,
		"20110609T02:00:00" : 4567,
	}
}

Increment the current hour only. Delete the older columns either when a read detects there are old values or as a maintenance job. Or as part of writing values for the first 5 minutes of any hour. 

The row will get spread out over a lot of sstables which may reduce read speed. If this is a problem consider a separate CF with more aggressive GC and compaction settings. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Jun 2011, at 09:28, Ian Holsman wrote:

> So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. 
> 
> I could always use a logfile and run the archiver off that as a worst case I guess. 
> Would doing so many deletes screw up the db/cause other problems?
> 
> ---
> Ian Holsman - 703 879-3128
> 
> I saw the angel in the marble and carved until I set him free -- Michelangelo
> 
> On 09/06/2011, at 4:22 PM, Ryan King <ry...@twitter.com> wrote:
> 
>> On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
>>> Hi Ryan.
>>> you wouldn't have your version of cassandra up on github would you??
>> 
>> No, and the patch isn't in our version yet either. We're still working on it.
>> 
>> -ryan

Re: need some help with counters

Posted by Ian Holsman <ha...@holsman.net>.

So would doing something like storing it in reverse (so I know what to delete) work? Or is storing a million columns in a supercolumn impossible. 

I could always use a logfile and run the archiver off that as a worst case I guess. 
Would doing so many deletes screw up the db/cause other problems?

---
Ian Holsman - 703 879-3128

I saw the angel in the marble and carved until I set him free -- Michelangelo

On 09/06/2011, at 4:22 PM, Ryan King <ry...@twitter.com> wrote:

> On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
>> Hi Ryan.
>> you wouldn't have your version of cassandra up on github would you??
> 
> No, and the patch isn't in our version yet either. We're still working on it.
> 
> -ryan

Re: need some help with counters

Posted by Ryan King <ry...@twitter.com>.

On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman <ha...@holsman.net> wrote:
> Hi Ryan.
> you wouldn't have your version of cassandra up on github would you??

No, and the patch isn't in our version yet either. We're still working on it.

-ryan

Re: need some help with counters

Posted by Ian Holsman <ha...@holsman.net>.

Hi Ryan.
you wouldn't have your version of cassandra up on github would you??

Colin.. always a pleasure.

On Jun 9, 2011, at 3:44 PM, Ryan King wrote:

> On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman <ha...@holsman.net> wrote:
>> Hi.
>> 
>> I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was wondering if anyone can help me with my problem.
>> 
>> I want to keep some page-view stats on a URL at different levels of granularity (page views per hour, page views per day, page views per year etc etc).
>> 
>> 
>> so my thinking was to create something a counter with a key based on Year-Month-Day-Hour, and simply increment the counter as I go along.
>> 
>> this work's well and I'm getting my metrics beautifully put into the right places.
>> 
>> the only problem I have is that I only need the last 48-hours worth of metrics at the hour level.
>> 
>> how do I get rid of the old counters?
>> do I need to write a archiver that will go through each url (could be millions) and just delete them?
>> 
>> I'm sure other people have encountered this, and was wondering how they approached it.
> 
> Here's how we are going to do it at twitter:
> https://issues.apache.org/jira/browse/CASSANDRA-2735
> 
> -ryan

Re: need some help with counters

Posted by Ryan King <ry...@twitter.com>.

On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman <ha...@holsman.net> wrote:
> Hi.
>
> I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was wondering if anyone can help me with my problem.
>
> I want to keep some page-view stats on a URL at different levels of granularity (page views per hour, page views per day, page views per year etc etc).
>
>
> so my thinking was to create something a counter with a key based on Year-Month-Day-Hour, and simply increment the counter as I go along.
>
> this work's well and I'm getting my metrics beautifully put into the right places.
>
> the only problem I have is that I only need the last 48-hours worth of metrics at the hour level.
>
> how do I get rid of the old counters?
> do I need to write a archiver that will go through each url (could be millions) and just delete them?
>
> I'm sure other people have encountered this, and was wondering how they approached it.

Here's how we are going to do it at twitter:
https://issues.apache.org/jira/browse/CASSANDRA-2735

-ryan