You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Utku Can Topçu <ut...@topcu.gen.tr> on 2012/06/13 18:28:28 UTC

Distinct Counter Proposal for Cassandra

Hi All,

Let's assume we have a use case where we need to count the number of
columns for a given key. Let's say the key is the URL and the column-name
is the IP address or any cardinality identifier.

The straight forward implementation seems to be simple, just inserting the
IP Adresses as columns under the key defined by the URL and using get_count
to count them back. However the problem here is in case of large rows
(where too many IP addresses are in); the get_count method has to
de-serialize the whole row and calculate the count. As also defined in the
user guides, it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't have a
strict requirement for the count to be exact. There are streaming
algorithms that will provide good cardinality estimations within a
predefined failure rate, I think the most popular one seems to be the
(Hyper)LogLog algorithm, also there's an optimal one developed recently,
please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

If you want to take a look at the Java implementation for LogLog,
Clearspring has both LogLog and space optimized HyperLogLog available at
https://github.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra. The
distributed nature of all these algorithms can easily be adapted to
Cassandra's model. I think most of us would love to see come cardinality
estimating columns in Cassandra.

Regards,
Utku

Re: Distinct Counter Proposal for Cassandra

Posted by Tim Wintle <ti...@gmail.com>.

Would it be possible to support this in a more general case by providing a
distributed |= operator over arbitrary byte strings (like the + operator on
counter columns), which would allow distributed bloom filters as well?

Tim Wintle

On Fri, Jun 29, 2012 at 6:31 AM, Chris Burroughs
<ch...@gmail.com>wrote:

> Well I obviously think it would be handy.  If this get's proposed and
> end's up using stream-lib don't be shy about asking for help.
>
> On a more general note, it would be great to see the special case
> Counter code become more general atomic operation code.
>
> On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
> > Hi Yuki,
> >
> > I think I should have used the word discussion instead of proposal for
> the
> > mailing subject. I have quite some of a design in my mind but I think
> it's
> > not yet ripe enough to formalize. I'll try to simplify it and open a Jira
> > ticket.
> > But first I'm wondering if there would be any excitement in the community
> > for such a feature.
> >
> > Regards,
> > Utku
> >
> > On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mo...@gmail.com>
> wrote:
> >
> >> You can open JIRA ticket at
> >> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
> >>
> >> Just for the input:
> >>
> >> I had once implemented HyperLogLog counter to use internally in
> Cassandra,
> >> but it turned out I didn't need it so I just put it to gist. You can
> find
> >> it here: https://gist.github.com/2597943
> >>
> >> The above implementation and most of the other ones (including
> stream-lib)
> >> implement the optimized version of the algorithm which counts up to
> 10^9,
> >> so may need some work.
> >>
> >> Other alternative is self-learning bitmap (
> >> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
> >> understanding, is more memory efficient when counting small values.
> >>
> >> Yuki
> >>
> >> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
> >>
> >> Hi All,
> >>
> >> Let's assume we have a use case where we need to count the number of
> >> columns for a given key. Let's say the key is the URL and the
> column-name
> >> is the IP address or any cardinality identifier.
> >>
> >> The straight forward implementation seems to be simple, just inserting
> the
> >> IP Adresses as columns under the key defined by the URL and using
> get_count
> >> to count them back. However the problem here is in case of large rows
> >> (where too many IP addresses are in); the get_count method has to
> >> de-serialize the whole row and calculate the count. As also defined in
> the
> >> user guides, it's not an O(1) operation and it's quite costly.
> >>
> >> However, this problem seems to have better solutions if you don't have a
> >> strict requirement for the count to be exact. There are streaming
> >> algorithms that will provide good cardinality estimations within a
> >> predefined failure rate, I think the most popular one seems to be the
> >> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
> >> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
> >>
> >> If you want to take a look at the Java implementation for LogLog,
> >> Clearspring has both LogLog and space optimized HyperLogLog available at
> >> https://github.com/clearspring/stream-lib
> >>
> >> I don't see a reason why this can't be implemented in Cassandra. The
> >> distributed nature of all these algorithms can easily be adapted to
> >> Cassandra's model. I think most of us would love to see come cardinality
> >> estimating columns in Cassandra.
> >>
> >> Regards,
> >> Utku
> >>
> >>
> >>
> >
>
>

Re: Distinct Counter Proposal for Cassandra

Posted by Chris Burroughs <ch...@gmail.com>.

Well I obviously think it would be handy.  If this get's proposed and
end's up using stream-lib don't be shy about asking for help.

On a more general note, it would be great to see the special case
Counter code become more general atomic operation code.

On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
> Hi Yuki,
> 
> I think I should have used the word discussion instead of proposal for the
> mailing subject. I have quite some of a design in my mind but I think it's
> not yet ripe enough to formalize. I'll try to simplify it and open a Jira
> ticket.
> But first I'm wondering if there would be any excitement in the community
> for such a feature.
> 
> Regards,
> Utku
> 
> On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mo...@gmail.com> wrote:
> 
>> You can open JIRA ticket at
>> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>>
>> Just for the input:
>>
>> I had once implemented HyperLogLog counter to use internally in Cassandra,
>> but it turned out I didn't need it so I just put it to gist. You can find
>> it here: https://gist.github.com/2597943
>>
>> The above implementation and most of the other ones (including stream-lib)
>> implement the optimized version of the algorithm which counts up to 10^9,
>> so may need some work.
>>
>> Other alternative is self-learning bitmap (
>> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
>> understanding, is more memory efficient when counting small values.
>>
>> Yuki
>>
>> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
>>
>> Hi All,
>>
>> Let's assume we have a use case where we need to count the number of
>> columns for a given key. Let's say the key is the URL and the column-name
>> is the IP address or any cardinality identifier.
>>
>> The straight forward implementation seems to be simple, just inserting the
>> IP Adresses as columns under the key defined by the URL and using get_count
>> to count them back. However the problem here is in case of large rows
>> (where too many IP addresses are in); the get_count method has to
>> de-serialize the whole row and calculate the count. As also defined in the
>> user guides, it's not an O(1) operation and it's quite costly.
>>
>> However, this problem seems to have better solutions if you don't have a
>> strict requirement for the count to be exact. There are streaming
>> algorithms that will provide good cardinality estimations within a
>> predefined failure rate, I think the most popular one seems to be the
>> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
>> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>>
>> If you want to take a look at the Java implementation for LogLog,
>> Clearspring has both LogLog and space optimized HyperLogLog available at
>> https://github.com/clearspring/stream-lib
>>
>> I don't see a reason why this can't be implemented in Cassandra. The
>> distributed nature of all these algorithms can easily be adapted to
>> Cassandra's model. I think most of us would love to see come cardinality
>> estimating columns in Cassandra.
>>
>> Regards,
>> Utku
>>
>>
>>
>

Re: Distinct Counter Proposal for Cassandra

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.

Hi Yuki,

I think I should have used the word discussion instead of proposal for the
mailing subject. I have quite some of a design in my mind but I think it's
not yet ripe enough to formalize. I'll try to simplify it and open a Jira
ticket.
But first I'm wondering if there would be any excitement in the community
for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mo...@gmail.com> wrote:

> You can open JIRA ticket at
> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>
> Just for the input:
>
> I had once implemented HyperLogLog counter to use internally in Cassandra,
> but it turned out I didn't need it so I just put it to gist. You can find
> it here: https://gist.github.com/2597943
>
> The above implementation and most of the other ones (including stream-lib)
> implement the optimized version of the algorithm which counts up to 10^9,
> so may need some work.
>
> Other alternative is self-learning bitmap (
> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
> understanding, is more memory efficient when counting small values.
>
> Yuki
>
> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
>
> Hi All,
>
> Let's assume we have a use case where we need to count the number of
> columns for a given key. Let's say the key is the URL and the column-name
> is the IP address or any cardinality identifier.
>
> The straight forward implementation seems to be simple, just inserting the
> IP Adresses as columns under the key defined by the URL and using get_count
> to count them back. However the problem here is in case of large rows
> (where too many IP addresses are in); the get_count method has to
> de-serialize the whole row and calculate the count. As also defined in the
> user guides, it's not an O(1) operation and it's quite costly.
>
> However, this problem seems to have better solutions if you don't have a
> strict requirement for the count to be exact. There are streaming
> algorithms that will provide good cardinality estimations within a
> predefined failure rate, I think the most popular one seems to be the
> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>
> If you want to take a look at the Java implementation for LogLog,
> Clearspring has both LogLog and space optimized HyperLogLog available at
> https://github.com/clearspring/stream-lib
>
> I don't see a reason why this can't be implemented in Cassandra. The
> distributed nature of all these algorithms can easily be adapted to
> Cassandra's model. I think most of us would love to see come cardinality
> estimating columns in Cassandra.
>
> Regards,
> Utku
>
>
>

Re: Distinct Counter Proposal for Cassandra

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/13/2012 01:00 PM, Yuki Morishita wrote:
> The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work.
> 
> Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values.

The closest we could get to a one-size fits all would probably be an
adaptive counting scheme that uses linear counting (or self-learning
bitmap, didn't know about that one!) for small expected cardinalities
and a LogLog variant for higher ones.  It's more choices to make, but
choosing between "not too big" and "really really big" doesn't seem like
an unreasonable burden to me.

Re: Distinct Counter Proposal for Cassandra

Posted by Yuki Morishita <mo...@gmail.com>.

You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA with your proposal.

Just for the input:

I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out I didn't need it so I just put it to gist. You can find it here: https://gist.github.com/2597943

The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work.

Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values.

Yuki

On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:

> Hi All,
>  
> Let's assume we have a use case where we need to count the number of columns for a given key. Let's say the key is the URL and the column-name is the IP address or any cardinality identifier.
>  
> The straight forward implementation seems to be simple, just inserting the IP Adresses as columns under the key defined by the URL and using get_count to count them back. However the problem here is in case of large rows (where too many IP addresses are in); the get_count method has to de-serialize the whole row and calculate the count. As also defined in the user guides, it's not an O(1) operation and it's quite costly.
>  
> However, this problem seems to have better solutions if you don't have a strict requirement for the count to be exact. There are streaming algorithms that will provide good cardinality estimations within a predefined failure rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, also there's an optimal one developed recently, please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>  
> If you want to take a look at the Java implementation for LogLog, Clearspring has both LogLog and space optimized HyperLogLog available at https://github.com/clearspring/stream-lib
>  
> I don't see a reason why this can't be implemented in Cassandra. The distributed nature of all these algorithms can easily be adapted to Cassandra's model. I think most of us would love to see come cardinality estimating columns in Cassandra.
>  
> Regards,
> Utku