You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Sa Li <sa...@gmail.com> on 2014/08/20 23:44:42 UTC

distinct counting

Hi, all

I know storm does good job on counting and other aggregate jobs, I wonder
if anyone ever did distinct counting in storm, and how would you set the
time sliding window?

thanks


Alec

Re: distinct counting

Posted by Gna Phetsarath <gn...@velos.io>.
Alec:

You can create a bolt that talks to redis which provides an HLLP counter,
and other counters for you:

http://redis.io/commands#hyperloglog

For windowing/aggregation, you can use a key_root+hourly_timestamp, merge
across the keys bounded by timestamps and then expire old keys as needed.

Good luck!

-Gna


On Thu, Aug 21, 2014 at 2:18 PM, Nima Movafaghrad <
nima.movafaghrad@oracle.com> wrote:

> Alec,
>
>
>
> You can use something like HyperLogLog or Bloomfilters to do Unique and/or
> Distinct counting. Just create a bolt that does that.
>
>
>
> Nima
>
>
>
> *From:* Sa Li [mailto:sa.in.vanc@gmail.com]
> *Sent:* Wednesday, August 20, 2014 2:45 PM
> *To:* user@storm.incubator.apache.org
> *Subject:* distinct counting
>
>
>
> Hi, all
>
>
>
> I know storm does good job on counting and other aggregate jobs, I wonder
> if anyone ever did distinct counting in storm, and how would you set the
> time sliding window?
>
>
>
> thanks
>
>
>
>
> Alec
>



-- 
GNA PHETSARATH
DIRECTOR OF TECHNOLOGY
Velos
Accelerating Machine Learning

440 9TH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: 917.525.2466x103   M: 917.373.7363   F: 917.525.2466x103
E: *gna.phetsarath@velos.io <gn...@velos.io>* W: *www.velos.io
<http://www.velos.io>*
Twitter: @sourigna

Re: distinct counting

Posted by Sa Li <sa...@gmail.com>.
hi, Miloš 

Thanks for reply, redid hll definitely worth to try, anyone have done that before? my understand is that redis data structure is key-value, where the offset represents the userid for example, the concern is whether redis add another bottle-neck to whole system?

In addition, I feel TridentReach does unique count as well, my question is how to use external timestamp to define time windows, since I have not seen any sample code for timestamps.


thanks 

Alec

On Aug 21, 2014, at 4:36 PM, Miloš Solujić <mi...@gmail.com> wrote:

> Alec,
> 
> For this one, I'd recommend redis hll like Gna explained earlier.
> 
> On 21 Aug 2014 23:31, "Sa Li" <sa...@gmail.com> wrote:
> Thanks all the reply
> 
> I have considered to integrate the java-hll package (https://github.com/aggregateknowledge/java-hll), which uses hash-function murmur_23 from google, I am having lot of exceptions to include it, I am thinking if this hash is compatible with the distributed machnism of storm (I might be naive). 
> 
> Another thing I am thinking is to use TridentReach, this is to count the unique people exposed to a url page, I am thinking to combine this tridentReach with kafkaSpout, my question, should I create a fixed size Hashmap to contain the URL and array of visitors? So this means the fixed size of hash map represents the window size of slide window. I wonder if this is correct?
> 
> 
> thanks
> 
> Alec
> 
> On Aug 21, 2014, at 11:18 AM, Nima Movafaghrad <ni...@oracle.com> wrote:
> 
>> Alec,
>>  
>> You can use something like HyperLogLog or Bloomfilters to do Unique and/or Distinct counting. Just create a bolt that does that.
>>  
>> Nima
>>  
>> From: Sa Li [mailto:sa.in.vanc@gmail.com] 
>> Sent: Wednesday, August 20, 2014 2:45 PM
>> To: user@storm.incubator.apache.org
>> Subject: distinct counting
>>  
>> Hi, all
>>  
>> I know storm does good job on counting and other aggregate jobs, I wonder if anyone ever did distinct counting in storm, and how would you set the time sliding window?
>>  
>> thanks
>>  
>> 
>> Alec
> 


Re: distinct counting

Posted by Miloš Solujić <mi...@gmail.com>.
Alec,

For this one, I'd recommend redis hll like Gna explained earlier.
On 21 Aug 2014 23:31, "Sa Li" <sa...@gmail.com> wrote:

> Thanks all the reply
>
> I have considered to integrate the java-hll package (
> https://github.com/aggregateknowledge/java-hll), which uses hash-function
> murmur_23 from google, I am having lot of exceptions to include it, I am
> thinking if this hash is compatible with the distributed machnism of storm
> (I might be naive).
>
> Another thing I am thinking is to use TridentReach, this is to count the
> unique people exposed to a url page, I am thinking to combine this
> tridentReach with kafkaSpout, my question, should I create a fixed size
> Hashmap to contain the URL and array of visitors? So this means the fixed
> size of hash map represents the window size of slide window. I wonder if
> this is correct?
>
>
> thanks
>
> Alec
>
> On Aug 21, 2014, at 11:18 AM, Nima Movafaghrad <
> nima.movafaghrad@oracle.com> wrote:
>
> Alec,
>
> You can use something like HyperLogLog or Bloomfilters to do Unique and/or
> Distinct counting. Just create a bolt that does that.
>
> Nima
>
> *From:* Sa Li [mailto:sa.in.vanc@gmail.com <sa...@gmail.com>]
> *Sent:* Wednesday, August 20, 2014 2:45 PM
> *To:* user@storm.incubator.apache.org
> *Subject:* distinct counting
>
> Hi, all
>
> I know storm does good job on counting and other aggregate jobs, I wonder
> if anyone ever did distinct counting in storm, and how would you set the
> time sliding window?
>
> thanks
>
>
> Alec
>
>
>

Re: distinct counting

Posted by Sa Li <sa...@gmail.com>.
Thanks all the reply

I have considered to integrate the java-hll package (https://github.com/aggregateknowledge/java-hll), which uses hash-function murmur_23 from google, I am having lot of exceptions to include it, I am thinking if this hash is compatible with the distributed machnism of storm (I might be naive). 

Another thing I am thinking is to use TridentReach, this is to count the unique people exposed to a url page, I am thinking to combine this tridentReach with kafkaSpout, my question, should I create a fixed size Hashmap to contain the URL and array of visitors? So this means the fixed size of hash map represents the window size of slide window. I wonder if this is correct?


thanks

Alec

On Aug 21, 2014, at 11:18 AM, Nima Movafaghrad <ni...@oracle.com> wrote:

> Alec,
>  
> You can use something like HyperLogLog or Bloomfilters to do Unique and/or Distinct counting. Just create a bolt that does that.
>  
> Nima
>  
> From: Sa Li [mailto:sa.in.vanc@gmail.com] 
> Sent: Wednesday, August 20, 2014 2:45 PM
> To: user@storm.incubator.apache.org
> Subject: distinct counting
>  
> Hi, all
>  
> I know storm does good job on counting and other aggregate jobs, I wonder if anyone ever did distinct counting in storm, and how would you set the time sliding window?
>  
> thanks
>  
> 
> Alec


RE: distinct counting

Posted by Nima Movafaghrad <ni...@oracle.com>.
Alec,

 

You can use something like HyperLogLog or Bloomfilters to do Unique and/or Distinct counting. Just create a bolt that does that.

 

Nima

 

From: Sa Li [mailto:sa.in.vanc@gmail.com] 
Sent: Wednesday, August 20, 2014 2:45 PM
To: user@storm.incubator.apache.org
Subject: distinct counting

 

Hi, all

 

I know storm does good job on counting and other aggregate jobs, I wonder if anyone ever did distinct counting in storm, and how would you set the time sliding window? 

 

thanks

 


Alec