You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Radek Maciaszek <ra...@gmail.com> on 2011/01/14 12:14:37 UTC

Unique users analysis

Hi,

I am working on some large scale unique users analysis (think hundreds of
millions of records per day). Since number of all records per month goes
into many billions I am hoping that there may be some alternative to running
"SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
deriving monthly numbers based on daily numbers of unique users with a use
of statistical linear regression or a similar technique.

I tried to use sampling but that does not seem to give me the results I
would expect. That is, for example if I sample every 1/100 record I would
expect to see about 1% of the total number of unique users from a given
period of time but instead I am seeing bigger numbers, for example something
like 5%. I am not sure if the law of large number applies to unique users
analysis.

Moreover it seems that "select distinct" does not scale linearly (which is
understandable) and so doing monthly unique users analysis takes way longer
than 30 x time to perform daily unique users analysis and requires a lot of
memory.

I was wondering if anyone had a similar challenge?

Many thanks,
Radek

Re: Unique users analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hello,

In case anyone will be looking for a similar solution in future I put a
short blog post on this subject:
http://www.dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/

Best,
Radek

On 14 January 2011 12:50, Radek Maciaszek <ra...@maciaszek.co.uk> wrote:

> Hi Itai,
>
> I did not think about sampling users instead of sampling records, but it
> makes a much more sense indeed.
> As it happens my ID is also hexadecimal and so I did exactly what you
> suggested. In the results my error is less than 1% comparing to observed
> values!
>
> Many thanks!!
> Radek
>
>
> On 14 January 2011 11:32, Itai Hochman <it...@outbrain.com> wrote:
>
>> We had a similar challenge and we dealt with it  by sampling based on the
>> user id.
>>
>> We have a unique id which is a random hexadecimal format- for instance
>> A12890900.
>>
>> The query is running  only on users that have an id that ends with 00.
>>
>> At the end we multiply by 256 and get a pretty close number to the real
>> number.
>>
>>
>>
>> Itai
>>
>>
>>
>> On 01/14/2011 01:14 PM, Radek Maciaszek wrote:
>>
>>  Hi,
>>>
>>> I am working on some large scale unique users analysis (think hundreds of
>>> millions of records per day). Since number of all records per month goes
>>> into many billions I am hoping that there may be some alternative to running
>>> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
>>> deriving monthly numbers based on daily numbers of unique users with a use
>>> of statistical linear regression or a similar technique.
>>>
>>> I tried to use sampling but that does not seem to give me the results I
>>> would expect. That is, for example if I sample every 1/100 record I would
>>> expect to see about 1% of the total number of unique users from a given
>>> period of time but instead I am seeing bigger numbers, for example something
>>> like 5%. I am not sure if the law of large number applies to unique users
>>> analysis.
>>>
>>> Moreover it seems that "select distinct" does not scale linearly (which
>>> is understandable) and so doing monthly unique users analysis takes way
>>> longer than 30 x time to perform daily unique users analysis and requires a
>>> lot of memory.
>>>
>>> I was wondering if anyone had a similar challenge?
>>>
>>> Many thanks,
>>> Radek
>>>
>>
>>
>
>
>

Re: Unique users analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hi Itai,

I did not think about sampling users instead of sampling records, but it
makes a much more sense indeed.
As it happens my ID is also hexadecimal and so I did exactly what you
suggested. In the results my error is less than 1% comparing to observed
values!

Many thanks!!
Radek

On 14 January 2011 11:32, Itai Hochman <it...@outbrain.com> wrote:

> We had a similar challenge and we dealt with it  by sampling based on the
> user id.
>
> We have a unique id which is a random hexadecimal format- for instance
> A12890900.
>
> The query is running  only on users that have an id that ends with 00.
>
> At the end we multiply by 256 and get a pretty close number to the real
> number.
>
>
>
> Itai
>
>
>
> On 01/14/2011 01:14 PM, Radek Maciaszek wrote:
>
>  Hi,
>>
>> I am working on some large scale unique users analysis (think hundreds of
>> millions of records per day). Since number of all records per month goes
>> into many billions I am hoping that there may be some alternative to running
>> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
>> deriving monthly numbers based on daily numbers of unique users with a use
>> of statistical linear regression or a similar technique.
>>
>> I tried to use sampling but that does not seem to give me the results I
>> would expect. That is, for example if I sample every 1/100 record I would
>> expect to see about 1% of the total number of unique users from a given
>> period of time but instead I am seeing bigger numbers, for example something
>> like 5%. I am not sure if the law of large number applies to unique users
>> analysis.
>>
>> Moreover it seems that "select distinct" does not scale linearly (which is
>> understandable) and so doing monthly unique users analysis takes way longer
>> than 30 x time to perform daily unique users analysis and requires a lot of
>> memory.
>>
>> I was wondering if anyone had a similar challenge?
>>
>> Many thanks,
>> Radek
>>
>
>

Re: Unique users analysis

Posted by Itai Hochman <it...@outbrain.com>.
We had a similar challenge and we dealt with it  by sampling based on 
the user id.

We have a unique id which is a random hexadecimal format- for instance 
A12890900.

The query is running  only on users that have an id that ends with 00.

At the end we multiply by 256 and get a pretty close number to the real 
number.



Itai


On 01/14/2011 01:14 PM, Radek Maciaszek wrote:

> Hi,
>
> I am working on some large scale unique users analysis (think hundreds 
> of millions of records per day). Since number of all records per month 
> goes into many billions I am hoping that there may be some alternative 
> to running "SELECT DISTINCT user_unique_id..." such as sampling data 
> or perhaps deriving monthly numbers based on daily numbers of unique 
> users with a use of statistical linear regression or a similar technique.
>
> I tried to use sampling but that does not seem to give me the results 
> I would expect. That is, for example if I sample every 1/100 record I 
> would expect to see about 1% of the total number of unique users from 
> a given period of time but instead I am seeing bigger numbers, for 
> example something like 5%. I am not sure if the law of large number 
> applies to unique users analysis.
>
> Moreover it seems that "select distinct" does not scale linearly 
> (which is understandable) and so doing monthly unique users analysis 
> takes way longer than 30 x time to perform daily unique users analysis 
> and requires a lot of memory.
>
> I was wondering if anyone had a similar challenge?
>
> Many thanks,
> Radek