You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Tom Hubina <to...@z2live.com> on 2012/10/11 00:05:22 UTC

Rolling MAU computation

I'm trying to compute the number of active users in the previous 30 days
for each day over a date range. I can't think of any way to do it directly
within Hive so I'm wondering if you guys have any ideas.

Basically the algorithm is something like:

For each day in date range:
   SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day <
30;

Thanks for your help!

Tom

Re: Rolling MAU computation

Posted by Vijay <te...@gmail.com>.
For both simplicity and efficiency, I'd recommend making the
mausummary table partitioned on date and generate the MAU data each
day. There is no reason to generate MAU data for a given day more than
once (unless you find some problems with the source data or
something).

On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <ig...@decide.com> wrote:
> You just need to put the join condition in the WHERE clause. That way Hive
> will do a cartesian product followed by a filter.
>
> On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <to...@z2live.com> wrote:
>>
>> I think I see what you're saying about the temp table with start/end dates
>> (30x expansion makes sense) and it sounds like it should work. I just need
>> to figure out a good way to generate the table. Thanks!
>>
>> Tom
>>
>> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>>
>>> If you have a lot of data, you might have to write a custom reducer (in
>>> python) to keep track of the moving date window.
>>>
>>> If you don't have that much data, you might want to use a temp table
>>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To
>>> create such a table, you can self-join a table of unique dates using the
>>> above condition. Then, you would join your data with that table on
>>> start_date and group by end_date counting distinct user_ids. Hope I got that
>>> right :)
>>>
>>> The latter approach will essentially multiply the number of rows by 30.
>>>
>>> igor
>>> decide.com
>>>
>>>
>>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:
>>>>
>>>> I'm trying to compute the number of active users in the previous 30 days
>>>> for each day over a date range. I can't think of any way to do it directly
>>>> within Hive so I'm wondering if you guys have any ideas.
>>>>
>>>> Basically the algorithm is something like:
>>>>
>>>> For each day in date range:
>>>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day -
>>>> logins.day < 30;
>>>>
>>>> Thanks for your help!
>>>>
>>>> Tom
>>>>
>>>
>>
>

Re: Rolling MAU computation

Posted by Igor Tatarinov <ig...@decide.com>.
You just need to put the join condition in the WHERE clause. That way Hive
will do a cartesian product followed by a filter.

On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <to...@z2live.com> wrote:

> I think I see what you're saying about the temp table with start/end dates
> (30x expansion makes sense) and it sounds like it should work. I just need
> to figure out a good way to generate the table. Thanks!
>
> Tom
>
> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> If you have a lot of data, you might have to write a custom reducer (in
>> python) to keep track of the moving date window.
>>
>> If you don't have that much data, you might want to use a temp table
>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To
>> create such a table, you can self-join a table of unique dates using the
>> above condition. Then, you would join your data with that table on
>> start_date and group by end_date counting distinct user_ids. Hope I got
>> that right :)
>>
>> The latter approach will essentially multiply the number of rows by 30.
>>
>> igor
>> decide.com
>>
>>
>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:
>>
>>> I'm trying to compute the number of active users in the previous 30 days
>>> for each day over a date range. I can't think of any way to do it directly
>>> within Hive so I'm wondering if you guys have any ideas.
>>>
>>> Basically the algorithm is something like:
>>>
>>> For each day in date range:
>>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day -
>>> logins.day < 30;
>>>
>>>  Thanks for your help!
>>>
>>> Tom
>>>
>>>
>>
>

Re: Rolling MAU computation

Posted by Tom Hubina <to...@z2live.com>.
I think I see what you're saying about the temp table with start/end dates
(30x expansion makes sense) and it sounds like it should work. I just need
to figure out a good way to generate the table. Thanks!

Tom

On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <ig...@decide.com> wrote:

> If you have a lot of data, you might have to write a custom reducer (in
> python) to keep track of the moving date window.
>
> If you don't have that much data, you might want to use a temp table
> <start_date, end_date> such that datediff(end_date, start_date) < 30. To
> create such a table, you can self-join a table of unique dates using the
> above condition. Then, you would join your data with that table on
> start_date and group by end_date counting distinct user_ids. Hope I got
> that right :)
>
> The latter approach will essentially multiply the number of rows by 30.
>
> igor
> decide.com
>
>
> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:
>
>> I'm trying to compute the number of active users in the previous 30 days
>> for each day over a date range. I can't think of any way to do it directly
>> within Hive so I'm wondering if you guys have any ideas.
>>
>> Basically the algorithm is something like:
>>
>> For each day in date range:
>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
>> < 30;
>>
>>  Thanks for your help!
>>
>> Tom
>>
>>
>

Re: Rolling MAU computation

Posted by Igor Tatarinov <ig...@decide.com>.
If you have a lot of data, you might have to write a custom reducer (in
python) to keep track of the moving date window.

If you don't have that much data, you might want to use a temp table
<start_date, end_date> such that datediff(end_date, start_date) < 30. To
create such a table, you can self-join a table of unique dates using the
above condition. Then, you would join your data with that table on
start_date and group by end_date counting distinct user_ids. Hope I got
that right :)

The latter approach will essentially multiply the number of rows by 30.

igor
decide.com


On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:

> I'm trying to compute the number of active users in the previous 30 days
> for each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
> < 30;
>
> Thanks for your help!
>
> Tom
>
>

Re: Rolling MAU computation

Posted by Tom Hubina <to...@z2live.com>.
An example would be awesome .. I've never used a map side join (though I'm
searching on that now .. )

Tom

On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria
<ro...@stumbleupon.com>wrote:

> I've done this with a map side join using a table that stores days of the
> week. I use that to drive the day im calculating the count for. Let me know
> if you need an example.
>
> Cheers,
> R
>
>
> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:
>
>> I'm trying to compute the number of active users in the previous 30 days
>> for each day over a date range. I can't think of any way to do it directly
>> within Hive so I'm wondering if you guys have any ideas.
>>
>> Basically the algorithm is something like:
>>
>> For each day in date range:
>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
>> < 30;
>>
>>  Thanks for your help!
>>
>> Tom
>>
>>
>

Re: Rolling MAU computation

Posted by Roberto Sanabria <ro...@stumbleupon.com>.
I've done this with a map side join using a table that stores days of the
week. I use that to drive the day im calculating the count for. Let me know
if you need an example.

Cheers,
R

On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <to...@z2live.com> wrote:

> I'm trying to compute the number of active users in the previous 30 days
> for each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
> < 30;
>
> Thanks for your help!
>
> Tom
>
>

Re: Rolling MAU computation

Posted by Tom Hubina <to...@z2live.com>.
The problem is that "day" is the value in the for loop.

I've tried doing a join with a table that contains the set of days, but the
problem is that you can't do a join on a range ... Hive only support
equality in the join. For example:

INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(userid))
    FROM days
    JOIN logins ON date_add(logins.t, 30) >= days.day AND logins.t <=
days.day
    GROUP BY day;

fails because of the range in the join.

Tom


On Wed, Oct 10, 2012 at 8:50 PM, MiaoMiao <li...@gmail.com> wrote:

> How about
> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
> < 30 GROUP BY day;
>
> On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <to...@z2live.com> wrote:
> > I'm trying to compute the number of active users in the previous 30 days
> for
> > each day over a date range. I can't think of any way to do it directly
> > within Hive so I'm wondering if you guys have any ideas.
> >
> > Basically the algorithm is something like:
> >
> > For each day in date range:
> >    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day -
> logins.day <
> > 30;
> >
> > Thanks for your help!
> >
> > Tom
> >
>

Re: Rolling MAU computation

Posted by MiaoMiao <li...@gmail.com>.
How about
SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
< 30 GROUP BY day;

On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <to...@z2live.com> wrote:
> I'm trying to compute the number of active users in the previous 30 days for
> each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day <
> 30;
>
> Thanks for your help!
>
> Tom
>