You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Vinay Seth Mohta <vi...@gmail.com> on 2010/02/03 20:10:33 UTC

Implementing cohort analysis

Hi,

I've been thinking of using Hadoop/Hive to do cohort analysis on a
large data set.  The general structure of the problem is:
- define a cohort of users using some criteria (e.g. users who visited
on the 2010-01-10, users who visited an experimental landing page,
etc.)
- track their behavior over time (e.g. users who visited landing page
v2 had twice as many sessions in the following 10 days than users who
visited landing page v1)

While I've seen many folks include bullets in presentations that
indicate that they use Hadoop/Hive for cohort analysis, I haven't seen
any good examples for how they implement it (at least not via Google).

The two approaches that I see are:

1) use hadoop streaming:
    - first, run a map-only job to filter out the cohort and create an
output file with all user ids that you want to track
    - second map/reduce job that clusters by user where the mapper
filters the data to only the user ids identified in the previous job
(e.g. via a hash lookup) and the reduce computes behavior for this
subset of users

2) use Hive with a WHERE clause to limit the cohort and then use
UDAF's so that you get all rows for the user and then in the UDAF,
implement desired functionality in the iterator.

Are there other / better ways to do cohort analysis?  Anyone have
examples they're willing to post or point me to?

Thanks in advance,
Vinay

Re: Implementing cohort analysis

Posted by zaki rahaman <za...@gmail.com>.

Why not create a table with the user ids or whatever other cohorts you want
to create and then use joins on the appropriate keys to go against other
data sources producing the subsets that show your cohorts? Seems a lot less
complicated then the approach you describe.

On Wed, Feb 3, 2010 at 2:10 PM, Vinay Seth Mohta
<vi...@gmail.com>wrote:

> Hi,
>
> I've been thinking of using Hadoop/Hive to do cohort analysis on a
> large data set.  The general structure of the problem is:
> - define a cohort of users using some criteria (e.g. users who visited
> on the 2010-01-10, users who visited an experimental landing page,
> etc.)
> - track their behavior over time (e.g. users who visited landing page
> v2 had twice as many sessions in the following 10 days than users who
> visited landing page v1)
>
> While I've seen many folks include bullets in presentations that
> indicate that they use Hadoop/Hive for cohort analysis, I haven't seen
> any good examples for how they implement it (at least not via Google).
>
> The two approaches that I see are:
>
> 1) use hadoop streaming:
>    - first, run a map-only job to filter out the cohort and create an
> output file with all user ids that you want to track
>    - second map/reduce job that clusters by user where the mapper
> filters the data to only the user ids identified in the previous job
> (e.g. via a hash lookup) and the reduce computes behavior for this
> subset of users
>
> 2) use Hive with a WHERE clause to limit the cohort and then use
> UDAF's so that you get all rows for the user and then in the UDAF,
> implement desired functionality in the iterator.
>
> Are there other / better ways to do cohort analysis?  Anyone have
> examples they're willing to post or point me to?
>
> Thanks in advance,
> Vinay
>



-- 
Zaki Rahaman