You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sam Wu <sw...@gmail.com> on 2013/11/14 00:26:59 UTC

hbase suitable for churn analysis ?

Hi all,

I am thinking about using Random Forest to do churn analysis with Hbase as NoSQL data store.
Currently,  we have all the user history (basically many type of event data)  resides in S3 & Redshift (we have one table per date/per event)
Events includes startTime, endTime, and other pertinent information,..

We are thinking about converting all the event tables into one fat table(with other helper parameter tables) with one row per user using Hbase.

Each row will have user id as key, with some column-family/qualifier, e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier as different types of event.  Since initially we are more interested in new user retention, so 30 days might be good to start with.

We can label record as churning away by no active activity in continuous 10 days.

If data schema looks good, ingest data from S3 into HBase. Then do Random Forest to classifier new profile data.

Is this types of data a good candidate for Hbase.
Opinion is highly appreciated.


BR

Sam

Re: hbase suitable for churn analysis ?

Posted by sam wu <sw...@gmail.com>.
Thanks for the great info



On Thu, Nov 14, 2013 at 9:40 AM, James Taylor <jt...@salesforce.com>wrote:

> We ingest logs using Pig to write Phoenix-compliant HFiles, load those into
> HBase and then use Phoenix (https://github.com/forcedotcom/phoenix) to
> query directly over the HBase data through SQL.
>
> Regards,
> James
>
>
> On Thu, Nov 14, 2013 at 9:35 AM, sam wu <sw...@gmail.com> wrote:
>
> > we ingest data from log (one file/table, per event, per date) into HBase
> > offline on daily basis. So we can get no_day info.
> > My thoughts for churn analysis based on two types of user.
> > green (young, maybe < 7 days in system), predict churn based on first 7?
> > days activity, ideally predict while the user is still logging into the
> > system, and if the churn probablity is high, reward sweets to keep them
> > stay longer.
> > Senior user, predict churn based on weekly? summary.
> >
> > One thought to accomplish this is to have one detailed daily table, and
> > some summary (weekly?) table. new daily data get ingested into daily
> table.
> > Once every week, summary/move some old daily data into weekly table
> >
> >
> >
> > On Thu, Nov 14, 2013 at 9:15 AM, Pradeep Gollakota <pradeepg26@gmail.com
> > >wrote:
> >
> > > I'm a little curious as to how you would be able to use no_of_days as a
> > > column qualifier at all... it changes everyday for all users right? So
> > how
> > > will you keep your table updated?
> > >
> > >
> > > On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > You can use your no_day as a column qualifier probably.
> > > >
> > > > The column families are best suitable to regroup column qualifiers
> with
> > > the
> > > > same access (read/write) pattern. So if all your columns qualifiers
> > have
> > > > the same pattern, simply put them on the same familly.
> > > >
> > > > JM
> > > >
> > > >
> > > > 2013/11/14 sam wu <sw...@gmail.com>
> > > >
> > > > > Thanks for the advise.
> > > > > What about key is userId + no_day(since user registered), and
> column
> > > > family
> > > > > is each typeEvent, and qualifier is the detailed trxs.
> > > > >
> > > > >
> > > > > On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > Hi Sam,
> > > > > >
> > > > > > So are you saying that you will have about 30 column families? If
> > so
> > > I
> > > > > > don't think tit's a good idea.
> > > > > >
> > > > > > JM
> > > > > >
> > > > > >
> > > > > > 2013/11/13 Sam Wu <sw...@gmail.com>
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I am thinking about using Random Forest to do churn analysis
> with
> > > > Hbase
> > > > > > as
> > > > > > > NoSQL data store.
> > > > > > > Currently,  we have all the user history (basically many type
> of
> > > > event
> > > > > > > data)  resides in S3 & Redshift (we have one table per date/per
> > > > event)
> > > > > > > Events includes startTime, endTime, and other pertinent
> > > > information,..
> > > > > > >
> > > > > > > We are thinking about converting all the event tables into one
> > fat
> > > > > > > table(with other helper parameter tables) with one row per user
> > > using
> > > > > > Hbase.
> > > > > > >
> > > > > > > Each row will have user id as key, with some
> > > column-family/qualifier,
> > > > > > > e.g.: col-family, d1,d2,……d30 (days in the system), and
> qualifier
> > > as
> > > > > > > different types of event.  Since initially we are more
> interested
> > > in
> > > > > new
> > > > > > > user retention, so 30 days might be good to start with.
> > > > > > >
> > > > > > > We can label record as churning away by no active activity in
> > > > > continuous
> > > > > > > 10 days.
> > > > > > >
> > > > > > > If data schema looks good, ingest data from S3 into HBase. Then
> > do
> > > > > Random
> > > > > > > Forest to classifier new profile data.
> > > > > > >
> > > > > > > Is this types of data a good candidate for Hbase.
> > > > > > > Opinion is highly appreciated.
> > > > > > >
> > > > > > >
> > > > > > > BR
> > > > > > >
> > > > > > > Sam
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: hbase suitable for churn analysis ?

Posted by James Taylor <jt...@salesforce.com>.
We ingest logs using Pig to write Phoenix-compliant HFiles, load those into
HBase and then use Phoenix (https://github.com/forcedotcom/phoenix) to
query directly over the HBase data through SQL.

Regards,
James


On Thu, Nov 14, 2013 at 9:35 AM, sam wu <sw...@gmail.com> wrote:

> we ingest data from log (one file/table, per event, per date) into HBase
> offline on daily basis. So we can get no_day info.
> My thoughts for churn analysis based on two types of user.
> green (young, maybe < 7 days in system), predict churn based on first 7?
> days activity, ideally predict while the user is still logging into the
> system, and if the churn probablity is high, reward sweets to keep them
> stay longer.
> Senior user, predict churn based on weekly? summary.
>
> One thought to accomplish this is to have one detailed daily table, and
> some summary (weekly?) table. new daily data get ingested into daily table.
> Once every week, summary/move some old daily data into weekly table
>
>
>
> On Thu, Nov 14, 2013 at 9:15 AM, Pradeep Gollakota <pradeepg26@gmail.com
> >wrote:
>
> > I'm a little curious as to how you would be able to use no_of_days as a
> > column qualifier at all... it changes everyday for all users right? So
> how
> > will you keep your table updated?
> >
> >
> > On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > You can use your no_day as a column qualifier probably.
> > >
> > > The column families are best suitable to regroup column qualifiers with
> > the
> > > same access (read/write) pattern. So if all your columns qualifiers
> have
> > > the same pattern, simply put them on the same familly.
> > >
> > > JM
> > >
> > >
> > > 2013/11/14 sam wu <sw...@gmail.com>
> > >
> > > > Thanks for the advise.
> > > > What about key is userId + no_day(since user registered), and column
> > > family
> > > > is each typeEvent, and qualifier is the detailed trxs.
> > > >
> > > >
> > > > On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Hi Sam,
> > > > >
> > > > > So are you saying that you will have about 30 column families? If
> so
> > I
> > > > > don't think tit's a good idea.
> > > > >
> > > > > JM
> > > > >
> > > > >
> > > > > 2013/11/13 Sam Wu <sw...@gmail.com>
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am thinking about using Random Forest to do churn analysis with
> > > Hbase
> > > > > as
> > > > > > NoSQL data store.
> > > > > > Currently,  we have all the user history (basically many type of
> > > event
> > > > > > data)  resides in S3 & Redshift (we have one table per date/per
> > > event)
> > > > > > Events includes startTime, endTime, and other pertinent
> > > information,..
> > > > > >
> > > > > > We are thinking about converting all the event tables into one
> fat
> > > > > > table(with other helper parameter tables) with one row per user
> > using
> > > > > Hbase.
> > > > > >
> > > > > > Each row will have user id as key, with some
> > column-family/qualifier,
> > > > > > e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier
> > as
> > > > > > different types of event.  Since initially we are more interested
> > in
> > > > new
> > > > > > user retention, so 30 days might be good to start with.
> > > > > >
> > > > > > We can label record as churning away by no active activity in
> > > > continuous
> > > > > > 10 days.
> > > > > >
> > > > > > If data schema looks good, ingest data from S3 into HBase. Then
> do
> > > > Random
> > > > > > Forest to classifier new profile data.
> > > > > >
> > > > > > Is this types of data a good candidate for Hbase.
> > > > > > Opinion is highly appreciated.
> > > > > >
> > > > > >
> > > > > > BR
> > > > > >
> > > > > > Sam
> > > > >
> > > >
> > >
> >
>

Re: hbase suitable for churn analysis ?

Posted by sam wu <sw...@gmail.com>.
we ingest data from log (one file/table, per event, per date) into HBase
offline on daily basis. So we can get no_day info.
My thoughts for churn analysis based on two types of user.
green (young, maybe < 7 days in system), predict churn based on first 7?
days activity, ideally predict while the user is still logging into the
system, and if the churn probablity is high, reward sweets to keep them
stay longer.
Senior user, predict churn based on weekly? summary.

One thought to accomplish this is to have one detailed daily table, and
some summary (weekly?) table. new daily data get ingested into daily table.
Once every week, summary/move some old daily data into weekly table



On Thu, Nov 14, 2013 at 9:15 AM, Pradeep Gollakota <pr...@gmail.com>wrote:

> I'm a little curious as to how you would be able to use no_of_days as a
> column qualifier at all... it changes everyday for all users right? So how
> will you keep your table updated?
>
>
> On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > You can use your no_day as a column qualifier probably.
> >
> > The column families are best suitable to regroup column qualifiers with
> the
> > same access (read/write) pattern. So if all your columns qualifiers have
> > the same pattern, simply put them on the same familly.
> >
> > JM
> >
> >
> > 2013/11/14 sam wu <sw...@gmail.com>
> >
> > > Thanks for the advise.
> > > What about key is userId + no_day(since user registered), and column
> > family
> > > is each typeEvent, and qualifier is the detailed trxs.
> > >
> > >
> > > On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Sam,
> > > >
> > > > So are you saying that you will have about 30 column families? If so
> I
> > > > don't think tit's a good idea.
> > > >
> > > > JM
> > > >
> > > >
> > > > 2013/11/13 Sam Wu <sw...@gmail.com>
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am thinking about using Random Forest to do churn analysis with
> > Hbase
> > > > as
> > > > > NoSQL data store.
> > > > > Currently,  we have all the user history (basically many type of
> > event
> > > > > data)  resides in S3 & Redshift (we have one table per date/per
> > event)
> > > > > Events includes startTime, endTime, and other pertinent
> > information,..
> > > > >
> > > > > We are thinking about converting all the event tables into one fat
> > > > > table(with other helper parameter tables) with one row per user
> using
> > > > Hbase.
> > > > >
> > > > > Each row will have user id as key, with some
> column-family/qualifier,
> > > > > e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier
> as
> > > > > different types of event.  Since initially we are more interested
> in
> > > new
> > > > > user retention, so 30 days might be good to start with.
> > > > >
> > > > > We can label record as churning away by no active activity in
> > > continuous
> > > > > 10 days.
> > > > >
> > > > > If data schema looks good, ingest data from S3 into HBase. Then do
> > > Random
> > > > > Forest to classifier new profile data.
> > > > >
> > > > > Is this types of data a good candidate for Hbase.
> > > > > Opinion is highly appreciated.
> > > > >
> > > > >
> > > > > BR
> > > > >
> > > > > Sam
> > > >
> > >
> >
>

Re: hbase suitable for churn analysis ?

Posted by Pradeep Gollakota <pr...@gmail.com>.
I'm a little curious as to how you would be able to use no_of_days as a
column qualifier at all... it changes everyday for all users right? So how
will you keep your table updated?


On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> You can use your no_day as a column qualifier probably.
>
> The column families are best suitable to regroup column qualifiers with the
> same access (read/write) pattern. So if all your columns qualifiers have
> the same pattern, simply put them on the same familly.
>
> JM
>
>
> 2013/11/14 sam wu <sw...@gmail.com>
>
> > Thanks for the advise.
> > What about key is userId + no_day(since user registered), and column
> family
> > is each typeEvent, and qualifier is the detailed trxs.
> >
> >
> > On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Sam,
> > >
> > > So are you saying that you will have about 30 column families? If so I
> > > don't think tit's a good idea.
> > >
> > > JM
> > >
> > >
> > > 2013/11/13 Sam Wu <sw...@gmail.com>
> > >
> > > > Hi all,
> > > >
> > > > I am thinking about using Random Forest to do churn analysis with
> Hbase
> > > as
> > > > NoSQL data store.
> > > > Currently,  we have all the user history (basically many type of
> event
> > > > data)  resides in S3 & Redshift (we have one table per date/per
> event)
> > > > Events includes startTime, endTime, and other pertinent
> information,..
> > > >
> > > > We are thinking about converting all the event tables into one fat
> > > > table(with other helper parameter tables) with one row per user using
> > > Hbase.
> > > >
> > > > Each row will have user id as key, with some column-family/qualifier,
> > > > e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier as
> > > > different types of event.  Since initially we are more interested in
> > new
> > > > user retention, so 30 days might be good to start with.
> > > >
> > > > We can label record as churning away by no active activity in
> > continuous
> > > > 10 days.
> > > >
> > > > If data schema looks good, ingest data from S3 into HBase. Then do
> > Random
> > > > Forest to classifier new profile data.
> > > >
> > > > Is this types of data a good candidate for Hbase.
> > > > Opinion is highly appreciated.
> > > >
> > > >
> > > > BR
> > > >
> > > > Sam
> > >
> >
>

Re: hbase suitable for churn analysis ?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
You can use your no_day as a column qualifier probably.

The column families are best suitable to regroup column qualifiers with the
same access (read/write) pattern. So if all your columns qualifiers have
the same pattern, simply put them on the same familly.

JM


2013/11/14 sam wu <sw...@gmail.com>

> Thanks for the advise.
> What about key is userId + no_day(since user registered), and column family
> is each typeEvent, and qualifier is the detailed trxs.
>
>
> On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Sam,
> >
> > So are you saying that you will have about 30 column families? If so I
> > don't think tit's a good idea.
> >
> > JM
> >
> >
> > 2013/11/13 Sam Wu <sw...@gmail.com>
> >
> > > Hi all,
> > >
> > > I am thinking about using Random Forest to do churn analysis with Hbase
> > as
> > > NoSQL data store.
> > > Currently,  we have all the user history (basically many type of event
> > > data)  resides in S3 & Redshift (we have one table per date/per event)
> > > Events includes startTime, endTime, and other pertinent information,..
> > >
> > > We are thinking about converting all the event tables into one fat
> > > table(with other helper parameter tables) with one row per user using
> > Hbase.
> > >
> > > Each row will have user id as key, with some column-family/qualifier,
> > > e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier as
> > > different types of event.  Since initially we are more interested in
> new
> > > user retention, so 30 days might be good to start with.
> > >
> > > We can label record as churning away by no active activity in
> continuous
> > > 10 days.
> > >
> > > If data schema looks good, ingest data from S3 into HBase. Then do
> Random
> > > Forest to classifier new profile data.
> > >
> > > Is this types of data a good candidate for Hbase.
> > > Opinion is highly appreciated.
> > >
> > >
> > > BR
> > >
> > > Sam
> >
>

Re: hbase suitable for churn analysis ?

Posted by sam wu <sw...@gmail.com>.
Thanks for the advise.
What about key is userId + no_day(since user registered), and column family
is each typeEvent, and qualifier is the detailed trxs.


On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Sam,
>
> So are you saying that you will have about 30 column families? If so I
> don't think tit's a good idea.
>
> JM
>
>
> 2013/11/13 Sam Wu <sw...@gmail.com>
>
> > Hi all,
> >
> > I am thinking about using Random Forest to do churn analysis with Hbase
> as
> > NoSQL data store.
> > Currently,  we have all the user history (basically many type of event
> > data)  resides in S3 & Redshift (we have one table per date/per event)
> > Events includes startTime, endTime, and other pertinent information,..
> >
> > We are thinking about converting all the event tables into one fat
> > table(with other helper parameter tables) with one row per user using
> Hbase.
> >
> > Each row will have user id as key, with some column-family/qualifier,
> > e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier as
> > different types of event.  Since initially we are more interested in new
> > user retention, so 30 days might be good to start with.
> >
> > We can label record as churning away by no active activity in continuous
> > 10 days.
> >
> > If data schema looks good, ingest data from S3 into HBase. Then do Random
> > Forest to classifier new profile data.
> >
> > Is this types of data a good candidate for Hbase.
> > Opinion is highly appreciated.
> >
> >
> > BR
> >
> > Sam
>

Re: hbase suitable for churn analysis ?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Sam,

So are you saying that you will have about 30 column families? If so I
don't think tit's a good idea.

JM


2013/11/13 Sam Wu <sw...@gmail.com>

> Hi all,
>
> I am thinking about using Random Forest to do churn analysis with Hbase as
> NoSQL data store.
> Currently,  we have all the user history (basically many type of event
> data)  resides in S3 & Redshift (we have one table per date/per event)
> Events includes startTime, endTime, and other pertinent information,..
>
> We are thinking about converting all the event tables into one fat
> table(with other helper parameter tables) with one row per user using Hbase.
>
> Each row will have user id as key, with some column-family/qualifier,
> e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier as
> different types of event.  Since initially we are more interested in new
> user retention, so 30 days might be good to start with.
>
> We can label record as churning away by no active activity in continuous
> 10 days.
>
> If data schema looks good, ingest data from S3 into HBase. Then do Random
> Forest to classifier new profile data.
>
> Is this types of data a good candidate for Hbase.
> Opinion is highly appreciated.
>
>
> BR
>
> Sam