You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mubarak Seyed <mu...@gmail.com> on 2010/10/20 08:26:02 UTC

Design Question

Hello,

I am looking for a solution to address my problem related to building an
intelligent system using Machine learning techniques and Data mining.

My requirements are as follows:

- Client system does the transaction using hub, we have a historical data
and we can predict the trends of min/avg/max number of transaction for a
given interval
- Using the historical data, mine the data, need to find the predictions
- Need to build a intelligent system (using ML technique, neural network
algorithms) if there is no transaction for a client in the given prediction
range then system needs to send alarms


For example, Walmart sells gift cards, each sale is a transaction and it
needs to come to main system (from hub), we have a historical data for
WalMart for sales (for each day, each hour, each 10 mins, peak volume,
holiday season), if there is no transaction from WalMart for X range of time
and that range does not fall in a prediction data, then intelligent systems
needs to raise an alarm.


I am interested to play around with some of the technologies such as
 Machine Learning (using Mahout/Hadoop), Neural Networks, Data mining
techniques.

Thanks in advance.

-Mubarak.

Re: Design Question

Posted by Ted Dunning <te...@gmail.com>.
There is no direct support for this in Mahout, but some of the underpinnings
are there.  One thought that I have is that
the data involved in these processes are not usually massive and can be
handled using conventional systems.  The
reason that the scale isn't so large is that you either a very low event
rate which means that the total number of events
is small or you have a high event rate in which the underlying Poisson
parameters vary quite slowly relative to the inter-arrival
time.  This means that you can measure counts over time periods that are
still pretty short with respect to the parmeter rate
of change and have small data again.

Given this, my suggestions are to do one or more of the following:

- use JAGS in R or BUGS for doing the hierarchical Bayesian modeling
described in this paper

- use raw R to build an MCMC sampler for this model

- experiment with variational optimization of this model

- consider simplifying the MMPP model by directly estimating the output of
the Markov model using something like a Kalman filter and short time
averages for rate parameters.  This gives an incredibly simple model with
very good performance.  For instance, I have done
this to create a system to alert when sales on a web site stopped happening.
 The method I used was to use hourly estimates of rates
and build a linear model based on the rate for the same hour one week ago
and one day ago.  Then, I could build a Poisson process
alert based simply on inter-arrival time and desired false positive rate.
 Normally I set the false positive rate to about one alarm per
week or two.  This worked extremely well.

- when you need to deploy the system and know specifically what you want to
do, come back to Mahout to code the system
using the basic numerical mathematical algorithms that you have developed in
the first three options.

The reason that I suggest this is that Mahout is not a super efficient
experimental platform because for experimental purposes,
efficiency is measured in developer time, not run time.  Mahout does provide
good deployment efficiency because it supports scaling
well, but this comes at a developer time cost.

Speak up if my suggestions are silly.  You certainly know your problem
better than I do.

And while you are at it, can you say what you data represent?  Can you
publish your data?


On Sun, Oct 31, 2010 at 1:17 AM, Mubarak Seyed <mu...@gmail.com>wrote:

> Thanks Ted.
>
> Is there any way to use MMPP (Markov-manipulated Poisson process) algorithm
> (www.datalab.uci.edu/papers/tkdd07.pdf) in Mahout 0.4?
> Can you please direct me to some examples?
>
> Thanks,
> Mubarak
>
>
> On Wed, Oct 20, 2010 at 4:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > For many situations, this can be done very simply, especially if you are
> > working web-based systems.  For that case,
> > it is straightforward to model transactions coming as a Poisson process
> > with
> > a time varying rate.  In the simplest case,
> > very simple seasonality models can be used to estimate the time varying
> > rate.  I have used hourly estimates from one
> > day ago and one week ago as good indicators in the past.  These
> indicators
> > did not model long weekends as well as I would
> > have liked, the the alarms based on these models were better than any
> other
> > system available.  Long-term seasonality
> > was handled very well because of the short term nature of the expected
> > volume estimates.  For tighter bounds,
> > it should be possible to use something akin to generalized linear models
> to
> > incorporate more information to get better
> > rate predictions.  Since the failures I was trying to detected quickly
> were
> > typically total failures, I just had to raise an alert
> > as quickly as possible when the inter-transaction time exceeded a
> > reasonable
> > bound.  For a specified false positive rate,
> > this was very easily done and results were very nearly optimal.  More
> > importantly, the alerts almost always were faster
> > than our CEO who had an eagle eye for these things.
> >
> > For brick-and-mortar systems, this can be a bit more difficult because
> > business practices tend to cause some very irregular
> > volumes.  If you are dealing with transactions that are being reported in
> > real-time rather than in batches, then you should be
> > fine.  Batch reporting based on human triggers could probably be handled
> > using longer/softer rate averaging windows, however.
> >
> > I really don't expect that you need anything all that fancy for the rate
> > estimation.
> >
> > Can you say more about your data?  Can you post anonymous sample data for
> a
> > two week period?
> >
> > On Tue, Oct 19, 2010 at 11:26 PM, Mubarak Seyed <mubarak.seyed@gmail.com
> > >wrote:
> >
> > > My requirements are as follows:
> > >
> > > - Client system does the transaction using hub, we have a historical
> data
> > > and we can predict the trends of min/avg/max number of transaction for
> a
> > > given interval
> > > - Using the historical data, mine the data, need to find the
> predictions
> > > - Need to build a intelligent system (using ML technique, neural
> network
> > > algorithms) if there is no transaction for a client in the given
> > prediction
> > > range then system needs to send alarms
> > >
> > >
> > > For example, Walmart sells gift cards, each sale is a transaction and
> it
> > > needs to come to main system (from hub), we have a historical data for
> > > WalMart for sales (for each day, each hour, each 10 mins, peak volume,
> > > holiday season), if there is no transaction from WalMart for X range of
> > > time
> > > and that range does not fall in a prediction data, then intelligent
> > systems
> > > needs to raise an alarm.
> > >
> >
>
>
>
> --
> Thanks,
> Mubarak Seyed.
>

Re: Design Question

Posted by Mubarak Seyed <mu...@gmail.com>.
Thanks Ted.

Is there any way to use MMPP (Markov-manipulated Poisson process) algorithm
(www.datalab.uci.edu/papers/tkdd07.pdf) in Mahout 0.4?
Can you please direct me to some examples?

Thanks,
Mubarak


On Wed, Oct 20, 2010 at 4:06 PM, Ted Dunning <te...@gmail.com> wrote:

> For many situations, this can be done very simply, especially if you are
> working web-based systems.  For that case,
> it is straightforward to model transactions coming as a Poisson process
> with
> a time varying rate.  In the simplest case,
> very simple seasonality models can be used to estimate the time varying
> rate.  I have used hourly estimates from one
> day ago and one week ago as good indicators in the past.  These indicators
> did not model long weekends as well as I would
> have liked, the the alarms based on these models were better than any other
> system available.  Long-term seasonality
> was handled very well because of the short term nature of the expected
> volume estimates.  For tighter bounds,
> it should be possible to use something akin to generalized linear models to
> incorporate more information to get better
> rate predictions.  Since the failures I was trying to detected quickly were
> typically total failures, I just had to raise an alert
> as quickly as possible when the inter-transaction time exceeded a
> reasonable
> bound.  For a specified false positive rate,
> this was very easily done and results were very nearly optimal.  More
> importantly, the alerts almost always were faster
> than our CEO who had an eagle eye for these things.
>
> For brick-and-mortar systems, this can be a bit more difficult because
> business practices tend to cause some very irregular
> volumes.  If you are dealing with transactions that are being reported in
> real-time rather than in batches, then you should be
> fine.  Batch reporting based on human triggers could probably be handled
> using longer/softer rate averaging windows, however.
>
> I really don't expect that you need anything all that fancy for the rate
> estimation.
>
> Can you say more about your data?  Can you post anonymous sample data for a
> two week period?
>
> On Tue, Oct 19, 2010 at 11:26 PM, Mubarak Seyed <mubarak.seyed@gmail.com
> >wrote:
>
> > My requirements are as follows:
> >
> > - Client system does the transaction using hub, we have a historical data
> > and we can predict the trends of min/avg/max number of transaction for a
> > given interval
> > - Using the historical data, mine the data, need to find the predictions
> > - Need to build a intelligent system (using ML technique, neural network
> > algorithms) if there is no transaction for a client in the given
> prediction
> > range then system needs to send alarms
> >
> >
> > For example, Walmart sells gift cards, each sale is a transaction and it
> > needs to come to main system (from hub), we have a historical data for
> > WalMart for sales (for each day, each hour, each 10 mins, peak volume,
> > holiday season), if there is no transaction from WalMart for X range of
> > time
> > and that range does not fall in a prediction data, then intelligent
> systems
> > needs to raise an alarm.
> >
>



-- 
Thanks,
Mubarak Seyed.

Re: Design Question

Posted by Ted Dunning <te...@gmail.com>.
For many situations, this can be done very simply, especially if you are
working web-based systems.  For that case,
it is straightforward to model transactions coming as a Poisson process with
a time varying rate.  In the simplest case,
very simple seasonality models can be used to estimate the time varying
rate.  I have used hourly estimates from one
day ago and one week ago as good indicators in the past.  These indicators
did not model long weekends as well as I would
have liked, the the alarms based on these models were better than any other
system available.  Long-term seasonality
was handled very well because of the short term nature of the expected
volume estimates.  For tighter bounds,
it should be possible to use something akin to generalized linear models to
incorporate more information to get better
rate predictions.  Since the failures I was trying to detected quickly were
typically total failures, I just had to raise an alert
as quickly as possible when the inter-transaction time exceeded a reasonable
bound.  For a specified false positive rate,
this was very easily done and results were very nearly optimal.  More
importantly, the alerts almost always were faster
than our CEO who had an eagle eye for these things.

For brick-and-mortar systems, this can be a bit more difficult because
business practices tend to cause some very irregular
volumes.  If you are dealing with transactions that are being reported in
real-time rather than in batches, then you should be
fine.  Batch reporting based on human triggers could probably be handled
using longer/softer rate averaging windows, however.

I really don't expect that you need anything all that fancy for the rate
estimation.

Can you say more about your data?  Can you post anonymous sample data for a
two week period?

On Tue, Oct 19, 2010 at 11:26 PM, Mubarak Seyed <mu...@gmail.com>wrote:

> My requirements are as follows:
>
> - Client system does the transaction using hub, we have a historical data
> and we can predict the trends of min/avg/max number of transaction for a
> given interval
> - Using the historical data, mine the data, need to find the predictions
> - Need to build a intelligent system (using ML technique, neural network
> algorithms) if there is no transaction for a client in the given prediction
> range then system needs to send alarms
>
>
> For example, Walmart sells gift cards, each sale is a transaction and it
> needs to come to main system (from hub), we have a historical data for
> WalMart for sales (for each day, each hour, each 10 mins, peak volume,
> holiday season), if there is no transaction from WalMart for X range of
> time
> and that range does not fall in a prediction data, then intelligent systems
> needs to raise an alarm.
>