You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Srivathsan Srinivas <sr...@gmail.com> on 2010/11/01 16:02:45 UTC

outlier detection in time-series using Mahout

Hi,
       Any pointers to techniques/papers that detect outliers in time-series
of very large data sets using Mahout? I am interesting in seeing what
techniques are favorable for use in large-scale distributed systems using
Hadoop/Mahout.

Thanks,
Sri.

Re: outlier detection in time-series using Mahout

Posted by Srivathsan Srinivas <sr...@gmail.com>.

Hi Ashwin,
   Thanks for the pointer. I will look into it and learn those stuff.

-Sri.

On Wed, Nov 3, 2010 at 12:40 PM, Ashwin Jayaprakash <
ashwin.jayaprakash@gmail.com> wrote:

>
> Have you had a look at  http://code.google.com/p/jmotif/ jmotif  ? It
> looks
> interesting except that it's GPL.
>
> Ashwin Jayaprakash.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/outlier-detection-in-time-series-using-Mahout-tp1822136p1836429.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: outlier detection in time-series using Mahout

Posted by Ashwin Jayaprakash <as...@gmail.com>.

Have you had a look at  http://code.google.com/p/jmotif/ jmotif  ? It looks
interesting except that it's GPL.

Ashwin Jayaprakash.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/outlier-detection-in-time-series-using-Mahout-tp1822136p1836429.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: outlier detection in time-series using Mahout

Posted by Srivathsan Srinivas <sr...@gmail.com>.

On the same note, a parallelizable form of AVF - Attribute Value
Frequency, looks to be promising for rapid outlier detection using
hadoop. A paper titled "a fast parallel outlier detection for
categorical datasets using map reduce" gives more info.

I am looking for various techniques and tools that would enable me to
detect and score outliers on massive datasets that might be streaming.
Just began studying some techniques and got some pointers from you
all.

Thanks,
Srinivas.

On Wednesday, November 3, 2010, Srivathsan Srinivas
<sr...@gmail.com> wrote:
> Thanks. I am reading a recent paper of Keogh's = time series shapelets
> : a novel technique that allows accurate, interpretable and fast
> classification.  A springer publication of data mining and knowledge
> discovery, 18 June 2010.
>
> I am basically skimming several papers with different ideas to see
> what can bec easily and efficiently parrallelized for using hadoop...
>
> Thanks much for pointing to the presentation and the paper.
>
> Srinivas.
>
> On Wednesday, November 3, 2010, Federico Castanedo <fc...@inf.uc3m.es> wrote:
>> Hi,
>>
>> 2010/11/1 Srivathsan Srinivas <sr...@gmail.com>:
>>> Dear Ted,
>>>
>>> Thanks for pointing to Dirchlet mixture model. I shall look into that.
>>>
>>> Basically, I am looking into auto correlation function, Control Charts,
>>> Moving Average, Population Stability, and Poisson regression (much of the
>>> data can be described as daily|hourly counts)– I’d like to build a tool that
>>> would blend these approaches into a scorecard for proactive alerting for any
>>> outliers...
>>>
>>> For the above, I am interested in seeing how the time-series data can be
>>> broken into manageable segments and distributed-off to different machines in
>>> a Hadoop network.
>>>
>> I've never seen something similar in hadoop, but my suggestion for a
>> good algorithm for
>> segmenting time-series is:
>>
>> Sliding Window And Bottom-Up (SWAB) from Keogh et. al. Here is the paper:
>>
>> http://www.cs.ucr.edu/~eamonn/icdm-01.pdf
>>
>> and here a presentation:
>> www-scf.usc.edu/~selinach/segmentation-slides.pd
>>
>>
>>> Thanks again,
>>> Sri.
>>>
>>>
>>> On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning <te...@gmail.com> wrote:
>>>
>>>> There is nothing explicit in Mahout for this, but you could use the
>>>> Dirchlet
>>>> mixture model clustering to do this.
>>>>
>>>> The idea would be to express your different observed time series or short
>>>> segments of time sequences as mixture
>>>> models and then find regions that are not well described by this mixture
>>>> model.  Ideally, you would have a Markov
>>>> model underneath the mixture coefficients, but that is out of scope for
>>>> what
>>>> Mahout does for you right off the bat.  It
>>>> wouldn't be too hard to merge the HMM code and the DP clustering to get
>>>> this, though.
>>>>
>>>> So the answer is no.
>>>>
>>>> But Mahout would be a decent substrate for building your own.
>>>>
>>>> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
>>>> srivathsan.srinivas@gmail.com> wrote:
>>>>
>>>> > Hi,
>>>> >       Any pointers to techniques/papers that detect outliers in
>>>> time-series
>>>> > of very large data sets using Mahout? I am interesting in seeing what
>>>> > techniques are favorable for use in large-scale distributed systems using
>>>> > Hadoop/Mahout.
>>>> >
>>>> > Thanks,
>>>> > Sri.
>>>> >
>>>>
>>>
>>
>

Re: outlier detection in time-series using Mahout

Posted by Ted Dunning <te...@gmail.com>.

I tried the shapelet approach for video signature generation once upon a
time and was not enormously impressed with the accuracy/recall tradeoffs.

To some degree, I expect that this was partially due to my own deficient
implementation, but I really do think that there may be better approaches
such as vector quantization of a state space of some kind.

On Wed, Nov 3, 2010 at 6:02 PM, Srivathsan Srinivas <
srivathsan.srinivas@gmail.com> wrote:

> Thanks. I am reading a recent paper of Keogh's = time series shapelets
> : a novel technique that allows accurate, interpretable and fast
> classification.  A springer publication of data mining and knowledge
> discovery, 18 June 2010.
>
> I am basically skimming several papers with different ideas to see
> what can bec easily and efficiently parrallelized for using hadoop...
>
> Thanks much for pointing to the presentation and the paper.
>
> Srinivas.
>
> On Wednesday, November 3, 2010, Federico Castanedo <fc...@inf.uc3m.es>
> wrote:
> > Hi,
> >
> > 2010/11/1 Srivathsan Srinivas <sr...@gmail.com>:
> >> Dear Ted,
> >>
> >> Thanks for pointing to Dirchlet mixture model. I shall look into that.
> >>
> >> Basically, I am looking into auto correlation function, Control Charts,
> >> Moving Average, Population Stability, and Poisson regression (much of
> the
> >> data can be described as daily|hourly counts)– I’d like to build a tool
> that
> >> would blend these approaches into a scorecard for proactive alerting for
> any
> >> outliers...
> >>
> >> For the above, I am interested in seeing how the time-series data can be
> >> broken into manageable segments and distributed-off to different
> machines in
> >> a Hadoop network.
> >>
> > I've never seen something similar in hadoop, but my suggestion for a
> > good algorithm for
> > segmenting time-series is:
> >
> > Sliding Window And Bottom-Up (SWAB) from Keogh et. al. Here is the paper:
> >
> > http://www.cs.ucr.edu/~eamonn/icdm-01.pdf
> >
> > and here a presentation:
> > www-scf.usc.edu/~selinach/segmentation-slides.pd
> >
> >
> >> Thanks again,
> >> Sri.
> >>
> >>
> >> On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >>
> >>> There is nothing explicit in Mahout for this, but you could use the
> >>> Dirchlet
> >>> mixture model clustering to do this.
> >>>
> >>> The idea would be to express your different observed time series or
> short
> >>> segments of time sequences as mixture
> >>> models and then find regions that are not well described by this
> mixture
> >>> model.  Ideally, you would have a Markov
> >>> model underneath the mixture coefficients, but that is out of scope for
> >>> what
> >>> Mahout does for you right off the bat.  It
> >>> wouldn't be too hard to merge the HMM code and the DP clustering to get
> >>> this, though.
> >>>
> >>> So the answer is no.
> >>>
> >>> But Mahout would be a decent substrate for building your own.
> >>>
> >>> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
> >>> srivathsan.srinivas@gmail.com> wrote:
> >>>
> >>> > Hi,
> >>> >       Any pointers to techniques/papers that detect outliers in
> >>> time-series
> >>> > of very large data sets using Mahout? I am interesting in seeing what
> >>> > techniques are favorable for use in large-scale distributed systems
> using
> >>> > Hadoop/Mahout.
> >>> >
> >>> > Thanks,
> >>> > Sri.
> >>> >
> >>>
> >>
> >
>

Re: outlier detection in time-series using Mahout

Posted by Srivathsan Srinivas <sr...@gmail.com>.

Thanks. I am reading a recent paper of Keogh's = time series shapelets
: a novel technique that allows accurate, interpretable and fast
classification.  A springer publication of data mining and knowledge
discovery, 18 June 2010.

I am basically skimming several papers with different ideas to see
what can bec easily and efficiently parrallelized for using hadoop...

Thanks much for pointing to the presentation and the paper.

Srinivas.

On Wednesday, November 3, 2010, Federico Castanedo <fc...@inf.uc3m.es> wrote:
> Hi,
>
> 2010/11/1 Srivathsan Srinivas <sr...@gmail.com>:
>> Dear Ted,
>>
>> Thanks for pointing to Dirchlet mixture model. I shall look into that.
>>
>> Basically, I am looking into auto correlation function, Control Charts,
>> Moving Average, Population Stability, and Poisson regression (much of the
>> data can be described as daily|hourly counts)– I’d like to build a tool that
>> would blend these approaches into a scorecard for proactive alerting for any
>> outliers...
>>
>> For the above, I am interested in seeing how the time-series data can be
>> broken into manageable segments and distributed-off to different machines in
>> a Hadoop network.
>>
> I've never seen something similar in hadoop, but my suggestion for a
> good algorithm for
> segmenting time-series is:
>
> Sliding Window And Bottom-Up (SWAB) from Keogh et. al. Here is the paper:
>
> http://www.cs.ucr.edu/~eamonn/icdm-01.pdf
>
> and here a presentation:
> www-scf.usc.edu/~selinach/segmentation-slides.pd
>
>
>> Thanks again,
>> Sri.
>>
>>
>> On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning <te...@gmail.com> wrote:
>>
>>> There is nothing explicit in Mahout for this, but you could use the
>>> Dirchlet
>>> mixture model clustering to do this.
>>>
>>> The idea would be to express your different observed time series or short
>>> segments of time sequences as mixture
>>> models and then find regions that are not well described by this mixture
>>> model.  Ideally, you would have a Markov
>>> model underneath the mixture coefficients, but that is out of scope for
>>> what
>>> Mahout does for you right off the bat.  It
>>> wouldn't be too hard to merge the HMM code and the DP clustering to get
>>> this, though.
>>>
>>> So the answer is no.
>>>
>>> But Mahout would be a decent substrate for building your own.
>>>
>>> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
>>> srivathsan.srinivas@gmail.com> wrote:
>>>
>>> > Hi,
>>> >       Any pointers to techniques/papers that detect outliers in
>>> time-series
>>> > of very large data sets using Mahout? I am interesting in seeing what
>>> > techniques are favorable for use in large-scale distributed systems using
>>> > Hadoop/Mahout.
>>> >
>>> > Thanks,
>>> > Sri.
>>> >
>>>
>>
>

Re: outlier detection in time-series using Mahout

Posted by Federico Castanedo <fc...@inf.uc3m.es>.

Hi,

2010/11/1 Srivathsan Srinivas <sr...@gmail.com>:
> Dear Ted,
>
> Thanks for pointing to Dirchlet mixture model. I shall look into that.
>
> Basically, I am looking into auto correlation function, Control Charts,
> Moving Average, Population Stability, and Poisson regression (much of the
> data can be described as daily|hourly counts)– I’d like to build a tool that
> would blend these approaches into a scorecard for proactive alerting for any
> outliers...
>
> For the above, I am interested in seeing how the time-series data can be
> broken into manageable segments and distributed-off to different machines in
> a Hadoop network.
>
I've never seen something similar in hadoop, but my suggestion for a
good algorithm for
segmenting time-series is:

Sliding Window And Bottom-Up (SWAB) from Keogh et. al. Here is the paper:

http://www.cs.ucr.edu/~eamonn/icdm-01.pdf

and here a presentation:
www-scf.usc.edu/~selinach/segmentation-slides.pd


> Thanks again,
> Sri.
>
>
> On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> There is nothing explicit in Mahout for this, but you could use the
>> Dirchlet
>> mixture model clustering to do this.
>>
>> The idea would be to express your different observed time series or short
>> segments of time sequences as mixture
>> models and then find regions that are not well described by this mixture
>> model.  Ideally, you would have a Markov
>> model underneath the mixture coefficients, but that is out of scope for
>> what
>> Mahout does for you right off the bat.  It
>> wouldn't be too hard to merge the HMM code and the DP clustering to get
>> this, though.
>>
>> So the answer is no.
>>
>> But Mahout would be a decent substrate for building your own.
>>
>> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
>> srivathsan.srinivas@gmail.com> wrote:
>>
>> > Hi,
>> >       Any pointers to techniques/papers that detect outliers in
>> time-series
>> > of very large data sets using Mahout? I am interesting in seeing what
>> > techniques are favorable for use in large-scale distributed systems using
>> > Hadoop/Mahout.
>> >
>> > Thanks,
>> > Sri.
>> >
>>
>

Re: outlier detection in time-series using Mahout

Posted by Srivathsan Srinivas <sr...@gmail.com>.

Dear Ted,

Thanks for pointing to Dirchlet mixture model. I shall look into that.

Basically, I am looking into auto correlation function, Control Charts,
Moving Average, Population Stability, and Poisson regression (much of the
data can be described as daily|hourly counts)– I’d like to build a tool that
would blend these approaches into a scorecard for proactive alerting for any
outliers...

For the above, I am interested in seeing how the time-series data can be
broken into manageable segments and distributed-off to different machines in
a Hadoop network.

Thanks again,
Sri.

On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning <te...@gmail.com> wrote:

> There is nothing explicit in Mahout for this, but you could use the
> Dirchlet
> mixture model clustering to do this.
>
> The idea would be to express your different observed time series or short
> segments of time sequences as mixture
> models and then find regions that are not well described by this mixture
> model.  Ideally, you would have a Markov
> model underneath the mixture coefficients, but that is out of scope for
> what
> Mahout does for you right off the bat.  It
> wouldn't be too hard to merge the HMM code and the DP clustering to get
> this, though.
>
> So the answer is no.
>
> But Mahout would be a decent substrate for building your own.
>
> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
> srivathsan.srinivas@gmail.com> wrote:
>
> > Hi,
> >       Any pointers to techniques/papers that detect outliers in
> time-series
> > of very large data sets using Mahout? I am interesting in seeing what
> > techniques are favorable for use in large-scale distributed systems using
> > Hadoop/Mahout.
> >
> > Thanks,
> > Sri.
> >
>

Re: outlier detection in time-series using Mahout

Posted by Ted Dunning <te...@gmail.com>.

There is nothing explicit in Mahout for this, but you could use the Dirchlet
mixture model clustering to do this.

The idea would be to express your different observed time series or short
segments of time sequences as mixture
models and then find regions that are not well described by this mixture
model.  Ideally, you would have a Markov
model underneath the mixture coefficients, but that is out of scope for what
Mahout does for you right off the bat.  It
wouldn't be too hard to merge the HMM code and the DP clustering to get
this, though.

So the answer is no.

But Mahout would be a decent substrate for building your own.

On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
srivathsan.srinivas@gmail.com> wrote:

> Hi,
>       Any pointers to techniques/papers that detect outliers in time-series
> of very large data sets using Mahout? I am interesting in seeing what
> techniques are favorable for use in large-scale distributed systems using
> Hadoop/Mahout.
>
> Thanks,
> Sri.
>