You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Patterson, Josh" <jp...@tva.gov> on 2009/11/20 16:16:12 UTC

mahout examples

I've noticed that the article at:

 

http://www.ibm.com/developerworks/java/library/j-mahout/

 

uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
included downloadable code includes what looks like a snapshot version
of Mahout 0.2 - is that essentially a copy of the release? Are there any
other differences I should note when working through these examples?
Thanks!

 

Josh Patterson

TVA

Re: mahout examples

Posted by Sean Owen <sr...@gmail.com>.

(Back in the day this project used Ant but we moved to Maven a while
ago. I don't think that's it -- it's that Grant's package of example
code uses Ant to build. That's all.)

On Fri, Nov 20, 2009 at 4:43 PM, Ted Dunning <te...@gmail.com> wrote:
> That article is a little bit out of date.  That could be why it uses ant.
> There have been small code changes since then as well.
>
> The best strategy is probably to get started and ask lots of questions
> here.  Your experiences would make a good nugget for an update wiki page.

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

That article is a little bit out of date.  That could be why it uses ant.
There have been small code changes since then as well.

The best strategy is probably to get started and ask lots of questions
here.  Your experiences would make a good nugget for an update wiki page.

On Fri, Nov 20, 2009 at 8:10 AM, Sean Owen <sr...@gmail.com> wrote:

> I think it's Grant's package of demo code that uses Ant. The main
> project indeed uses Maven. You should probably use the real 0.2
> release, which was just finalized this week. It should work with these
> examples -- at least I cannot think of any change that would break it.
> It's possible it needs a small change or two, at best.
>
> On Fri, Nov 20, 2009 at 3:16 PM, Patterson, Josh <jp...@tva.gov>
> wrote:
> > I've noticed that the article at:
> >
> >
> >
> > http://www.ibm.com/developerworks/java/library/j-mahout/
> >
> >
> >
> > uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
> > included downloadable code includes what looks like a snapshot version
> > of Mahout 0.2 - is that essentially a copy of the release? Are there any
> > other differences I should note when working through these examples?
> > Thanks!
> >
> >
> >
> > Josh Patterson
> >
> > TVA
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by Sean Owen <sr...@gmail.com>.

I think it's Grant's package of demo code that uses Ant. The main
project indeed uses Maven. You should probably use the real 0.2
release, which was just finalized this week. It should work with these
examples -- at least I cannot think of any change that would break it.
It's possible it needs a small change or two, at best.

On Fri, Nov 20, 2009 at 3:16 PM, Patterson, Josh <jp...@tva.gov> wrote:
> I've noticed that the article at:
>
>
>
> http://www.ibm.com/developerworks/java/library/j-mahout/
>
>
>
> uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
> included downloadable code includes what looks like a snapshot version
> of Mahout 0.2 - is that essentially a copy of the release? Are there any
> other differences I should note when working through these examples?
> Thanks!
>
>
>
> Josh Patterson
>
> TVA
>
>

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

A very natural next step would be spectral clustering for which Jake's work
on decompositions will be invaluable.

On Fri, Nov 20, 2009 at 1:44 PM, Grant Ingersoll <gs...@apache.org>wrote:

> > Once we can get the basic kmeans rolling with some known dataset we'll
> > be able to iterate on that and move towards using more complex
> > techniques and other grid timeseries data. Any suggestions or discussion
> > is greatly appreciated,
>
> I think it is pretty wide open here, so I'd suggest focusing on specific
> questions as they arise and we can help at that point.

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by Grant Ingersoll <gs...@apache.org>.

First off, great feedback from everyone on the article!  It's sometimes funny what people take from something you present.  I saw one review saying I was too thin on collab filtering and too heavy on clustering, while the consensus here seems to be the opposite!  The important thing is that y'all read it!  For that, I am grateful.

As for Ant v. Mvn, I went back and forth on it.  In the end, I chose Ant, b/c I needed to build custom targets quickly that executed a variety of steps.  Mvn is not strong at this.  In fact, you'll see in the build file I have a section that updates SVN from Mahout and then invokes Maven!

The code is indeed pre-0.2.  I was, in fact, late on the article delivery b/c I had to do a fair amount of work in Mahout to get it up to what I deemed article quality (consistent Driver programs, the ability to create Vectors, etc.)  You should definitely be using the 0.2 release now that it is out.  There were also some things I had hoped to get in for the article and the release that I simply couldn't finish in time (MAHOUT-165 and also the log-likelihood cluster labeling stuff, M-163, I think).

Some more thoughts, Josh, about your specific issues below.

On Nov 20, 2009, at 2:56 PM, Patterson, Josh wrote:

> Drew,
> Yeah, I'm also very interested in clustering and the article is mainly
> focused on collaborative filtering. I've been reading through mailing
> list archives and Mahout source code such as KMeansDriver.java to get a
> better idea on how to get clustering working before I ask the simple
> "FAQ"-type questions. We'd like to be able to: 
> 
> a.) cluster
> b.) classify
> 
> our data so we can explore the data and project / classify trends. Other
> data mining tools are a little more turn-key than Mahout, but since we
> use Hadoop we believe Mahout is the best option at this point. My short
> term goal is to take known data sets such as:
> 
> http://www-stat.ucdavis.edu/~shumway/tsa.html
> and
> http://lib.stat.cmu.edu/datasets/
> 
> and get basic clustering to work on Hadoop with Mahout.

It should be fairly straightforward to get basics working.  In fact, one of the examples on the wiki is for the use case of control systems (albeit synthetic ones), but I'll defer to others for deeper insight.

> If I can get
> clustering working well and consistently on Mahout, it's a solid
> stepping stone towards working with other data sets (and any canonical
> time series data set suggestions are appreciated here). I've worked with
> timeseries data sets and machine learning before but more with neural
> nets (and others) in non-distributed situations. Since our project is
> open source, I'd love to be able to contribute back time series related
> code to future Mahout versions.

+1.  That would be great!  I know David Hall was doing some topics over time stuff, but haven't seen him around lately to know for sure.  That was related to his LDA contribution, so it may not be applicable for you.

> Our project's home page is:
> 
> http://openpdc.codeplex.com
> 
> Some background:
> 
> http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-s
> ource/
> http://news.cnet.com/8301-13846_3-10393259-62.html?tag=mncol;title
> http://earth2tech.com/2009/11/10/the-google-android-of-the-smart-grid-op
> enpdc/
> 
> I think in terms of clustering time series data, the first step looks to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.

Yep.

> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or discussion
> is greatly appreciated,

I think it is pretty wide open here, so I'd suggest focusing on specific questions as they arise and we can help at that point.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

RE: mahout examples

Posted by "Patterson, Josh" <jp...@tva.gov>.

I believe Ted is correct, the distance function is the key with k-means.

Another suggestion that I've heard for comparing similarity between time series data is comparing "energy", or something along the lines of CUSUM:

http://www.variation.com/cpa/help/hs108.htm
http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc323.htm
http://www.public.iastate.edu/~vardeman/asqcourse/cusumcharts.pdf
http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Articles/cumulative_sum_charts_for_small_shifts.pdf

Waveforms could be broken down into sequences (or n-grams even) of units of energy change which might simplify the comparison calculation somewhat. I've used a similar technique (task-response-thresholds) when working with Ant Colony mechanics:

http://jpatterson.floe.tv/index.php/2008/09/25/discovery-and-social-insects/
http://jpatterson.floe.tv/index.php/2009/07/19/tinyos-and-tinytermite/

in that enough change over a time period triggers a task-change / response function. Here I think its simply "more energy change than the previous N samples triggers an ouput" which ends up being a feature vector. I believe that one could see an "energy pattern" derived from a waveform and that could possibly indicate an aspect of similarity. I think this would alleviate the vertical shift problem and then you could focus on the horizontal shift issue which has been dealt with in IR by using various forms of n-gram techniques, intersection of forward and reverse b-trees for wildcard sequences / comparisons, etc. Of course I'm guessing there, but this problem will involve a lot of trial and error in terms of basic building blocks for timeseries comparison --- and it’s a starting point. 

Josh Patterson
TVA

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, November 30, 2009 2:54 AM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

N-squared isn't all that scalable.

But k-means doesn't require all n^2 distances to be computed.  It just
requires that distances to the centroids be computed.

That is generally vastly less effort and since the number of clusters is
typically bounded by a constant, it makes k-means into a nearly linear
algorithm.

On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:

> Thanks  for sharing the article.  The article focuses mainly on
> distance computation between sequences, which will help us in creating
> the self-similarity matrix.  And then you can probably apply any
> standard self-similarity based clustering techniques ( spectral
> clustering or k-means etc. ).
>
> Approach sounds okay, except that k-means requires the nXn matrix to
> be computed which itself could be pretty huge.  But as long as you can
> distribute ( which you apparantly can ) over mapreduce/mahout it
> should be fine.
>
> -Prasen
>
> On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <jp...@tva.gov>
> wrote:
> > Prasen,
> > I've been reviewing techniques and literature on data mining in time
> > series and I found another paper that you might be interested in from
> > the time series "search" domain that deals with similarity of time
> > series data:
> >
> > http://delab.csd.auth.gr/papers/PCI99am.pdf
> >
> > Sequences are transformed into a feature vector and Euclidean distances
> > between the feature vectors are then calculated. I'm still getting this
> > concept (plus other variations) and mahout in general "mapped out". I
> > read some on suffix trees and they look very similar to k-grams and
> > permutation indexes in Information Retrieval material. I'm still
> > digesting this time series problem (and its several sub problems) but I
> > thought I'd throw that paper out there and see what you thought.
> >
> > Josh Patterson
> > TVA
> >
> > -----Original Message-----
> > From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
> > prasenjit mukherjee
> > Sent: Saturday, November 21, 2009 12:21 AM
> > To: mahout-user@lucene.apache.org
> > Subject: Re: mahout examples
> >
> > Hi Josh,
> >
> > I too am working on  clustering time-series-data, and basically trying
> > to come up with a sequence clustering model. Would like to know how
> > you intend to use K-means to achieve that.  Are you  treating each
> > sequence as a point ?  Then, what would be your vector representation
> > of a sequence and also more importantly which metric ( distance
> > computation logic ) will you be using ?
> >
> > BTW, I am thinking along the lines of STC ( suffix-tree based clustering
> > ).
> >
> > -Prasen
> >
> > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov>
> > wrote:
> >> I think in terms of clustering time series data, the first step looks
> > to
> >> be vectorizing the input cases with possibly the DenseVector class and
> >> feeding that to a basic KMeans implementation like KMeansDriver.java.
> >> Once we can get the basic kmeans rolling with some known dataset we'll
> >> be able to iterate on that and move towards using more complex
> >> techniques and other grid timeseries data. Any suggestions or
> > discussion
> >> is greatly appreciated,
> >>
> >> Josh Patterson
> >> TVA
> >
>

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

N-squared isn't all that scalable.

But k-means doesn't require all n^2 distances to be computed.  It just
requires that distances to the centroids be computed.

That is generally vastly less effort and since the number of clusters is
typically bounded by a constant, it makes k-means into a nearly linear
algorithm.

On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:

> Thanks  for sharing the article.  The article focuses mainly on
> distance computation between sequences, which will help us in creating
> the self-similarity matrix.  And then you can probably apply any
> standard self-similarity based clustering techniques ( spectral
> clustering or k-means etc. ).
>
> Approach sounds okay, except that k-means requires the nXn matrix to
> be computed which itself could be pretty huge.  But as long as you can
> distribute ( which you apparantly can ) over mapreduce/mahout it
> should be fine.
>
> -Prasen
>
> On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <jp...@tva.gov>
> wrote:
> > Prasen,
> > I've been reviewing techniques and literature on data mining in time
> > series and I found another paper that you might be interested in from
> > the time series "search" domain that deals with similarity of time
> > series data:
> >
> > http://delab.csd.auth.gr/papers/PCI99am.pdf
> >
> > Sequences are transformed into a feature vector and Euclidean distances
> > between the feature vectors are then calculated. I'm still getting this
> > concept (plus other variations) and mahout in general "mapped out". I
> > read some on suffix trees and they look very similar to k-grams and
> > permutation indexes in Information Retrieval material. I'm still
> > digesting this time series problem (and its several sub problems) but I
> > thought I'd throw that paper out there and see what you thought.
> >
> > Josh Patterson
> > TVA
> >
> > -----Original Message-----
> > From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
> > prasenjit mukherjee
> > Sent: Saturday, November 21, 2009 12:21 AM
> > To: mahout-user@lucene.apache.org
> > Subject: Re: mahout examples
> >
> > Hi Josh,
> >
> > I too am working on  clustering time-series-data, and basically trying
> > to come up with a sequence clustering model. Would like to know how
> > you intend to use K-means to achieve that.  Are you  treating each
> > sequence as a point ?  Then, what would be your vector representation
> > of a sequence and also more importantly which metric ( distance
> > computation logic ) will you be using ?
> >
> > BTW, I am thinking along the lines of STC ( suffix-tree based clustering
> > ).
> >
> > -Prasen
> >
> > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov>
> > wrote:
> >> I think in terms of clustering time series data, the first step looks
> > to
> >> be vectorizing the input cases with possibly the DenseVector class and
> >> feeding that to a basic KMeans implementation like KMeansDriver.java.
> >> Once we can get the basic kmeans rolling with some known dataset we'll
> >> be able to iterate on that and move towards using more complex
> >> techniques and other grid timeseries data. Any suggestions or
> > discussion
> >> is greatly appreciated,
> >>
> >> Josh Patterson
> >> TVA
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

Thanks  for sharing the article.  The article focuses mainly on
distance computation between sequences, which will help us in creating
the self-similarity matrix.  And then you can probably apply any
standard self-similarity based clustering techniques ( spectral
clustering or k-means etc. ).

Approach sounds okay, except that k-means requires the nXn matrix to
be computed which itself could be pretty huge.  But as long as you can
distribute ( which you apparantly can ) over mapreduce/mahout it
should be fine.

-Prasen

On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <jp...@tva.gov> wrote:
> Prasen,
> I've been reviewing techniques and literature on data mining in time
> series and I found another paper that you might be interested in from
> the time series "search" domain that deals with similarity of time
> series data:
>
> http://delab.csd.auth.gr/papers/PCI99am.pdf
>
> Sequences are transformed into a feature vector and Euclidean distances
> between the feature vectors are then calculated. I'm still getting this
> concept (plus other variations) and mahout in general "mapped out". I
> read some on suffix trees and they look very similar to k-grams and
> permutation indexes in Information Retrieval material. I'm still
> digesting this time series problem (and its several sub problems) but I
> thought I'd throw that paper out there and see what you thought.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
> prasenjit mukherjee
> Sent: Saturday, November 21, 2009 12:21 AM
> To: mahout-user@lucene.apache.org
> Subject: Re: mahout examples
>
> Hi Josh,
>
> I too am working on  clustering time-series-data, and basically trying
> to come up with a sequence clustering model. Would like to know how
> you intend to use K-means to achieve that.  Are you  treating each
> sequence as a point ?  Then, what would be your vector representation
> of a sequence and also more importantly which metric ( distance
> computation logic ) will you be using ?
>
> BTW, I am thinking along the lines of STC ( suffix-tree based clustering
> ).
>
> -Prasen
>
> On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov>
> wrote:
>> I think in terms of clustering time series data, the first step looks
> to
>> be vectorizing the input cases with possibly the DenseVector class and
>> feeding that to a basic KMeans implementation like KMeansDriver.java.
>> Once we can get the basic kmeans rolling with some known dataset we'll
>> be able to iterate on that and move towards using more complex
>> techniques and other grid timeseries data. Any suggestions or
> discussion
>> is greatly appreciated,
>>
>> Josh Patterson
>> TVA
>

RE: mahout examples

Posted by "Patterson, Josh" <jp...@tva.gov>.

Prasen,
I've been reviewing techniques and literature on data mining in time
series and I found another paper that you might be interested in from
the time series "search" domain that deals with similarity of time
series data:

http://delab.csd.auth.gr/papers/PCI99am.pdf

Sequences are transformed into a feature vector and Euclidean distances
between the feature vectors are then calculated. I'm still getting this
concept (plus other variations) and mahout in general "mapped out". I
read some on suffix trees and they look very similar to k-grams and
permutation indexes in Information Retrieval material. I'm still
digesting this time series problem (and its several sub problems) but I
thought I'd throw that paper out there and see what you thought.

Josh Patterson
TVA

-----Original Message-----
From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
prasenjit mukherjee
Sent: Saturday, November 21, 2009 12:21 AM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

Hi Josh,

I too am working on  clustering time-series-data, and basically trying
to come up with a sequence clustering model. Would like to know how
you intend to use K-means to achieve that.  Are you  treating each
sequence as a point ?  Then, what would be your vector representation
of a sequence and also more importantly which metric ( distance
computation logic ) will you be using ?

BTW, I am thinking along the lines of STC ( suffix-tree based clustering
).

-Prasen

On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov>
wrote:
> I think in terms of clustering time series data, the first step looks
to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.
> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or
discussion
> is greatly appreciated,
>
> Josh Patterson
> TVA

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

Also, this algorithm is inherently linear in time in the size of the input.
That means it is feasible on large data.

On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <ja...@gmail.com> wrote:

> The machinery to do the above in parallel on "ridiculously big" data on
> Hadoop
> should be coming in soon with some of the stuff I'm working on contributing
> to Mahout.
>

-- 
Ted Dunning, CTO
DeepDyve

RE: mahout examples

Posted by "Patterson, Josh" <jp...@tva.gov>.

Ted, 

Good to see some suggested reading, today I was also thinking that possibly we could agree on a canonical time series example data set to reference and test on since it seems like everyone has slight variations in what sort of "time series" data they are using. It would also be easier for me to discuss techniques when using a known example dataset as a common reference point. When talking about decomposition techniques it would probably make it somewhat more consistent well, and then anyone could work out from there to adapt it to their specific needs, add "secret sauce", etc.

Josh Patterson
TVA

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Sun 11/22/2009 1:58 PM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

This has connections back to the theoretical literature on dynamical systems
where the term "symbolic dynamics" is used to refer to the investigation of
the sequences of symbols that can be produced by various complex systems
depending on the quantization of the state space.  The important take-away
from that world is that essentially all of the important dynamical
properties of a system can be described by using the appropriate
quantization.  Of course, figuring out what the appropriate quantization is
can be a major problem, but for many systems, almost any non-trivial
quantization is useful.

The use of a long n-grams also has analogies in state-space embedding of
dynamical systems where you don't have access to the full state of a system
but instead use repeated measurements to substitute for the the full state.
Again, in many cases you can do as much with an embedded state-space as you
could with the (unobservable) full state.  A great example is the classic
dripping faucet problem (Shaw's paper isn't available, but this follow-up
is: http://metal.elte.hu/botond/pdf/CHA00059.pdf).  One of the original
papers on the subject of state-space reconstruction is available on-line at
http://cos.cumt.edu.cn/jpkc/dxwl/zl/zl1/Physical%20Review%20Classics/statistical/132.pdf

The point here is that n-gram techniques for time series are really not at
all far-fetched.

On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <ja...@gmail.com> wrote:

> While I'll add the caveat also that I haven't done this for temporal data
> (other than the case which Ted is referring to, where text technically has
> some temporal
> nature to it by its sequentiality), doing this kind of thing with
> "significant
> ngrams" as Ted describes can allow you to arbitrarily keep higher order
> correlations if you
> do this in a randomized SVD: intead of just keeping the interesting
> bi-grams,
> keep *all* ngrams up to some fixed size (even as large as 5, say), then to
> random
> projection on your bag-of-ngrams to map it down from the huge
> numUniqueSymbols^{5} dimensional space (well technically this probably
> overflows sizeof(long), so you probably are wrapping around mod some big
> prime close
> to 2^64, but collisions will be still be rare and will just act as noise)
> down
> to some reasonable space still larger than you think is necessary (maybe
> 1k-10k),
> *then* do the SVD there.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

This has connections back to the theoretical literature on dynamical systems
where the term "symbolic dynamics" is used to refer to the investigation of
the sequences of symbols that can be produced by various complex systems
depending on the quantization of the state space.  The important take-away
from that world is that essentially all of the important dynamical
properties of a system can be described by using the appropriate
quantization.  Of course, figuring out what the appropriate quantization is
can be a major problem, but for many systems, almost any non-trivial
quantization is useful.

The use of a long n-grams also has analogies in state-space embedding of
dynamical systems where you don't have access to the full state of a system
but instead use repeated measurements to substitute for the the full state.
Again, in many cases you can do as much with an embedded state-space as you
could with the (unobservable) full state.  A great example is the classic
dripping faucet problem (Shaw's paper isn't available, but this follow-up
is: http://metal.elte.hu/botond/pdf/CHA00059.pdf).  One of the original
papers on the subject of state-space reconstruction is available on-line at
http://cos.cumt.edu.cn/jpkc/dxwl/zl/zl1/Physical%20Review%20Classics/statistical/132.pdf

The point here is that n-gram techniques for time series are really not at
all far-fetched.

On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <ja...@gmail.com> wrote:

> While I'll add the caveat also that I haven't done this for temporal data
> (other than the case which Ted is referring to, where text technically has
> some temporal
> nature to it by its sequentiality), doing this kind of thing with
> "significant
> ngrams" as Ted describes can allow you to arbitrarily keep higher order
> correlations if you
> do this in a randomized SVD: intead of just keeping the interesting
> bi-grams,
> keep *all* ngrams up to some fixed size (even as large as 5, say), then to
> random
> projection on your bag-of-ngrams to map it down from the huge
> numUniqueSymbols^{5} dimensional space (well technically this probably
> overflows sizeof(long), so you probably are wrapping around mod some big
> prime close
> to 2^64, but collisions will be still be rare and will just act as noise)
> down
> to some reasonable space still larger than you think is necessary (maybe
> 1k-10k),
> *then* do the SVD there.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, Nov 21, 2009 at 10:22 PM, Ted Dunning <te...@gmail.com> wrote:

> Please keep in mind that the approach I am suggesting here is untried on
> your kind of data.  I have used it in text and transactional streams, but
> who knows if it will generalize.
>
> If we take you example of "a b a a c" and let us assume that "a a" and "a
> b"
> have been observed to be interesting phrases.  We would reduce your
> original
> sequence to "ab aa c".  At that point, we assume that all interesting
> temporality is captured in the ab and aa.  Almost by definition, we don't
> need anything else because in previous analysis, other phrases seemed to
> occur without predictive value.  Thus, we can switch to bag-of-terms format
> at this point without loss of (important) information.  From there, we are
> only concerned with the sequence x phrase occurrence matrix and can proceed
> with SVD or random indexing or whatever else we want.
>

While I'll add the caveat also that I haven't done this for temporal data
(other than
the case which Ted is referring to, where text technically has some temporal
nature
to it by its sequentiality), doing this kind of thing with "significant
ngrams" as Ted
describes can allow you to arbitrarily keep higher order correlations if you
do
this in a randomized SVD: intead of just keeping the interesting bi-grams,
keep
*all* ngrams up to some fixed size (even as large as 5, say), then to random
projection on your bag-of-ngrams to map it down from the huge
numUniqueSymbols^{5} dimensional space (well technically this probably
overflows
sizeof(long), so you probably are wrapping around mod some big prime close
to
2^64, but collisions will be still be rare and will just act as noise) down
to some
reasonable space still larger than you think is necessary (maybe 1k-10k),
*then*
do the SVD there.

At that point, your similarities should be capturing a fair amount of the
higher order
time series effects.  I'm advocating Ted's (a) basically, but keeping even
more
information than you think is necessary, because why not - if you're doing a

randomized projection before SVD, the only thing to lose is that you add too
much
noise, which while it could be a problem, weighting by some combination of
log-likelihood of the subsequences and an effective IDF should help (here
I'm
being pretty vague, I'll admit - this weighting is probably pretty important
if you
keep as much information as I'm advocating).

The machinery to do the above in parallel on "ridiculously big" data on
Hadoop
should be coming in soon with some of the stuff I'm working on contributing
to Mahout.

  -jake

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

Please keep in mind that the approach I am suggesting here is untried on
your kind of data.  I have used it in text and transactional streams, but
who knows if it will generalize.

If we take you example of "a b a a c" and let us assume that "a a" and "a b"
have been observed to be interesting phrases.  We would reduce your original
sequence to "ab aa c".  At that point, we assume that all interesting
temporality is captured in the ab and aa.  Almost by definition, we don't
need anything else because in previous analysis, other phrases seemed to
occur without predictive value.  Thus, we can switch to bag-of-terms format
at this point without loss of (important) information.  From there, we are
only concerned with the sequence x phrase occurrence matrix and can proceed
with SVD or random indexing or whatever else we want.

There is a serious question in this approach about what to do with
overlapping phrases.  My two answers are to either (a) keep both or (b) pick
the tiling that gives us the most interesting sequence of tokens.  Approach
(b) is basically the same algorithm that is used to segment Chinese.   See
http://technology.chtsai.org/mmseg/ for a great example.

Another class of techniques entirely which deals a bit better with the
problem of (slightly) longer range dependencies is some variant on random
indexing.  Here, you start with a random vector for each symbol and then
move each symbol vector a skoshi in the direction of the symbols it cooccurs
with.  If you use separate forward and backward looking vectors then you get
strong directionality and sequencing.  The vector of a sequence is then just
the sum (perhaps weighted by rarity) of the trained vectors for each term.
The size of the cooccurrence window is an important parameter here.

Surprisingly, these techniques have some pretty deep connections on the
mathematical level.  The first approach + random basis project or SVD is
very comparable to the second approach except that the second handles
dependencies that skip over intervening symbols more directly.

In any case, you lose some sequence information.  This is, in fact, exactly
what gives you a nice metric over similar sequences.

On Sat, Nov 21, 2009 at 10:00 PM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:

> On Sun, Nov 22, 2009 at 9:54 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Expressing your symbolic sequences by tiling these phrases gives you much
> of
> > the temporality that you are interested and lets you use algorithms like
> > k-means pretty much directly.
>
> Approach sounds interesting . Can you explain a bit on how you intend
> to represent a sequence as a vector here ? Assuming sequence being  "a
> b a a c". I was thinking of the following 2 approaches :
>
> If I use symbols as my basis and the coefficients as time-slices then
> I would loose the information of recurring symbols  ( symbol a in my
> example ) .  e.g. vector representation of "a b a a c": 1(a)+ 2(b) +
> 5(c) ( problem : how to incorporate 3a,4a )
>
> On the other hand if I use time-slices as my basis and some mapping of
> terms as its coefficients then my simple euclidean measure wont make
> any sense.  e.g. let's a->1, b->2, c->3, then vector representation of
> "a b a a c":  1(t1) + 2(t2) + 1(t3) + 1(t4) + 3(t5)
>
> -Prasen
>
> >
> > If you don't have symbolic sequences, you have another problem, but you
> > might get similar results by doing vector quantization on your continuous
> > time-series expressed in terms of multi-scale localized spectral
> detectors.
> > Some problems work well with those techniques, some definitely need more
> > interesting feature detectors.  The spectral processing and vector
> > quantization are fairly natural tasks for map-reduce which is nice.  In
> > fact, vector quantization is commonly done with some variant on k-means.
> >
>

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

On Sun, Nov 22, 2009 at 9:54 AM, Ted Dunning <te...@gmail.com> wrote:
> Expressing your symbolic sequences by tiling these phrases gives you much of
> the temporality that you are interested and lets you use algorithms like
> k-means pretty much directly.

Approach sounds interesting . Can you explain a bit on how you intend
to represent a sequence as a vector here ? Assuming sequence being  "a
b a a c". I was thinking of the following 2 approaches :

If I use symbols as my basis and the coefficients as time-slices then
I would loose the information of recurring symbols  ( symbol a in my
example ) .  e.g. vector representation of "a b a a c": 1(a)+ 2(b) +
5(c) ( problem : how to incorporate 3a,4a )

On the other hand if I use time-slices as my basis and some mapping of
terms as its coefficients then my simple euclidean measure wont make
any sense.  e.g. let's a->1, b->2, c->3, then vector representation of
"a b a a c":  1(t1) + 2(t2) + 1(t3) + 1(t4) + 3(t5)

-Prasen

>
> If you don't have symbolic sequences, you have another problem, but you
> might get similar results by doing vector quantization on your continuous
> time-series expressed in terms of multi-scale localized spectral detectors.
> Some problems work well with those techniques, some definitely need more
> interesting feature detectors.  The spectral processing and vector
> quantization are fairly natural tasks for map-reduce which is nice.  In
> fact, vector quantization is commonly done with some variant on k-means.
>

Re: mahout examples

Posted by Ted Dunning <te...@gmail.com>.

The hadoop way of approaching a similar problem as suffix trees would be to
broadcast short symbol sequences into a large counter.  This is wasteful in
terms of total data transferred, but it makes much of that back up because
the transfers are all very sequential.  You can discard sequences with few
occurrences immediately and then use the counts to further filter the list,
eliminating uninteresting sequences.

Expressing your symbolic sequences by tiling these phrases gives you much of
the temporality that you are interested and lets you use algorithms like
k-means pretty much directly.

If you don't have symbolic sequences, you have another problem, but you
might get similar results by doing vector quantization on your continuous
time-series expressed in terms of multi-scale localized spectral detectors.
Some problems work well with those techniques, some definitely need more
interesting feature detectors.  The spectral processing and vector
quantization are fairly natural tasks for map-reduce which is nice.  In
fact, vector quantization is commonly done with some variant on k-means.

On Sat, Nov 21, 2009 at 6:21 PM, Patterson, Josh <jp...@tva.gov>wrote:

> Prasen,
> Well, I'm not entirely sure how I'm going to do it right now, its going to
> come down to trial and error with multiple approaches. There are many
> obstacles to overcome, including the ones you are speaking of like:
>
> - timeseries vector representation
> - timeseries shifts, scaling
> - for clustering, need a heuristic for distance metric
>
> I have not heard of suffix-tree based clustering, but it definitely sounds
> interesting and I'll check that one out. I suggested kmeans initially as its
> a very basic and well known clustering technique. I need to review my older
> notes for approaches to solve some of these projects and review more papers
> dealing with timeseries data in general. My initial thought is to decompose
> a block of timeseries data into a set of time ordered "features" in order to
> reduce the effect of scaling issues and alignment effects. That would make
> it easier to calculate to a delta between vectors, but also might create
> unintended approximation errors.
>
> As I get more Mahout code running, I'll probably post relevant research
> papers dealing with time series data that I'm looking at. That way anyone
> can chip in their opinion of the approach and we can build more example code
> for dealing with timeseries in Mahout.
>
> I'll take a look at the suffix-tree technique, and I look forward to any
> other suggestions you might have in terms of an approach.
>
> Josh Patterson
> TVA
>
>
> -----Original Message-----
> From: prasen.bea@gmail.com on behalf of prasenjit mukherjee
> Sent: Sat 11/21/2009 12:20 AM
> To: mahout-user@lucene.apache.org
> Subject: Re: mahout examples
>
> Hi Josh,
>
> I too am working on  clustering time-series-data, and basically trying
> to come up with a sequence clustering model. Would like to know how
> you intend to use K-means to achieve that.  Are you  treating each
> sequence as a point ?  Then, what would be your vector representation
> of a sequence and also more importantly which metric ( distance
> computation logic ) will you be using ?
>
> BTW, I am thinking along the lines of STC ( suffix-tree based clustering ).
>
> -Prasen
>
> On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov>
> wrote:
> > I think in terms of clustering time series data, the first step looks to
> > be vectorizing the input cases with possibly the DenseVector class and
> > feeding that to a basic KMeans implementation like KMeansDriver.java.
> > Once we can get the basic kmeans rolling with some known dataset we'll
> > be able to iterate on that and move towards using more complex
> > techniques and other grid timeseries data. Any suggestions or discussion
> > is greatly appreciated,
> >
> > Josh Patterson
> > TVA
>
>


-- 
Ted Dunning, CTO
DeepDyve

RE: mahout examples

Posted by "Patterson, Josh" <jp...@tva.gov>.

Prasen,
Well, I'm not entirely sure how I'm going to do it right now, its going to come down to trial and error with multiple approaches. There are many obstacles to overcome, including the ones you are speaking of like:

- timeseries vector representation
- timeseries shifts, scaling
- for clustering, need a heuristic for distance metric

I have not heard of suffix-tree based clustering, but it definitely sounds interesting and I'll check that one out. I suggested kmeans initially as its a very basic and well known clustering technique. I need to review my older notes for approaches to solve some of these projects and review more papers dealing with timeseries data in general. My initial thought is to decompose a block of timeseries data into a set of time ordered "features" in order to reduce the effect of scaling issues and alignment effects. That would make it easier to calculate to a delta between vectors, but also might create unintended approximation errors.

As I get more Mahout code running, I'll probably post relevant research papers dealing with time series data that I'm looking at. That way anyone can chip in their opinion of the approach and we can build more example code for dealing with timeseries in Mahout.

I'll take a look at the suffix-tree technique, and I look forward to any other suggestions you might have in terms of an approach.

Josh Patterson
TVA

-----Original Message-----
From: prasen.bea@gmail.com on behalf of prasenjit mukherjee
Sent: Sat 11/21/2009 12:20 AM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

Hi Josh,

I too am working on  clustering time-series-data, and basically trying
to come up with a sequence clustering model. Would like to know how
you intend to use K-means to achieve that.  Are you  treating each
sequence as a point ?  Then, what would be your vector representation
of a sequence and also more importantly which metric ( distance
computation logic ) will you be using ?

BTW, I am thinking along the lines of STC ( suffix-tree based clustering ).

-Prasen

On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov> wrote:
> I think in terms of clustering time series data, the first step looks to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.
> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or discussion
> is greatly appreciated,
>
> Josh Patterson
> TVA

Re: mahout examples

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

Hi Josh,

I too am working on  clustering time-series-data, and basically trying
to come up with a sequence clustering model. Would like to know how
you intend to use K-means to achieve that.  Are you  treating each
sequence as a point ?  Then, what would be your vector representation
of a sequence and also more importantly which metric ( distance
computation logic ) will you be using ?

BTW, I am thinking along the lines of STC ( suffix-tree based clustering ).

-Prasen

On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jp...@tva.gov> wrote:
> I think in terms of clustering time series data, the first step looks to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.
> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or discussion
> is greatly appreciated,
>
> Josh Patterson
> TVA

RE: mahout examples

Posted by "Patterson, Josh" <jp...@tva.gov>.

Drew,
Yeah, I'm also very interested in clustering and the article is mainly
focused on collaborative filtering. I've been reading through mailing
list archives and Mahout source code such as KMeansDriver.java to get a
better idea on how to get clustering working before I ask the simple
"FAQ"-type questions. We'd like to be able to: 

a.) cluster
b.) classify

our data so we can explore the data and project / classify trends. Other
data mining tools are a little more turn-key than Mahout, but since we
use Hadoop we believe Mahout is the best option at this point. My short
term goal is to take known data sets such as:

http://www-stat.ucdavis.edu/~shumway/tsa.html
and
http://lib.stat.cmu.edu/datasets/

and get basic clustering to work on Hadoop with Mahout. If I can get
clustering working well and consistently on Mahout, it's a solid
stepping stone towards working with other data sets (and any canonical
time series data set suggestions are appreciated here). I've worked with
timeseries data sets and machine learning before but more with neural
nets (and others) in non-distributed situations. Since our project is
open source, I'd love to be able to contribute back time series related
code to future Mahout versions. Our project's home page is:

http://openpdc.codeplex.com

Some background:

http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-s
ource/
http://news.cnet.com/8301-13846_3-10393259-62.html?tag=mncol;title
http://earth2tech.com/2009/11/10/the-google-android-of-the-smart-grid-op
enpdc/

I think in terms of clustering time series data, the first step looks to
be vectorizing the input cases with possibly the DenseVector class and
feeding that to a basic KMeans implementation like KMeansDriver.java.
Once we can get the basic kmeans rolling with some known dataset we'll
be able to iterate on that and move towards using more complex
techniques and other grid timeseries data. Any suggestions or discussion
is greatly appreciated,

Josh Patterson
TVA

-----Original Message-----
From: Drew Farris [mailto:drew.farris@gmail.com] 
Sent: Friday, November 20, 2009 12:04 PM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

Hi Josh,

I got started with Mahout using Grant's article as an introduction --
the code included with it is a older than the 0.2 release.

Ant is used primarilly to build Grant's examples and as a launcher --
it sets up the classpath so that the examples can be run properly. The
build script uses the pre-build version of mahout shipped with the
bundle.

I found that the article itself and the sample code provided was a
great way to get started, but I quickly started working from the
subversion repository once I had achieved the basics. I was mostly
interested in clustering and that wasn't covered in the article to a
great extent anyway. The mahout wiki is a great resource for exploring
further.

For what it's worth the mahout sources included in Grant's article and
the sources obtainable as part of the release or from subversion must
all be built using maven

Hope this helps,

Drew

On Fri, Nov 20, 2009 at 10:16 AM, Patterson, Josh <jp...@tva.gov>
wrote:
> I've noticed that the article at:
>
>
>
> http://www.ibm.com/developerworks/java/library/j-mahout/
>
>
>
> uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
> included downloadable code includes what looks like a snapshot version
> of Mahout 0.2 - is that essentially a copy of the release? Are there
any
> other differences I should note when working through these examples?
> Thanks!
>
>
>
> Josh Patterson
>
> TVA
>
>

Re: mahout examples

Posted by Drew Farris <dr...@gmail.com>.

Hi Josh,

I got started with Mahout using Grant's article as an introduction --
the code included with it is a older than the 0.2 release.

Ant is used primarilly to build Grant's examples and as a launcher --
it sets up the classpath so that the examples can be run properly. The
build script uses the pre-build version of mahout shipped with the
bundle.

I found that the article itself and the sample code provided was a
great way to get started, but I quickly started working from the
subversion repository once I had achieved the basics. I was mostly
interested in clustering and that wasn't covered in the article to a
great extent anyway. The mahout wiki is a great resource for exploring
further.

For what it's worth the mahout sources included in Grant's article and
the sources obtainable as part of the release or from subversion must
all be built using maven

Hope this helps,

Drew

On Fri, Nov 20, 2009 at 10:16 AM, Patterson, Josh <jp...@tva.gov> wrote:
> I've noticed that the article at:
>
>
>
> http://www.ibm.com/developerworks/java/library/j-mahout/
>
>
>
> uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
> included downloadable code includes what looks like a snapshot version
> of Mahout 0.2 - is that essentially a copy of the release? Are there any
> other differences I should note when working through these examples?
> Thanks!
>
>
>
> Josh Patterson
>
> TVA
>
>