You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Florent Empis <fl...@gmail.com> on 2010/07/15 16:51:40 UTC

Beginner questions on clustering & M/R

Hi,

I want to learn more on clustering techniques. I have skimmed through
Programming Collective Intelligence and Mahout in Action in the past but I
don't have them on hand at the moment... :(
I've seen Isabel Drost mail about test data on http://mldata.org/about/
I've had an idea of using http://mldata.org/repository/view/stockvalues/ for
a pet project.
My idea is as follow: can we see a common behaviour between companies' stock
value?
I would expect ending up with cluster of banking sector shares, utilities
share, media etc... and maybe some more unexpected cluster, who knows?

My idea is basically:
1°)Transform the dataset from values to daily variation as percentage
drop/raise (data is then normalized)
2°)Apply clustering technique(s)

The issue may seem silly but as I understand it, clustering happens in a 2
(or more) dimension space.
I know I have 2 dimensions: variation and time, but I can't wrap my head on
the problem...

I *think* that the K-Means example does exactly what I intend to do my
second step, is this correct?
However, I can grasp what the 2 dimensional display represent exactly: what
are the x and y axis ?

Added question: I am fairly new to the M/R paradigm, but let's say I would
like to do step 1 (data normalization) in a M/R fashion. Would the following
be a good idea:
My data is a matrix of k stock values S in n intervals of time.
I call the first stock in the file, first and second period:
S1,t & S1,t+1 ...

Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
(Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
Reduce Step:
( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))

I apologize for my beginner's questions but.... everyone has to start
somewhere :-)

BR,

Florent Empis

Re: Beginner questions on clustering & M/R

Posted by Florent Empis <fl...@gmail.com>.

Thanks :-)


2010/7/16 Ted Dunning <te...@gmail.com>

> Gabor transform retains some time domain information.  Since economic
> process change somewhat, I think that would be important.  It also helps
> avoid questions about how to window the signal (it effectively *is* a
> windowed Fourier transform).
>
> On Fri, Jul 16, 2010 at 5:01 AM, Florent Empis <florent.empis@gmail.com
> >wrote:
>
> > What makes you think that Gabor would help? Because of phase shifting? I
> > would then basically be clustering my data by phase shifting, is that
> right
> > ?
> >
>

Re: Beginner questions on clustering & M/R

Posted by Ted Dunning <te...@gmail.com>.

Gabor transform retains some time domain information.  Since economic
process change somewhat, I think that would be important.  It also helps
avoid questions about how to window the signal (it effectively *is* a
windowed Fourier transform).

On Fri, Jul 16, 2010 at 5:01 AM, Florent Empis <fl...@gmail.com>wrote:

> What makes you think that Gabor would help? Because of phase shifting? I
> would then basically be clustering my data by phase shifting, is that right
> ?
>

Re: Beginner questions on clustering & M/R

Posted by Florent Empis <fl...@gmail.com>.

Hi,

First of all, let me stress I'm not actually trying to do quant analysis...
it's just for fun, not pratical use is expected, other than learning some
new stuff.

I also thought of using a transform from time to frequency (fourrier...) but
it was only a wild guess based on my limited knoweldge of electronics and
signal processing where the usual answer to a complex signal analysis is "do
a fourier transform, it will help" :)

What makes you think that Gabor would help? Because of phase shifting? I
would then basically be clustering my data by phase shifting, is that right
?

Thanks for your help!

Florent




2010/7/15 Ted Dunning <te...@gmail.com>

> Clustering of time series data is usually better done in an abstract
> relatively low dimensional coordinate space based on some transform like a
> locality sensitive frequency transform.  Gabor transforms might be
> appropriate.
>
> You might be able to get away with something like an SVD of your daily
> change data.
>
> On Thu, Jul 15, 2010 at 7:51 AM, Florent Empis <florent.empis@gmail.com
> >wrote:
>
> > Hi,
> >
> > I want to learn more on clustering techniques. I have skimmed through
> > Programming Collective Intelligence and Mahout in Action in the past but
> I
> > don't have them on hand at the moment... :(
> > I've seen Isabel Drost mail about test data on http://mldata.org/about/
> > I've had an idea of using
> http://mldata.org/repository/view/stockvalues/for
> > a pet project.
> > My idea is as follow: can we see a common behaviour between companies'
> > stock
> > value?
> > I would expect ending up with cluster of banking sector shares, utilities
> > share, media etc... and maybe some more unexpected cluster, who knows?
> >
> > My idea is basically:
> > 1°)Transform the dataset from values to daily variation as percentage
> > drop/raise (data is then normalized)
> > 2°)Apply clustering technique(s)
> >
> > The issue may seem silly but as I understand it, clustering happens in a
> 2
> > (or more) dimension space.
> > I know I have 2 dimensions: variation and time, but I can't wrap my head
> on
> > the problem...
> >
> > I *think* that the K-Means example does exactly what I intend to do my
> > second step, is this correct?
> > However, I can grasp what the 2 dimensional display represent exactly:
> what
> > are the x and y axis ?
> >
> > Added question: I am fairly new to the M/R paradigm, but let's say I
> would
> > like to do step 1 (data normalization) in a M/R fashion. Would the
> > following
> > be a good idea:
> > My data is a matrix of k stock values S in n intervals of time.
> > I call the first stock in the file, first and second period:
> > S1,t & S1,t+1 ...
> >
> > Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
> > output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
> > (Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
> > Reduce Step:
> > ( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))
> >
> > I apologize for my beginner's questions but.... everyone has to start
> > somewhere :-)
> >
> > BR,
> >
> > Florent Empis
> >
>

Re: Beginner questions on clustering & M/R

Posted by Ted Dunning <te...@gmail.com>.

Just speaking heuristically, time series data is very high dimensional.  For
the equities market, you have (at least) daily samples on nearly 10,000
publicly traded stocks.  With only 3 years of data, that gives you 10
million dimensions.  With 30 years of data, things are obviously 10x worse.
 If you include options, futures and commodities things get vastly worse.

Even more problematic, the direct time series data is not translation
invariant.  This means that learning something about the past only teaches
you about the past.  The direct prices are not even magnitude invariant
which is the motive for studying the first-order differences or for using
the log of the prices.

These make any kind of learning approach pretty difficult.

So... the requirement is to decrease the dimensionality somehow.
 Essentially, that means to take those thousands of samples of thousands of
equities and describe them in a much more compact form of some kind.
 Hopefully, this compact representation has important components that are
slowly varying so that predictions made using these components have a
reasonable range into the future.

There are lots of kinds of dimensionality reduction that you can try.  The
three general categories that I would think of off-the-cuff would be
combined frequency/time representations like wavelets (the Gabor transforms
that I mentioned are in this category), SVD techniques which might be able
to decode industry sectors that move together or more general probabilistic
latent variable techniques.  Combinations of these are also plausible.

If you take the SVD stuff in particular, you would start with, say, your
equity data in a matrix.  Each row would represent a different equity and
each column would represent a single time value.  Since equities appear and
disappear, you would have significant numbers of missing observations.  To
deal with the exponential growth phenomena associated with economic entities
in general, I would recommend starting with the log of the price.  If you
take the partial SVD decomposition of this matrix in a fashion suitably
adjusted for the missing values, you will have a left singular matrix that
transforms stocks into the internal representation and a right singular
vector that encodes time-based patterns of price movement.  The SVD
expresses the price movements of individual stocks in terms of linear
combinations of these time-based patterns.

At this level, you can use the system as a method for detecting when a stock
starts to deviate from its cohort.  This might be an interesting signal, for
example, to alert you to examine something more carefully.

If you include various leading economic indicators in your data as well as
simple equity prices, then you begin to get some predictive power.  This is
especially true if you include the leading indicators in a delayed form so
that their predictive effect can be recognized and encoded by the SVD.
 Another trick is to build the SVD initially using just a moderate
indicators in lagged form combined with a few strong indicators of current
conditions.  That will give you right singular vectors that are associated
with general patterns of economic activity.  You can then use those right
singular vectors to derive a matrix of approximate left singular vectors for
the equities of interest.  What you have done at this point is to shoe-horn
the equity prices into a shoe made out of general economic indicators that
are suitably lagged so as for force the model induced by this approximate
SVD to be as predictive as possible.

This is just an outline of how these techniques can be used.  To make
successful models along these lines will take a LOT of detail work.  For
instance, the details of how you express the prices in the beginning is a
big deal.  Another issue is how you express the lagged indicators.  Just
time shifting them is unlikely to be successful.  Convolving with a delay
filter (or several such) that is structured based on expert opinions is
probably much better.  A huge over-arching issue is how to deal with the
fact that if you pick over your data hundreds of times, you may well no
longer be predicting anything but the idiosyncracies of the past due to
over-fitting.

I hope this helps.

On Sat, Jul 17, 2010 at 2:11 PM, Florent Empis <fl...@gmail.com>wrote:

>
> On the SVD part... why would that help?
>
> Thanks  for your input:)
>
>

Re: Beginner questions on clustering & M/R

Posted by Florent Empis <fl...@gmail.com>.

Hi,

On the SVD part... why would that help?

Thanks  for your input:)

Florent

2010/7/15 Ted Dunning <te...@gmail.com>

> Clustering of time series data is usually better done in an abstract
> relatively low dimensional coordinate space based on some transform like a
> locality sensitive frequency transform.  Gabor transforms might be
> appropriate.
>
> You might be able to get away with something like an SVD of your daily
> change data.
>
> On Thu, Jul 15, 2010 at 7:51 AM, Florent Empis <florent.empis@gmail.com
> >wrote:
>
> > Hi,
> >
> > I want to learn more on clustering techniques. I have skimmed through
> > Programming Collective Intelligence and Mahout in Action in the past but
> I
> > don't have them on hand at the moment... :(
> > I've seen Isabel Drost mail about test data on http://mldata.org/about/
> > I've had an idea of using
> http://mldata.org/repository/view/stockvalues/for
> > a pet project.
> > My idea is as follow: can we see a common behaviour between companies'
> > stock
> > value?
> > I would expect ending up with cluster of banking sector shares, utilities
> > share, media etc... and maybe some more unexpected cluster, who knows?
> >
> > My idea is basically:
> > 1°)Transform the dataset from values to daily variation as percentage
> > drop/raise (data is then normalized)
> > 2°)Apply clustering technique(s)
> >
> > The issue may seem silly but as I understand it, clustering happens in a
> 2
> > (or more) dimension space.
> > I know I have 2 dimensions: variation and time, but I can't wrap my head
> on
> > the problem...
> >
> > I *think* that the K-Means example does exactly what I intend to do my
> > second step, is this correct?
> > However, I can grasp what the 2 dimensional display represent exactly:
> what
> > are the x and y axis ?
> >
> > Added question: I am fairly new to the M/R paradigm, but let's say I
> would
> > like to do step 1 (data normalization) in a M/R fashion. Would the
> > following
> > be a good idea:
> > My data is a matrix of k stock values S in n intervals of time.
> > I call the first stock in the file, first and second period:
> > S1,t & S1,t+1 ...
> >
> > Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
> > output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
> > (Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
> > Reduce Step:
> > ( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))
> >
> > I apologize for my beginner's questions but.... everyone has to start
> > somewhere :-)
> >
> > BR,
> >
> > Florent Empis
> >
>

Re: Beginner questions on clustering & M/R

Posted by Ted Dunning <te...@gmail.com>.

Clustering of time series data is usually better done in an abstract
relatively low dimensional coordinate space based on some transform like a
locality sensitive frequency transform.  Gabor transforms might be
appropriate.

You might be able to get away with something like an SVD of your daily
change data.

On Thu, Jul 15, 2010 at 7:51 AM, Florent Empis <fl...@gmail.com>wrote:

> Hi,
>
> I want to learn more on clustering techniques. I have skimmed through
> Programming Collective Intelligence and Mahout in Action in the past but I
> don't have them on hand at the moment... :(
> I've seen Isabel Drost mail about test data on http://mldata.org/about/
> I've had an idea of using http://mldata.org/repository/view/stockvalues/for
> a pet project.
> My idea is as follow: can we see a common behaviour between companies'
> stock
> value?
> I would expect ending up with cluster of banking sector shares, utilities
> share, media etc... and maybe some more unexpected cluster, who knows?
>
> My idea is basically:
> 1°)Transform the dataset from values to daily variation as percentage
> drop/raise (data is then normalized)
> 2°)Apply clustering technique(s)
>
> The issue may seem silly but as I understand it, clustering happens in a 2
> (or more) dimension space.
> I know I have 2 dimensions: variation and time, but I can't wrap my head on
> the problem...
>
> I *think* that the K-Means example does exactly what I intend to do my
> second step, is this correct?
> However, I can grasp what the 2 dimensional display represent exactly: what
> are the x and y axis ?
>
> Added question: I am fairly new to the M/R paradigm, but let's say I would
> like to do step 1 (data normalization) in a M/R fashion. Would the
> following
> be a good idea:
> My data is a matrix of k stock values S in n intervals of time.
> I call the first stock in the file, first and second period:
> S1,t & S1,t+1 ...
>
> Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
> output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
> (Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
> Reduce Step:
> ( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))
>
> I apologize for my beginner's questions but.... everyone has to start
> somewhere :-)
>
> BR,
>
> Florent Empis
>

Re: Beginner questions on clustering & M/R

Posted by Joe Spears <js...@indieplaya.com>.

I once thought about being a quant long before I started my current company
:) .

I don't want to ruin the surprise for you, but because of volatility in the
market and the fact that you are looking at daily data (unless you spend a
lot of time writing a custom clustering implementation) you are more likely
to see clusters like 'defensive stocks', 'bell weather stocks' and the
like... whose performance is less about their sector and more about the
overall market's "like" of the company. Since some sectors have a higher
affinity to these categorizations, you are likely to see some patterns as
you describe and will also see some sectors being clustered apart. One
example is the 'tech' sector which has many companies that perform
differently in any particular market climate.

I found the much more interesting problem to be in the area of which
non-sector companies influence other companies and to what degree. For
instance, could I create a model that is "subconsciously aware" that stock1
is a supplier to stock2 who is a major vendor of stock3. If stock3 takes a
dip, to what extent does that influence stock2's share price and could I
short stock3.

To build this model, I bought "tickdata" from a company called "Fitch" (
http://www.fitchgroup.com/fitchdata/) and built a stock simulator and neural
net as part of my Master's project. I had about 1 year of data and could get
pretty good at making some predictions, but my performance never came close
to beating the major indexes. I bet if you had more data and a better
algorithm, that you could get much better performance. Whether you could do
better than the indexes is another question.

Joe

On Thu, Jul 15, 2010 at 7:51 AM, Florent Empis <fl...@gmail.com>wrote:

> Hi,
>
> I want to learn more on clustering techniques. I have skimmed through
> Programming Collective Intelligence and Mahout in Action in the past but I
> don't have them on hand at the moment... :(
> I've seen Isabel Drost mail about test data on http://mldata.org/about/
> I've had an idea of using http://mldata.org/repository/view/stockvalues/for
> a pet project.
> My idea is as follow: can we see a common behaviour between companies'
> stock
> value?
> I would expect ending up with cluster of banking sector shares, utilities
> share, media etc... and maybe some more unexpected cluster, who knows?
>
> My idea is basically:
> 1°)Transform the dataset from values to daily variation as percentage
> drop/raise (data is then normalized)
> 2°)Apply clustering technique(s)
>
> The issue may seem silly but as I understand it, clustering happens in a 2
> (or more) dimension space.
> I know I have 2 dimensions: variation and time, but I can't wrap my head on
> the problem...
>
> I *think* that the K-Means example does exactly what I intend to do my
> second step, is this correct?
> However, I can grasp what the 2 dimensional display represent exactly: what
> are the x and y axis ?
>
> Added question: I am fairly new to the M/R paradigm, but let's say I would
> like to do step 1 (data normalization) in a M/R fashion. Would the
> following
> be a good idea:
> My data is a matrix of k stock values S in n intervals of time.
> I call the first stock in the file, first and second period:
> S1,t & S1,t+1 ...
>
> Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
> output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
> (Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
> Reduce Step:
> ( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))
>
> I apologize for my beginner's questions but.... everyone has to start
> somewhere :-)
>
> BR,
>
> Florent Empis
>