You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2010/06/21 20:12:47 UTC

classifier architecture needed

We are now beginning to have lots of classifiers in Mahout.  The naive
Bayes, complementary naive Bayes and random Forest grandfathers have been
joined by my recent SGD and Zhao Zhendong's prolific set of approaches for
logistic regression and SVM variants.

All of these implementations have similar characteristics and virtually none
are inter-operable.

Even worse, the model produced by a clustering system is really just like a
model produced by a classifier so we should increase the number of sources
of incompatible classifiers even more.  Altogether, we probably have a dozen
ways of building classifiers.

I would like to start a discussion about a framework that we can fit all of
these approaches together in much the same way that the recommendations
stuff has such nice pluggable properties.

As I see it, the opportunities for commonality (aka our current
deficiencies)  include:

- original input format reading

-- the naive Bayes code uses an ad hoc format similar to what Jason Rennie
used for 20 news groups.  This code uses Lucene 3.0 style analyzers.

-- Zhao uses something a lot like SVMLight input format

-- The SGD code looks at CSV data

-- Drew wrote some Avro document code

-- Lucene has been used as a sort of vectors for clustering

My summary here is that the Lucene analyzers look like they could be used
very effectively for our purposes.  We would need to write AttributeFilter's
that do two kinds of vectorization (random project and dictionary based).
We also should have 4 standard input format parsers as examples (CSV,
SVMLight, VowpalWabbit, current naive Bayes format).

We need something simply and general that subsumes all of these input use
cases.

- conversion to vectors

-- SGD introduced from random projection

-- Naive bayes has some dictionary based conversions

-- Other stuff does this or that

This should be subsumed into the AttributeFilters that I mentioned above.
 We really just need random projection and Salton style vector space models.
 Clearly, we should allow direct input of vectors as well in case the user
is producing them for us.

- command line option processing

We really need to have a simple way to integrate all of the input processing
options easily into new and old code

- model storage

It would be lovely if we could instantiate a model from a stored form
without even known what kind of learning produced the model.  All of the
classifiers and clustering algorithms should put out something that can be
instantiated this way.  I used Gson in the SGD code and found it pretty
congenial, but I didn't encode the class of the classifier, nor did I
provide a classifier abstract class.  I don't know what k-means or Canopy
clustering produce, nor random forests or Naive Bayes, but I am sure that
all of them are highly specific to the particular kind of model.

I don't know what is best here, but we definitely need something more common
than what we have.

What do others think?

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

I agree that models should be highly generic.  I just don't think that we
should legislate the content of either their internal model, nor of their
serialized representation.

The contract is pretty clear, however.  There are just a few methods and it
isn't hard for all models to support them, especially with an abstract class
providing default implementations.  There are a few strangenesses in the API
that I would suggest that can help a lot with performance.  For instance,
for two-class models, it should be possible to call the model and get back a
double value which is the score for the first class.  For k-class models, it
should be possible to pass in a vector that gets the scores for the various
models and both 1 of k and 1 of k-1 encoding should be supported.  All of
these variants are useful to avoid having to cons up a bunch of extra
vectors at classification time.

On Tue, Jun 22, 2010 at 10:07 AM, Robin Anil <ro...@gmail.com> wrote:

> The reason I said models be generic is because they can then be read across
> classifiers. Like if the classifier does nearest centroid matching like in
> NB or using output of K-Means it can be used. Or if a margin trained using
> pegasos can be used by any SVM classifier. Thats all
>

Re: classifier architecture needed

Posted by Robin Anil <ro...@gmail.com>.

Wikipedia unigram dictionary is 381MB on disk. Bigram and trigram sizes will
explode like anything. So Vectorizer could be a pass through if reading
vectors(parallely generated) in each of the jobs or on the fly converted if
using the randomizer

The reason I said models be generic is because they can then be read across
classifiers. Like if the classifier does nearest centroid matching like in
NB or using output of K-Means it can be used. Or if a margin trained using
pegasos can be used by any SVM classifier. Thats all


Robin

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 22, 2010 at 9:47 AM, Robin Anil <ro...@gmail.com> wrote:

> >
> > Again, I would recommend a blob as the on-disk
> > format.

Why a blob. Why not a flexible multi list of matrices and vectors?
> Is there any model storing byte level information ?
>

The SGD has a parameter vector as well as a trace dictionary.  The parameter
vector is fine as a vector.  The trace is an int to string multi-map.

The random forest has several hundred decision trees in the model.  Each
decision tree is a collection of rules which contain a variable name and a
cut-point.

Re: classifier architecture needed

Posted by Robin Anil <ro...@gmail.com>.

>
> Again, I would recommend a blob as the on-disk
> format. Why a blob. Why not a flexible multi list of matrices and vectors?
> Is there any model storing byte level information ?
>

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 22, 2010 at 9:44 AM, Robin Anil <ro...@gmail.com> wrote:

> > > On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > >> A Classifier Training Job will take a Trainer, and a Vector location
> and
> > >
> > > produce a Model
> >
>
> How about A tranform layer which converts ondisk data into vectors
> seamlessly? That should solve the issue

This is what I meant that the Classifier Training Job should accept a
Vectorizer.

This transform layer needs to be close to the object doing the training to
avoid data transfer and copy costs.  It should also be flexible in that
different kinds of transforms should be injectable into the job.

Re: classifier architecture needed

Posted by Robin Anil <ro...@gmail.com>.

>
> >
> >
> > On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> >> A Classifier Training Job will take a Trainer, and a Vector location and
> >
> > produce a Model
>

How about A tranform layer which converts ondisk data into vectors
seamlessly? That should solve the issue

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 22, 2010 at 9:25 AM, Ted Dunning <te...@gmail.com> wrote:

>
>
> On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> A Classifier Training Job will take a Trainer, and a Vector location and
>
> produce a Model
>>
>
> No.  Well, not exclusively, anyway.  We can't be limited to reading vectors
> due to the fairly substantial (3x) performance hit that will entail.
>

Ahhh... last minute thought here.

The output here also needs to include a vectorizer state.  Many vectorizers
require information to be repeatable.  For instance, a dictionary based
vectorizer might develop a dictionary as it sees terms during training.
 Another example is AdaptiveWordValueEncoder which doesn't use a dictionary,
but does keep counts to help with weighting.  And finally, all of the hashed
representations should produce some kind of trace history so that they can
be reverse engineered.  Again, I would recommend a blob as the on-disk
format.

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <ro...@gmail.com> wrote:

> See how this sound(listing down requirements)
>
> A model can be class with a list of matrices, a list of vectors. Each
> algorithm takes care of naming these matrices/vectors and reading and
> writing values to it (similar to Datastore)
>

I think that this is too restrictive.  I would prefer that models are
essentially opaque blobs in wire or disk format but whatever model you have
can be instantiated using a standard factory.

> All Classifiers will work with vectors
> All Trainers will work with vectors
>

Yes.

There should also be a standard framework that allows conversion to vectors
without moving vectors across IPC links.

Multiple techniques to vectorize data.
> - Dictionary based
> - Random hashing based
>

Yes.

Random hashing especially needs to handle interaction variables well.

We need fielded data as well and support for continuous variables in
addition to text-like data.

> A Classifier Training Job will take a Trainer, and a Vector location and
> produce a Model
>

No.  Well, not exclusively, anyway.  We can't be limited to reading vectors
due to the fairly substantial (3x) performance hit that will entail.

I would recommend that a training job will take a Trainer, a Vectorizer, an
InputSource and produce a Model.

A Classifier Testing Job will take a Classifier, a Model and a Test Vector
> location and produce statistics
>

Again, need a vectorizer.

> A Classifier Job will take a Classifier, a Model and a vector location and
> label the vectors with probability or likelihood values and return 1 or top
> N labels
>

Again, need a vectorizer.

I think that we should designate a list of preserved fields which may be no
more than the id and the output should be attached.  Possible forms are

top k labels (with or without probs)
all probabilities

> Model Storage
> Datastore has a list of matrices and a list of vectors. It can be
> serialized
> to disk. Or stored on Hbase or any other Hashtable
> implementation(memcached)
>

I prefer that a model is a blob, preferably some what inspectable such as
with JSON formats.

Re: classifier architecture needed

Posted by Robin Anil <ro...@gmail.com>.

See how this sound(listing down requirements)

A model can be class with a list of matrices, a list of vectors. Each
algorithm takes care of naming these matrices/vectors and reading and
writing values to it (similar to Datastore)
All Classifiers will work with vectors
All Trainers will work with vectors

Multiple techniques to vectorize data.
- Dictionary based
- Random hashing based

A Classifier Training Job will take a Trainer, and a Vector location and
produce a Model
A Classifier Testing Job will take a Classifier, a Model and a Test Vector
location and produce statistics
A Classifier Job will take a Classifier, a Model and a vector location and
label the vectors with probability or likelihood values and return 1 or top
N labels


Model Storage
Datastore has a list of matrices and a list of vectors. It can be serialized
to disk. Or stored on Hbase or any other Hashtable implementation(memcached)

Re: classifier architecture needed

Posted by Isabel Drost <is...@apache.org>.

On 21.06.2010 Ted Dunning wrote:
> I would like to start a discussion about a framework that we can fit all of
> these approaches together in much the same way that the recommendations
> stuff has such nice pluggable properties.

+1 Like the ideas that have been tossed around in this discussion. Do you think 
those are specific enough already to open a JIRA issue to track further 
discussions? (Didn't see any opened so far - sorry, if I overlooked it.)

Isabel

Re: classifier architecture needed

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 22, 2010 at 8:33 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote:
>
> > We really need to have a simple way to integrate all of the input
> processing
> > options easily into new and old code
>
> More or less, what we need is a pipeline that can ingest many different
> kinds of things and output Vectors, right (assuming bayes is converted to
> use vectors).  Ideally it would be easy to configure, work well in a cluster
> and can output various formats (for instance freq. item set as well).
>

Yes.

But classifiers need to be able to do the conversion on the fly as well.
 Just recently a client had a model where there are almost 20 interaction
variables among categorical variables with a large number of possible
values.  Very soon, there will be interaction variables against text.  This
means that the vector form of the training or test examples will be 2-3x
larger than the original form.  SGD is already likely to be I/O bound and
killing performance further seems a very bad idea.

We also very much need to be good at both command line and programmatic
composition of these pipelines.

>
> > - model storage
> >
> > It would be lovely if we could instantiate a model from a stored form
> > without even known what kind of learning produced the model.  All of the
> > classifiers and clustering algorithms should put out something that can
> be
> > instantiated this way.  I used Gson in the SGD code and found it pretty
> > congenial, but I didn't encode the class of the classifier, nor did I
> > provide a classifier abstract class.  I don't know what k-means or Canopy
> > clustering produce, nor random forests or Naive Bayes, but I am sure that
> > all of them are highly specific to the particular kind of model.
>
> Just to be clear, are you suggesting that, ultimately, the models can be
> used interchangeably?
>

Yes.

And in combination.  It is common for models and clusterers to be used as
feature extractors for other models (or clustering).  Model combination like
this was what won Netflix.

The most common use case, however, is evaluation.  It is important to be
able to throw any model at exactly the same test set and evaluation code.

An as yet unexplored use case (for Mahout) is to use feature sharding such
as with the random forest with alternative models.

Another use case is semi-supervised learning where you train a model and use
the output of the model against a larger corpus as training data for another
model.  We shouldn't be limited as to which models go where in such an
architecture.

Re: classifier architecture needed

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote:

> We are now beginning to have lots of classifiers in Mahout.  The naive
> Bayes, complementary naive Bayes and random Forest grandfathers have been
> joined by my recent SGD and Zhao Zhendong's prolific set of approaches for
> logistic regression and SVM variants.
> 
> All of these implementations have similar characteristics and virtually none
> are inter-operable.
> 
> Even worse, the model produced by a clustering system is really just like a
> model produced by a classifier so we should increase the number of sources
> of incompatible classifiers even more.  Altogether, we probably have a dozen
> ways of building classifiers.
> 
> I would like to start a discussion about a framework that we can fit all of
> these approaches together in much the same way that the recommendations
> stuff has such nice pluggable properties.
> 
> As I see it, the opportunities for commonality (aka our current
> deficiencies)  include:
> 
> - original input format reading
> 
> -- the naive Bayes code uses an ad hoc format similar to what Jason Rennie
> used for 20 news groups.  This code uses Lucene 3.0 style analyzers.
> 
> -- Zhao uses something a lot like SVMLight input format
> 
> -- The SGD code looks at CSV data
> 
> -- Drew wrote some Avro document code
> 
> -- Lucene has been used as a sort of vectors for clustering
> 
> My summary here is that the Lucene analyzers look like they could be used
> very effectively for our purposes.  We would need to write AttributeFilter's
> that do two kinds of vectorization (random project and dictionary based).
> We also should have 4 standard input format parsers as examples (CSV,
> SVMLight, VowpalWabbit, current naive Bayes format).
> 
> We need something simply and general that subsumes all of these input use
> cases.
> 
> - conversion to vectors
> 
> -- SGD introduced from random projection
> 
> -- Naive bayes has some dictionary based conversions
> 
> -- Other stuff does this or that
> 
> This should be subsumed into the AttributeFilters that I mentioned above.
> We really just need random projection and Salton style vector space models.
> Clearly, we should allow direct input of vectors as well in case the user
> is producing them for us.
> 
> - command line option processing
> 
> We really need to have a simple way to integrate all of the input processing
> options easily into new and old code

More or less, what we need is a pipeline that can ingest many different kinds of things and output Vectors, right (assuming bayes is converted to use vectors).  Ideally it would be easy to configure, work well in a cluster and can output various formats (for instance freq. item set as well).



> 
> - model storage
> 
> It would be lovely if we could instantiate a model from a stored form
> without even known what kind of learning produced the model.  All of the
> classifiers and clustering algorithms should put out something that can be
> instantiated this way.  I used Gson in the SGD code and found it pretty
> congenial, but I didn't encode the class of the classifier, nor did I
> provide a classifier abstract class.  I don't know what k-means or Canopy
> clustering produce, nor random forests or Naive Bayes, but I am sure that
> all of them are highly specific to the particular kind of model.

Just to be clear, are you suggesting that, ultimately, the models can be used interchangeably?

> 
> I don't know what is best here, but we definitely need something more common
> than what we have.
> 
> What do others think?


Definitely agree.