You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2010/06/07 09:02:05 UTC

producing vectors from composite documents

I just posted another MAHOUT-228 patch that contains a working and almost
useful version of SGD learning.  See
https://issues.apache.org/jira/browse/MAHOUT-228 for details and the patch.

As part of this effort, I have included my latest ideas on how to vectorize
composite documents that include combinations of numerical, textual and
categorical information, especially where textual and categorical data has
an unbounded vocabulary.  An important improvement that I have in the
current code is that as vectors are created, a trace is kept which allows
the resulting vectors to be reverse engineered.  The basic idea is that
values are inserted 1 or more times into a vector at hashed locations that
depend on the name of the variable (for numeric values) or the name of the
variable and the word being inserted (for textual and categorical data).
 More than one probe is used for textual and categorical data to ameliorate
the problem of collisions and I am undecided on the virtues of multiple
probes for numerical variables.   Each update to the vector leaves a trace
in a dictionary so that an arbitrary vector can be reversed back to the
original data reasonably well.  Currently, the code is highly CSV centric
since that is the data I am working with first.

The place to see this system in action is
examples/src/...classifier/sgd/TrainLogistic and the key class is
CsvRecordFactory.  A sample command line is on the JIRA ticket.  I would
love comments and critiques on how this might fit into other areas of our
systems.

Drew, I especially would like to hear what you think about how this would
relate to the Avro document stuff you did.

Robin, Jeff, I am curious what you think about these ideas relative to
k-means and other clustering techniques.

Jake, I am very interested in what you think of this kind of technique
relative to your needs for large SVD's.  It relates closely, of course, to
random projects (it is essentially a random projection).

Re: producing vectors from composite documents

Posted by Ted Dunning <te...@gmail.com>.

It is too simple right now as I explained in my previous message, but it is
just a stake in the ground.

On Mon, Jun 7, 2010 at 8:19 AM, Drew Farris <dr...@gmail.com> wrote:

> The approach to the vectorization sounds like a pretty neat idea. I'm
> interested in seeing how the vector + trace to human readable dump code
> works too.
>

Re: producing vectors from composite documents

Posted by Drew Farris <dr...@gmail.com>.

Ok, I'll look at it from this perspective. FWIW, I don't have any good tools
for creating asd's from various formats at this point either.

On Mon, Jun 7, 2010 at 12:12 PM, Ted Dunning <te...@gmail.com> wrote:

> I was thinking that my CsvRecordFactory should become more generic and have
> three different vector generating calls.
>
> One would be from csv data.
>
> One would be from a map.
>
> And a third would be from asd.
>
>

Re: producing vectors from composite documents

Posted by Ted Dunning <te...@gmail.com>.

I was thinking that my CsvRecordFactory should become more generic and have
three different vector generating calls.

One would be from csv data.

One would be from a map.

And a third would be from asd.

On Mon, Jun 7, 2010 at 8:19 AM, Drew Farris <dr...@gmail.com> wrote:

> At first read of your description it seems like we could consider
> implementing a csv -> avro structured document mapping and then modify the
> vectorization/learning code to take avro structured documents (asd) as
> input.
>

Re: producing vectors from composite documents

Posted by Drew Farris <dr...@gmail.com>.

On Mon, Jun 7, 2010 at 3:02 AM, Ted Dunning <te...@gmail.com> wrote:

>
> Drew, I especially would like to hear what you think about how this would
> relate to the Avro document stuff you did.
>

At first read of your description it seems like we could consider
implementing a csv -> avro structured document mapping and then modify the
vectorization/learning code to take avro structured documents (asd) as
input. Users develop their own processors to convert from their format to
asd, or perhaps something in the vein of Solr's DataImportHandler could be
used as a general tool to load from databases or xml into asd. At a lower
level this makes your proposed lucene IndexWriter interface implementation
very attractive.

I am a little skeptical about the utility of avro for storing the vectors
themselves. Some very early initial tests suggested that using avro
reflection to derive a schema from an existing class (such as one of
mahout's vector classes), did not produce a large win in terms of
performance or space but more work in that direction still needs to happen.

I'll take a look at how the data loading relates with the rest of the code
in your patch and come back with questions.

The approach to the vectorization sounds like a pretty neat idea. I'm
interested in seeing how the vector + trace to human readable dump code
works too.

Drew

Re: producing vectors from composite documents

Posted by Olivier Grisel <ol...@ensta.org>.

2010/6/9 Olivier Grisel <ol...@ensta.org>:
> 2010/6/9 Jake Mannix <ja...@gmail.com>:
>> On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel <ol...@ensta.org>wrote:
>>
>>> 2010/6/8 Ted Dunning <te...@gmail.com>:
>>> > Got it.
>>> >
>>> > This really needs to be done before vectorization, but you can segregate
>>> the
>>> > output vector for different handling by passing in a view to different
>>> parts
>>> > of the vector.
>>> >
>>> > My recommendation is that you apply IDF using the weight dictionary in
>>> the
>>> > vectorizer.  That will let you have multiple text fields with different
>>> > weighting schemes but still put all the results into a single result
>>> vector.
>>> >  As a side effect, if you put everything into a vector of dimension 1,
>>> then
>>> > you get multi-field weighted inputs for free.
>>>
>>> Instead of storing the exact IDF values in an explicit dictionnary,
>>> one could use a counting bloom filters datastructure to reduce the
>>> memory footprint and speedup the lookups (though lucene is able to
>>> handle millions of terms without any perf issues).
>>>
>>
>> Using counting bloom filters is a really good idea here.  Do you know
>> any good java implementations of these?
>
> Nope, but AFAIK Ted's combination of probes logic + Murmurhash
> implementation does 90% of the work.

Actually according to this very interesting blog post by Jonathan
Ellis, both hadoop and cassandra provide tested and efficient counting
filters based on murmurhash:

 http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: producing vectors from composite documents

Posted by Olivier Grisel <ol...@ensta.org>.

2010/6/9 Jake Mannix <ja...@gmail.com>:
> On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel <ol...@ensta.org>wrote:
>
>> 2010/6/8 Ted Dunning <te...@gmail.com>:
>> > Got it.
>> >
>> > This really needs to be done before vectorization, but you can segregate
>> the
>> > output vector for different handling by passing in a view to different
>> parts
>> > of the vector.
>> >
>> > My recommendation is that you apply IDF using the weight dictionary in
>> the
>> > vectorizer.  That will let you have multiple text fields with different
>> > weighting schemes but still put all the results into a single result
>> vector.
>> >  As a side effect, if you put everything into a vector of dimension 1,
>> then
>> > you get multi-field weighted inputs for free.
>>
>> Instead of storing the exact IDF values in an explicit dictionnary,
>> one could use a counting bloom filters datastructure to reduce the
>> memory footprint and speedup the lookups (though lucene is able to
>> handle millions of terms without any perf issues).
>>
>
> Using counting bloom filters is a really good idea here.  Do you know
> any good java implementations of these?

Nope, but AFAIK Ted's combination of probes logic + Murmurhash
implementation does 90% of the work.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: producing vectors from composite documents

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel <ol...@ensta.org>wrote:

> 2010/6/8 Ted Dunning <te...@gmail.com>:
> > Got it.
> >
> > This really needs to be done before vectorization, but you can segregate
> the
> > output vector for different handling by passing in a view to different
> parts
> > of the vector.
> >
> > My recommendation is that you apply IDF using the weight dictionary in
> the
> > vectorizer.  That will let you have multiple text fields with different
> > weighting schemes but still put all the results into a single result
> vector.
> >  As a side effect, if you put everything into a vector of dimension 1,
> then
> > you get multi-field weighted inputs for free.
>
> Instead of storing the exact IDF values in an explicit dictionnary,
> one could use a counting bloom filters datastructure to reduce the
> memory footprint and speedup the lookups (though lucene is able to
> handle millions of terms without any perf issues).
>

Using counting bloom filters is a really good idea here.  Do you know
any good java implementations of these?

  -jake

Re: producing vectors from composite documents

Posted by Olivier Grisel <ol...@ensta.org>.

2010/6/8 Ted Dunning <te...@gmail.com>:
> Got it.
>
> This really needs to be done before vectorization, but you can segregate the
> output vector for different handling by passing in a view to different parts
> of the vector.
>
> My recommendation is that you apply IDF using the weight dictionary in the
> vectorizer.  That will let you have multiple text fields with different
> weighting schemes but still put all the results into a single result vector.
>  As a side effect, if you put everything into a vector of dimension 1, then
> you get multi-field weighted inputs for free.

Instead of storing the exact IDF values in an explicit dictionnary,
one could use a counting bloom filters datastructure to reduce the
memory footprint and speedup the lookups (though lucene is able to
handle millions of terms without any perf issues).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: producing vectors from composite documents

Posted by Ted Dunning <te...@gmail.com>.

Got it.

This really needs to be done before vectorization, but you can segregate the
output vector for different handling by passing in a view to different parts
of the vector.

My recommendation is that you apply IDF using the weight dictionary in the
vectorizer.  That will let you have multiple text fields with different
weighting schemes but still put all the results into a single result vector.
 As a side effect, if you put everything into a vector of dimension 1, then
you get multi-field weighted inputs for free.

On Tue, Jun 8, 2010 at 11:01 AM, Robin Anil <ro...@gmail.com> wrote:

> > I think that you misunderstand me a little bit, and I know that I am not
> > understanding what you are saying here.
> >
>
> Okay.. Lets take an example. Say you have users with text bio and the
> feature age, weight etc.
> text is sparse and we need to apply tfidf on it, while we should not on age
> and weight. So i this case, we need to hash the text into some range and do
> one pass or two pass idf calculation in that range. We need to leave the
> other features alone right. Otherwise by idf they will be squashed log(1)

Re: producing vectors from composite documents

Posted by Robin Anil <ro...@gmail.com>.

>
> > First the documents are normalized, then normalized sums of weights are
> > computed instead of computing the word count. This is the key step which
> > boosts the classification accuracy on text. I can move this to the
> document
> > vectorizer.
> >
>
> And the idf weighting can be done on-line or in two passes.  The two pass
> approach is more precise, but not necessarily very much.  A compromise is
> also possible where the first pass is a small subset of documents (say
> 10,000 docs).  That keeps it really fast and that dictionary can be used as
> the seed for the adaptive weighting (or just used directly).
>
> We already do the two pass in tfidf stage of Dictionary vectorizer. What I
need for CNB are the sums of weights of features and labels and the total
sum of weight of all cells in the matrix

>
> > With this new vectorization, can we hash sparse features to a particular
> id
> > range and ask the tfidf job to compute tfidf for just that portion?. This
> > means, I can delete away the tfidf calculation code for CNB. This can
> exist
> > as a separate vectorizer. And both clustering and classification can use
> > it.
> > It will partially kill its online nature. We can circumvent that using a
> > Document-Frequency Map to compute approximate tf-idf during online stage
> >
>
> I think that you misunderstand me a little bit, and I know that I am not
> understanding what you are saying here.
>

Okay.. Lets take an example. Say you have users with text bio and the
feature age, weight etc.
text is sparse and we need to apply tfidf on it, while we should not on age
and weight. So i this case, we need to hash the text into some range and do
one pass or two pass idf calculation in that range. We need to leave the
other features alone right. Otherwise by idf they will be squashed log(1)

Re: producing vectors from composite documents

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 8, 2010 at 12:20 AM, Robin Anil <ro...@gmail.com> wrote:

>
> >  Generally, Euclidean or L_1 distances are about all that makes sense for
> > these vectors.  For clustering, I worry that I don't take IDF into
> account
> > (there is some provision for that in the AdaptiveWordEncoder, though).
>  For
> > most learning applications, IDF shouldn't matter except that it might
> make
> > convergence faster by reducing the size of the largest eigenvalue.
>
> NB/CNB actually computes a sort of cluster average for a class and selects
> the one with the minimum cosine/dot product
>

Dot product <=> Euclidean metric.

Tfidf will definitely help a lot for this sort of thing.

> First the documents are normalized, then normalized sums of weights are
> computed instead of computing the word count. This is the key step which
> boosts the classification accuracy on text. I can move this to the document
> vectorizer.
>

And the idf weighting can be done on-line or in two passes.  The two pass
approach is more precise, but not necessarily very much.  A compromise is
also possible where the first pass is a small subset of documents (say
10,000 docs).  That keeps it really fast and that dictionary can be used as
the seed for the adaptive weighting (or just used directly).

> With this new vectorization, can we hash sparse features to a particular id
> range and ask the tfidf job to compute tfidf for just that portion?. This
> means, I can delete away the tfidf calculation code for CNB. This can exist
> as a separate vectorizer. And both clustering and classification can use
> it.
> It will partially kill its online nature. We can circumvent that using a
> Document-Frequency Map to compute approximate tf-idf during online stage
>

I think that you misunderstand me a little bit, and I know that I am not
understanding what you are saying here.

The new vectorizer can definitely do IDF weighting and that definitely makes
it good as a driver for classifiers and clustering.

One important thing about the IDF weighting and conversion is that except
for the weights, the conversion to vector is stateless.  The same document
will convert to the same pattern of non-zero elements in the output vector.
 If you have a constant weight dictionary, then the same document will
convert to the exact same output vector no matter what.

Moreover, if you use adaptive weighting, the weights should be pretty close
to the actual weights after you have seen a few thousand documents.  If you
have a global estimate from a random sample of documents then the results
should be close to right no matter what.

So I don't understand the two comments that you make "hash sparse features
to a particular id
range and ... compute tfidf for just that portion" and  "partially kill its
online nature".  Can you explain these a little more?

Does "hashing to a particular id range" mean only hash some words and not
others?  Or does it mean hash to a sub-range of the output vector?  Why
would you do either of these?  Regardless of why, I think the answer may be
yes.  The first can be handled by only vectorizing some fields, the second
can be handled by passing a view of a sub-vector to the vectorizer instead
of passing the entire vector.  Again, though, since I don't understand why
you would need to do this, I think I misunderstand your question.

And how can we "kill the online nature" of a stateless algorithm?

>
> Actually, I am not seeing the big need at the moment to reverse engineer
> data, its good for debugging but not so necessary in production. Let
> prioritise on gettting this plugged in, and work on this after.
>

The major use cases here are cluster dumping and model export from logistic
regression.

I agree with the down-prioritization of the accurate dumping.

Re: producing vectors from composite documents

Posted by Robin Anil <ro...@gmail.com>.

Now I am going to put some comments from the CNB classifier perspective,
lets see how we can integrate things together


>  Generally, Euclidean or L_1 distances are about all that makes sense for
> these vectors.  For clustering, I worry that I don't take IDF into account
> (there is some provision for that in the AdaptiveWordEncoder, though).  For
> most learning applications, IDF shouldn't matter except that it might make
> convergence faster by reducing the size of the largest eigenvalue.

NB/CNB actually computes a sort of cluster average for a class and selects
the one with the minimum cosine/dot product

First the documents are normalized, then normalized sums of weights are
computed instead of computing the word count. This is the key step which
boosts the classification accuracy on text. I can move this to the document
vectorizer.
Once we have the tfidf vectors, CNB computes sum of weights for each feature
and for each label and totally for the whole matrix. So it sort of looks
like the 2x2 LLR computation table. The the actual feature weight in a class
vector is calculated using these 4 values

With this new vectorization, can we hash sparse features to a particular id
range and ask the tfidf job to compute tfidf for just that portion?. This
means, I can delete away the tfidf calculation code for CNB. This can exist
as a separate vectorizer. And both clustering and classification can use it.
It will partially kill its online nature. We can circumvent that using a
Document-Frequency Map to compute approximate tf-idf during online stage


>
> > About the dictionary based trace. I need to actually see how the trace is
> > useful. Do you keep track of the most important feature from those that
> go
> > into a particular hashed location?.
> >
>
> Right now, I pretty much assume that there are no collisions.  This isn't
> always all that great an assumption.  To get rid of that problem, it is
> probably pretty easy to do a relaxation step where I generate an
> explanation
> for a vector and then generate the vector for that explanation.  If there
> are collisions, this last vector will differ slightly from the original and
> the explanation of the difference should get us much closer to the
> original.
>
> Actually, I am not seeing the big need at the moment to reverse engineer
data, its good for debugging but not so necessary in production. Let
prioritise on gettting this plugged in, and work on this after.

Re: producing vectors from composite documents

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 7, 2010 at 7:50 AM, Robin Anil <ro...@gmail.com> wrote:

> .... But wont the similarity metrics need to be
> different for such a vector?
>

Generally, Euclidean or L_1 distances are about all that makes sense for
these vectors.  For clustering, I worry that I don't take IDF into account
(there is some provision for that in the AdaptiveWordEncoder, though).  For
most learning applications, IDF shouldn't matter except that it might make
convergence faster by reducing the size of the largest eigenvalue.

> About the dictionary based trace. I need to actually see how the trace is
> useful. Do you keep track of the most important feature from those that go
> into a particular hashed location?.
>

Right now, I pretty much assume that there are no collisions.  This isn't
always all that great an assumption.  To get rid of that problem, it is
probably pretty easy to do a relaxation step where I generate an explanation
for a vector and then generate the vector for that explanation.  If there
are collisions, this last vector will differ slightly from the original and
the explanation of the difference should get us much closer to the original.

In clustering, we need to show the cluster centroids and the top features in
> it for text. I don't know if that is useful for types of data other than
> text. With these vectors how would the cluster dumper change?
>

I think that for general ideas about vector content, this style should be
fine.  The cluster dumper could just print the explanation of the input
vectors.  I have found explanations very helpful in mixed settings so far,
but that is because I am printing out classification models.  For most
clustering applications, that might be a different story.

Re: producing vectors from composite documents

Posted by Robin Anil <ro...@gmail.com>.

>
> Robin, Jeff, I am curious what you think about these ideas relative to
> k-means and other clustering techniques.
>
> This is exciting, having a generic vectorizer, (without the dependency on
dictionary) is great!. It will fit right in with the rest of clustering, it
all using vectors after all. But wont the similarity metrics need to be
different for such a vector?

About the dictionary based trace. I need to actually see how the trace is
useful. Do you keep track of the most important feature from those that go
into a particular hashed location?.

In clustering, we need to show the cluster centroids and the top features in
it for text. I don't know if that is useful for types of data other than
text. With these vectors how would the cluster dumper change?

Robin