You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Max Heimel <mh...@googlemail.com> on 2009/11/29 22:44:32 UTC

Feature Extraction for TU Berlin Winter of Code project

Hello everybody,

I'd like to discuss some issues with you regarding the 3rd layer of
our proposed tuwoc-architecture: the feature extraction from the
preprocessed crawled blog entries.

Currently we do a rather simple process: compute for each document
TFIDF of all terms in the corpus. This is implemented straight-forward
as a two-step map/reduce job. First a map job computes (and serializes
to HBASE) TF histograms for each document. Then a reduce job computes
the IDF of all terms occuring in the corpus and serializes the list of
term/IDF pairs to HDFS. Finally, a third map job uses the serialized
term/IDF pairs and TF histograms to compute a feature vector for each
document. So basically, our feature space is the set of all term/IDF
pairs.

I currently see one major issue with this approach: our feature space
- and thus our feature vectors - will probably get very large when
many documents are scanned. This will obviously lead to the clustering
being very slow. We probably will have to perform some kind of feature
reduction during the feature extraction to get smaller - but still
expressive - feature vectors. One idea would e.g. be to perform PCA on
the "complete" feature vectors in order to identify dimensions that
can be pruned. However, this might be computationally too expensive.
Since I am not very experienced in this field, I hoped that some of
you could share their thoughts or sugestions on the issue.

Cheers,
Max

Re: Feature Extraction for TU Berlin Winter of Code project

Posted by Ted Dunning <te...@gmail.com>.

Check the mailing list archives for more details and discussion.  The best
single paper on the topic is this one:

http://arxiv.org/abs/0909.4061

On Tue, Dec 1, 2009 at 10:06 AM, Max Heimel <mh...@googlemail.com> wrote:

> Do you have further details on the SVD implementation you were
> mentioning  (e.g. a paper detailing the approach). I once implemented
> a streaming SVD algorithm for Cell and would be interested to see the
> algorithm that was used for Mahout. Thanks.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Feature Extraction for TU Berlin Winter of Code project

Posted by Max Heimel <mh...@googlemail.com>.

Hi Ted,

first thanks for your answer.

> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>> - and thus our feature vectors - will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily.  If you use sparse vectors, then the dot products required
> are pretty fast.

Yes, we are already using the Mahout provided sparse vector implementation.

> In addition, you could make use of some of the random indexing or simhash
> methods.  At the simplest level, simply assign a low dimensional unit length
> random vector to each feature.  Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature.  If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data.

Random Indexing sounds like a really interesting technique for feature
extraction in the tuwoc project.
I think this approach combined with a more selective data cleansing
and feature selection should provide us with reasonable feature
vectors. Thanks for the hint!

Cheers
Max

PS:
Do you have further details on the SVD implementation you were
mentioning  (e.g. a paper detailing the approach). I once implemented
a streaming SVD algorithm for Cell and would be interested to see the
algorithm that was used for Mahout. Thanks.


On Mon, Nov 30, 2009 at 12:25 AM, Ted Dunning <te...@gmail.com> wrote:
> On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <mh...@googlemail.com> wrote:
>
>> ...
>> Currently we do a rather simple process: compute for each document
>> TFIDF of all terms in the corpus. This is implemented straight-forward
>> as a two-step map/reduce job. First a map job computes (and serializes
>> to HBASE) TF histograms for each document. Then a reduce job computes
>> the IDF of all terms occuring in the corpus and serializes the list of
>> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
>> term/IDF pairs and TF histograms to compute a feature vector for each
>> document. So basically, our feature space is the set of all term/IDF
>> pairs.
>>
>
> You could also use the code in Mahout that allows you to write a Lucene
> index as a sequence of document vectors.
>
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>> - and thus our feature vectors - will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily.  If you use sparse vectors, then the dot products required
> are pretty fast.
>
>
>> We probably will have to perform some kind of feature
>> reduction during the feature extraction to get smaller - but still
>> expressive - feature vectors. One idea would e.g. be to perform PCA on
>> the "complete" feature vectors in order to identify dimensions that
>> can be pruned. However, this might be computationally too expensive.
>> Since I am not very experienced in this field, I hoped that some of
>> you could share their thoughts or sugestions on the issue.
>>
>
> Read the archives for Mahout-dev.  Several developers are working on SVD
> decomposition algorithms that run in parallel to do what you need.
>
> In addition, you could make use of some of the random indexing or simhash
> methods.  At the simplest level, simply assign a low dimensional unit length
> random vector to each feature.  Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature.  If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data.  This is pretty much what a sim-hash does, except that
> simhashes go one step further and binarize the vectors.
>
> A slightly more involved approach would be to use these same initial random
> vectors and update them so that the new vectors for each term or other
> feature are the IDF weighted sum of terms or features that occur nearby.
> This makes it so that terms that appear in similar contexts will have
> similar directions which is a simple step toward getting vectors with
> semantic values.  This can be done very efficiently in a single map-reduce
> pass over your data.  You would use these feature vectors just like you did
> with the sim-hash technique.  This class of techniques is known as random
> indexing.
>
> A third approach is to use random projects to make computation of the SVD of
> the document x term matrix tractable.  In fact, for your purposes, you can
> use a truncated and simplified algorithm that only computes the singular
> vectors for the terms which you would then use in similar wise to the first
> two methods.  In contrast, you can also use the truncated algorithm to only
> compute the singular vectors for the documents without ever computing term
> vectors.  This is useful if all you want to do is cluster the documents and
> don't really care about the term vectors.  As I mentioned, there should be a
> bit of code appearing shortly that does some or all of this.  The speed of
> this approach should be quite high.
>

Re: Feature Extraction for TU Berlin Winter of Code project

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 29, 2009, at 11:25 PM, Ted Dunning wrote:

> On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <mh...@googlemail.com> wrote:
> 
>> ...
>> Currently we do a rather simple process: compute for each document
>> TFIDF of all terms in the corpus. This is implemented straight-forward
>> as a two-step map/reduce job. First a map job computes (and serializes
>> to HBASE) TF histograms for each document. Then a reduce job computes
>> the IDF of all terms occuring in the corpus and serializes the list of
>> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
>> term/IDF pairs and TF histograms to compute a feature vector for each
>> document. So basically, our feature space is the set of all term/IDF
>> pairs.
>> 
> 
> You could also use the code in Mahout that allows you to write a Lucene
> index as a sequence of document vectors.
> 
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.

There Categorization stuff also has a M/R ready TF/IDF calculator.  It would be nice to see this abstracted from the categorization stuff and just used to produce various outputs as needed.

-Grant

Re: Feature Extraction for TU Berlin Winter of Code project

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <mh...@googlemail.com> wrote:

> ...
> Currently we do a rather simple process: compute for each document
> TFIDF of all terms in the corpus. This is implemented straight-forward
> as a two-step map/reduce job. First a map job computes (and serializes
> to HBASE) TF histograms for each document. Then a reduce job computes
> the IDF of all terms occuring in the corpus and serializes the list of
> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
> term/IDF pairs and TF histograms to compute a feature vector for each
> document. So basically, our feature space is the set of all term/IDF
> pairs.
>

You could also use the code in Mahout that allows you to write a Lucene
index as a sequence of document vectors.

In any case, you should look at the format already in use by Mahout tools to
match those to what you do.

I currently see one major issue with this approach: our feature space
> - and thus our feature vectors - will probably get very large when
> many documents are scanned. This will obviously lead to the clustering
> being very slow.

Not necessarily.  If you use sparse vectors, then the dot products required
are pretty fast.

> We probably will have to perform some kind of feature
> reduction during the feature extraction to get smaller - but still
> expressive - feature vectors. One idea would e.g. be to perform PCA on
> the "complete" feature vectors in order to identify dimensions that
> can be pruned. However, this might be computationally too expensive.
> Since I am not very experienced in this field, I hoped that some of
> you could share their thoughts or sugestions on the issue.
>

Read the archives for Mahout-dev.  Several developers are working on SVD
decomposition algorithms that run in parallel to do what you need.

In addition, you could make use of some of the random indexing or simhash
methods.  At the simplest level, simply assign a low dimensional unit length
random vector to each feature.  Then the vector for each document would be
the IDF weighted sum of the vectors for each feature.  If you use a moderate
to high dimensional vector (> 50 dimensions, <500), this will give you nice
fixed length vectors that will pretty accurately preserve dot products of
the original data.  This is pretty much what a sim-hash does, except that
simhashes go one step further and binarize the vectors.

A slightly more involved approach would be to use these same initial random
vectors and update them so that the new vectors for each term or other
feature are the IDF weighted sum of terms or features that occur nearby.
This makes it so that terms that appear in similar contexts will have
similar directions which is a simple step toward getting vectors with
semantic values.  This can be done very efficiently in a single map-reduce
pass over your data.  You would use these feature vectors just like you did
with the sim-hash technique.  This class of techniques is known as random
indexing.

A third approach is to use random projects to make computation of the SVD of
the document x term matrix tractable.  In fact, for your purposes, you can
use a truncated and simplified algorithm that only computes the singular
vectors for the terms which you would then use in similar wise to the first
two methods.  In contrast, you can also use the truncated algorithm to only
compute the singular vectors for the documents without ever computing term
vectors.  This is useful if all you want to do is cluster the documents and
don't really care about the term vectors.  As I mentioned, there should be a
bit of code appearing shortly that does some or all of this.  The speed of
this approach should be quite high.