You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2014/03/18 22:40:31 UTC

Text clustering with hashing vector encoders

Hi all,

Would it be possible to use hashing vector encoders for text clustering
just like when classifying?

Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the cluster labels.
This is quite a complex process where multiple outputs are read and written
in the entire clustering process.

I think it would be great if both algorithms could use the same encoding
process but I don't know if this is possible.

The problem is that we lose the mapping between token and position when
hashing. We need this mapping to determine cluster labels.

However, maybe we could make it so hashed encoders can be used and that
determining top labels is left to the user. This might be a possibility
because I noticed a problem with the current cluster labeling code. This is
what happens: first vectors are vectorized with TF-IDF and clustered. Then
the labels are ranked, but again according to TF-IDF, instead of TF. So it
is possible that a token becomes the top ranked label, even though it is
rare within the cluster. The document with that token is in the cluster
because of other tokens. If the labels are determined based on a TF score
within the cluster I think you would have better labels. But this requires
a post-processing step on your original data and doing a TF count.

Thoughts?

Cheers,

Frank

Re: Text clustering with hashing vector encoders

Posted by Ted Dunning <te...@gmail.com>.

With text hashing, you have an issue because of collisions.  In spite of
this, you get good results and can decrease the dimension of the data
substantially using a single hashed location.

If you use more than one probe, the probability that two words will hash to
exactly the same two locations is much lower than the probability of a
single collision.  This means that there is a much better chance of being
able to distinguish important differences.  Practically, this allows you to
use yet smaller vectors to represent your data.  The down-side of this is
that you will have more non-zero values which increase the amount of math
that you actually need to do.  The trade-off is more complex than it looks
since the models you build from these values often has dense vectors
internally.  Thus, it may be good to hash down to 100K dimensions with
several probes rather than down to 100 million dimensions with a single
probe.

In the limit, you are basically doing a random projection except that by
preserving substantial sparsity, you can still get the benefits of L1
regularization.




On Tue, Mar 18, 2014 at 9:23 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> How does "with multiple probes" affect distance preservation, and how would
> idf weighting get tricky just by hashing strings?
>
> Would we be computing distance between hashed strings, or distance between
> vectors based on counts of hashed strings?
>
>
> On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > +1 to this. We could then use Hamming Distance to compute the distances
> > between Hashed Vectors.
> >
> > We have  the code for HashedVector.java based on Moses Charikar's SimHash
> > paper.
> >
> >
> >
> >
> >
> >
> >
> > On Tuesday, March 18, 2014 7:14 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > Yes.  Hashing vector encoders will preserve distances when used with
> > multiple probes.
> >
> > Interpretation becomes somewhat difficult, but there is code available to
> > reverse engineer labels on hashed vectors.
> >
> > IDF weighting is slightly tricky, but quite doable if you keep a
> dictionary
> > of, say, the most common 50-200 thousand words and assume all others have
> > constant and equal frequency.
> >
> >
> >
> >
> > On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <frank@frankscholten.nl
> > >wrote:
> >
> > > Hi all,
> > >
> > > Would it be possible to use hashing vector encoders for text clustering
> > > just like when classifying?
> > >
> > > Currently we vectorize using a dictionary where we map each token to a
> > > fixed position in the dictionary. After the clustering we use have to
> > > retrieve the dictionary to determine the cluster labels.
> > > This is quite a complex process where multiple outputs are read and
> > written
> > > in the entire clustering process.
> > >
> > > I think it would be great if both algorithms could use the same
> encoding
> > > process but I don't know if this is possible.
> > >
> > > The problem is that we lose the mapping between token and position when
> > > hashing. We need this mapping to determine cluster labels.
> > >
> > > However, maybe we could make it so hashed encoders can be used and that
> > > determining top labels is left to the user. This might be a possibility
> > > because I noticed a problem with the current cluster labeling code.
> This
> > is
> > > what happens: first vectors are vectorized with TF-IDF and clustered.
> > Then
> > > the labels are ranked, but again according to TF-IDF, instead of TF. So
> > it
> > > is possible that a token becomes the top ranked label, even though it
> is
> > > rare within the cluster. The document with that token is in the cluster
> > > because of other tokens. If the labels are determined based on a TF
> score
> > > within the cluster I think you would have better labels. But this
> > requires
> > > a post-processing step on your original data and doing a TF count.
> > >
> > > Thoughts?
> > >
> > > Cheers,
> > >
> > > Frank
> > >
> >
>

Re: Text clustering with hashing vector encoders

Posted by Andrew Musselman <an...@gmail.com>.

How does "with multiple probes" affect distance preservation, and how would
idf weighting get tricky just by hashing strings?

Would we be computing distance between hashed strings, or distance between
vectors based on counts of hashed strings?


On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi <su...@yahoo.com>wrote:

> +1 to this. We could then use Hamming Distance to compute the distances
> between Hashed Vectors.
>
> We have  the code for HashedVector.java based on Moses Charikar's SimHash
> paper.
>
>
>
>
>
>
>
> On Tuesday, March 18, 2014 7:14 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> Yes.  Hashing vector encoders will preserve distances when used with
> multiple probes.
>
> Interpretation becomes somewhat difficult, but there is code available to
> reverse engineer labels on hashed vectors.
>
> IDF weighting is slightly tricky, but quite doable if you keep a dictionary
> of, say, the most common 50-200 thousand words and assume all others have
> constant and equal frequency.
>
>
>
>
> On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > Would it be possible to use hashing vector encoders for text clustering
> > just like when classifying?
> >
> > Currently we vectorize using a dictionary where we map each token to a
> > fixed position in the dictionary. After the clustering we use have to
> > retrieve the dictionary to determine the cluster labels.
> > This is quite a complex process where multiple outputs are read and
> written
> > in the entire clustering process.
> >
> > I think it would be great if both algorithms could use the same encoding
> > process but I don't know if this is possible.
> >
> > The problem is that we lose the mapping between token and position when
> > hashing. We need this mapping to determine cluster labels.
> >
> > However, maybe we could make it so hashed encoders can be used and that
> > determining top labels is left to the user. This might be a possibility
> > because I noticed a problem with the current cluster labeling code. This
> is
> > what happens: first vectors are vectorized with TF-IDF and clustered.
> Then
> > the labels are ranked, but again according to TF-IDF, instead of TF. So
> it
> > is possible that a token becomes the top ranked label, even though it is
> > rare within the cluster. The document with that token is in the cluster
> > because of other tokens. If the labels are determined based on a TF score
> > within the cluster I think you would have better labels. But this
> requires
> > a post-processing step on your original data and doing a TF count.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Re: Text clustering with hashing vector encoders

Posted by Suneel Marthi <su...@yahoo.com>.

+1 to this. We could then use Hamming Distance to compute the distances between Hashed Vectors.

We have  the code for HashedVector.java based on Moses Charikar's SimHash paper.







On Tuesday, March 18, 2014 7:14 PM, Ted Dunning <te...@gmail.com> wrote:
 
Yes.  Hashing vector encoders will preserve distances when used with
multiple probes.

Interpretation becomes somewhat difficult, but there is code available to
reverse engineer labels on hashed vectors.

IDF weighting is slightly tricky, but quite doable if you keep a dictionary
of, say, the most common 50-200 thousand words and assume all others have
constant and equal frequency.




On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi all,
>
> Would it be possible to use hashing vector encoders for text clustering
> just like when classifying?
>
> Currently we vectorize using a dictionary where we map each token to a
> fixed position in the dictionary. After the clustering we use have to
> retrieve the dictionary to determine the cluster labels.
> This is quite a complex process where multiple outputs are read and written
> in the entire clustering process.
>
> I think it would be great if both algorithms could use the same encoding
> process but I don't know if this is possible.
>
> The problem is that we lose the mapping between token and position when
> hashing. We need this mapping to determine cluster labels.
>
> However, maybe we could make it so hashed encoders can be used and that
> determining top labels is left to the user. This might be a possibility
> because I noticed a problem with the current cluster labeling code. This is
> what happens: first vectors are vectorized with TF-IDF and clustered. Then
> the labels are ranked, but again according to TF-IDF, instead of TF. So it
> is possible that a token becomes the top ranked label, even though it is
> rare within the cluster. The document with that token is in the cluster
> because of other tokens. If the labels are determined based on a TF score
> within the cluster I think you would have better labels. But this requires
> a post-processing step on your original data and doing a TF count.
>
> Thoughts?
>
> Cheers,
>
> Frank
>

Re: Text clustering with hashing vector encoders

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Mar 20, 2014 at 12:39 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> For representing the cluster we have a separate job that assigns users
> ("documents") to clusters and shows the most discriminating words for the
> cluster via the LogLikelihood class. The results are then visualized using
> http://wordcram.org/ for the whoah effect.
>

This is a good way to reverse engineer the hashing.

Re: Text clustering with hashing vector encoders

Posted by Frank Scholten <fr...@frankscholten.nl>.

Ah, interesting. I am going to try it out.

Thanks for your comments!


On Fri, Mar 21, 2014 at 9:29 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Hi frank,
>
> no, no collocation job. You just take a big enough sample of documents and
> assign it to it's cluster with the learned ClusterClassifier. Parallel to
> that you count the total words in a guava multiset and the per cluster word
> counts in a multiset. The LogLikelihood class contains a convenient method
> that takes two multisets that you use for all clusters.
>
> there should be no need in starting a map reduce job for that, with some
> ram you can just stream the documents from the hdfs
>
>
>
>
> On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi Johannes,
> >
> > Sounds good.
> >
> > The step for finding labels is still unclear to me. You use the
> > Loglikelihood class on the original documents? How? Or do you mean the
> > collocation job?
> >
> > Cheers,
> >
> > Frank
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
> > johannes.schulte@gmail.com> wrote:
> >
> > > Hi Frank, we are using a very similar system in production.
> > > Hashing text like data to a 50000 dimensional vector with two probes,
> and
> > > then applying tf-idf weighting.
> > >
> > > For IDF we dont keep a separate weight dictionary but just count the
> > > distinct training examples ("documents") that have a non null value per
> > > column.
> > > so there is a full idf vector that can be used.
> > > Instead of Euclidean Distance we use Cosine (Performance Reasons).
> > >
> > > The results are very good, building such a system is easy and maybe
> it's
> > > worth a try.
> > >
> > > For representing the cluster we have a separate job that assigns users
> > > ("documents") to clusters and shows the most discriminating words for
> the
> > > cluster via the LogLikelihood class. The results are then visualized
> > using
> > > http://wordcram.org/ for the whoah effect.
> > >
> > > Cheers,
> > >
> > > Johannes
> > >
> > >
> > > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <
> > frank@frankscholten.nl
> > > > >wrote:
> > > >
> > > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <
> ted.dunning@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Yes.  Hashing vector encoders will preserve distances when used
> > with
> > > > > > multiple probes.
> > > > > >
> > > > >
> > > > > So if a token occurs two times in a document the first token will
> be
> > > > mapped
> > > > > to a given location and when the token is hashed the second time it
> > > will
> > > > be
> > > > > mapped to a different location, right?
> > > > >
> > > >
> > > > No.  The same token will always hash to the same location(s).
> > > >
> > > >
> > > > > I am wondering if when many probes are used and a large enough
> vector
> > > > this
> > > > > process mimics TF weighting, since documents that have a high TF
> of a
> > > > given
> > > > > token will have the same positions marked in the vector. As Suneel
> > said
> > > > > when we then use the Hamming distance the vectors that are close to
> > > each
> > > > > other should be in the same cluster.
> > > > >
> > > >
> > > > Hamming distance doesn't quite work because you want to have
> collisions
> > > to
> > > > a sum rather than an OR.  Also, if you apply weights to the words,
> > these
> > > > weights will be added to all of the probe locations for the words.
> >  This
> > > > means we still need a plus/times/L2 dot product rather than an
> > > plus/AND/L1
> > > > dot product like the Hamming distance uses.
> > > >
> > > > >
> > > > > > Interpretation becomes somewhat difficult, but there is code
> > > available
> > > > to
> > > > > > reverse engineer labels on hashed vectors.
> > > > >
> > > > >
> > > > > I saw that AdaptiveWordEncoder has a built in dictionary so I can
> see
> > > > which
> > > > > words it has seen but I don't see how to go from a position or
> > several
> > > > > positions in the vector to labels. Is there an example in the code
> I
> > > can
> > > > > look at?
> > > > >
> > > >
> > > > Yes.  The newsgroups example applies.
> > > >
> > > > The AdaptiveWordEncoder counts word occurrences that it sees and uses
> > the
> > > > IDF based on the resulting counts.  This assumes that all instances
> of
> > > the
> > > > AWE will see the same rough distribution of words to work.  It is
> fine
> > > for
> > > > lots of applications and not fine for lots.
> > > >
> > > >
> > > > >
> > > > >
> > > > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > > > dictionary
> > > > > > of, say, the most common 50-200 thousand words and assume all
> > others
> > > > have
> > > > > > constant and equal frequency.
> > > > > >
> > > > >
> > > > > How would IDF weighting work in conjunction with hashing? First
> build
> > > up
> > > > a
> > > > > dictionary of 50-200 and pass that into the vector encoders? The
> > > drawback
> > > > > of this is that you have another pass through the data and another
> > > > 'input'
> > > > > to keep track of and configure. But maybe it has to be like that.
> > > >
> > > >
> > > > With hashing, you still have the option of applying a weight to the
> > > hashed
> > > > representation of each word.  The question is what weight.
> > > >
> > > > To build a small dictionary, you don't have to go through all of the
> > > data.
> > > >  Just enough to get reasonably accurate weights for most words.  All
> > > words
> > > > not yet seen can be assumed to be rare and thus get the nominal
> > > "rare-word"
> > > > weight.
> > > >
> > > > Keeping track of the dictionary of weights is, indeed, a pain.
> > > >
> > > >
> > > >
> > > > > The
> > > > > reason I like the hashed encoders is that vectorizing can be done
> in
> > a
> > > > > streaming manner at the last possible moment. With the current
> tools
> > > you
> > > > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > > > >
> > > >
> > > > Indeed.  That is the great virtue.
> > > >
> > > >
> > > > >
> > > > > If this approach is doable I would like to code up a Java
> non-Hadoop
> > > > > example using the Reuters dataset which vectorizes each doc using
> the
> > > > > hashing encoders, configures KMeans with Hamming distance and then
> > > write
> > > > > some code to get the labels.
> > > > >
> > > >
> > > > Use Euclidean distance, not Hamming.
> > > >
> > > > You can definitely use the AWE here if you randomize document
> ordering.
> > > >
> > >
> >
>

Re: Text clustering with hashing vector encoders

Posted by Johannes Schulte <jo...@gmail.com>.

Hi frank,

no, no collocation job. You just take a big enough sample of documents and
assign it to it's cluster with the learned ClusterClassifier. Parallel to
that you count the total words in a guava multiset and the per cluster word
counts in a multiset. The LogLikelihood class contains a convenient method
that takes two multisets that you use for all clusters.

there should be no need in starting a map reduce job for that, with some
ram you can just stream the documents from the hdfs




On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi Johannes,
>
> Sounds good.
>
> The step for finding labels is still unclear to me. You use the
> Loglikelihood class on the original documents? How? Or do you mean the
> collocation job?
>
> Cheers,
>
> Frank
>
>
>
>
>
>
>
> On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > Hi Frank, we are using a very similar system in production.
> > Hashing text like data to a 50000 dimensional vector with two probes, and
> > then applying tf-idf weighting.
> >
> > For IDF we dont keep a separate weight dictionary but just count the
> > distinct training examples ("documents") that have a non null value per
> > column.
> > so there is a full idf vector that can be used.
> > Instead of Euclidean Distance we use Cosine (Performance Reasons).
> >
> > The results are very good, building such a system is easy and maybe it's
> > worth a try.
> >
> > For representing the cluster we have a separate job that assigns users
> > ("documents") to clusters and shows the most discriminating words for the
> > cluster via the LogLikelihood class. The results are then visualized
> using
> > http://wordcram.org/ for the whoah effect.
> >
> > Cheers,
> >
> > Johannes
> >
> >
> > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <
> frank@frankscholten.nl
> > > >wrote:
> > >
> > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Yes.  Hashing vector encoders will preserve distances when used
> with
> > > > > multiple probes.
> > > > >
> > > >
> > > > So if a token occurs two times in a document the first token will be
> > > mapped
> > > > to a given location and when the token is hashed the second time it
> > will
> > > be
> > > > mapped to a different location, right?
> > > >
> > >
> > > No.  The same token will always hash to the same location(s).
> > >
> > >
> > > > I am wondering if when many probes are used and a large enough vector
> > > this
> > > > process mimics TF weighting, since documents that have a high TF of a
> > > given
> > > > token will have the same positions marked in the vector. As Suneel
> said
> > > > when we then use the Hamming distance the vectors that are close to
> > each
> > > > other should be in the same cluster.
> > > >
> > >
> > > Hamming distance doesn't quite work because you want to have collisions
> > to
> > > a sum rather than an OR.  Also, if you apply weights to the words,
> these
> > > weights will be added to all of the probe locations for the words.
>  This
> > > means we still need a plus/times/L2 dot product rather than an
> > plus/AND/L1
> > > dot product like the Hamming distance uses.
> > >
> > > >
> > > > > Interpretation becomes somewhat difficult, but there is code
> > available
> > > to
> > > > > reverse engineer labels on hashed vectors.
> > > >
> > > >
> > > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> > > which
> > > > words it has seen but I don't see how to go from a position or
> several
> > > > positions in the vector to labels. Is there an example in the code I
> > can
> > > > look at?
> > > >
> > >
> > > Yes.  The newsgroups example applies.
> > >
> > > The AdaptiveWordEncoder counts word occurrences that it sees and uses
> the
> > > IDF based on the resulting counts.  This assumes that all instances of
> > the
> > > AWE will see the same rough distribution of words to work.  It is fine
> > for
> > > lots of applications and not fine for lots.
> > >
> > >
> > > >
> > > >
> > > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > > dictionary
> > > > > of, say, the most common 50-200 thousand words and assume all
> others
> > > have
> > > > > constant and equal frequency.
> > > > >
> > > >
> > > > How would IDF weighting work in conjunction with hashing? First build
> > up
> > > a
> > > > dictionary of 50-200 and pass that into the vector encoders? The
> > drawback
> > > > of this is that you have another pass through the data and another
> > > 'input'
> > > > to keep track of and configure. But maybe it has to be like that.
> > >
> > >
> > > With hashing, you still have the option of applying a weight to the
> > hashed
> > > representation of each word.  The question is what weight.
> > >
> > > To build a small dictionary, you don't have to go through all of the
> > data.
> > >  Just enough to get reasonably accurate weights for most words.  All
> > words
> > > not yet seen can be assumed to be rare and thus get the nominal
> > "rare-word"
> > > weight.
> > >
> > > Keeping track of the dictionary of weights is, indeed, a pain.
> > >
> > >
> > >
> > > > The
> > > > reason I like the hashed encoders is that vectorizing can be done in
> a
> > > > streaming manner at the last possible moment. With the current tools
> > you
> > > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > > >
> > >
> > > Indeed.  That is the great virtue.
> > >
> > >
> > > >
> > > > If this approach is doable I would like to code up a Java non-Hadoop
> > > > example using the Reuters dataset which vectorizes each doc using the
> > > > hashing encoders, configures KMeans with Hamming distance and then
> > write
> > > > some code to get the labels.
> > > >
> > >
> > > Use Euclidean distance, not Hamming.
> > >
> > > You can definitely use the AWE here if you randomize document ordering.
> > >
> >
>

Re: Text clustering with hashing vector encoders

Posted by Frank Scholten <fr...@frankscholten.nl>.

Hi Johannes,

Sounds good.

The step for finding labels is still unclear to me. You use the
Loglikelihood class on the original documents? How? Or do you mean the
collocation job?

Cheers,

Frank







On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Hi Frank, we are using a very similar system in production.
> Hashing text like data to a 50000 dimensional vector with two probes, and
> then applying tf-idf weighting.
>
> For IDF we dont keep a separate weight dictionary but just count the
> distinct training examples ("documents") that have a non null value per
> column.
> so there is a full idf vector that can be used.
> Instead of Euclidean Distance we use Cosine (Performance Reasons).
>
> The results are very good, building such a system is easy and maybe it's
> worth a try.
>
> For representing the cluster we have a separate job that assigns users
> ("documents") to clusters and shows the most discriminating words for the
> cluster via the LogLikelihood class. The results are then visualized using
> http://wordcram.org/ for the whoah effect.
>
> Cheers,
>
> Johannes
>
>
> On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <frank@frankscholten.nl
> > >wrote:
> >
> > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Yes.  Hashing vector encoders will preserve distances when used with
> > > > multiple probes.
> > > >
> > >
> > > So if a token occurs two times in a document the first token will be
> > mapped
> > > to a given location and when the token is hashed the second time it
> will
> > be
> > > mapped to a different location, right?
> > >
> >
> > No.  The same token will always hash to the same location(s).
> >
> >
> > > I am wondering if when many probes are used and a large enough vector
> > this
> > > process mimics TF weighting, since documents that have a high TF of a
> > given
> > > token will have the same positions marked in the vector. As Suneel said
> > > when we then use the Hamming distance the vectors that are close to
> each
> > > other should be in the same cluster.
> > >
> >
> > Hamming distance doesn't quite work because you want to have collisions
> to
> > a sum rather than an OR.  Also, if you apply weights to the words, these
> > weights will be added to all of the probe locations for the words.  This
> > means we still need a plus/times/L2 dot product rather than an
> plus/AND/L1
> > dot product like the Hamming distance uses.
> >
> > >
> > > > Interpretation becomes somewhat difficult, but there is code
> available
> > to
> > > > reverse engineer labels on hashed vectors.
> > >
> > >
> > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> > which
> > > words it has seen but I don't see how to go from a position or several
> > > positions in the vector to labels. Is there an example in the code I
> can
> > > look at?
> > >
> >
> > Yes.  The newsgroups example applies.
> >
> > The AdaptiveWordEncoder counts word occurrences that it sees and uses the
> > IDF based on the resulting counts.  This assumes that all instances of
> the
> > AWE will see the same rough distribution of words to work.  It is fine
> for
> > lots of applications and not fine for lots.
> >
> >
> > >
> > >
> > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > dictionary
> > > > of, say, the most common 50-200 thousand words and assume all others
> > have
> > > > constant and equal frequency.
> > > >
> > >
> > > How would IDF weighting work in conjunction with hashing? First build
> up
> > a
> > > dictionary of 50-200 and pass that into the vector encoders? The
> drawback
> > > of this is that you have another pass through the data and another
> > 'input'
> > > to keep track of and configure. But maybe it has to be like that.
> >
> >
> > With hashing, you still have the option of applying a weight to the
> hashed
> > representation of each word.  The question is what weight.
> >
> > To build a small dictionary, you don't have to go through all of the
> data.
> >  Just enough to get reasonably accurate weights for most words.  All
> words
> > not yet seen can be assumed to be rare and thus get the nominal
> "rare-word"
> > weight.
> >
> > Keeping track of the dictionary of weights is, indeed, a pain.
> >
> >
> >
> > > The
> > > reason I like the hashed encoders is that vectorizing can be done in a
> > > streaming manner at the last possible moment. With the current tools
> you
> > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > >
> >
> > Indeed.  That is the great virtue.
> >
> >
> > >
> > > If this approach is doable I would like to code up a Java non-Hadoop
> > > example using the Reuters dataset which vectorizes each doc using the
> > > hashing encoders, configures KMeans with Hamming distance and then
> write
> > > some code to get the labels.
> > >
> >
> > Use Euclidean distance, not Hamming.
> >
> > You can definitely use the AWE here if you randomize document ordering.
> >
>

Re: Text clustering with hashing vector encoders

Posted by Johannes Schulte <jo...@gmail.com>.

Hi Frank, we are using a very similar system in production.
Hashing text like data to a 50000 dimensional vector with two probes, and
then applying tf-idf weighting.

For IDF we dont keep a separate weight dictionary but just count the
distinct training examples ("documents") that have a non null value per
column.
so there is a full idf vector that can be used.
Instead of Euclidean Distance we use Cosine (Performance Reasons).

The results are very good, building such a system is easy and maybe it's
worth a try.

For representing the cluster we have a separate job that assigns users
("documents") to clusters and shows the most discriminating words for the
cluster via the LogLikelihood class. The results are then visualized using
http://wordcram.org/ for the whoah effect.

Cheers,

Johannes


On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Yes.  Hashing vector encoders will preserve distances when used with
> > > multiple probes.
> > >
> >
> > So if a token occurs two times in a document the first token will be
> mapped
> > to a given location and when the token is hashed the second time it will
> be
> > mapped to a different location, right?
> >
>
> No.  The same token will always hash to the same location(s).
>
>
> > I am wondering if when many probes are used and a large enough vector
> this
> > process mimics TF weighting, since documents that have a high TF of a
> given
> > token will have the same positions marked in the vector. As Suneel said
> > when we then use the Hamming distance the vectors that are close to each
> > other should be in the same cluster.
> >
>
> Hamming distance doesn't quite work because you want to have collisions to
> a sum rather than an OR.  Also, if you apply weights to the words, these
> weights will be added to all of the probe locations for the words.  This
> means we still need a plus/times/L2 dot product rather than an plus/AND/L1
> dot product like the Hamming distance uses.
>
> >
> > > Interpretation becomes somewhat difficult, but there is code available
> to
> > > reverse engineer labels on hashed vectors.
> >
> >
> > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> which
> > words it has seen but I don't see how to go from a position or several
> > positions in the vector to labels. Is there an example in the code I can
> > look at?
> >
>
> Yes.  The newsgroups example applies.
>
> The AdaptiveWordEncoder counts word occurrences that it sees and uses the
> IDF based on the resulting counts.  This assumes that all instances of the
> AWE will see the same rough distribution of words to work.  It is fine for
> lots of applications and not fine for lots.
>
>
> >
> >
> > > IDF weighting is slightly tricky, but quite doable if you keep a
> > dictionary
> > > of, say, the most common 50-200 thousand words and assume all others
> have
> > > constant and equal frequency.
> > >
> >
> > How would IDF weighting work in conjunction with hashing? First build up
> a
> > dictionary of 50-200 and pass that into the vector encoders? The drawback
> > of this is that you have another pass through the data and another
> 'input'
> > to keep track of and configure. But maybe it has to be like that.
>
>
> With hashing, you still have the option of applying a weight to the hashed
> representation of each word.  The question is what weight.
>
> To build a small dictionary, you don't have to go through all of the data.
>  Just enough to get reasonably accurate weights for most words.  All words
> not yet seen can be assumed to be rare and thus get the nominal "rare-word"
> weight.
>
> Keeping track of the dictionary of weights is, indeed, a pain.
>
>
>
> > The
> > reason I like the hashed encoders is that vectorizing can be done in a
> > streaming manner at the last possible moment. With the current tools you
> > have to do: data -> data2seq -> seq2sparse -> kmeans.
> >
>
> Indeed.  That is the great virtue.
>
>
> >
> > If this approach is doable I would like to code up a Java non-Hadoop
> > example using the Reuters dataset which vectorizes each doc using the
> > hashing encoders, configures KMeans with Hamming distance and then write
> > some code to get the labels.
> >
>
> Use Euclidean distance, not Hamming.
>
> You can definitely use the AWE here if you randomize document ordering.
>

Re: Text clustering with hashing vector encoders

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <fr...@frankscholten.nl>wrote:

> On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Yes.  Hashing vector encoders will preserve distances when used with
> > multiple probes.
> >
>
> So if a token occurs two times in a document the first token will be mapped
> to a given location and when the token is hashed the second time it will be
> mapped to a different location, right?
>

No.  The same token will always hash to the same location(s).

> I am wondering if when many probes are used and a large enough vector this
> process mimics TF weighting, since documents that have a high TF of a given
> token will have the same positions marked in the vector. As Suneel said
> when we then use the Hamming distance the vectors that are close to each
> other should be in the same cluster.
>

Hamming distance doesn't quite work because you want to have collisions to
a sum rather than an OR.  Also, if you apply weights to the words, these
weights will be added to all of the probe locations for the words.  This
means we still need a plus/times/L2 dot product rather than an plus/AND/L1
dot product like the Hamming distance uses.

>
> > Interpretation becomes somewhat difficult, but there is code available to
> > reverse engineer labels on hashed vectors.
>
>
> I saw that AdaptiveWordEncoder has a built in dictionary so I can see which
> words it has seen but I don't see how to go from a position or several
> positions in the vector to labels. Is there an example in the code I can
> look at?
>

Yes.  The newsgroups example applies.

The AdaptiveWordEncoder counts word occurrences that it sees and uses the
IDF based on the resulting counts.  This assumes that all instances of the
AWE will see the same rough distribution of words to work.  It is fine for
lots of applications and not fine for lots.

>
>
> > IDF weighting is slightly tricky, but quite doable if you keep a
> dictionary
> > of, say, the most common 50-200 thousand words and assume all others have
> > constant and equal frequency.
> >
>
> How would IDF weighting work in conjunction with hashing? First build up a
> dictionary of 50-200 and pass that into the vector encoders? The drawback
> of this is that you have another pass through the data and another 'input'
> to keep track of and configure. But maybe it has to be like that.

With hashing, you still have the option of applying a weight to the hashed
representation of each word.  The question is what weight.

To build a small dictionary, you don't have to go through all of the data.
 Just enough to get reasonably accurate weights for most words.  All words
not yet seen can be assumed to be rare and thus get the nominal "rare-word"
weight.

Keeping track of the dictionary of weights is, indeed, a pain.

> The
> reason I like the hashed encoders is that vectorizing can be done in a
> streaming manner at the last possible moment. With the current tools you
> have to do: data -> data2seq -> seq2sparse -> kmeans.
>

Indeed.  That is the great virtue.

>
> If this approach is doable I would like to code up a Java non-Hadoop
> example using the Reuters dataset which vectorizes each doc using the
> hashing encoders, configures KMeans with Hamming distance and then write
> some code to get the labels.
>

Use Euclidean distance, not Hamming.

You can definitely use the AWE here if you randomize document ordering.

Re: Text clustering with hashing vector encoders

Posted by Frank Scholten <fr...@frankscholten.nl>.

On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  Hashing vector encoders will preserve distances when used with
> multiple probes.
>

So if a token occurs two times in a document the first token will be mapped
to a given location and when the token is hashed the second time it will be
mapped to a different location, right?

I am wondering if when many probes are used and a large enough vector this
process mimics TF weighting, since documents that have a high TF of a given
token will have the same positions marked in the vector. As Suneel said
when we then use the Hamming distance the vectors that are close to each
other should be in the same cluster.

>
> Interpretation becomes somewhat difficult, but there is code available to
> reverse engineer labels on hashed vectors.

I saw that AdaptiveWordEncoder has a built in dictionary so I can see which
words it has seen but I don't see how to go from a position or several
positions in the vector to labels. Is there an example in the code I can
look at?

> IDF weighting is slightly tricky, but quite doable if you keep a dictionary
> of, say, the most common 50-200 thousand words and assume all others have
> constant and equal frequency.
>

How would IDF weighting work in conjunction with hashing? First build up a
dictionary of 50-200 and pass that into the vector encoders? The drawback
of this is that you have another pass through the data and another 'input'
to keep track of and configure. But maybe it has to be like that. The
reason I like the hashed encoders is that vectorizing can be done in a
streaming manner at the last possible moment. With the current tools you
have to do: data -> data2seq -> seq2sparse -> kmeans.

If this approach is doable I would like to code up a Java non-Hadoop
example using the Reuters dataset which vectorizes each doc using the
hashing encoders, configures KMeans with Hamming distance and then write
some code to get the labels.

Cheers,

Frank

>
>
>
> On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > Would it be possible to use hashing vector encoders for text clustering
> > just like when classifying?
> >
> > Currently we vectorize using a dictionary where we map each token to a
> > fixed position in the dictionary. After the clustering we use have to
> > retrieve the dictionary to determine the cluster labels.
> > This is quite a complex process where multiple outputs are read and
> written
> > in the entire clustering process.
> >
> > I think it would be great if both algorithms could use the same encoding
> > process but I don't know if this is possible.
> >
> > The problem is that we lose the mapping between token and position when
> > hashing. We need this mapping to determine cluster labels.
> >
> > However, maybe we could make it so hashed encoders can be used and that
> > determining top labels is left to the user. This might be a possibility
> > because I noticed a problem with the current cluster labeling code. This
> is
> > what happens: first vectors are vectorized with TF-IDF and clustered.
> Then
> > the labels are ranked, but again according to TF-IDF, instead of TF. So
> it
> > is possible that a token becomes the top ranked label, even though it is
> > rare within the cluster. The document with that token is in the cluster
> > because of other tokens. If the labels are determined based on a TF score
> > within the cluster I think you would have better labels. But this
> requires
> > a post-processing step on your original data and doing a TF count.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Re: Text clustering with hashing vector encoders

Posted by Ted Dunning <te...@gmail.com>.

Yes.  Hashing vector encoders will preserve distances when used with
multiple probes.

Interpretation becomes somewhat difficult, but there is code available to
reverse engineer labels on hashed vectors.

IDF weighting is slightly tricky, but quite doable if you keep a dictionary
of, say, the most common 50-200 thousand words and assume all others have
constant and equal frequency.



On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi all,
>
> Would it be possible to use hashing vector encoders for text clustering
> just like when classifying?
>
> Currently we vectorize using a dictionary where we map each token to a
> fixed position in the dictionary. After the clustering we use have to
> retrieve the dictionary to determine the cluster labels.
> This is quite a complex process where multiple outputs are read and written
> in the entire clustering process.
>
> I think it would be great if both algorithms could use the same encoding
> process but I don't know if this is possible.
>
> The problem is that we lose the mapping between token and position when
> hashing. We need this mapping to determine cluster labels.
>
> However, maybe we could make it so hashed encoders can be used and that
> determining top labels is left to the user. This might be a possibility
> because I noticed a problem with the current cluster labeling code. This is
> what happens: first vectors are vectorized with TF-IDF and clustered. Then
> the labels are ranked, but again according to TF-IDF, instead of TF. So it
> is possible that a token becomes the top ranked label, even though it is
> rare within the cluster. The document with that token is in the cluster
> because of other tokens. If the labels are determined based on a TF score
> within the cluster I think you would have better labels. But this requires
> a post-processing step on your original data and doing a TF count.
>
> Thoughts?
>
> Cheers,
>
> Frank
>