You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2010/09/01 15:44:17 UTC

FeatureEncoder Question

I am trying to put FeatureEncoder in front of Mahout Bayes trainer and
classifier, I have a doubt about the interaction encoder. How does
difference in the hash bit correlate with an interaction ?


Robin

FeatureEncoder Question

Posted by Robin Anil <ro...@gmail.com>.

Ted,

I am trying to put FeatureEncoder in front of Mahout Bayes trainer and
classifier, I have a doubt about the interaction encoder. How does
difference in the hash bit correlate with an interaction, like say edit
distance, or similarity. I am finding it random.


Robin

Re: FeatureEncoder Question

Posted by Ted Dunning <te...@gmail.com>.

You could go down that sort of path, but you lose all the power of
quasi-orthogonality that way.

The basic idea is that random points on the sphere are almost all orthogonal
to within an epsilon that depends on the dimensionality of the space and the
number of vectors.  For epsilon = 0, then the number n of vectors that you
can place into the space is clearly the dimension d.  For non-trivial values
of epsilon, however, the number is exponential in d.  This holds
constructively as a hard bound, or in probability for random vectors.

The exponentiality means that if you divide the space into two sub-spaces,
you get vastly less than half capacity in each sub-space.

This behavior is related to Bloom filters where a fixed false positive rate
leads to a size proportional to the number of documents and the number of
bits per document is proportional to the negative log of the false positive
rate.

The upshot is that it is better to mix everything into the same space and
take our chances.

On Wed, Sep 1, 2010 at 7:24 AM, Robin Anil <ro...@gmail.com> wrote:

>
> What if you dont mix it in the same vector length. Make vector 2l and add
> these in the second half so they wont mix right?
>
>

Re: FeatureEncoder Question

Posted by Robin Anil <ro...@gmail.com>.

On Wed, Sep 1, 2010 at 7:49 PM, Ted Dunning <te...@gmail.com> wrote:

> The goal of the interaction encoder is to produce a vector that is
> orthogonal to the original.  The current strategy is to add the hash values
> together which should leave the new locations in different values from
> either of the originals (on average).

The place that this gets trickier is when text interacts with something,
> especially text.  This is because text encodes as a vector with (nearly) as
> many non-zeros as unique words in the original text for each probe.  When
> text with n words interacts with text with m words, you get n x m non-zeros
> in the result.  I think that is the best thing to do, but it can be costly
> if you have gobs of words in your text.
>
> I am wide open for suggestions on this.  What we have so far is good enough
> for the current application and the current tests verify the orthogonality
> for a few examples, but more thought would be good.


What if you dont mix it in the same vector length. Make vector 2l and add
these in the second half so they wont mix right?

>


> On Wed, Sep 1, 2010 at 6:44 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > I am trying to put FeatureEncoder in front of Mahout Bayes trainer and
> > classifier, I have a doubt about the interaction encoder. How does
> > difference in the hash bit correlate with an interaction ?
> >
> >
> > Robin
> >
>

Re: FeatureEncoder Question

Posted by Ted Dunning <te...@gmail.com>.

The goal of the interaction encoder is to produce a vector that is
orthogonal to the original.  The current strategy is to add the hash values
together which should leave the new locations in different values from
either of the originals (on average).

The place that this gets trickier is when text interacts with something,
especially text.  This is because text encodes as a vector with (nearly) as
many non-zeros as unique words in the original text for each probe.  When
text with n words interacts with text with m words, you get n x m non-zeros
in the result.  I think that is the best thing to do, but it can be costly
if you have gobs of words in your text.

I am wide open for suggestions on this.  What we have so far is good enough
for the current application and the current tests verify the orthogonality
for a few examples, but more thought would be good.

On Wed, Sep 1, 2010 at 6:44 AM, Robin Anil <ro...@gmail.com> wrote:

> I am trying to put FeatureEncoder in front of Mahout Bayes trainer and
> classifier, I have a doubt about the interaction encoder. How does
> difference in the hash bit correlate with an interaction ?
>
>
> Robin
>