You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/04/25 05:56:40 UTC

Cosine distances to Random Vector basis

I just found this vector distance idea in a technical paper:

Create a space defined by  X random vectors. For you data vectors,
take the cosine distance to each random vector and save the sign of
the value as a bit. This gives a bit set of X bits.
There could be another distance and algorithm for picking the bit value.

The effect is to cease using numerical vectors as a "carrier signal"
for the concept of "positions and distances". This is a different,
more focused representation. And, Hamming distance is somewhat faster
than Euclidean :) Of course, picking enough bits is a problem.

However, I lost the paper. Does this ring a bell with anyone?

-- 
Lance Norskog
goksron@gmail.com

Re: Cosine distances to Random Vector basis

Posted by Randall McRee <ra...@gmail.com>.

I've done a new, clean, implementation of this (just the knn piece) at my
current company which has agreed to allow an open source contribution.

Thanks,
Randy

On Mon, Apr 25, 2011 at 11:09 PM, Ted Dunning <te...@gmail.com> wrote:

> Available cheaper at my old company.
>
>
> http://www.deepdyve.com/lp/association-for-computing-machinery/symbolic-regression-using-nearest-neighbor-indexing-GmDA73L5II
>
>
>
> On Mon, Apr 25, 2011 at 10:22 PM, Randall McRee <randall.mcree@gmail.com
> >wrote:
>
> > Symbolic Regression using Nearest Neighbor
> > Indexing
> >
>

Re: Cosine distances to Random Vector basis

Posted by Ted Dunning <te...@gmail.com>.

Available cheaper at my old company.

http://www.deepdyve.com/lp/association-for-computing-machinery/symbolic-regression-using-nearest-neighbor-indexing-GmDA73L5II



On Mon, Apr 25, 2011 at 10:22 PM, Randall McRee <ra...@gmail.com>wrote:

> Symbolic Regression using Nearest Neighbor
> Indexing
>

Re: Cosine distances to Random Vector basis

Posted by Randall McRee <ra...@gmail.com>.

Charikar is the definitive reference for this method. See

[1]     Charikar, M., Similarity estimation techniques from
rounding*.* In *Proceedings
of the Symposium on Theory of Computing*, 2002.

I also created a simple LSH NN method based on this idea (refined, I think)
which you can find here: Mcree, Symbolic Regression using Nearest Neighbor
Indexing<http://portal.acm.org/citation.cfm?id=1830841&dl=ACM&coll=DL&CFID=17946055&CFTOKEN=18630366>Access
to the paper requires ACM membership--let me know if you want me to
send you a copy.

The basic idea I introduced was to ignore random projections that do not
fall into the TopN (funny how these ideas repeat in different contexts!).
Random projections that are not in the TopN act like noise, and are not
useful IMHO.

Randy McRee

On Sun, Apr 24, 2011 at 9:41 PM, Ted Dunning <te...@gmail.com> wrote:

> Sounds like a variant of LSH to me.
>
> See Wikipedia article on LSH with random projections.
>
> On Sun, Apr 24, 2011 at 8:56 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > I just found this vector distance idea in a technical paper:
> >
> > Create a space defined by  X random vectors. For you data vectors,
> > take the cosine distance to each random vector and save the sign of
> > the value as a bit. This gives a bit set of X bits.
> > There could be another distance and algorithm for picking the bit value.
> >
> > The effect is to cease using numerical vectors as a "carrier signal"
> > for the concept of "positions and distances". This is a different,
> > more focused representation. And, Hamming distance is somewhat faster
> > than Euclidean :) Of course, picking enough bits is a problem.
> >
> > However, I lost the paper. Does this ring a bell with anyone?
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Cosine distances to Random Vector basis

Posted by Ted Dunning <te...@gmail.com>.

Sounds like a variant of LSH to me.

See Wikipedia article on LSH with random projections.

On Sun, Apr 24, 2011 at 8:56 PM, Lance Norskog <go...@gmail.com> wrote:

> I just found this vector distance idea in a technical paper:
>
> Create a space defined by  X random vectors. For you data vectors,
> take the cosine distance to each random vector and save the sign of
> the value as a bit. This gives a bit set of X bits.
> There could be another distance and algorithm for picking the bit value.
>
> The effect is to cease using numerical vectors as a "carrier signal"
> for the concept of "positions and distances". This is a different,
> more focused representation. And, Hamming distance is somewhat faster
> than Euclidean :) Of course, picking enough bits is a problem.
>
> However, I lost the paper. Does this ring a bell with anyone?
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Cosine distances to Random Vector basis

Posted by Jake Mannix <ja...@gmail.com>.

This is the starting point of the way I've always seen people do
Locality Sensitive Hashing with floating point vectors.  Once you
have these bit vectors, you can do minhash stuff on them to
complete LSH.

On Sun, Apr 24, 2011 at 8:56 PM, Lance Norskog <go...@gmail.com> wrote:

> I just found this vector distance idea in a technical paper:
>
> Create a space defined by  X random vectors. For you data vectors,
> take the cosine distance to each random vector and save the sign of
> the value as a bit. This gives a bit set of X bits.
> There could be another distance and algorithm for picking the bit value.
>
> The effect is to cease using numerical vectors as a "carrier signal"
> for the concept of "positions and distances". This is a different,
> more focused representation. And, Hamming distance is somewhat faster
> than Euclidean :) Of course, picking enough bits is a problem.
>
> However, I lost the paper. Does this ring a bell with anyone?
>
> --
> Lance Norskog
> goksron@gmail.com
>