You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/05/16 05:54:44 UTC

"Geometric Random Hyperplane LSH" - simple implementation

Some minor testing, based on previous mail threads (thanks Ted-I-mean-everyone).

A Locality-Sensitive-Hash space is defined 50 random vectors (0->1). A
sample vector's position in the LSH space is a bitset. Each entry the
sign of the cosine distance to that LSH vector. Thus every LSH
distance has 50 bits. This space of course has no negative vectors.

Test data: 1000 random vectors as samples. All values 0->1, linear distribution.
This test data gives no negative cosine distances, and so all bits are
0. This is expected (from previous mail threads).

Now, define the random vectors in the space as the dimension-wise
delta between two random vectors. This gives vector values ranging
from 0.5 to -0.5; The same test of 1000 random vectors gives Hamming
distances of 935 to 950 (non-matching) bits. Here is the
OnlineSummarizer output:
[(count=1000.0),(sd=4.15466),(mean=938.856),(min=928.0),(25%=935.935),(median=938.932),(75%=941.725),(max=951.0),]

Changing the sample vectors from linear to Gaussian, the distances
range from 915 to 960. The mid quartiles are farther from the median.
[(count=1000.0),(sd=5.52686),(mean=935.912),(min=916.0),(25%=931.871),(median=935.810),(75%=939.730),(max=961.0),]

Given the central limit theorem blah blah blah, this seems appropriate.

Should I spin a patch? Where would this go? I have dense&sparse
implementations, a Vector wrapper and some unit tests.

Next week in Jerusalem,

Lance Norskog
goksron@gmail.com

Re: "Geometric Random Hyperplane LSH" - simple implementation

Posted by Ted Dunning <te...@gmail.com>.
This seems suspicious.

Why are the Hamming distances not distributed around 500?

On Sun, May 15, 2011 at 8:54 PM, Lance Norskog <go...@gmail.com> wrote:

> Now, define the random vectors in the space as the dimension-wise
> delta between two random vectors. This gives vector values ranging
> from 0.5 to -0.5; The same test of 1000 random vectors gives Hamming
> distances of 935 to 950 (non-matching) bits. Here is the
> OnlineSummarizer output:
>
> [(count=1000.0),(sd=4.15466),(mean=938.856),(min=928.0),(25%=935.935),(median=938.932),(75%=941.725),(max=951.0),]
>

Re: "Geometric Random Hyperplane LSH" - simple implementation

Posted by Ted Dunning <te...@gmail.com>.
Here is code to create a difference basis and also code to do LSH encoding.


# creates an LSH basis set by picking pairs of vectors at random
# note that a significant amount of the time this will pick the same
# vector both times and the result will be zero.  This would be easy
# to fix, but I am lazy so I just live with it.
lshBasis = function(vectors, m) {
  n = dim(vectors)[2]
  v1 = vectors[ceiling(runif(m) * n), ]
  v2 = vectors[ceiling(runif(m) * n), ]
  return(v1-v2)
}

# encode a data point into a binary vector based on a given
# LSH basis
lshCode = function(basis, v) {
  if (!is.null(dim(v))) {
    # handle a matrix full of row vectors
    return(basis %*% t(v) > 0)
  } else {
    # handle a single vector
    return(basis %*% v > 0)
  }
}


On Mon, May 16, 2011 at 4:08 PM, Lance Norskog <go...@gmail.com> wrote:

> Please send or post your R code.
>
> On Sun, May 15, 2011 at 10:35 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > I think I was the source of this expectation.
> >
> > And I also think I was wrong.
> >
> > I just did some experiments myself in R and random cut vectors seem to
> work
> > about as well as non-random ones for positive orthant vectors.  For oddly
> > distributed vectors, it still might be good to use difference vectors as
> a
> > basis for LSH, but I am much less convinced than before.
> >
> > On Sun, May 15, 2011 at 8:54 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> Test data: 1000 random vectors as samples. All values 0->1, linear
> >> distribution.
> >> This test data gives no negative cosine distances, and so all bits are
> >> 0. This is expected (from previous mail threads).
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: "Geometric Random Hyperplane LSH" - simple implementation

Posted by Lance Norskog <go...@gmail.com>.
Please send or post your R code.

On Sun, May 15, 2011 at 10:35 PM, Ted Dunning <te...@gmail.com> wrote:
> I think I was the source of this expectation.
>
> And I also think I was wrong.
>
> I just did some experiments myself in R and random cut vectors seem to work
> about as well as non-random ones for positive orthant vectors.  For oddly
> distributed vectors, it still might be good to use difference vectors as a
> basis for LSH, but I am much less convinced than before.
>
> On Sun, May 15, 2011 at 8:54 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> Test data: 1000 random vectors as samples. All values 0->1, linear
>> distribution.
>> This test data gives no negative cosine distances, and so all bits are
>> 0. This is expected (from previous mail threads).
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: "Geometric Random Hyperplane LSH" - simple implementation

Posted by Ted Dunning <te...@gmail.com>.
I think I was the source of this expectation.

And I also think I was wrong.

I just did some experiments myself in R and random cut vectors seem to work
about as well as non-random ones for positive orthant vectors.  For oddly
distributed vectors, it still might be good to use difference vectors as a
basis for LSH, but I am much less convinced than before.

On Sun, May 15, 2011 at 8:54 PM, Lance Norskog <go...@gmail.com> wrote:

> Test data: 1000 random vectors as samples. All values 0->1, linear
> distribution.
> This test data gives no negative cosine distances, and so all bits are
> 0. This is expected (from previous mail threads).
>