You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by tog <gu...@gmail.com> on 2011/03/23 16:30:52 UTC

Nearest Neighbour: Similarity search

Hi,

We are using Mahout as the main building block of our similarity search
platform. We are now at a point where we want (using a different algorithm)
refine the similarity with top similar candidates.
That algorithm requires to find nearest neighbor for different vectors (dim
> 50) - my sets of vectors are small (max 10k). I do understand that such an
algorithm  is not explicitly part of Mahout - but I was wondering if it
might be a building block of some other algorithms in Mahout.
If not, which is my guess, does anyone know a good JAVA library doing this
(i.e ANN search or similar because for such "high" dimension Kd-Tree are
known to be inefficient.

Thanks for your help
Guillaume

Re: Nearest Neighbour: Similarity search

Posted by Ted Dunning <te...@gmail.com>.

Thanks Lance.

On Wed, Mar 23, 2011 at 10:04 PM, Lance Norskog <go...@gmail.com> wrote:

> Oops: a typo killed the link.
>  http://arxiv.org/abs/0909.4061 <space> 'for'.
>
> On Wed, Mar 23, 2011 at 7:50 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > On Wed, Mar 23, 2011 at 7:32 PM, tog <gu...@gmail.com> wrote:
> >
> >> Don't have 4 seconds to do the job ;-)  and my dimension is > 50
> >>
> >
> > Your dimension is almost certainly not much >50 after dimensionality
> > reduction.
> >
> > Remember also, 4 seconds was to compute ALL 100 million of the distance
> > pairs.  Computing 10,000 pairs will be less than a millisecond.
> >
> >
> >>
> >> But let me explain my whole problem:
> >>
> >
> > Yeah.. I will have look at this more carefully off-line.  My key point is
> > that anything you are doing with L_2 distances can be very closely
> > approximated using a random projection.  See
> > http://arxiv.org/abs/0909.4061for the proof.
> >
> > To test for the applicability of this, try doing a partial SVD and look
> at
> > the die-off of the singular vectors. These give you tight bounds on the
> > accuracy of the reduced dimensionality distance relative to the actual
> > distance.
> >
> > The SVD is not necessary in practice.  Then only the random projection is
> > needed.
> >
> > Can you make these data available?  I could produce a small demo for you.
> >
> >
> >
> >>   - I have objects which can be described by a set of approximately
> 1-10k
> >> dense vectors (dim >50)
> >>   - I want to retrieve similar objects so what I did using Mahout is to
> >> cluster all vectors coming from all objects using K-Means (done off
> line)
> >>   - Then comes the user with a new object (with its own set of vector)
> :-)
> >>       - I am able to get similar objects according to the number of
> >> clusters they have in common (the higher the better)
> >>
> >
> > This might work reasonably well.  BUT, I think that you can do better.
> >
> >
> >>
> >> On Thu, Mar 24, 2011 at 12:13 AM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> I just did a quick experiment, btw.
> >>>
> >>> In R, computing 10,000 x 10,000 dot products of 10 dimensional values
> >>> takes 2 seconds.  At 20 dimensions, it takes 4 seconds.
> >>>
> >>> This says to me that you don't really need Mahout here except that you
> >>> probably can't read your original data into R.  Mahout can
> >>> help because you can read your data row by row, do the random
> projection
> >>> and add the projected row to a dense matrix.  This
> >>> dense matrix can then be used to compute your distances.  The
> distances,
> >>> as I mentioned can be computed row by row so that
> >>> you can sparsify them on the fly.  Total memory should be roughly equal
> >>> to
> >>>
> >>>    number of rows x (number of reduced dimensions x sizeof(double) +
> >>> number of retained neighbors x (sizeof(double) + sizeof(int)))
> >>>
> >>> This clearly isn't a scalable algorithm but we can still help.
> >>>
> >>> On Wed, Mar 23, 2011 at 10:24 AM, Sean Owen <sr...@gmail.com> wrote:
> >>>
> >>>> OK, it probably won't quite work then. The item-item similarity
> >>>> metrics generally assume sparse vectors, and that the values are
> >>>> something like "ratings". If they're values from different scales, it
> >>>> won't make much sense.
> >>>>
> >>>> On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com>
> wrote:
> >>>> > Ok and what is the underlying algorithm in that job ?
> >>>> > On my side the vectors are so called feature vectors in image
> >>>> processing so each entry has no particular meaning. Similarity will be
> based
> >>>> on the L2 norm
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Nearest Neighbour: Similarity search

Posted by Lance Norskog <go...@gmail.com>.

Oops: a typo killed the link.
 http://arxiv.org/abs/0909.4061 <space> 'for'.

On Wed, Mar 23, 2011 at 7:50 PM, Ted Dunning <te...@gmail.com> wrote:
> On Wed, Mar 23, 2011 at 7:32 PM, tog <gu...@gmail.com> wrote:
>
>> Don't have 4 seconds to do the job ;-)  and my dimension is > 50
>>
>
> Your dimension is almost certainly not much >50 after dimensionality
> reduction.
>
> Remember also, 4 seconds was to compute ALL 100 million of the distance
> pairs.  Computing 10,000 pairs will be less than a millisecond.
>
>
>>
>> But let me explain my whole problem:
>>
>
> Yeah.. I will have look at this more carefully off-line.  My key point is
> that anything you are doing with L_2 distances can be very closely
> approximated using a random projection.  See
> http://arxiv.org/abs/0909.4061for the proof.
>
> To test for the applicability of this, try doing a partial SVD and look at
> the die-off of the singular vectors. These give you tight bounds on the
> accuracy of the reduced dimensionality distance relative to the actual
> distance.
>
> The SVD is not necessary in practice.  Then only the random projection is
> needed.
>
> Can you make these data available?  I could produce a small demo for you.
>
>
>
>>   - I have objects which can be described by a set of approximately 1-10k
>> dense vectors (dim >50)
>>   - I want to retrieve similar objects so what I did using Mahout is to
>> cluster all vectors coming from all objects using K-Means (done off line)
>>   - Then comes the user with a new object (with its own set of vector) :-)
>>       - I am able to get similar objects according to the number of
>> clusters they have in common (the higher the better)
>>
>
> This might work reasonably well.  BUT, I think that you can do better.
>
>
>>
>> On Thu, Mar 24, 2011 at 12:13 AM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> I just did a quick experiment, btw.
>>>
>>> In R, computing 10,000 x 10,000 dot products of 10 dimensional values
>>> takes 2 seconds.  At 20 dimensions, it takes 4 seconds.
>>>
>>> This says to me that you don't really need Mahout here except that you
>>> probably can't read your original data into R.  Mahout can
>>> help because you can read your data row by row, do the random projection
>>> and add the projected row to a dense matrix.  This
>>> dense matrix can then be used to compute your distances.  The distances,
>>> as I mentioned can be computed row by row so that
>>> you can sparsify them on the fly.  Total memory should be roughly equal
>>> to
>>>
>>>    number of rows x (number of reduced dimensions x sizeof(double) +
>>> number of retained neighbors x (sizeof(double) + sizeof(int)))
>>>
>>> This clearly isn't a scalable algorithm but we can still help.
>>>
>>> On Wed, Mar 23, 2011 at 10:24 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> OK, it probably won't quite work then. The item-item similarity
>>>> metrics generally assume sparse vectors, and that the values are
>>>> something like "ratings". If they're values from different scales, it
>>>> won't make much sense.
>>>>
>>>> On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com> wrote:
>>>> > Ok and what is the underlying algorithm in that job ?
>>>> > On my side the vectors are so called feature vectors in image
>>>> processing so each entry has no particular meaning. Similarity will be based
>>>> on the L2 norm
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Nearest Neighbour: Similarity search

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Mar 23, 2011 at 7:32 PM, tog <gu...@gmail.com> wrote:

> Don't have 4 seconds to do the job ;-)  and my dimension is > 50
>

Your dimension is almost certainly not much >50 after dimensionality
reduction.

Remember also, 4 seconds was to compute ALL 100 million of the distance
pairs.  Computing 10,000 pairs will be less than a millisecond.

>
> But let me explain my whole problem:
>

Yeah.. I will have look at this more carefully off-line.  My key point is
that anything you are doing with L_2 distances can be very closely
approximated using a random projection.  See
http://arxiv.org/abs/0909.4061for the proof.

To test for the applicability of this, try doing a partial SVD and look at
the die-off of the singular vectors. These give you tight bounds on the
accuracy of the reduced dimensionality distance relative to the actual
distance.

The SVD is not necessary in practice.  Then only the random projection is
needed.

Can you make these data available?  I could produce a small demo for you.

>   - I have objects which can be described by a set of approximately 1-10k
> dense vectors (dim >50)
>   - I want to retrieve similar objects so what I did using Mahout is to
> cluster all vectors coming from all objects using K-Means (done off line)
>   - Then comes the user with a new object (with its own set of vector) :-)
>       - I am able to get similar objects according to the number of
> clusters they have in common (the higher the better)
>

This might work reasonably well.  BUT, I think that you can do better.

>
> On Thu, Mar 24, 2011 at 12:13 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> I just did a quick experiment, btw.
>>
>> In R, computing 10,000 x 10,000 dot products of 10 dimensional values
>> takes 2 seconds.  At 20 dimensions, it takes 4 seconds.
>>
>> This says to me that you don't really need Mahout here except that you
>> probably can't read your original data into R.  Mahout can
>> help because you can read your data row by row, do the random projection
>> and add the projected row to a dense matrix.  This
>> dense matrix can then be used to compute your distances.  The distances,
>> as I mentioned can be computed row by row so that
>> you can sparsify them on the fly.  Total memory should be roughly equal
>> to
>>
>>    number of rows x (number of reduced dimensions x sizeof(double) +
>> number of retained neighbors x (sizeof(double) + sizeof(int)))
>>
>> This clearly isn't a scalable algorithm but we can still help.
>>
>> On Wed, Mar 23, 2011 at 10:24 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> OK, it probably won't quite work then. The item-item similarity
>>> metrics generally assume sparse vectors, and that the values are
>>> something like "ratings". If they're values from different scales, it
>>> won't make much sense.
>>>
>>> On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com> wrote:
>>> > Ok and what is the underlying algorithm in that job ?
>>> > On my side the vectors are so called feature vectors in image
>>> processing so each entry has no particular meaning. Similarity will be based
>>> on the L2 norm
>>> >
>>>
>>
>>
>
>
> --
> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>

Re: Nearest Neighbour: Similarity search

Posted by tog <gu...@gmail.com>.

Don't have 4 seconds to do the job ;-)  and my dimension is > 50

But let me explain my whole problem:
  - I have objects which can be described by a set of approximately 1-10k
dense vectors (dim >50)
  - I want to retrieve similar objects so what I did using Mahout is to
cluster all vectors coming from all objects using K-Means (done off line)
  - Then comes the user with a new object (with its own set of vector) :-)
      - I am able to get similar objects according to the number of clusters
they have in common (the higher the better)

Now comes my problem, I want to partially revisit the ranking of my (say)
first 100 similar objects. Many will probably share the same number of
clusters with the query object. In order to break the tis, I want to compute
an algebraic transformation of vectors (the quality of this function will
basically refine the ranking). For this I need to find, for each of the
vectors from the query object, the closer (nearest neighbor) in each of the
100 concerned objects.
My idea was to do it the other way around:
    Embed the dense vectors from the query image in some "efficient"
data-structure
    For each of the 100 objects
       For each of the vectors
          Find nearest neighbor from that efficient data structure
       From pairs compute algebraic transformation quality measure
    Refine ranking of the objects

Hope this is making sense !

Any advice ?

Best Regards
Guillaume



On Thu, Mar 24, 2011 at 12:13 AM, Ted Dunning <te...@gmail.com> wrote:

> I just did a quick experiment, btw.
>
> In R, computing 10,000 x 10,000 dot products of 10 dimensional values takes
> 2 seconds.  At 20 dimensions, it takes 4 seconds.
>
> This says to me that you don't really need Mahout here except that you
> probably can't read your original data into R.  Mahout can
> help because you can read your data row by row, do the random projection
> and add the projected row to a dense matrix.  This
> dense matrix can then be used to compute your distances.  The distances, as
> I mentioned can be computed row by row so that
> you can sparsify them on the fly.  Total memory should be roughly equal to
>
>    number of rows x (number of reduced dimensions x sizeof(double) + number
> of retained neighbors x (sizeof(double) + sizeof(int)))
>
> This clearly isn't a scalable algorithm but we can still help.
>
> On Wed, Mar 23, 2011 at 10:24 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> OK, it probably won't quite work then. The item-item similarity
>> metrics generally assume sparse vectors, and that the values are
>> something like "ratings". If they're values from different scales, it
>> won't make much sense.
>>
>> On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com> wrote:
>> > Ok and what is the underlying algorithm in that job ?
>> > On my side the vectors are so called feature vectors in image processing
>> so each entry has no particular meaning. Similarity will be based on the L2
>> norm
>> >
>>
>
>


-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net

Re: Nearest Neighbour: Similarity search

Posted by Ted Dunning <te...@gmail.com>.

I just did a quick experiment, btw.

In R, computing 10,000 x 10,000 dot products of 10 dimensional values takes
2 seconds.  At 20 dimensions, it takes 4 seconds.

This says to me that you don't really need Mahout here except that you
probably can't read your original data into R.  Mahout can
help because you can read your data row by row, do the random projection and
add the projected row to a dense matrix.  This
dense matrix can then be used to compute your distances.  The distances, as
I mentioned can be computed row by row so that
you can sparsify them on the fly.  Total memory should be roughly equal to

   number of rows x (number of reduced dimensions x sizeof(double) + number
of retained neighbors x (sizeof(double) + sizeof(int)))

This clearly isn't a scalable algorithm but we can still help.

On Wed, Mar 23, 2011 at 10:24 AM, Sean Owen <sr...@gmail.com> wrote:

> OK, it probably won't quite work then. The item-item similarity
> metrics generally assume sparse vectors, and that the values are
> something like "ratings". If they're values from different scales, it
> won't make much sense.
>
> On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com> wrote:
> > Ok and what is the underlying algorithm in that job ?
> > On my side the vectors are so called feature vectors in image processing
> so each entry has no particular meaning. Similarity will be based on the L2
> norm
> >
>

Re: Nearest Neighbour: Similarity search

Posted by Sean Owen <sr...@gmail.com>.

OK, it probably won't quite work then. The item-item similarity
metrics generally assume sparse vectors, and that the values are
something like "ratings". If they're values from different scales, it
won't make much sense.

On Wed, Mar 23, 2011 at 4:27 PM, tog <gu...@gmail.com> wrote:
> Ok and what is the underlying algorithm in that job ?
> On my side the vectors are so called feature vectors in image processing so each entry has no particular meaning. Similarity will be based on the L2 norm
>

Re: Nearest Neighbour: Similarity search

Posted by Ted Dunning <te...@gmail.com>.

Does that imply that your vectors are dense?

It also sounds like you want to look for things that are not just near
duplicates as well.

For that kind of problem, there is some chance that a few features dominate
the problem and you can do, say, a 10 or 20 dimensional first cut and then
revise the ordering of the top few hundred hits to get the results you need.

If that doesn't work, then you are pretty much stuck doing the all pairs
computation.  The good news is that your data set is so small that this can
be done pretty efficiently.

My own approach would be to use a random projection to drive your matrix
size down to 10K x 30 and then just do the multiplication giving a 10K x 10K
matrix that you use to extract the 50 best elements.  The identity (x-y)^2 =
x^2 - 2 x . y + y^2 allows you to use the matrix product to get distances.
 The random projection can be arranged to preserve the dot products.

Since this result is relatively small, you can probably do the computation
on a single node using Mahout (or any other) matrix arithmetic package.  It
can also be arranged to avoid storing the entire product by computing a row
at a time and keeping only the best 50 elements.  This allows you to have a
sparse result.  If you really, really want to, you can recompute the
distances for the 50 best elements without the random projection, but I
think that is over-kill.

On Wed, Mar 23, 2011 at 9:27 AM, tog <gu...@gmail.com> wrote:

> Ok and what is the underlying algorithm in that job ?
> On my side the vectors are so called feature vectors in image processing so
> each entry has no particular meaning. Similarity will be based on the L2
> norm
>
>
> On Mar 23, 2011, at 9:13 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > This sounds exactly like what the item-simiarity Mahout job in
> > org.apache.mahout.cf.taste.hadoop.similarity.item does. To use it, you
> > would have to think about how to map your input onto what this process
> > expects.
> >
> > Your "items" are vectors -- what are the entries though? what do the
> > vectors encode?
> >
> >
> >
> > On Wed, Mar 23, 2011 at 3:30 PM, tog <gu...@gmail.com> wrote:
> >> Hi,
> >>
> >> We are using Mahout as the main building block of our similarity search
> >> platform. We are now at a point where we want (using a different
> algorithm)
> >> refine the similarity with top similar candidates.
> >> That algorithm requires to find nearest neighbor for different vectors
> (dim
> >>> 50) - my sets of vectors are small (max 10k). I do understand that such
> an
> >> algorithm  is not explicitly part of Mahout - but I was wondering if it
> >> might be a building block of some other algorithms in Mahout.
> >> If not, which is my guess, does anyone know a good JAVA library doing
> this
> >> (i.e ANN search or similar because for such "high" dimension Kd-Tree are
> >> known to be inefficient.
> >>
> >> Thanks for your help
> >> Guillaume
> >>
>

Re: Nearest Neighbour: Similarity search

Posted by tog <gu...@gmail.com>.

Ok and what is the underlying algorithm in that job ?
On my side the vectors are so called feature vectors in image processing so each entry has no particular meaning. Similarity will be based on the L2 norm 


On Mar 23, 2011, at 9:13 PM, Sean Owen <sr...@gmail.com> wrote:

> This sounds exactly like what the item-simiarity Mahout job in
> org.apache.mahout.cf.taste.hadoop.similarity.item does. To use it, you
> would have to think about how to map your input onto what this process
> expects.
> 
> Your "items" are vectors -- what are the entries though? what do the
> vectors encode?
> 
> 
> 
> On Wed, Mar 23, 2011 at 3:30 PM, tog <gu...@gmail.com> wrote:
>> Hi,
>> 
>> We are using Mahout as the main building block of our similarity search
>> platform. We are now at a point where we want (using a different algorithm)
>> refine the similarity with top similar candidates.
>> That algorithm requires to find nearest neighbor for different vectors (dim
>>> 50) - my sets of vectors are small (max 10k). I do understand that such an
>> algorithm  is not explicitly part of Mahout - but I was wondering if it
>> might be a building block of some other algorithms in Mahout.
>> If not, which is my guess, does anyone know a good JAVA library doing this
>> (i.e ANN search or similar because for such "high" dimension Kd-Tree are
>> known to be inefficient.
>> 
>> Thanks for your help
>> Guillaume
>>

Re: Nearest Neighbour: Similarity search

Posted by Sean Owen <sr...@gmail.com>.

This sounds exactly like what the item-simiarity Mahout job in
org.apache.mahout.cf.taste.hadoop.similarity.item does. To use it, you
would have to think about how to map your input onto what this process
expects.

Your "items" are vectors -- what are the entries though? what do the
vectors encode?



On Wed, Mar 23, 2011 at 3:30 PM, tog <gu...@gmail.com> wrote:
> Hi,
>
> We are using Mahout as the main building block of our similarity search
> platform. We are now at a point where we want (using a different algorithm)
> refine the similarity with top similar candidates.
> That algorithm requires to find nearest neighbor for different vectors (dim
>> 50) - my sets of vectors are small (max 10k). I do understand that such an
> algorithm  is not explicitly part of Mahout - but I was wondering if it
> might be a building block of some other algorithms in Mahout.
> If not, which is my guess, does anyone know a good JAVA library doing this
> (i.e ANN search or similar because for such "high" dimension Kd-Tree are
> known to be inefficient.
>
> Thanks for your help
> Guillaume
>