You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Felix Filozov <ff...@gmail.com> on 2011/10/10 21:26:38 UTC

Generic approach to kNN

I would like perform a kNN similarity search, where each data point is a N
dimensional vector and each coordinate in the vector may take on any value
(reals or strings). It seems to me that Mahout doesn't have the ability to
perform a generic kNN similarity search, instead the problem has to be
mapped to a recommender. Is Mahout the right tool for this task?

If it is, how have you dealt with the mapping, and if not, what would you
recommend?

Thanks.

Felix

Re: Generic approach to kNN

Posted by Josh Patterson <jo...@cloudera.com>.
If you want to keep it all in memory and you have under say a GB (or
so) of data, you could just use Weka's BallTree:

An example of where I used this before -----

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Datamining/Weka/WekaUtils.java

I believe the search time on a BallTree for kNN is Log(N) which is
handy as well.

JP

On Thu, Oct 13, 2011 at 1:32 AM, Felix Filozov <ff...@gmail.com> wrote:
> I decided that since I don't have as much data as I thought I would have, I
> would simply choose an optimized data structure to hold my data set, which
> I'd query locally.
>
> I did start looking into distributed NN options and ways of parallelizing NN
> as well.
>
> Thanks for all the help.
>
> On Wed, Oct 12, 2011 at 11:26 PM, Josh Patterson <jo...@cloudera.com> wrote:
>
>> Without knowing a lot about what you are doing, I'd say you could just
>> do this rather simply as Sean has said with a basic similarity
>> function;
>>
>> The really simple "batch" version of this might be:
>>
>> 1. Define similarity function
>> 2. Input of some sort of "base point / instance" which we'll use to
>> search against
>> 3. the map side of the MR job just takes each input vector and scores
>> it with the distance function
>> 4. output using the total order partitioner, sorting on distance score
>> 5. look at the first k entries on the front end of the thing
>>
>> A more complicated option might be something along the lines of "MD-tree":
>>
>> http://www.cs.ucsb.edu/~sudipto/papers/md-hbase.pdf
>>
>> where they are storing a k-d tree in HBase to give relatively low
>> latency kNN search queries.
>>
>> The batch version seems like it might be a nice place to start.
>>
>> Hope this helps,
>>
>> JP
>>
>>
>> On Mon, Oct 10, 2011 at 3:26 PM, Felix Filozov <ff...@gmail.com> wrote:
>> > I would like perform a kNN similarity search, where each data point is a
>> N
>> > dimensional vector and each coordinate in the vector may take on any
>> value
>> > (reals or strings). It seems to me that Mahout doesn't have the ability
>> to
>> > perform a generic kNN similarity search, instead the problem has to be
>> > mapped to a recommender. Is Mahout the right tool for this task?
>> >
>> > If it is, how have you dealt with the mapping, and if not, what would you
>> > recommend?
>> >
>> > Thanks.
>> >
>> > Felix
>> >
>>
>>
>>
>> --
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com
>>
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Generic approach to kNN

Posted by Felix Filozov <ff...@gmail.com>.
I decided that since I don't have as much data as I thought I would have, I
would simply choose an optimized data structure to hold my data set, which
I'd query locally.

I did start looking into distributed NN options and ways of parallelizing NN
as well.

Thanks for all the help.

On Wed, Oct 12, 2011 at 11:26 PM, Josh Patterson <jo...@cloudera.com> wrote:

> Without knowing a lot about what you are doing, I'd say you could just
> do this rather simply as Sean has said with a basic similarity
> function;
>
> The really simple "batch" version of this might be:
>
> 1. Define similarity function
> 2. Input of some sort of "base point / instance" which we'll use to
> search against
> 3. the map side of the MR job just takes each input vector and scores
> it with the distance function
> 4. output using the total order partitioner, sorting on distance score
> 5. look at the first k entries on the front end of the thing
>
> A more complicated option might be something along the lines of "MD-tree":
>
> http://www.cs.ucsb.edu/~sudipto/papers/md-hbase.pdf
>
> where they are storing a k-d tree in HBase to give relatively low
> latency kNN search queries.
>
> The batch version seems like it might be a nice place to start.
>
> Hope this helps,
>
> JP
>
>
> On Mon, Oct 10, 2011 at 3:26 PM, Felix Filozov <ff...@gmail.com> wrote:
> > I would like perform a kNN similarity search, where each data point is a
> N
> > dimensional vector and each coordinate in the vector may take on any
> value
> > (reals or strings). It seems to me that Mahout doesn't have the ability
> to
> > perform a generic kNN similarity search, instead the problem has to be
> > mapped to a recommender. Is Mahout the right tool for this task?
> >
> > If it is, how have you dealt with the mapping, and if not, what would you
> > recommend?
> >
> > Thanks.
> >
> > Felix
> >
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
>

Re: Generic approach to kNN

Posted by Josh Patterson <jo...@cloudera.com>.
Without knowing a lot about what you are doing, I'd say you could just
do this rather simply as Sean has said with a basic similarity
function;

The really simple "batch" version of this might be:

1. Define similarity function
2. Input of some sort of "base point / instance" which we'll use to
search against
3. the map side of the MR job just takes each input vector and scores
it with the distance function
4. output using the total order partitioner, sorting on distance score
5. look at the first k entries on the front end of the thing

A more complicated option might be something along the lines of "MD-tree":

http://www.cs.ucsb.edu/~sudipto/papers/md-hbase.pdf

where they are storing a k-d tree in HBase to give relatively low
latency kNN search queries.

The batch version seems like it might be a nice place to start.

Hope this helps,

JP


On Mon, Oct 10, 2011 at 3:26 PM, Felix Filozov <ff...@gmail.com> wrote:
> I would like perform a kNN similarity search, where each data point is a N
> dimensional vector and each coordinate in the vector may take on any value
> (reals or strings). It seems to me that Mahout doesn't have the ability to
> perform a generic kNN similarity search, instead the problem has to be
> mapped to a recommender. Is Mahout the right tool for this task?
>
> If it is, how have you dealt with the mapping, and if not, what would you
> recommend?
>
> Thanks.
>
> Felix
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Generic approach to kNN

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Oct 11, 2011 at 8:18 AM, Sean Owen <sr...@gmail.com> wrote:

> It doesn't sound like you really have or want a recommender problem,
> then. Really, this is neither a clustering nor recommender problem;
> it's just using a similarity metric.
>

Absolutely.


> Ted's right that you do have to encode these things as numeric vectors
> to use anything in Mahout. A vector can't have values like "cat" --
> how far is that from 2 or "dog"? You have to encode that first. There
> are utilities to help encode categorical data though.
>

A word about this encoding.

We use 1 of n encoding for categorical data.  Since we usually don't know
how many categories there are, we also tend to use feature hashing which is
a form of k of N encoding where k is usually 2 or 3 and N is now a large
fixed number.  The effect is that the dot product of identical categories is
k and the dot product of non-identical categories is almost always 0.  The
almost-always comes from the way that we use a hash function to pick which k
of N locations to use.

But then you could easily make use of the bits from the recommender
> code if you wanted to find nearest neighbor, or reuse a bit of
> clustering on Hadoop that calculates similarities. (In fact there are
> bits of the recommender on Hadoop that calculate similarities too --
> plenty of pieces you could co-opt to just create similarities to one
> item.)
>

Indeed.  The key here is adaptation of existing code.


>
> On Tue, Oct 11, 2011 at 6:57 AM, Felix Filozov <ff...@gmail.com> wrote:
> > Ted, does this apply to recommenders?
>

It can.


> ...
>
> I've been making my way through Mahout in Action (I just realized you guys
> > are the authors; great book!) and some online tutorials, and it seems to
> me
> > that I'd have to do quite a lot of shoehorning to achieve my goal. I
> don't
> > really have a notion of a user or a rating, similarity would have to be
> > defined by me, and any further optimization using Kd-trees or LSH could
> be
> > difficult.
>

Skip to chapter 14 for a bit.  Fancy encoding isn't usually necessary for
normal recommenders and the normal recommenders stuff is what the first
third of the book is all about.

Re: Generic approach to kNN

Posted by Sean Owen <sr...@gmail.com>.
It doesn't sound like you really have or want a recommender problem,
then. Really, this is neither a clustering nor recommender problem;
it's just using a similarity metric.

Ted's right that you do have to encode these things as numeric vectors
to use anything in Mahout. A vector can't have values like "cat" --
how far is that from 2 or "dog"? You have to encode that first. There
are utilities to help encode categorical data though.

But then you could easily make use of the bits from the recommender
code if you wanted to find nearest neighbor, or reuse a bit of
clustering on Hadoop that calculates similarities. (In fact there are
bits of the recommender on Hadoop that calculate similarities too --
plenty of pieces you could co-opt to just create similarities to one
item.)

If you want to write a similarity function that works on what are
really Map<?,?> objects, you could easily do that. You're probably not
really using Mahout anymore at that stage. But it's simple enough that
you could do it. After writing the similarity metric, the rest of the
code is one for-loop.

On Tue, Oct 11, 2011 at 6:57 AM, Felix Filozov <ff...@gmail.com> wrote:
> Ted, does this apply to recommenders?
>
> Let me describe my problem more simply: Imagine you have a set of N feature
> vectors, and you're given a vector X (not in the set of N), and you're asked
> to find a vector in N which is nearest to X. I believe this is a classic
> description of NN.
>
> I've been making my way through Mahout in Action (I just realized you guys
> are the authors; great book!) and some online tutorials, and it seems to me
> that I'd have to do quite a lot of shoehorning to achieve my goal. I don't
> really have a notion of a user or a rating, similarity would have to be
> defined by me, and any further optimization using Kd-trees or LSH could be
> difficult.
>
>
> On Mon, Oct 10, 2011 at 9:25 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> You need to encode these as numerical vectors.
>>
>> The classes in org.apache.mahout.vectorizer.encoders can help converting
>> combined numerical, categorical and textual fields into a coherent vector
>> that can be used with standard distance measures.
>>
>> On Mon, Oct 10, 2011 at 11:54 PM, Felix Filozov <ff...@gmail.com>
>> wrote:
>>
>> > I have a set of feature vectors. They're composed of integers and other
>> > non-numerical values. This means that I would need the ability to supply
>> my
>> > own distance function. My data has no notion of users, just vectors.
>> >
>> > Example:
>> >
>> > vector 1: (1, apple, dog, 34, 8766)
>> > ...
>> > vector n: (3, orange, cat, 3738, 3737)
>> >
>> > I would like to know if Mahout can perform kNN similarity search using
>> such
>> > arbitrary items/vectors. As a side question, can it  perform that outside
>> > the context of a recommender? I think reducing some problems to a
>> > recommendation may a bit awkward.
>> >
>> >
>> >
>> > On Monday, October 10, 2011, Sean Owen <sr...@gmail.com> wrote:
>> > > I think there are a lot of answers to this, depending on what exactly
>> > > you want. This is just one answer -- maybe you can clarify your
>> > > requirements.
>> > >
>> > > You want to just find the k most similar items, and you want to
>> > > construe this as a recommender problem?
>> > > The item-based recommenders have a mostSimilarItems() method. All it
>> > > does is find the k most similar items to the given item. It's just
>> > > applying a given similarity metric to search all possibilities. It
>> > > works on "items" but you can flip it around to work on users if you
>> > > like.
>> > >
>> > > Vectors really have to take on numeric values, or else they're not
>> > > really vectors! Are you trying to map discrete values to some numeric
>> > > range?
>> > >
>> > >
>> > > On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ff...@gmail.com>
>> > wrote:
>> > >> I would like perform a kNN similarity search, where each data point is
>> a
>> > N
>> > >> dimensional vector and each coordinate in the vector may take on any
>> > value
>> > >> (reals or strings). It seems to me that Mahout doesn't have the
>> ability
>> > to
>> > >> perform a generic kNN similarity search, instead the problem has to be
>> > >> mapped to a recommender. Is Mahout the right tool for this task?
>> > >>
>> > >> If it is, how have you dealt with the mapping, and if not, what would
>> > you
>> > >> recommend?
>> > >>
>> > >> Thanks.
>> > >>
>> > >> Felix
>> > >>
>> > >
>> >
>>
>

Re: Generic approach to kNN

Posted by Felix Filozov <ff...@gmail.com>.
Ted, does this apply to recommenders?

Let me describe my problem more simply: Imagine you have a set of N feature
vectors, and you're given a vector X (not in the set of N), and you're asked
to find a vector in N which is nearest to X. I believe this is a classic
description of NN.

I've been making my way through Mahout in Action (I just realized you guys
are the authors; great book!) and some online tutorials, and it seems to me
that I'd have to do quite a lot of shoehorning to achieve my goal. I don't
really have a notion of a user or a rating, similarity would have to be
defined by me, and any further optimization using Kd-trees or LSH could be
difficult.


On Mon, Oct 10, 2011 at 9:25 PM, Ted Dunning <te...@gmail.com> wrote:

> You need to encode these as numerical vectors.
>
> The classes in org.apache.mahout.vectorizer.encoders can help converting
> combined numerical, categorical and textual fields into a coherent vector
> that can be used with standard distance measures.
>
> On Mon, Oct 10, 2011 at 11:54 PM, Felix Filozov <ff...@gmail.com>
> wrote:
>
> > I have a set of feature vectors. They're composed of integers and other
> > non-numerical values. This means that I would need the ability to supply
> my
> > own distance function. My data has no notion of users, just vectors.
> >
> > Example:
> >
> > vector 1: (1, apple, dog, 34, 8766)
> > ...
> > vector n: (3, orange, cat, 3738, 3737)
> >
> > I would like to know if Mahout can perform kNN similarity search using
> such
> > arbitrary items/vectors. As a side question, can it  perform that outside
> > the context of a recommender? I think reducing some problems to a
> > recommendation may a bit awkward.
> >
> >
> >
> > On Monday, October 10, 2011, Sean Owen <sr...@gmail.com> wrote:
> > > I think there are a lot of answers to this, depending on what exactly
> > > you want. This is just one answer -- maybe you can clarify your
> > > requirements.
> > >
> > > You want to just find the k most similar items, and you want to
> > > construe this as a recommender problem?
> > > The item-based recommenders have a mostSimilarItems() method. All it
> > > does is find the k most similar items to the given item. It's just
> > > applying a given similarity metric to search all possibilities. It
> > > works on "items" but you can flip it around to work on users if you
> > > like.
> > >
> > > Vectors really have to take on numeric values, or else they're not
> > > really vectors! Are you trying to map discrete values to some numeric
> > > range?
> > >
> > >
> > > On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ff...@gmail.com>
> > wrote:
> > >> I would like perform a kNN similarity search, where each data point is
> a
> > N
> > >> dimensional vector and each coordinate in the vector may take on any
> > value
> > >> (reals or strings). It seems to me that Mahout doesn't have the
> ability
> > to
> > >> perform a generic kNN similarity search, instead the problem has to be
> > >> mapped to a recommender. Is Mahout the right tool for this task?
> > >>
> > >> If it is, how have you dealt with the mapping, and if not, what would
> > you
> > >> recommend?
> > >>
> > >> Thanks.
> > >>
> > >> Felix
> > >>
> > >
> >
>

Re: Generic approach to kNN

Posted by Ted Dunning <te...@gmail.com>.
You need to encode these as numerical vectors.

The classes in org.apache.mahout.vectorizer.encoders can help converting
combined numerical, categorical and textual fields into a coherent vector
that can be used with standard distance measures.

On Mon, Oct 10, 2011 at 11:54 PM, Felix Filozov <ff...@gmail.com> wrote:

> I have a set of feature vectors. They're composed of integers and other
> non-numerical values. This means that I would need the ability to supply my
> own distance function. My data has no notion of users, just vectors.
>
> Example:
>
> vector 1: (1, apple, dog, 34, 8766)
> ...
> vector n: (3, orange, cat, 3738, 3737)
>
> I would like to know if Mahout can perform kNN similarity search using such
> arbitrary items/vectors. As a side question, can it  perform that outside
> the context of a recommender? I think reducing some problems to a
> recommendation may a bit awkward.
>
>
>
> On Monday, October 10, 2011, Sean Owen <sr...@gmail.com> wrote:
> > I think there are a lot of answers to this, depending on what exactly
> > you want. This is just one answer -- maybe you can clarify your
> > requirements.
> >
> > You want to just find the k most similar items, and you want to
> > construe this as a recommender problem?
> > The item-based recommenders have a mostSimilarItems() method. All it
> > does is find the k most similar items to the given item. It's just
> > applying a given similarity metric to search all possibilities. It
> > works on "items" but you can flip it around to work on users if you
> > like.
> >
> > Vectors really have to take on numeric values, or else they're not
> > really vectors! Are you trying to map discrete values to some numeric
> > range?
> >
> >
> > On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ff...@gmail.com>
> wrote:
> >> I would like perform a kNN similarity search, where each data point is a
> N
> >> dimensional vector and each coordinate in the vector may take on any
> value
> >> (reals or strings). It seems to me that Mahout doesn't have the ability
> to
> >> perform a generic kNN similarity search, instead the problem has to be
> >> mapped to a recommender. Is Mahout the right tool for this task?
> >>
> >> If it is, how have you dealt with the mapping, and if not, what would
> you
> >> recommend?
> >>
> >> Thanks.
> >>
> >> Felix
> >>
> >
>

Re: Generic approach to kNN

Posted by Felix Filozov <ff...@gmail.com>.
I have a set of feature vectors. They're composed of integers and other
non-numerical values. This means that I would need the ability to supply my
own distance function. My data has no notion of users, just vectors.

Example:

vector 1: (1, apple, dog, 34, 8766)
...
vector n: (3, orange, cat, 3738, 3737)

I would like to know if Mahout can perform kNN similarity search using such
arbitrary items/vectors. As a side question, can it  perform that outside
the context of a recommender? I think reducing some problems to a
recommendation may a bit awkward.



On Monday, October 10, 2011, Sean Owen <sr...@gmail.com> wrote:
> I think there are a lot of answers to this, depending on what exactly
> you want. This is just one answer -- maybe you can clarify your
> requirements.
>
> You want to just find the k most similar items, and you want to
> construe this as a recommender problem?
> The item-based recommenders have a mostSimilarItems() method. All it
> does is find the k most similar items to the given item. It's just
> applying a given similarity metric to search all possibilities. It
> works on "items" but you can flip it around to work on users if you
> like.
>
> Vectors really have to take on numeric values, or else they're not
> really vectors! Are you trying to map discrete values to some numeric
> range?
>
>
> On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ff...@gmail.com> wrote:
>> I would like perform a kNN similarity search, where each data point is a
N
>> dimensional vector and each coordinate in the vector may take on any
value
>> (reals or strings). It seems to me that Mahout doesn't have the ability
to
>> perform a generic kNN similarity search, instead the problem has to be
>> mapped to a recommender. Is Mahout the right tool for this task?
>>
>> If it is, how have you dealt with the mapping, and if not, what would you
>> recommend?
>>
>> Thanks.
>>
>> Felix
>>
>

Re: Generic approach to kNN

Posted by Sean Owen <sr...@gmail.com>.
I think there are a lot of answers to this, depending on what exactly
you want. This is just one answer -- maybe you can clarify your
requirements.

You want to just find the k most similar items, and you want to
construe this as a recommender problem?
The item-based recommenders have a mostSimilarItems() method. All it
does is find the k most similar items to the given item. It's just
applying a given similarity metric to search all possibilities. It
works on "items" but you can flip it around to work on users if you
like.

Vectors really have to take on numeric values, or else they're not
really vectors! Are you trying to map discrete values to some numeric
range?


On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ff...@gmail.com> wrote:
> I would like perform a kNN similarity search, where each data point is a N
> dimensional vector and each coordinate in the vector may take on any value
> (reals or strings). It seems to me that Mahout doesn't have the ability to
> perform a generic kNN similarity search, instead the problem has to be
> mapped to a recommender. Is Mahout the right tool for this task?
>
> If it is, how have you dealt with the mapping, and if not, what would you
> recommend?
>
> Thanks.
>
> Felix
>