You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2015/08/26 13:35:45 UTC

Build k-NN graph for large dataset

Dear all,

I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
>>> 10000) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based on
some kind similarity defined on the vertex spaces.
My problem is to implement an efficient algorithm to compute the weight
matrix of the graph. I need to compute a N*N similarities and the only way
I know is to use "cartesian" operation follow by "map" operation on RDD.
But, this is very slow when the N is large. Is there a more cleaver way to
do this for an arbitrary similarity function ?

Cheers,

Jao

Re: Build k-NN graph for large dataset

Posted by Robin East <ro...@xense.co.uk>.

You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality.
> On 26 Aug 2015, at 12:35, Jaonary Rabarisoa <ja...@gmail.com> wrote:
> 
> Dear all,
> 
> I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 10000) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. 
> My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use "cartesian" operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? 
> 
> Cheers,
> 
> Jao


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Build k-NN graph for large dataset

Posted by Maruf Aytekin <aa...@gmail.com>.

Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
I hope It will help.

Maruf


On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa <ja...@gmail.com>
wrote:

> Thank you all for these links. I'll check them.
>
> On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack <ch...@gmail.com>
> wrote:
>
>> +1 to all of the above esp.  Dimensionality reduction and locality
>> sensitive hashing / min hashing.
>>
>> There's also an algorithm implemented in MLlib called DIMSUM which was
>> developed at Twitter for this purpose. I've been meaning to try it and
>> would be interested to hear about results you get.
>>
>> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
>>
>> Charlie
>>
>>
>> — Sent from Mailbox
>>
>> On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <
>> michaelmalak@yahoo.com.invalid>, wrote:
>>
>>> Yes. And a paper that describes using grids (actually varying grids) is
>>> http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In
>>> the Spark GraphX In Action book that Robin East and I are writing, we
>>> implement a drastically simplified version of this in chapter 7, which
>>> should become available in the MEAP mid-September.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>> ------------------------------
>>>
>>> If you don't want to compute all N^2 similarities, you need to implement
>>> some kind of blocking first. For example, LSH (locally sensitive hashing).
>>> A quick search gave this link to a Spark implementation:
>>>
>>>
>>> http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing
>>>
>>>
>>>
>>> On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <ja...@gmail.com>
>>> wrote:
>>>
>>> Dear all,
>>>
>>> I'm trying to find an efficient way to build a k-NN graph for a large
>>> dataset. Precisely, I have a large set of high dimensional vector (say d
>>> >>> 10000) and I want to build a graph where those high dimensional points
>>> are the vertices and each one is linked to the k-nearest neighbor based on
>>> some kind similarity defined on the vertex spaces.
>>> My problem is to implement an efficient algorithm to compute the weight
>>> matrix of the graph. I need to compute a N*N similarities and the only way
>>> I know is to use "cartesian" operation follow by "map" operation on RDD.
>>> But, this is very slow when the N is large. Is there a more cleaver way to
>>> do this for an arbitrary similarity function ?
>>>
>>> Cheers,
>>>
>>> Jao
>>>
>>>
>>>
>>>
>>>
>

Re: Build k-NN graph for large dataset

Posted by Jaonary Rabarisoa <ja...@gmail.com>.

Thank you all for these links. I'll check them.

On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack <ch...@gmail.com>
wrote:

> +1 to all of the above esp.  Dimensionality reduction and locality
> sensitive hashing / min hashing.
>
> There's also an algorithm implemented in MLlib called DIMSUM which was
> developed at Twitter for this purpose. I've been meaning to try it and
> would be interested to hear about results you get.
>
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
>
> Charlie
>
>
> — Sent from Mailbox
>
> On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <
> michaelmalak@yahoo.com.invalid>, wrote:
>
>> Yes. And a paper that describes using grids (actually varying grids) is
>> http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In
>> the Spark GraphX In Action book that Robin East and I are writing, we
>> implement a drastically simplified version of this in chapter 7, which
>> should become available in the MEAP mid-September.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>> ------------------------------
>>
>> If you don't want to compute all N^2 similarities, you need to implement
>> some kind of blocking first. For example, LSH (locally sensitive hashing).
>> A quick search gave this link to a Spark implementation:
>>
>>
>> http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing
>>
>>
>>
>> On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <ja...@gmail.com>
>> wrote:
>>
>> Dear all,
>>
>> I'm trying to find an efficient way to build a k-NN graph for a large
>> dataset. Precisely, I have a large set of high dimensional vector (say d
>> >>> 10000) and I want to build a graph where those high dimensional points
>> are the vertices and each one is linked to the k-nearest neighbor based on
>> some kind similarity defined on the vertex spaces.
>> My problem is to implement an efficient algorithm to compute the weight
>> matrix of the graph. I need to compute a N*N similarities and the only way
>> I know is to use "cartesian" operation follow by "map" operation on RDD.
>> But, this is very slow when the N is large. Is there a more cleaver way to
>> do this for an arbitrary similarity function ?
>>
>> Cheers,
>>
>> Jao
>>
>>
>>
>>
>>

Re: Build k-NN graph for large dataset

Posted by Charlie Hack <ch...@gmail.com>.

+1 to all of the above esp.  Dimensionality reduction and locality sensitive hashing / min hashing. 


There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. 





https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum













Charlie 







—
Sent from Mailbox




On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <mi...@yahoo.com.invalid>, wrote:


Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter 7, which should become available in the MEAP mid-September. http://www.manning.com/books/spark-graphx-in-action






   
 


If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: 

http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing












On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:

Dear all,


I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 10000) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use "cartesian" operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? 




Cheers,




Jao

Re: Build k-NN graph for large dataset

Posted by Michael Malak <mi...@yahoo.com.INVALID>.

Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter 7, which should become available in the MEAP mid-September. http://www.manning.com/books/spark-graphx-in-action

      From: Kristina Rogale Plazonic <kp...@gmail.com>
 To: Jaonary Rabarisoa <ja...@gmail.com> 
Cc: user <us...@spark.apache.org> 
 Sent: Wednesday, August 26, 2015 7:24 AM
 Subject: Re: Build k-NN graph for large dataset

If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: 
http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:

Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 10000) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use "cartesian" operation follow by "map" operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? 
Cheers,
Jao

Re: Build k-NN graph for large dataset

Posted by Kristina Rogale Plazonic <kp...@gmail.com>.

If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:

http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <ja...@gmail.com>
wrote:

> Dear all,
>
> I'm trying to find an efficient way to build a k-NN graph for a large
> dataset. Precisely, I have a large set of high dimensional vector (say d
> >>> 10000) and I want to build a graph where those high dimensional points
> are the vertices and each one is linked to the k-nearest neighbor based on
> some kind similarity defined on the vertex spaces.
> My problem is to implement an efficient algorithm to compute the weight
> matrix of the graph. I need to compute a N*N similarities and the only way
> I know is to use "cartesian" operation follow by "map" operation on RDD.
> But, this is very slow when the N is large. Is there a more cleaver way to
> do this for an arbitrary similarity function ?
>
> Cheers,
>
> Jao
>