You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "apu mishra . rr" <ap...@gmail.com> on 2016/02/03 23:04:48 UTC

Nearest neighbors in Spark with Annoy

As mllib doesn't have nearest-neighbors functionality, I'm trying to use
Annoy <https://github.com/spotify/annoy> for Approximate Nearest Neighbors.
I try to broadcast the Annoy object and pass it to workers; however, it
does not operate as expected.

Below is complete code for reproducibility. The problem is highlighted in
the difference seen when using Annoy with vs without Spark.

from annoy import AnnoyIndex
import random
random.seed(42)

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
allvectors = []
for i in xrange(20):
    v = [random.gauss(0, 1) for z in xrange(f)]
    t.add_item(i, v)
    allvectors.append((i, v))
t.build(10) # 10 trees

# Use Annoy with Spark
sparkvectors = sc.parallelize(allvectors)
bct = sc.broadcast(t)
x = sparkvectors.map(lambda x: bct.value.get_nns_by_vector(vector=x[1],
n=5))
print "Five closest neighbors for first vector with Spark:",
print x.first()

# Use Annoy without Spark
print "Five closest neighbors for first vector without Spark:",
print(t.get_nns_by_vector(vector=allvectors[0][1], n=5))


Output seen:

Five closest neighbors for first vector with Spark: None

Five closest neighbors for first vector without Spark: [0, 13, 12, 6, 4]