You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "J. Delgado" <jo...@gmail.com> on 2017/12/09 08:00:15 UTC

Word Embedding stored in Lucene Index

It has been a couple of years since the Neu-IR WS (
https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/craswell-report-2016.pdf).
I'm wondering if anyone has tinkered with storing word/document embeddings
and using inside Lucene to improve the core relevance model.


One of the key ideas of neural search is to leverage such representations
in order to improve the effectiveness of search engines. It would be very
nice if we could have a retrieval model that relies on word and document
vectors (also called *embeddings*) with the above capabilities, so we could
calculate and leverage document and word similarities very efficiently by
looking at the "nearest neighbours".


I found this code that can generate word2vec from a Lucene index:

https://github.com/kojisekig/word2vec-lucene


But the closest work along the lines of using DL in Lucene is this paper
about "Large Scale Indexing and Searching Deep Convolutional Neural Network
Features" (https://link.springer.com/chapter/10.1007/978-3-319-43946-4_14)
that applies mainly to content-based image retrieval.


-- J