You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Benedict Holland <be...@gmail.com> on 2018/10/30 20:09:04 UTC

Integrating word2vec and glove results into Solr

Hello all,

We came up with a fascinating question. We actually have for our corpora,
word2vec, doc2vec, and GloVe results. Is it possible to use these datasets
within the search engine? If so, could you please point me to documentation
on how to get Solr to use them?

Thank you so much,
~Ben

Re: Integrating word2vec and glove results into Solr

Posted by Benedict Holland <be...@gmail.com>.
Thanks Doug.

It is funny that you should mention that. It is very hard trying to
convince people that just because words are somehow related, we really
don't know how they are related. This is especially true when they are
handed the results of a shallow neural net that took a research team a few
weeks to put together.

I am always happy to have the reminder about common and rare words.
Honestly, I am not that happy with the size of our corpus but it might be
just enough. Alternatively, we weight the results of the embeddings really
low for the search engine when it comes to displaying most relevant to
least.

Oh, given the lack of text being a problem, is there a problem with doing
this on twitter data? I assume that running vector relationships over
Twitter data is probably not going to do much.

Thank you so much for the feedback.
~Ben



On Tue, Oct 30, 2018 at 5:59 PM Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> You may already know this, but just be very careful. Embeddings are useful,
> but people often think of them as detecting synonyms, but really just
> encode contexts. For example antonyms and words with similar functions
> often are seen as similar.
>
> There's also issues with terms that occur in sparsely (you don't get enough
> contexts to get a good embedding)
> and issues with terms that occur very commonly (they tend to clump together
> despite different meanings)
>
> Older form of embedding, but the lessons still apply
>
> https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/
>
> I'd also recommend my talk at Activate that spends a ton of time on
> building/customizing embeddings for your use case
>
> https://docs.google.com/presentation/d/1-nPKX5VYUR7uue5IL0tm7M2YH0agb0aRO1y9sMKl1Hs/edit#slide=id.g3abdd68a3e_0_192
>
> -Doug
>
> On Tue, Oct 30, 2018 at 5:37 PM Benedict Holland <
> benedict.m.holland@gmail.com> wrote:
>
> > Oh very cool. I will have to look into this more. This is something up
> and
> > coming I take it?
> >
> > Thanks,
> > ~Ben
> >
> > On Tue, Oct 30, 2018 at 4:36 PM Alexandre Rafalovitch <
> arafalov@gmail.com>
> > wrote:
> >
> > > Simon Hughes presentation on just finished Activate may be relevant:
> > >
> > >
> >
> https://www.slideshare.net/SimonHughes13/vectors-in-search-towards-more-semantic-matching
> > > The video will be available in a couple of weeks, I am guessing from
> > > LucidWorks channel.
> > >
> > > Related repos:
> > > *) https://github.com/DiceTechJobs/VectorsInSearch
> > > *) https://github.com/DiceTechJobs/ConceptualSearch (older)
> > > *) https://github.com/kojisekig/word2vec-lucene - something else quite
> > old
> > >
> > > These are just keyword matches on your question. I am sure others may
> > > have some more real details.
> > >
> > > Regards,
> > >    Alex.
> > > On Tue, 30 Oct 2018 at 16:09, Benedict Holland
> > > <be...@gmail.com> wrote:
> > > >
> > > > Hello all,
> > > >
> > > > We came up with a fascinating question. We actually have for our
> > corpora,
> > > > word2vec, doc2vec, and GloVe results. Is it possible to use these
> > > datasets
> > > > within the search engine? If so, could you please point me to
> > > documentation
> > > > on how to get Solr to use them?
> > > >
> > > > Thank you so much,
> > > > ~Ben
> > >
> >
> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

Re: Integrating word2vec and glove results into Solr

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
You may already know this, but just be very careful. Embeddings are useful,
but people often think of them as detecting synonyms, but really just
encode contexts. For example antonyms and words with similar functions
often are seen as similar.

There's also issues with terms that occur in sparsely (you don't get enough
contexts to get a good embedding)
and issues with terms that occur very commonly (they tend to clump together
despite different meanings)

Older form of embedding, but the lessons still apply
https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/

I'd also recommend my talk at Activate that spends a ton of time on
building/customizing embeddings for your use case
https://docs.google.com/presentation/d/1-nPKX5VYUR7uue5IL0tm7M2YH0agb0aRO1y9sMKl1Hs/edit#slide=id.g3abdd68a3e_0_192

-Doug

On Tue, Oct 30, 2018 at 5:37 PM Benedict Holland <
benedict.m.holland@gmail.com> wrote:

> Oh very cool. I will have to look into this more. This is something up and
> coming I take it?
>
> Thanks,
> ~Ben
>
> On Tue, Oct 30, 2018 at 4:36 PM Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
> > Simon Hughes presentation on just finished Activate may be relevant:
> >
> >
> https://www.slideshare.net/SimonHughes13/vectors-in-search-towards-more-semantic-matching
> > The video will be available in a couple of weeks, I am guessing from
> > LucidWorks channel.
> >
> > Related repos:
> > *) https://github.com/DiceTechJobs/VectorsInSearch
> > *) https://github.com/DiceTechJobs/ConceptualSearch (older)
> > *) https://github.com/kojisekig/word2vec-lucene - something else quite
> old
> >
> > These are just keyword matches on your question. I am sure others may
> > have some more real details.
> >
> > Regards,
> >    Alex.
> > On Tue, 30 Oct 2018 at 16:09, Benedict Holland
> > <be...@gmail.com> wrote:
> > >
> > > Hello all,
> > >
> > > We came up with a fascinating question. We actually have for our
> corpora,
> > > word2vec, doc2vec, and GloVe results. Is it possible to use these
> > datasets
> > > within the search engine? If so, could you please point me to
> > documentation
> > > on how to get Solr to use them?
> > >
> > > Thank you so much,
> > > ~Ben
> >
>
-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug

Re: Integrating word2vec and glove results into Solr

Posted by Benedict Holland <be...@gmail.com>.
Oh very cool. I will have to look into this more. This is something up and
coming I take it?

Thanks,
~Ben

On Tue, Oct 30, 2018 at 4:36 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Simon Hughes presentation on just finished Activate may be relevant:
>
> https://www.slideshare.net/SimonHughes13/vectors-in-search-towards-more-semantic-matching
> The video will be available in a couple of weeks, I am guessing from
> LucidWorks channel.
>
> Related repos:
> *) https://github.com/DiceTechJobs/VectorsInSearch
> *) https://github.com/DiceTechJobs/ConceptualSearch (older)
> *) https://github.com/kojisekig/word2vec-lucene - something else quite old
>
> These are just keyword matches on your question. I am sure others may
> have some more real details.
>
> Regards,
>    Alex.
> On Tue, 30 Oct 2018 at 16:09, Benedict Holland
> <be...@gmail.com> wrote:
> >
> > Hello all,
> >
> > We came up with a fascinating question. We actually have for our corpora,
> > word2vec, doc2vec, and GloVe results. Is it possible to use these
> datasets
> > within the search engine? If so, could you please point me to
> documentation
> > on how to get Solr to use them?
> >
> > Thank you so much,
> > ~Ben
>

Re: Integrating word2vec and glove results into Solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Simon Hughes presentation on just finished Activate may be relevant:
https://www.slideshare.net/SimonHughes13/vectors-in-search-towards-more-semantic-matching
The video will be available in a couple of weeks, I am guessing from
LucidWorks channel.

Related repos:
*) https://github.com/DiceTechJobs/VectorsInSearch
*) https://github.com/DiceTechJobs/ConceptualSearch (older)
*) https://github.com/kojisekig/word2vec-lucene - something else quite old

These are just keyword matches on your question. I am sure others may
have some more real details.

Regards,
   Alex.
On Tue, 30 Oct 2018 at 16:09, Benedict Holland
<be...@gmail.com> wrote:
>
> Hello all,
>
> We came up with a fascinating question. We actually have for our corpora,
> word2vec, doc2vec, and GloVe results. Is it possible to use these datasets
> within the search engine? If so, could you please point me to documentation
> on how to get Solr to use them?
>
> Thank you so much,
> ~Ben