You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Upayavira <uv...@odoko.co.uk> on 2014/11/26 21:05:11 UTC

comparing feature vectors using Solr/Lucene

Hi,

I've been asked how to use Solr as a component in a machine learning
system, doing document comparison based upon feature vectors.

If I have two vectors, one in the index (in some form) and one in the
query (in some form), how can I do, for example, a vector multiplication
of the two vectors in order to calculate a score?

The feature space I am being given has 100 features, with numerical
scores for each feature. In this case, it is not sparse - most features
will have a value.

I have ideas, but it seems they get me some of the way, but not all.

Has anyone worked with Solr in this way?

Thanks,

Upayavira

Re: comparing feature vectors using Solr/Lucene

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello,

Lucene rocks in calculating scalar product (a score of whatever similarity)
of sparse feature vectors. That's it. Note that 'feature' usually means a
term, and 'feature vector' is a document. Which might be opposite to your
problem definition. You can either expand the definition of your problem,
or explain it in terms of the current Solr setup ie what you index and what
you filter for, etc.

Best wishes

On Thu, Nov 27, 2014 at 9:57 AM, Upayavira <uv...@odoko.co.uk> wrote:

> Thanks Nicholas, there is a sense in which Solr isn't the right tool.
> However, we already have lots of business rules encapsulated into filter
> queries, and already have content ingestion pipelines for our content in
> place.
>
> TF-IDF similarity is pluggable (even just by sorting on function
> queries), so am looking for an alternative way to encapsulate the
> scoring algorithm.
>
> Upayavira
>
> On Wed, Nov 26, 2014, at 10:14 PM, Nicholas Ding wrote:
> > I'm not sure if Solr is the right tool to do this task. You probably need
> > a
> > machine learning library like Mahout or Weka.
> >
> > PS: Lucene doesn't really use Cosine Similarity, it's using a practical
> > TF-IDF Similarity.
> >
> > Nicholas Ding
> >
> > On Wed, Nov 26, 2014 at 3:05 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > > Hi,
> > >
> > > I've been asked how to use Solr as a component in a machine learning
> > > system, doing document comparison based upon feature vectors.
> > >
> > > If I have two vectors, one in the index (in some form) and one in the
> > > query (in some form), how can I do, for example, a vector
> multiplication
> > > of the two vectors in order to calculate a score?
> > >
> > > The feature space I am being given has 100 features, with numerical
> > > scores for each feature. In this case, it is not sparse - most features
> > > will have a value.
> > >
> > > I have ideas, but it seems they get me some of the way, but not all.
> > >
> > > Has anyone worked with Solr in this way?
> > >
> > > Thanks,
> > >
> > > Upayavira
> > >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: comparing feature vectors using Solr/Lucene

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Upayavira,

on the lucene list, two tools are sometimes talked about which might be doing some of what you are searching:
- semanticvectors (https://code.google.com/p/semanticvectors)
- word2vec https://github.com/kojisekig/word2vec-lucene/i
Maybe it helps?
I'm under the impression that you are rather looking for the lucene performance instead of these tools which I see as rather explicit samples for the interest of using vectors for word engineering.

Paul


On 27 nov. 2014, at 07:57, Upayavira <uv...@odoko.co.uk> wrote:

> Thanks Nicholas, there is a sense in which Solr isn't the right tool.
> However, we already have lots of business rules encapsulated into filter
> queries, and already have content ingestion pipelines for our content in
> place.
> 
> TF-IDF similarity is pluggable (even just by sorting on function
> queries), so am looking for an alternative way to encapsulate the
> scoring algorithm.
> 
> Upayavira
> 
> On Wed, Nov 26, 2014, at 10:14 PM, Nicholas Ding wrote:
>> I'm not sure if Solr is the right tool to do this task. You probably need
>> a
>> machine learning library like Mahout or Weka.
>> 
>> PS: Lucene doesn't really use Cosine Similarity, it's using a practical
>> TF-IDF Similarity.
>> 
>> Nicholas Ding
>> 
>> On Wed, Nov 26, 2014 at 3:05 PM, Upayavira <uv...@odoko.co.uk> wrote:
>> 
>>> Hi,
>>> 
>>> I've been asked how to use Solr as a component in a machine learning
>>> system, doing document comparison based upon feature vectors.
>>> 
>>> If I have two vectors, one in the index (in some form) and one in the
>>> query (in some form), how can I do, for example, a vector multiplication
>>> of the two vectors in order to calculate a score?
>>> 
>>> The feature space I am being given has 100 features, with numerical
>>> scores for each feature. In this case, it is not sparse - most features
>>> will have a value.
>>> 
>>> I have ideas, but it seems they get me some of the way, but not all.
>>> 
>>> Has anyone worked with Solr in this way?
>>> 
>>> Thanks,
>>> 
>>> Upayavira
>>>

Re: comparing feature vectors using Solr/Lucene

Posted by Upayavira <uv...@odoko.co.uk>.

Thanks Nicholas, there is a sense in which Solr isn't the right tool.
However, we already have lots of business rules encapsulated into filter
queries, and already have content ingestion pipelines for our content in
place.

TF-IDF similarity is pluggable (even just by sorting on function
queries), so am looking for an alternative way to encapsulate the
scoring algorithm.

Upayavira

On Wed, Nov 26, 2014, at 10:14 PM, Nicholas Ding wrote:
> I'm not sure if Solr is the right tool to do this task. You probably need
> a
> machine learning library like Mahout or Weka.
> 
> PS: Lucene doesn't really use Cosine Similarity, it's using a practical
> TF-IDF Similarity.
> 
> Nicholas Ding
> 
> On Wed, Nov 26, 2014 at 3:05 PM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> > Hi,
> >
> > I've been asked how to use Solr as a component in a machine learning
> > system, doing document comparison based upon feature vectors.
> >
> > If I have two vectors, one in the index (in some form) and one in the
> > query (in some form), how can I do, for example, a vector multiplication
> > of the two vectors in order to calculate a score?
> >
> > The feature space I am being given has 100 features, with numerical
> > scores for each feature. In this case, it is not sparse - most features
> > will have a value.
> >
> > I have ideas, but it seems they get me some of the way, but not all.
> >
> > Has anyone worked with Solr in this way?
> >
> > Thanks,
> >
> > Upayavira
> >

Re: comparing feature vectors using Solr/Lucene

Posted by Nicholas Ding <ni...@gmail.com>.

I'm not sure if Solr is the right tool to do this task. You probably need a
machine learning library like Mahout or Weka.

PS: Lucene doesn't really use Cosine Similarity, it's using a practical
TF-IDF Similarity.

Nicholas Ding

On Wed, Nov 26, 2014 at 3:05 PM, Upayavira <uv...@odoko.co.uk> wrote:

> Hi,
>
> I've been asked how to use Solr as a component in a machine learning
> system, doing document comparison based upon feature vectors.
>
> If I have two vectors, one in the index (in some form) and one in the
> query (in some form), how can I do, for example, a vector multiplication
> of the two vectors in order to calculate a score?
>
> The feature space I am being given has 100 features, with numerical
> scores for each feature. In this case, it is not sparse - most features
> will have a value.
>
> I have ideas, but it seems they get me some of the way, but not all.
>
> Has anyone worked with Solr in this way?
>
> Thanks,
>
> Upayavira
>