You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Maria Mestre <Ma...@oracle.com> on 2019/01/28 16:29:27 UTC

MLT - unexpected design choice

Hi all,

First of all, I’m not a Java developer, and a SolR newbie. I have worked with Elasticsearch for some years (not contributing, just as a user), so I think I have the basics of text search engines covered. I am always learning new things though!

I created an index in SolR and used more-like-this on it, by passing a document_id. My data has a special feature, which is that one of the fields is called “description” but is only populated about 10% of the time. Most of the time it is empty. I am using that field to query similar documents.

So I query the /mlt endpoint using these parameters (for example):

{q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
mlt=true,
mlt.fl=description,
mlt.mindf=1,
mlt.mintf=1,
mlt.maxqt=5,
wt=json,
mlt.interestingTerms=details}

The issue I have is that when retrieving the key scored terms (interestingTerms), the code uses the total number of documents in the index, not the total number of documents with populated “description” field. This is where it’s done in the code: https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651

The effect of this choice is that the “idf” does not vary much, given that numDocs >> number of documents with “description”, so the key terms end up being just the terms with the highest term frequencies.

It is inconsistent because the MLT-search then uses these extracted key terms and scores all documents using an idf which is computed only on the subset of documents with “description”. So one part of the MLT uses a different numDocs than another part. This sounds like an odd choice, and not expected at all, and I wonder if I’m missing something.

Best,
Maria




 

Matrix Factorization possible with Streams?

Posted by Vidhya Kailash <vi...@gmail.com>.
Hi
I am wondering if anyone has attempted Matrix Factorization possible with
Streams in Solr? If so, any pointers would be appreciated.

thanks
Vidhya

Re: MLT - unexpected design choice

Posted by Maria Mestre <Ma...@oracle.com>.
Hi Alessandro and Matt,

Thanks so much for your help!

@Alessandro: I will do so, thank you :-)



> On 29 Jan 2019, at 12:26, Alessandro Benedetti <a....@sease.io> wrote:
> 
> Hi Maria,
> this is actually a great catch!
> I have been working a lot on the More Like This and this mistake never
> caught my attention.
> 
> I agree with you, feel free to open a Jira Issue.
> 
> First of all what you say, makes sense.
> Secondly it is the way it is the standard way used in the similarity Lucene
> calculations :
> 
> 
> 
> 
> 
> 
> 
> 
> *public Explanation idfExplain(CollectionStatistics collectionStats,
> TermStatistics termStats) {  final long df = termStats.docFreq();
> final long docCount = collectionStats.docCount();  final float idf =
> idf(df, docCount);  return Explanation.match(idf, "idf, computed as
> log((docCount+1)/(docFreq+1)) + 1 from:",      Explanation.match(df,
> "docFreq, number of documents containing term"),
> Explanation.match(docCount, "docCount, total number of documents with
> field"));}*
> 
> 
> *Indeed the int numDocs = ir.numDocs(); should actually be allocated
> per term in the for loop, using the field stats, something like:*
> 
> *numDocs = ir.getDocCount(fieldName)*
> 
> Feel free to open the Jira issue and attach a patch with at least a
> testCase that shows the bugfix.
> 
> I will be available for doing the review.
> 
> 
> Cheers
> 
> --------------------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e=>
> 
> 
> On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <matt@flax.co.uk <ma...@flax.co.uk>> wrote:
> 
>> Hi Maria,
>> 
>> Would it help to add a filter to your query to restrict the results to
>> just those where the description field is populated? Eg. add
>> 
>> fq=description:[* TO *]
>> 
>> to your query parameters.
>> 
>> Apologies if I'm misunderstanding the problem!
>> 
>> Best,
>> 
>> Matt
>> 
>> 
>> On 28/01/2019 16:29, Maria Mestre wrote:
>>> Hi all,
>>> 
>>> First of all, I’m not a Java developer, and a SolR newbie. I have worked
>> with Elasticsearch for some years (not contributing, just as a user), so I
>> think I have the basics of text search engines covered. I am always
>> learning new things though!
>>> 
>>> I created an index in SolR and used more-like-this on it, by passing a
>> document_id. My data has a special feature, which is that one of the fields
>> is called “description” but is only populated about 10% of the time. Most
>> of the time it is empty. I am using that field to query similar documents.
>>> 
>>> So I query the /mlt endpoint using these parameters (for example):
>>> 
>>> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
>>> mlt=true,
>>> mlt.fl=description,
>>> mlt.mindf=1,
>>> mlt.mintf=1,
>>> mlt.maxqt=5,
>>> wt=json,
>>> mlt.interestingTerms=details}
>>> 
>>> The issue I have is that when retrieving the key scored terms
>> (interestingTerms), the code uses the total number of documents in the
>> index, not the total number of documents with populated “description”
>> field. This is where it’s done in the code:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e=>
>>> 
>>> The effect of this choice is that the “idf” does not vary much, given
>> that numDocs >> number of documents with “description”, so the key terms
>> end up being just the terms with the highest term frequencies.
>>> 
>>> It is inconsistent because the MLT-search then uses these extracted key
>> terms and scores all documents using an idf which is computed only on the
>> subset of documents with “description”. So one part of the MLT uses a
>> different numDocs than another part. This sounds like an odd choice, and
>> not expected at all, and I wonder if I’m missing something.
>>> 
>>> Best,
>>> Maria
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> Matt Pearce
>> Flax - Open Source Enterprise Search
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e=>

Re: MLT - unexpected design choice

Posted by Alessandro Benedetti <a....@sease.io>.
Hi Maria,
this is actually a great catch!
I have been working a lot on the More Like This and this mistake never
caught my attention.

I agree with you, feel free to open a Jira Issue.

First of all what you say, makes sense.
Secondly it is the way it is the standard way used in the similarity Lucene
calculations :








*public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {  final long df = termStats.docFreq();
final long docCount = collectionStats.docCount();  final float idf =
idf(df, docCount);  return Explanation.match(idf, "idf, computed as
log((docCount+1)/(docFreq+1)) + 1 from:",      Explanation.match(df,
"docFreq, number of documents containing term"),
Explanation.match(docCount, "docCount, total number of documents with
field"));}*


*Indeed the int numDocs = ir.numDocs(); should actually be allocated
per term in the for loop, using the field stats, something like:*

*numDocs = ir.getDocCount(fieldName)*

Feel free to open the Jira issue and attach a patch with at least a
testCase that shows the bugfix.

I will be available for doing the review.


Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io


On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <ma...@flax.co.uk> wrote:

> Hi Maria,
>
> Would it help to add a filter to your query to restrict the results to
> just those where the description field is populated? Eg. add
>
> fq=description:[* TO *]
>
> to your query parameters.
>
> Apologies if I'm misunderstanding the problem!
>
> Best,
>
> Matt
>
>
> On 28/01/2019 16:29, Maria Mestre wrote:
> > Hi all,
> >
> > First of all, I’m not a Java developer, and a SolR newbie. I have worked
> with Elasticsearch for some years (not contributing, just as a user), so I
> think I have the basics of text search engines covered. I am always
> learning new things though!
> >
> > I created an index in SolR and used more-like-this on it, by passing a
> document_id. My data has a special feature, which is that one of the fields
> is called “description” but is only populated about 10% of the time. Most
> of the time it is empty. I am using that field to query similar documents.
> >
> > So I query the /mlt endpoint using these parameters (for example):
> >
> > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
> > mlt=true,
> > mlt.fl=description,
> > mlt.mindf=1,
> > mlt.mintf=1,
> > mlt.maxqt=5,
> > wt=json,
> > mlt.interestingTerms=details}
> >
> > The issue I have is that when retrieving the key scored terms
> (interestingTerms), the code uses the total number of documents in the
> index, not the total number of documents with populated “description”
> field. This is where it’s done in the code:
> https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651
> >
> > The effect of this choice is that the “idf” does not vary much, given
> that numDocs >> number of documents with “description”, so the key terms
> end up being just the terms with the highest term frequencies.
> >
> > It is inconsistent because the MLT-search then uses these extracted key
> terms and scores all documents using an idf which is computed only on the
> subset of documents with “description”. So one part of the MLT uses a
> different numDocs than another part. This sounds like an odd choice, and
> not expected at all, and I wonder if I’m missing something.
> >
> > Best,
> > Maria
> >
> >
> >
> >
> >
> >
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk
>

Re: MLT - unexpected design choice

Posted by Matt Pearce <ma...@flax.co.uk>.
Hi Maria,

Would it help to add a filter to your query to restrict the results to 
just those where the description field is populated? Eg. add

fq=description:[* TO *]

to your query parameters.

Apologies if I'm misunderstanding the problem!

Best,

Matt


On 28/01/2019 16:29, Maria Mestre wrote:
> Hi all,
> 
> First of all, I’m not a Java developer, and a SolR newbie. I have worked with Elasticsearch for some years (not contributing, just as a user), so I think I have the basics of text search engines covered. I am always learning new things though!
> 
> I created an index in SolR and used more-like-this on it, by passing a document_id. My data has a special feature, which is that one of the fields is called “description” but is only populated about 10% of the time. Most of the time it is empty. I am using that field to query similar documents.
> 
> So I query the /mlt endpoint using these parameters (for example):
> 
> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
> mlt=true,
> mlt.fl=description,
> mlt.mindf=1,
> mlt.mintf=1,
> mlt.maxqt=5,
> wt=json,
> mlt.interestingTerms=details}
> 
> The issue I have is that when retrieving the key scored terms (interestingTerms), the code uses the total number of documents in the index, not the total number of documents with populated “description” field. This is where it’s done in the code: https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651
> 
> The effect of this choice is that the “idf” does not vary much, given that numDocs >> number of documents with “description”, so the key terms end up being just the terms with the highest term frequencies.
> 
> It is inconsistent because the MLT-search then uses these extracted key terms and scores all documents using an idf which is computed only on the subset of documents with “description”. So one part of the MLT uses a different numDocs than another part. This sounds like an odd choice, and not expected at all, and I wonder if I’m missing something.
> 
> Best,
> Maria
> 
> 
> 
> 
>   
> 

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk