You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew Clegg <an...@gmail.com> on 2009/11/10 17:02:17 UTC

Selection of terms for MoreLikeThis

Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1

one of the hits in the results is "and" (I don't do any stopword removal on
this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords

I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

<lst name="aquaspirillum">
<int name="tf">1</int>
<int name="df">10</int>
<double name="tf-idf">0.1</double>
</lst>

that *don't* appear in the MoreLikeThis list. (I tried adding &mlt.maxwl=999
to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or
does MoreLikeThis do some other heuristic magic to select good candidates,
and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

-- 
:: http://biotext.org.uk/ ::

-- 
View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

Posted by Andrew Clegg <an...@gmail.com>.


Chantal Ackermann wrote:
> 
> your URL does not include the parameter mlt.boost. Setting that to 
> "true" made a noticeable difference for my queries.
> 

Hmm, I'm really not sure if this is doing the right thing either. When I add
it I get:

 <float name="keywords:dehydrogenase">1.0</float>
 <float name="keywords:reductase">0.60737264</float>
 <float name="keywords:metabolism">0.27599618</float>
 <float name="keywords:activity">0.2476748</float>
 <float name="keywords:process">0.24487767</float>
 <float name="keywords:alcohol">0.23969446</float>
 <float name="keywords:and">0.1990452</float>
 <float name="keywords:malate">0.18447271</float>
 <float name="keywords:biosynthesis">0.13297324</float>
 <float name="keywords:biosynthetic">0.1233415</float>
 <float name="keywords:degradation">0.11993817</float>
 <float name="keywords:precursor">0.11789705</float>
 <float name="keywords:metabolic">0.117194556</float>
 <float name="keywords:protein">0.11164951</float>
 <float name="keywords:synthase">0.10744005</float>
 <float name="keywords:acid">0.09943076</float>
 <float name="keywords:enzyme">0.097062066</float>
 <float name="keywords:succinyl-coa">0.09287166</float>
 <float name="keywords:putative">0.0877542</float>
 <float name="keywords:(nadp+)">0.0864609</float>
 <float name="keywords:4,6-dehydratase">0.08362857</float>
 <float name="keywords:fatty">0.07988805</float>
 <float name="keywords:chloroplast">0.079598725</float>
 <float name="keywords:lactobacillus">0.07747293</float>
 <float name="keywords:glyoxylate">0.075560644</float>

"and" scores far more highly than much more discriminative words like
"chloroplast" and "glyoxylate", both of which have *much* higher tf.idf
scores than "and" according to the TermVectorComponent:

<lst name="chloroplast">
<int name="tf">8</int>
<int name="df">1887</int>
<double name="tf-idf">0.0042395336512983575</double>
</lst>

<lst name="glyoxylate">
<int name="tf">7</int>
<int name="df">1111</int>
<double name="tf-idf">0.0063006300630063005</double>
</lst>

<lst name="and">
<int name="tf">45</int>
<int name="df">60316</int>
<double name="tf-idf">7.460706943431262E-4</double>
</lst>

In fact an order of magnitude higher.


Chantal Ackermann wrote:
> 
> If not, there is also the parameter
>   mlt.minwl
> "minimum word length below which words will be ignored."
> 
> All your other terms seem longer than 3, so it would help in this case? 
> But seems a bit like work around.
> 

Yeah, I could do that, or add a stopword list to that field. But there are
some other common terms in the list like "protein" or "enzyme" that are long
and not really stopwords, but have a similarly low tf.idf to "and":

<lst name="protein">
<int name="tf">43</int>
<int name="df">189541</int>
<double name="tf-idf">2.2686384476181933E-4</double>
</lst>

<lst name="enzyme">
<int name="tf">15</int>
<int name="df">16712</int>
<double name="tf-idf">8.975586404978459E-4</double>
</lst>

Plus, of course, I'm curious to know exactly how MLT is identifying those
terms as important, and if it's a bug or my fault...

Thanks for your help though! Do any of the Solr devs have an idea of the
mechanism at work here?

Andrew.

-- 
View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26337677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Andrew,

your URL does not include the parameter mlt.boost. Setting that to 
"true" made a noticeable difference for my queries.

If not, there is also the parameter
  mlt.minwl
"minimum word length below which words will be ignored."

All your other terms seem longer than 3, so it would help in this case? 
But seems a bit like work around.

Cheers,
Chantal

Andrew Clegg schrieb:
> 
> Chantal Ackermann wrote:
>> no idea, I'm afraid - but could you sent the output of
>> interestingTerms=details?
>> This at least would show what MoreLikeThis uses, in comparison to the
>> TermVectorComponent you've already pasted.
>>
> 
> I can, but I'm afraid they're not very illuminating!
> 
> http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=details&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1
> 
> <response>
> <lst name="responseHeader">
>  <int name="status">0</int>
>  <int name="QTime">59</int>
> </lst>
> <result name="response" numFound="280227" start="0"/>
> <lst name="interestingTerms">
>  <float name="keywords:dehydrogenase">1.0</float>
>  <float name="keywords:reductase">1.0</float>
>  <float name="keywords:metabolism">1.0</float>
>  <float name="keywords:activity">1.0</float>
>  <float name="keywords:process">1.0</float>
>  <float name="keywords:alcohol">1.0</float>
>  <float name="keywords:and">1.0</float>
>  <float name="keywords:malate">1.0</float>
>  <float name="keywords:biosynthesis">1.0</float>
>  <float name="keywords:biosynthetic">1.0</float>
>  <float name="keywords:degradation">1.0</float>
>  <float name="keywords:precursor">1.0</float>
>  <float name="keywords:metabolic">1.0</float>
>  <float name="keywords:protein">1.0</float>
>  <float name="keywords:synthase">1.0</float>
>  <float name="keywords:acid">1.0</float>
>  <float name="keywords:enzyme">1.0</float>
>  <float name="keywords:succinyl-coa">1.0</float>
>  <float name="keywords:putative">1.0</float>
>  <float name="keywords:(nadp+)">1.0</float>
>  <float name="keywords:4,6-dehydratase">1.0</float>
>  <float name="keywords:fatty">1.0</float>
>  <float name="keywords:chloroplast">1.0</float>
>  <float name="keywords:lactobacillus">1.0</float>
>  <float name="keywords:glyoxylate">1.0</float>
> </lst>
> </response>
> 
> Cheers,
> 
> Andrew.
> 
> --
> View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Selection of terms for MoreLikeThis

Posted by Andrew Clegg <an...@gmail.com>.


Chantal Ackermann wrote:
> 
> no idea, I'm afraid - but could you sent the output of 
> interestingTerms=details?
> This at least would show what MoreLikeThis uses, in comparison to the 
> TermVectorComponent you've already pasted.
> 

I can, but I'm afraid they're not very illuminating!

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=details&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1

<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">59</int>
</lst>
<result name="response" numFound="280227" start="0"/>
<lst name="interestingTerms">
 <float name="keywords:dehydrogenase">1.0</float>
 <float name="keywords:reductase">1.0</float>
 <float name="keywords:metabolism">1.0</float>
 <float name="keywords:activity">1.0</float>
 <float name="keywords:process">1.0</float>
 <float name="keywords:alcohol">1.0</float>
 <float name="keywords:and">1.0</float>
 <float name="keywords:malate">1.0</float>
 <float name="keywords:biosynthesis">1.0</float>
 <float name="keywords:biosynthetic">1.0</float>
 <float name="keywords:degradation">1.0</float>
 <float name="keywords:precursor">1.0</float>
 <float name="keywords:metabolic">1.0</float>
 <float name="keywords:protein">1.0</float>
 <float name="keywords:synthase">1.0</float>
 <float name="keywords:acid">1.0</float>
 <float name="keywords:enzyme">1.0</float>
 <float name="keywords:succinyl-coa">1.0</float>
 <float name="keywords:putative">1.0</float>
 <float name="keywords:(nadp+)">1.0</float>
 <float name="keywords:4,6-dehydratase">1.0</float>
 <float name="keywords:fatty">1.0</float>
 <float name="keywords:chloroplast">1.0</float>
 <float name="keywords:lactobacillus">1.0</float>
 <float name="keywords:glyoxylate">1.0</float>
</lst>
</response>

Cheers,

Andrew.

-- 
View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Andrew,

no idea, I'm afraid - but could you sent the output of 
interestingTerms=details?
This at least would show what MoreLikeThis uses, in comparison to the 
TermVectorComponent you've already pasted.

Chantal

Andrew Clegg schrieb:
> Any ideas on this? Is it worth sending a bug report?
> 
> Those links are live, by the way, in case anyone wants to verify that MLT is
> returning suggestions with very low tf.idf.
> 
> Cheers,
> 
> Andrew.
> 
> 
> Andrew Clegg wrote:
>> Hi,
>>
>> If I run a MoreLikeThis query like the following:
>>
>> http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1
>>
>> one of the hits in the results is "and" (I don't do any stopword removal
>> on this field).
>>
>> However if I look inside that document with the TermVectorComponent:
>>
>> http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords
>>
>> I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
>> with *much* higher tf.idf scores, e.g.:
>>
>> <lst name="aquaspirillum">
>> <int name="tf">1</int>
>> <int name="df">10</int>
>> <double name="tf-idf">0.1</double>
>> </lst>
>>
>> that *don't* appear in the MoreLikeThis list. (I tried adding
>> &mlt.maxwl=999 to the end of the MLT query but it makes no difference.)
>>
>> What's going on? Surely something with tf.idf = 0.1 is a far better
>> candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
>> Or does MoreLikeThis do some other heuristic magic to select good
>> candidates, and sometimes get it wrong?
>>
>> BTW the keywords field is indexed, stored, multi-valued and term-vectored.
>>
>> Thanks,
>>
>> Andrew.
>>
>> --
>> :: http://biotext.org.uk/ ::
>>
>>
> 
> --
> View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Selection of terms for MoreLikeThis

Posted by Andrew Clegg <an...@gmail.com>.

Any ideas on this? Is it worth sending a bug report?

Those links are live, by the way, in case anyone wants to verify that MLT is
returning suggestions with very low tf.idf.

Cheers,

Andrew.


Andrew Clegg wrote:
> 
> Hi,
> 
> If I run a MoreLikeThis query like the following:
> 
> http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1
> 
> one of the hits in the results is "and" (I don't do any stopword removal
> on this field).
> 
> However if I look inside that document with the TermVectorComponent:
> 
> http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords
> 
> I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
> with *much* higher tf.idf scores, e.g.:
> 
> <lst name="aquaspirillum">
> <int name="tf">1</int>
> <int name="df">10</int>
> <double name="tf-idf">0.1</double>
> </lst>
> 
> that *don't* appear in the MoreLikeThis list. (I tried adding
> &mlt.maxwl=999 to the end of the MLT query but it makes no difference.)
> 
> What's going on? Surely something with tf.idf = 0.1 is a far better
> candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
> Or does MoreLikeThis do some other heuristic magic to select good
> candidates, and sometimes get it wrong?
> 
> BTW the keywords field is indexed, stored, multi-valued and term-vectored.
> 
> Thanks,
> 
> Andrew.
> 
> -- 
> :: http://biotext.org.uk/ ::
> 
> 

-- 
View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
Sent from the Solr - User mailing list archive at Nabble.com.