You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rishabh Joshi <ri...@gmail.com> on 2007/11/22 13:45:18 UTC

Strange behavior MoreLikeThis Feature

Hi,

I am running Solr.1.3 to test the working of 'MoreLikeThisHandler'.
I have indexed only 2 documents as given below using the post.jar:

<add>
<doc>
  <field name="id">F8V7067-APL-KIT</field>
  <field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
  <field name="manu">Belkin</field>
  <field name="cat">electronics</field>
  <field name="cat">connector</field>
  <field name="features">car power adapter, white</field>
  <field name="weight">4</field>
  <field name="price">19.95</field>
  <field name="popularity">1</field>
  <field name="inStock">false</field>
</doc>
<doc>
  <field name="id">neardup06</field>
  <field name="features">Her kan du se traileren til nok en af efter泥ts
st鳳te action brag. Anmelderne sviner til men folk str鮭er i biografen at se
denne ts絴else. Resident Evil Extinction byder p塤et klassiske setup der
startede mange 沠tilbage, hvor man har en lille skare af helte der k箰er mod
de store onde horder. Lidt ala lord of the rings bare med zombier og andet
skravl. Filmen skovler penge ind i USA allerede og mon ikke det ogs塢liver en
succes i Europa.</field>
</doc>
</add>

Now when I run the following query:
http://localhost:8080/solr/mlt?q=id:neardup06&mlt.fl=features&mlt.mindf=1&mlt.mintf=1&mlt.displayTerms=details&wt=json&indent=on

I get the following result:
----------

{
 "responseHeader":{
  "status":0,
  "QTime":156},
 "WARNING":"This response format is experimental.  It is likely to
change in the future.",
 "match":{"numFound":1,"start":0,"docs":[
	{
	 "id":"neardup06",
	 "sku":"neardup06",
	 "popularity":0,
	 "timestamp":"2007-11-22T12:14:51.747Z",
	 "features":[
	  "Her kan du se traileren til nok en af efter�rets st�rste action
brag. Anmelderne sviner til men folk str�mmer i biografen at se denne
ts�ttelse. Resident Evil Extinction byder p� det klassiske setup der
startede mange �r tilbage, hvor man har en lille skare af helte der
k�mper mod de store onde horder. Lidt ala lord of the rings bare med
zombier og andet skravl. Filmen skovler penge ind i USA allerede og
mon ikke det ogs� bliver en succes i Europa. Finally, the unequivocal
truth about air pollution in Singapore hits the news."]}]
 },
 "response":{"numFound":1,"start":0,"docs":[
	{
	 "id":"IW-02",
	 "sku":"IW-02",
	 "price":11.5,
	 "weight":2.0,
	 "manu":"Belkin",
	 "name":"iPod & iPod Mini USB 2.0 Cable",
	 "inStock":false,
	 "popularity":1,
	 "timestamp":"2007-11-22T12:14:35.275Z",
	 "cat":[
	  "electronics",
	  "connector"],
	 "features":[
	  "car power adapter for iPod, white"]}]
----------
 }}

What I fail to understand here is, that the these words from the first
document (in the xml) -

"car power adapter for iPod, white"

are not present any where in document 2, but still MoreLikeThis feature
indicates that the two are similar.
Where am I going wrong?


Regards,
Rishabh

Re: Strange behavior MoreLikeThis Feature

Posted by Rishabh Joshi <ri...@gmail.com>.
Thanks Ryan. I now know the reason why.
Before I explain the reason, let me correct the mistake I made in my earlier
mail. I was not using the first document mentioned in the xml . Instead it
was this one:
<doc>
  <field name="id">IW-02</field>
  <field name="name">iPod &amp; iPod Mini USB 2.0 Cable</field>
  <field name="manu">Belkin</field>
  <field name="cat">electronics</field>
  <field name="cat">connector</field>
  <field name="features">car power adapter for iPod, white</field>
  <field name="weight">2</field>
  <field name="price">11.50</field>
  <field name="popularity">1</field>
  <field name="inStock">false</field>
</doc>

The reason I was getting strange result was because of the character "i".
Here is what I learnt from debug info:

"debug":{
  "rawquerystring":"id:neardup06",
  "querystring":"id:neardup06",
  "parsedquery":"features:og features:en features:til features:er
features:af features:der features:ts features:se features:i features:p
features:pet features:brag features:efter features:zombier features:k
features:tilbag features:ala features:sviner features:folk
features:klassisk features:resid features:horder features:lidt
features:man features:denn",
  "parsedquery_toString":"features:og features:en features:til
features:er features:af features:der features:ts features:se
features:i features:p features:pet features:brag features:efter
features:zombier features:k features:tilbag features:ala
features:sviner features:folk features:klassisk features:resid
features:horder features:lidt features:man features:denn",
  "explain":{
	"id=IW-02,internal_docid=8":"\n0.0050230525 = (MATCH) product of:\n
0.12557632 = (MATCH) sum of:\n    0.12557632 = (MATCH)
weight(features:i in 8), product of:\n      0.17474915 =
queryWeight(features:i), product of:\n        1.9162908 =
idf(docFreq=3)\n        0.09119135 = queryNorm\n      0.71860904 =
(MATCH) fieldWeight(features:i in 8), product of:\n        1.0 =
tf(termFreq(features:i)=1)\n        1.9162908 = idf(docFreq=3)\n
 0.375 = fieldNorm(field=features, doc=8)\n  0.04 = coord(1/25)\n"}}}

The field "features" uses the default fieldtype - "text" in the schema.xml.
The problem was solved by adding the character "i" to the
stopwords.txtfile. the "i"s in document 2 were matched with the "i" in
"iPod" of document
1.

I still have to figure out why a single character - "i" - matched the "i" in
a word - "iPod".

Regards,
Rishabh

On 22/11/2007, Ryan McKinley <ry...@gmail.com> wrote:
>
> >
> > Now when I run the following query:
> >
> http://localhost:8080/solr/mlt?q=id:neardup06&mlt.fl=features&mlt.mindf=1&mlt.mintf=1&mlt.displayTerms=details&wt=json&indent=on
> >
>
> try adding:
>   &debugQuery=on
>
> to your query string and you can see why each document matches...
>
> My guess is that "features" uses a text field with stemming and a
> stemmed word matches
>
> ryan
>

Re: Strange behavior MoreLikeThis Feature

Posted by Ryan McKinley <ry...@gmail.com>.
> 
> Now when I run the following query:
> http://localhost:8080/solr/mlt?q=id:neardup06&mlt.fl=features&mlt.mindf=1&mlt.mintf=1&mlt.displayTerms=details&wt=json&indent=on
> 

try adding:
  &debugQuery=on

to your query string and you can see why each document matches...

My guess is that "features" uses a text field with stemming and a 
stemmed word matches

ryan