You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "John, Phil (CSS)" <ph...@capita.co.uk> on 2013/03/05 14:53:49 UTC

Search term matching on part of a token, not the whole token

Hi,

I'm hitting a brick wall trying to diagnose this issue. We have a field,
configured like this:

</analyzer>

</fieldType>

And it has Dewey Decimal Classifications fed into it, e.g.

100

100.10

100.22

Etc.

When performing a search against the field (using the edismax parser) a
search like:

class:100

class:"100"

is matching both records with the exact token of 100, but also records
where 100 is only a part of the token, e.g. 100.10, 100.22 etc.

I've checked the analysis section of the admin interface and the field
is being tokenised correctly (eg, 100.10 is a single token), so I'm at a
loss as to why this is happening.

Does anyone have any ideas?

Regards,

Phil John
Technical Lead

Software services
Capita, Knights Court, Solihull Parkway, B37 7YB

Office: 0870 400 5000
Fax: 0870 400 5001
email: philjohn@capita.co.uk <ma...@capita.co.uk>

Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>

This email and any attachment to it are confidential. Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.

Any views or opinions expressed in this email are those of the sender only, unless otherwise stated. All copyright in any Capita material in this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes.

Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.

RE: Search term matching on part of a token, not the whole token

Posted by "John, Phil (CSS)" <ph...@capita.co.uk>.

Hi Chris,

Thank you for taking the time to assist. Here's both the field and
fieldtype definition:

<field name="class" type="class" indexed="true" stored="false"
required="false" multiValued="true"/>

<fieldType name="class" class="solr.TextField"
positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

And here's a simplified set of params passed to solr  (which still
causes the issue to show up):

q=class:510
defType=edismax
fl=*,score
q.op=AND
mm=100%
tie=0.01
start=0
rows=10
sort=score desc
qf=class^1.0

The parsed query (from debug): (+class:510)/no_coord

Scoring debug for one of the unexpected items:

4.5180607 = (MATCH) weight(class:510 in 1394980) [DefaultSimilarity],
result of:
  4.5180607 = fieldWeight in 1394980, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    7.228897 = idf(docFreq=2894, maxDocs=1468332)
    0.625 = fieldNorm(doc=1394980)

Kind regards,

Phil.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 08 March 2013 01:38
To: solr-user@lucene.apache.org
Subject: Re: Search term matching on part of a token, not the whole
token


:                 <fieldType name="class" class="solr.TextField"
	...
: When performing a search against the field (using the edismax parser)
a
: search like:
	...
: class:100

You shown us he fieldType "class" but you haven't provide any info about
the *field* named "class" -- please verify that it uses fieldType
"class"

You also didn't provide any details about how your edismax parser is
configured (ie: what kinds of defaults and query params are you using)

the debugQuery=true output showing what the final parsed query looks
like is also kind of important here.

https://wiki.apache.org/solr/UsingMailingLists

-Hoss


This email and any attachment to it are confidential.  Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.

Any views or opinions expressed in this email are those of the sender only, unless otherwise stated.  All copyright in any Capita material in this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes. 

Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.

Re: Search term matching on part of a token, not the whole token

Posted by Chris Hostetter <ho...@fucit.org>.

:                 <fieldType name="class" class="solr.TextField"
	...
: When performing a search against the field (using the edismax parser) a
: search like:
	...
: class:100

You shown us he fieldType "class" but you haven't provide any info about 
the *field* named "class" -- please verify that it uses fieldType "class"

You also didn't provide any details about how your edismax parser is 
configured (ie: what kinds of defaults and query params are you using)

the debugQuery=true output showing what the final parsed query looks like 
is also kind of important here.

https://wiki.apache.org/solr/UsingMailingLists

-Hoss

Re: Search term matching on part of a token, not the whole token

Posted by Jack Krupansky <ja...@basetechnology.com>.

Use the Solr Admin UI Analysis page and enter "100.10" for field type 
"class" and see whether it keeps the number as one term or not.

Do you maybe have a multivalued field that has both "100" and "100.10"?

Also do a &debugQuery=true and see what the result documents are actually 
being matched on.

-- Jack Krupansky

-----Original Message----- 
From: John, Phil (CSS)
Sent: Wednesday, March 06, 2013 4:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Search term matching on part of a token, not the whole token

No, double checked, and even went and reindexed yesterday and still the
same issue.

Regards,

Phil.

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: 05 March 2013 14:38
To: solr-user@lucene.apache.org
Subject: Re: Search term matching on part of a token, not the whole
token

Maybe you made changes to the analyzer but then failed to fully reindex
your data. I mean, it sounds like your index still contains terms that
had been tokenized by the standard tokenizer.

-- Jack Krupansky

-----Original Message-----
From: John, Phil (CSS)
Sent: Tuesday, March 05, 2013 8:53 AM
To: solr-user@lucene.apache.org
Subject: Search term matching on part of a token, not the whole token

Hi,



I'm hitting a brick wall trying to diagnose this issue. We have a field,
configured like this:



                <fieldType name="class" class="solr.TextField"
positionIncrementGap="100">

                        <analyzer>

                                <tokenizer
class="solr.WhiteSpaceTokenizerFactory"/>

                                <filter
class="solr.LowerCaseFilterFactory"/>

                        </analyzer>

                </fieldType>



And it has Dewey Decimal Classifications fed into it, e.g.



100

100.10

100.22



Etc.



When performing a search against the field (using the edismax parser) a
search like:



class:100



or



class:"100"



is matching both records with the exact token of 100, but also records
where 100 is only a part of the token, e.g. 100.10, 100.22 etc.



I've checked the analysis section of the admin interface and the field
is being tokenised correctly (eg, 100.10 is a single token), so I'm at a
loss as to why this is happening.



Does anyone have any ideas?



Regards,

Phil John
Technical Lead

Software services
Capita, Knights Court, Solihull Parkway, B37 7YB

Office: 0870 400 5000
Fax: 0870 400 5001
email: philjohn@capita.co.uk <ma...@capita.co.uk>

Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>





This email and any attachment to it are confidential.  Unless you are
the intended recipient, you may not use, copy or disclose either the
message or any information contained in the message. If you are not the
intended recipient, you should delete this email and notify the sender
immediately.

Any views or opinions expressed in this email are those of the sender
only, unless otherwise stated.  All copyright in any Capita material in
this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and
monitored for legitimate business purposes.

Capita exclude all liability for any loss or damage arising or resulting
from the receipt, use or transmission of this email to the fullest
extent permitted by law.

RE: Search term matching on part of a token, not the whole token

Posted by "John, Phil (CSS)" <ph...@capita.co.uk>.

No, double checked, and even went and reindexed yesterday and still the
same issue.

Regards,

Phil.

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: 05 March 2013 14:38
To: solr-user@lucene.apache.org
Subject: Re: Search term matching on part of a token, not the whole
token

Maybe you made changes to the analyzer but then failed to fully reindex
your data. I mean, it sounds like your index still contains terms that
had been tokenized by the standard tokenizer.

-- Jack Krupansky

-----Original Message-----
From: John, Phil (CSS)
Sent: Tuesday, March 05, 2013 8:53 AM
To: solr-user@lucene.apache.org
Subject: Search term matching on part of a token, not the whole token

Hi,



I'm hitting a brick wall trying to diagnose this issue. We have a field,
configured like this:



                <fieldType name="class" class="solr.TextField"
positionIncrementGap="100">

                        <analyzer>

                                <tokenizer
class="solr.WhiteSpaceTokenizerFactory"/>

                                <filter
class="solr.LowerCaseFilterFactory"/>

                        </analyzer>

                </fieldType>



And it has Dewey Decimal Classifications fed into it, e.g.



100

100.10

100.22



Etc.



When performing a search against the field (using the edismax parser) a
search like:



class:100



or



class:"100"



is matching both records with the exact token of 100, but also records
where 100 is only a part of the token, e.g. 100.10, 100.22 etc.



I've checked the analysis section of the admin interface and the field
is being tokenised correctly (eg, 100.10 is a single token), so I'm at a
loss as to why this is happening.



Does anyone have any ideas?



Regards,

Phil John
Technical Lead

Software services
Capita, Knights Court, Solihull Parkway, B37 7YB

Office: 0870 400 5000
Fax: 0870 400 5001
email: philjohn@capita.co.uk <ma...@capita.co.uk>

Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>





This email and any attachment to it are confidential.  Unless you are
the intended recipient, you may not use, copy or disclose either the
message or any information contained in the message. If you are not the
intended recipient, you should delete this email and notify the sender
immediately.

Any views or opinions expressed in this email are those of the sender
only, unless otherwise stated.  All copyright in any Capita material in
this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and
monitored for legitimate business purposes.

Capita exclude all liability for any loss or damage arising or resulting
from the receipt, use or transmission of this email to the fullest
extent permitted by law.

Re: Search term matching on part of a token, not the whole token

Posted by Jack Krupansky <ja...@basetechnology.com>.

Maybe you made changes to the analyzer but then failed to fully reindex your 
data. I mean, it sounds like your index still contains terms that had been 
tokenized by the standard tokenizer.

-- Jack Krupansky

-----Original Message----- 
From: John, Phil (CSS)
Sent: Tuesday, March 05, 2013 8:53 AM
To: solr-user@lucene.apache.org
Subject: Search term matching on part of a token, not the whole token

Hi,

I'm hitting a brick wall trying to diagnose this issue. We have a field,
configured like this:

                <fieldType name="class" class="solr.TextField"
positionIncrementGap="100">

                        <analyzer>

                                <tokenizer
class="solr.WhiteSpaceTokenizerFactory"/>

                                <filter
class="solr.LowerCaseFilterFactory"/>

                        </analyzer>

                </fieldType>

And it has Dewey Decimal Classifications fed into it, e.g.

100

100.10

100.22

Etc.

When performing a search against the field (using the edismax parser) a
search like:

class:100

or

class:"100"

is matching both records with the exact token of 100, but also records
where 100 is only a part of the token, e.g. 100.10, 100.22 etc.

I've checked the analysis section of the admin interface and the field
is being tokenised correctly (eg, 100.10 is a single token), so I'm at a
loss as to why this is happening.

Does anyone have any ideas?

Regards,

Phil John
Technical Lead

Software services
Capita, Knights Court, Solihull Parkway, B37 7YB

Office: 0870 400 5000
Fax: 0870 400 5001
email: philjohn@capita.co.uk <ma...@capita.co.uk>

Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>

This email and any attachment to it are confidential.  Unless you are the 
intended recipient, you may not use, copy or disclose either the message or 
any information contained in the message. If you are not the intended 
recipient, you should delete this email and notify the sender immediately.

Any views or opinions expressed in this email are those of the sender only, 
unless otherwise stated.  All copyright in any Capita material in this email 
is reserved.

All emails, incoming and outgoing, may be recorded by Capita and monitored 
for legitimate business purposes.

Capita exclude all liability for any loss or damage arising or resulting 
from the receipt, use or transmission of this email to the fullest extent 
permitted by law.