You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sohail Aboobaker <sa...@gmail.com> on 2012/11/21 14:13:52 UTC

Inconsistent search results.

Hi,

We have 500k+ documents indexed with many fields. One of the fields is a
simple text filled that is defined as default search field and we copy many
field values into that field.

Some values are composed of two components with a "." as separator. When we
search for the partial terms for such values, we get inconsistent results.
Following are some examples:

Value: KWJ1112.MC2850

we search on MC2850, it returns result.
we search on KWJ1112, no results.

Value: ACW9920.KL1230

we search on ACW9920, gives results.
we search on KL1230, gives results.

The results are inconsistent. Sometimes, it will give results on both sides
of partial search. For others, it would give results on only the last part
of word. The last part search always works.

We are using standard tokenizer as follows:

<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100"><analyzer type="index"><tokenizer
class="solr.StandardTokenizerFactory"/><filter
class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/><!-- in this example, we will only use
synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        --><filter
class="solr.LowerCaseFilterFactory"/></analyzer><analyzer
type="query"><tokenizer class="solr.StandardTokenizerFactory"/><filter
class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/><filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter
class="solr.LowerCaseFilterFactory"/></analyzer></fieldType>

What should we use in order to get consistent results for both sides of
component? Should we be using whitespace with worddelimiterfactory? Some
examples will be helpful.

Thanks

Sohail

Re: Inconsistent search results.

Posted by Sohail Aboobaker <sa...@gmail.com>.
Sorry, a correction. The first part doesn't give results.

SA8182B.BA0850  --> Will have issues when searching on SA8182 -- no
results. searching on BA0850 will give results.
SA8182.BA0850  --> No issues will return results for BA0850 and SA8182.

Regards,
Sohail

Re: Inconsistent search results.

Posted by Sohail Aboobaker <sa...@gmail.com>.
Hi,

Thank you for your help. The issue is now resolved after using analysis
tool as suggested by Jack and Chris. We used the following filters in the
end for this field:

      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

WordDelimiterFilterFactory does the trick for splitting tokens into our
words appropriately.

Thanks to everyone for helping.

Regards,
Sohail Aboobaker.

Re: Inconsistent search results.

Posted by Chris Hostetter <ho...@fucit.org>.
: After further analysis it was found that the cases in which the search
: works as expected are where the "." is preceded by a number. Whenever, we
: have an alphabet instead of number, the search on the word on right side
: doesn't return results.

Please note Jack's previous suggestion...

>> Try the Solr Admin Analysis page and see how your failing examples 
>> analyze for both index and query.

...that will show you exactly what in your analysis chain is responsible 
for each of the cahnges to your raw input to produce the final stream of 
tokens, and help you figure out how you might want to change things.

I suspect you'll find that the StandardTokenizer is the culprit, and if 
you are happy with all of it's other behavior, you can use 
PatternReplaceCharFilterFactory to tweak the stream before tokenization.


-Hoss

Re: Inconsistent search results.

Posted by Sohail Aboobaker <sa...@gmail.com>.
Hi,

After further analysis it was found that the cases in which the search
works as expected are where the "." is preceded by a number. Whenever, we
have an alphabet instead of number, the search on the word on right side
doesn't return results.

SA8182B.BA0850  --> Will have issues when searching on BA0850 -- no
results.
SA8182.BA0850  --> No issues will return results for BA0850 and SA8182.

Does that help figuring out what is needed?

We haven't tried the patterntokenizer yet.

Regards,
Sohail

Re: Inconsistent search results.

Posted by Luis Cappa Banda <lu...@gmail.com>.
Hello!

I suggest you to try PatternTokenizer with a regex that includes "." and
blank spaces, for example, in Query and Index analyzers for that fieldType.
The expression will be tokenized by that regex expression and you will
success querying. Unfortunately, you will have to reindex all if you change
your schema.

Regards,

- Luis Cappa
El 21/11/2012 19:13, "Jack Krupansky" <ja...@basetechnology.com> escribió:

> Try the Solr Admin Analysis page and see how your failing examples analyze
> for both index and query.
>
> Also, if you experiment with analyzer settings, be sure to FULLY reindex
> your documents since a mismatch between how the documents were ORIGINALLY
> analyzed and the latest query analysis can cause mismatches. Changing an
> index analyzer does not force an automatic reindex.
>
> Also, check to see that there is not a delimiter character, such as a
> colon, immediately before a term with no white space.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Sohail Aboobaker
> Sent: Wednesday, November 21, 2012 8:13 AM
> To: solr-user@lucene.apache.org
> Subject: Inconsistent search results.
>
> Hi,
>
> We have 500k+ documents indexed with many fields. One of the fields is a
> simple text filled that is defined as default search field and we copy many
> field values into that field.
>
> Some values are composed of two components with a "." as separator. When we
> search for the partial terms for such values, we get inconsistent results.
> Following are some examples:
>
> Value: KWJ1112.MC2850
>
> we search on MC2850, it returns result.
> we search on KWJ1112, no results.
>
> Value: ACW9920.KL1230
>
> we search on ACW9920, gives results.
> we search on KL1230, gives results.
>
> The results are inconsistent. Sometimes, it will give results on both sides
> of partial search. For others, it would give results on only the last part
> of word. The last part search always works.
>
> We are using standard tokenizer as follows:
>
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100"><**analyzer type="index"><tokenizer
> class="solr.**StandardTokenizerFactory"/><**filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
> enablePositionIncrements="**true"/><!-- in this example, we will only use
> synonyms at query time
>        <filter class="solr.**SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        --><filter
> class="solr.**LowerCaseFilterFactory"/></**analyzer><analyzer
> type="query"><tokenizer class="solr.**StandardTokenizerFactory"/><**filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
> enablePositionIncrements="**true"/><filter class="solr.**
> SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter
> class="solr.**LowerCaseFilterFactory"/></**analyzer></fieldType>
>
> What should we use in order to get consistent results for both sides of
> component? Should we be using whitespace with worddelimiterfactory? Some
> examples will be helpful.
>
> Thanks
>
> Sohail
>

Re: Inconsistent search results.

Posted by Jack Krupansky <ja...@basetechnology.com>.
Try the Solr Admin Analysis page and see how your failing examples analyze 
for both index and query.

Also, if you experiment with analyzer settings, be sure to FULLY reindex 
your documents since a mismatch between how the documents were ORIGINALLY 
analyzed and the latest query analysis can cause mismatches. Changing an 
index analyzer does not force an automatic reindex.

Also, check to see that there is not a delimiter character, such as a colon, 
immediately before a term with no white space.

-- Jack Krupansky

-----Original Message----- 
From: Sohail Aboobaker
Sent: Wednesday, November 21, 2012 8:13 AM
To: solr-user@lucene.apache.org
Subject: Inconsistent search results.

Hi,

We have 500k+ documents indexed with many fields. One of the fields is a
simple text filled that is defined as default search field and we copy many
field values into that field.

Some values are composed of two components with a "." as separator. When we
search for the partial terms for such values, we get inconsistent results.
Following are some examples:

Value: KWJ1112.MC2850

we search on MC2850, it returns result.
we search on KWJ1112, no results.

Value: ACW9920.KL1230

we search on ACW9920, gives results.
we search on KL1230, gives results.

The results are inconsistent. Sometimes, it will give results on both sides
of partial search. For others, it would give results on only the last part
of word. The last part search always works.

We are using standard tokenizer as follows:

<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100"><analyzer type="index"><tokenizer
class="solr.StandardTokenizerFactory"/><filter
class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/><!-- in this example, we will only use
synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        --><filter
class="solr.LowerCaseFilterFactory"/></analyzer><analyzer
type="query"><tokenizer class="solr.StandardTokenizerFactory"/><filter
class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/><filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter
class="solr.LowerCaseFilterFactory"/></analyzer></fieldType>

What should we use in order to get consistent results for both sides of
component? Should we be using whitespace with worddelimiterfactory? Some
examples will be helpful.

Thanks

Sohail