You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mahmoud Ismail <ma...@gmail.com> on 2012/03/15 10:46:25 UTC

using KeywordTokenizer in indexing and StandardTokenizer with shingle filter in query

Hi all,

I've reverse search situation where the indexed fields are keywords
"usually words set of words separated with underscore", and in the query i
supply a large text that i want solr to match which keywords fits in that
text.

so in this case i want solr to match the exact form of the keyword not part
of it.

here's my field type used for the keywords.

 <fieldType name="text_ar" class="solr.TextField" positionIncrementGap=
"100">

 <analyzer type="index">

<tokenizer class="solr.KeywordTokenizerFactory"/>

  <charFilter class="solr.MappingCharFilterFactory" mapping=
"mapping-underscore.txt"/>

<filter class="solr.ArabicNormalizationFilterFactory"/>

</analyzer>

 <analyzer type="query">

  <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.ArabicNormalizationFilterFactory"/>

  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
outputUnigrams="true"/>

</analyzer>

 </fieldType>

after indexing mykeywords, whenever i ran a query it doesn't match for the
whole keyword instead it matches word by word

so i changed the definition to the following

<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">

  <analyzer type="index">

  <tokenizer class="solr.StandardTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory" mapping=
"mapping-underscore.txt"/>

<filter class="solr.ArabicNormalizationFilterFactory"/>

  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
outputUnigrams="true"/>

  <!-- <filter class="solr.ArabicStemFilterFactory"/> -->

 </analyzer>

 <analyzer type="query">

  <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.ArabicNormalizationFilterFactory"/>

  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
outputUnigrams="true"/>

</analyzer>

 </fieldType>

but in this case it match parts of the keywords as well due to the use of
shingle in the index

How to enforce solr to match the whole keyword besides using shingle in the
query to create all possible combination that could match a keyword?

Best Regards,

Mahmoud Ismail

Re: using KeywordTokenizer in indexing and StandardTokenizer with shingle filter in query

Posted by Erick Erickson <er...@gmail.com>.
The best advice I can give is to spend some time on the admin/analysis page.
For instance, I believe that your first index analysis chain will do nothing.
KeywordTokenizerFactory does not break up the incoming text at all. Since there
is only a single token, the shinglefilter isn't doing anything either
(I don't think
anyway).

now, when you use StandardTokenizer in the query part, the incoming text
is broken up on whitespace, punctuation, and other characters.

And unless this is a typo, it's just not recognized by Solr at all (I'm a little
surprised it doesn't throw an error on startup):
<charFilter class="solr.MappingCharFilterFactory" mapping=
"mapping-underscore.txt"/>


If you use the admin/analysis page with various analysis chains, I
think you'll get
a much better sense of what the various filters actually *do*, which
is not always
exactly what I thought at various points.

Best
Erick

On Thu, Mar 15, 2012 at 4:46 AM, Mahmoud Ismail
<ma...@gmail.com> wrote:
> Hi all,
>
> I've reverse search situation where the indexed fields are keywords
> "usually words set of words separated with underscore", and in the query i
> supply a large text that i want solr to match which keywords fits in that
> text.
>
> so in this case i want solr to match the exact form of the keyword not part
> of it.
>
> here's my field type used for the keywords.
>
>  <fieldType name="text_ar" class="solr.TextField" positionIncrementGap=
> "100">
>
>  <analyzer type="index">
>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>  <charFilter class="solr.MappingCharFilterFactory" mapping=
> "mapping-underscore.txt"/>
>
> <filter class="solr.ArabicNormalizationFilterFactory"/>
>
> </analyzer>
>
>  <analyzer type="query">
>
>  <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
> outputUnigrams="true"/>
>
> </analyzer>
>
>  </fieldType>
>
> after indexing mykeywords, whenever i ran a query it doesn't match for the
> whole keyword instead it matches word by word
>
> so i changed the definition to the following
>
> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
>
>  <analyzer type="index">
>
>  <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <charFilter class="solr.MappingCharFilterFactory" mapping=
> "mapping-underscore.txt"/>
>
> <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
> outputUnigrams="true"/>
>
>  <!-- <filter class="solr.ArabicStemFilterFactory"/> -->
>
>  </analyzer>
>
>  <analyzer type="query">
>
>  <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>  <filter class="solr.ShingleFilterFactory" maxShingleSize="10"
> outputUnigrams="true"/>
>
> </analyzer>
>
>  </fieldType>
>
> but in this case it match parts of the keywords as well due to the use of
> shingle in the index
>
> How to enforce solr to match the whole keyword besides using shingle in the
> query to create all possible combination that could match a keyword?
>
> Best Regards,
>
> Mahmoud Ismail