You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andreas Owen <ao...@swissonline.ch> on 2014/04/06 22:24:20 UTC

ngramfilter minGramSize problem

i have the a fieldtype that uses ngramfilter whle indexing. is there a  
setting that can force the ngramfilter to index smaller words then the  
minGramSize? Mine is set to 3 and the search wont find word that are only  
1 or 2 chars long. i would like to not set minGramSize=1 because the  
results would be to diverse.

fieldtype:

<fieldType name="text_de" class="solr.TextField"  
positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
		<!-- <filter class="solr.WordDelimiterFilterFactory"  
types="at-under-alpha.txt"/> -->
		<filter class="solr.StopFilterFactory" ignoreCase="true"  
words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> <!-- remove common words -->
         <filter class="solr.GermanNormalizationFilterFactory"/>
		<filter class="solr.SnowballPorterFilterFactory" language="German"/>  
<!-- remove noun/adjective inflections like plural endings -->
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"  
generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>
		<filter class="solr.NGramFilterFactory" minGramSize="3"  
maxGramSize="50"/>

	   </analyzer>
	   <analyzer type="query">
			<tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
			<filter class="solr.LowerCaseFilterFactory"/>
			<filter class="solr.StopFilterFactory" ignoreCase="true"  
words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> <!-- remove common words -->
			<filter class="solr.GermanNormalizationFilterFactory"/>
			<filter class="solr.SnowballPorterFilterFactory" language="German"/>
			<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"  
generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>
       </analyzer>
     </fieldType>

Re: ngramfilter minGramSize problem

Posted by Andreas Owen <ao...@swissonline.ch>.

it works well. now why does the search only find something when the  
fieldname is added to the query with stopwords?

"cug" -> 9 hits
"mit cug" -> 0 hits
"plain_text:mit cug" -> 9 hits

why is this so? could it be a problem that stopwords aren't used in the  
query because no all fields that are search have the stopwordfilter?


On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI <fu...@gmail.com>  
wrote:

> Correction: My patch is at SOLR-5152
> 7 Nis 2014 01:05 tarihinde "Andreas Owen" <ao...@swissonline.ch> yazdı:
>
>> i thought i cound use <filter class="solr.LengthFilterFactory" min="1"
>> max="2"/> to index and search words that are only 1 or 2 chars long. it
>> seems to work but i have to test it some more
>>
>>
>> On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>
>> wrote:
>>
>>  i have the a fieldtype that uses ngramfilter whle indexing. is there a
>>> setting that can force the ngramfilter to index smaller words then the
>>> minGramSize? Mine is set to 3 and the search wont find word that are  
>>> only 1
>>> or 2 chars long. i would like to not set minGramSize=1 because the  
>>> results
>>> would be to diverse.
>>>
>>> fieldtype:
>>>
>>> <fieldType name="text_de" class="solr.TextField"
>>> positionIncrementGap="100">
>>>        <analyzer type="index">
>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>                 <!-- <filter class="solr.WordDelimiterFilterFactory"
>>> types="at-under-alpha.txt"/> -->
>>>                 <filter class="solr.StopFilterFactory"  
>>> ignoreCase="true"
>>> words="lang/stopwords_de.txt" format="snowball"  
>>> enablePositionIncrements="true"/>
>>> <!-- remove common words -->
>>>          <filter class="solr.GermanNormalizationFilterFactory"/>
>>>                 <filter class="solr.SnowballPorterFilterFactory"
>>> language="German"/> <!-- remove noun/adjective inflections like plural
>>> endings -->
>>>                 <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>                 <filter class="solr.NGramFilterFactory" minGramSize="3"
>>> maxGramSize="50"/>
>>>
>>>            </analyzer>
>>>            <analyzer type="query">
>>>                         <tokenizer class="solr.
>>> WhiteSpaceTokenizerFactory"/>
>>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>>                         <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
>>> enablePositionIncrements="true"/> <!-- remove common words -->
>>>                         <filter class="solr.
>>> GermanNormalizationFilterFactory"/>
>>>                         <filter  
>>> class="solr.SnowballPorterFilterFactory"
>>> language="German"/>
>>>                         <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>
>>
>> --
>> Using Opera's mail client: http://www.opera.com/mail/
>>


-- 
Using Opera's mail client: http://www.opera.com/mail/

Re: ngramfilter minGramSize problem

Posted by Furkan KAMACI <fu...@gmail.com>.

Correction: My patch is at SOLR-5152
7 Nis 2014 01:05 tarihinde "Andreas Owen" <ao...@swissonline.ch> yazdı:

> i thought i cound use <filter class="solr.LengthFilterFactory" min="1"
> max="2"/> to index and search words that are only 1 or 2 chars long. it
> seems to work but i have to test it some more
>
>
> On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>
> wrote:
>
>  i have the a fieldtype that uses ngramfilter whle indexing. is there a
>> setting that can force the ngramfilter to index smaller words then the
>> minGramSize? Mine is set to 3 and the search wont find word that are only 1
>> or 2 chars long. i would like to not set minGramSize=1 because the results
>> would be to diverse.
>>
>> fieldtype:
>>
>> <fieldType name="text_de" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>                 <!-- <filter class="solr.WordDelimiterFilterFactory"
>> types="at-under-alpha.txt"/> -->
>>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
>> <!-- remove common words -->
>>          <filter class="solr.GermanNormalizationFilterFactory"/>
>>                 <filter class="solr.SnowballPorterFilterFactory"
>> language="German"/> <!-- remove noun/adjective inflections like plural
>> endings -->
>>                 <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>                 <filter class="solr.NGramFilterFactory" minGramSize="3"
>> maxGramSize="50"/>
>>
>>            </analyzer>
>>            <analyzer type="query">
>>                         <tokenizer class="solr.
>> WhiteSpaceTokenizerFactory"/>
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
>> enablePositionIncrements="true"/> <!-- remove common words -->
>>                         <filter class="solr.
>> GermanNormalizationFilterFactory"/>
>>                         <filter class="solr.SnowballPorterFilterFactory"
>> language="German"/>
>>                         <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>        </analyzer>
>>      </fieldType>
>>
>
>
> --
> Using Opera's mail client: http://www.opera.com/mail/
>

Re: ngramfilter minGramSize problem

Posted by Andreas Owen <ao...@swissonline.ch>.

i thought i cound use <filter class="solr.LengthFilterFactory" min="1"  
max="2"/> to index and search words that are only 1 or 2 chars long. it  
seems to work but i have to test it some more


On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>  
wrote:

> i have the a fieldtype that uses ngramfilter whle indexing. is there a  
> setting that can force the ngramfilter to index smaller words then the  
> minGramSize? Mine is set to 3 and the search wont find word that are  
> only 1 or 2 chars long. i would like to not set minGramSize=1 because  
> the results would be to diverse.
>
> fieldtype:
>
> <fieldType name="text_de" class="solr.TextField"  
> positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
> 		<!-- <filter class="solr.WordDelimiterFilterFactory"  
> types="at-under-alpha.txt"/> -->
> 		<filter class="solr.StopFilterFactory" ignoreCase="true"  
> words="lang/stopwords_de.txt" format="snowball"  
> enablePositionIncrements="true"/> <!-- remove common words -->
>          <filter class="solr.GermanNormalizationFilterFactory"/>
> 		<filter class="solr.SnowballPorterFilterFactory" language="German"/>  
> <!-- remove noun/adjective inflections like plural endings -->
> 		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"  
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
> catenateAll="0" splitOnCaseChange="1"/>
> 		<filter class="solr.NGramFilterFactory" minGramSize="3"  
> maxGramSize="50"/>
>
> 	   </analyzer>
> 	   <analyzer type="query">
> 			<tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
> 			<filter class="solr.LowerCaseFilterFactory"/>
> 			<filter class="solr.StopFilterFactory" ignoreCase="true"  
> words="lang/stopwords_de.txt" format="snowball"  
> enablePositionIncrements="true"/> <!-- remove common words -->
> 			<filter class="solr.GermanNormalizationFilterFactory"/>
> 			<filter class="solr.SnowballPorterFilterFactory" language="German"/>
> 			<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"  
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
> catenateAll="0" splitOnCaseChange="1"/>
>        </analyzer>
>      </fieldType>


-- 
Using Opera's mail client: http://www.opera.com/mail/

Re: ngramfilter minGramSize problem

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Andreas;

I've implemented a similar feature into EdgeNgramFilter due to some Solr
users wants it. My patch is here:
https://issues.apache.org/jira/browse/SOLR-5332 However if you read the
conversation below the issue you will realize that you can do it with
another way.

Thanks;
Furkan KAMACI


2014-04-06 23:24 GMT+03:00 Andreas Owen <ao...@swissonline.ch>:

> i have the a fieldtype that uses ngramfilter whle indexing. is there a
> setting that can force the ngramfilter to index smaller words then the
> minGramSize? Mine is set to 3 and the search wont find word that are only 1
> or 2 chars long. i would like to not set minGramSize=1 because the results
> would be to diverse.
>
> fieldtype:
>
> <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>                 <!-- <filter class="solr.WordDelimiterFilterFactory"
> types="at-under-alpha.txt"/> -->
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
> <!-- remove common words -->
>         <filter class="solr.GermanNormalizationFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="German"/> <!-- remove noun/adjective inflections like plural
> endings -->
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                 <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>
>            </analyzer>
>            <analyzer type="query">
>                         <tokenizer class="solr.
> WhiteSpaceTokenizerFactory"/>
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/> <!-- remove common words -->
>                         <filter class="solr.GermanNormalizationFilterFacto
> ry"/>
>                         <filter class="solr.SnowballPorterFilterFactory"
> language="German"/>
>                         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>       </analyzer>
>     </fieldType>
>