You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andreas Owen <ao...@swissonline.ch> on 2014/04/06 22:24:20 UTC
ngramfilter minGramSize problem
i have the a fieldtype that uses ngramfilter whle indexing. is there a
setting that can force the ngramfilter to index smaller words then the
minGramSize? Mine is set to 3 and the search wont find word that are only
1 or 2 chars long. i would like to not set minGramSize=1 because the
results would be to diverse.
fieldtype:
<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.WordDelimiterFilterFactory"
types="at-under-alpha.txt"/> -->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt" format="snowball"
enablePositionIncrements="true"/> <!-- remove common words -->
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German"/>
<!-- remove noun/adjective inflections like plural endings -->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt" format="snowball"
enablePositionIncrements="true"/> <!-- remove common words -->
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
Re: ngramfilter minGramSize problem
Posted by Andreas Owen <ao...@swissonline.ch>.
it works well. now why does the search only find something when the
fieldname is added to the query with stopwords?
"cug" -> 9 hits
"mit cug" -> 0 hits
"plain_text:mit cug" -> 9 hits
why is this so? could it be a problem that stopwords aren't used in the
query because no all fields that are search have the stopwordfilter?
On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI <fu...@gmail.com>
wrote:
> Correction: My patch is at SOLR-5152
> 7 Nis 2014 01:05 tarihinde "Andreas Owen" <ao...@swissonline.ch> yazdı:
>
>> i thought i cound use <filter class="solr.LengthFilterFactory" min="1"
>> max="2"/> to index and search words that are only 1 or 2 chars long. it
>> seems to work but i have to test it some more
>>
>>
>> On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>
>> wrote:
>>
>> i have the a fieldtype that uses ngramfilter whle indexing. is there a
>>> setting that can force the ngramfilter to index smaller words then the
>>> minGramSize? Mine is set to 3 and the search wont find word that are
>>> only 1
>>> or 2 chars long. i would like to not set minGramSize=1 because the
>>> results
>>> would be to diverse.
>>>
>>> fieldtype:
>>>
>>> <fieldType name="text_de" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <!-- <filter class="solr.WordDelimiterFilterFactory"
>>> types="at-under-alpha.txt"/> -->
>>> <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>> words="lang/stopwords_de.txt" format="snowball"
>>> enablePositionIncrements="true"/>
>>> <!-- remove common words -->
>>> <filter class="solr.GermanNormalizationFilterFactory"/>
>>> <filter class="solr.SnowballPorterFilterFactory"
>>> language="German"/> <!-- remove noun/adjective inflections like plural
>>> endings -->
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.NGramFilterFactory" minGramSize="3"
>>> maxGramSize="50"/>
>>>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.
>>> WhiteSpaceTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
>>> enablePositionIncrements="true"/> <!-- remove common words -->
>>> <filter class="solr.
>>> GermanNormalizationFilterFactory"/>
>>> <filter
>>> class="solr.SnowballPorterFilterFactory"
>>> language="German"/>
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>
>>
>> --
>> Using Opera's mail client: http://www.opera.com/mail/
>>
--
Using Opera's mail client: http://www.opera.com/mail/
Re: ngramfilter minGramSize problem
Posted by Furkan KAMACI <fu...@gmail.com>.
Correction: My patch is at SOLR-5152
7 Nis 2014 01:05 tarihinde "Andreas Owen" <ao...@swissonline.ch> yazdı:
> i thought i cound use <filter class="solr.LengthFilterFactory" min="1"
> max="2"/> to index and search words that are only 1 or 2 chars long. it
> seems to work but i have to test it some more
>
>
> On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>
> wrote:
>
> i have the a fieldtype that uses ngramfilter whle indexing. is there a
>> setting that can force the ngramfilter to index smaller words then the
>> minGramSize? Mine is set to 3 and the search wont find word that are only 1
>> or 2 chars long. i would like to not set minGramSize=1 because the results
>> would be to diverse.
>>
>> fieldtype:
>>
>> <fieldType name="text_de" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- <filter class="solr.WordDelimiterFilterFactory"
>> types="at-under-alpha.txt"/> -->
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
>> <!-- remove common words -->
>> <filter class="solr.GermanNormalizationFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory"
>> language="German"/> <!-- remove noun/adjective inflections like plural
>> endings -->
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.NGramFilterFactory" minGramSize="3"
>> maxGramSize="50"/>
>>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.
>> WhiteSpaceTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
>> enablePositionIncrements="true"/> <!-- remove common words -->
>> <filter class="solr.
>> GermanNormalizationFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory"
>> language="German"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> </analyzer>
>> </fieldType>
>>
>
>
> --
> Using Opera's mail client: http://www.opera.com/mail/
>
Re: ngramfilter minGramSize problem
Posted by Andreas Owen <ao...@swissonline.ch>.
i thought i cound use <filter class="solr.LengthFilterFactory" min="1"
max="2"/> to index and search words that are only 1 or 2 chars long. it
seems to work but i have to test it some more
On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen <ao...@swissonline.ch>
wrote:
> i have the a fieldtype that uses ngramfilter whle indexing. is there a
> setting that can force the ngramfilter to index smaller words then the
> minGramSize? Mine is set to 3 and the search wont find word that are
> only 1 or 2 chars long. i would like to not set minGramSize=1 because
> the results would be to diverse.
>
> fieldtype:
>
> <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <!-- <filter class="solr.WordDelimiterFilterFactory"
> types="at-under-alpha.txt"/> -->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/> <!-- remove common words -->
> <filter class="solr.GermanNormalizationFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="German"/>
> <!-- remove noun/adjective inflections like plural endings -->
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/> <!-- remove common words -->
> <filter class="solr.GermanNormalizationFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="German"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> </analyzer>
> </fieldType>
--
Using Opera's mail client: http://www.opera.com/mail/
Re: ngramfilter minGramSize problem
Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Andreas;
I've implemented a similar feature into EdgeNgramFilter due to some Solr
users wants it. My patch is here:
https://issues.apache.org/jira/browse/SOLR-5332 However if you read the
conversation below the issue you will realize that you can do it with
another way.
Thanks;
Furkan KAMACI
2014-04-06 23:24 GMT+03:00 Andreas Owen <ao...@swissonline.ch>:
> i have the a fieldtype that uses ngramfilter whle indexing. is there a
> setting that can force the ngramfilter to index smaller words then the
> minGramSize? Mine is set to 3 and the search wont find word that are only 1
> or 2 chars long. i would like to not set minGramSize=1 because the results
> would be to diverse.
>
> fieldtype:
>
> <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <!-- <filter class="solr.WordDelimiterFilterFactory"
> types="at-under-alpha.txt"/> -->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
> <!-- remove common words -->
> <filter class="solr.GermanNormalizationFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="German"/> <!-- remove noun/adjective inflections like plural
> endings -->
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.
> WhiteSpaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/> <!-- remove common words -->
> <filter class="solr.GermanNormalizationFilterFacto
> ry"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="German"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> </analyzer>
> </fieldType>
>