You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arkadi Colson <ar...@smartbit.be> on 2012/12/14 15:10:09 UTC
NGram with words
Hi
When "abcdefg 123456" is in Solr I would like to have match with
- abcd
- cdef
- abcdefg 123456
- "abcdefg 123456"
- "defg 1234"
The last one is actually not working.
What am I doing wrong?
My config looks like this.
/<field name="smsc_description" type="text" indexed="true"
stored="false" multiValued="true" omitNorms="true" omitPositions="false"
omitTermFreqAndPositions="false"/>
<field name="smsc_description_ngram" type="text_ngram"
indexed="true" stored="false" multiValued="true" omitNorms="true"
omitPositions="false" omitTermFreqAndPositions="false"/>
<copyField source="smsc_description" dest="smsc_description_ngram"/>
//<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ngram" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2"
maxGramSize="8"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
/
BR,
Arkadi
Re: NGram with words
Posted by Walter Underwood <wu...@wunderwood.org>.
I specified "edge ngrams" because that is the one I've investigated. --wunder
On Dec 14, 2012, at 8:30 AM, Jack Krupansky wrote:
> I can believe it.
>
> Note: He's using "ngrams", not "edge" ngrams.
>
> -- Jack Krupansky
> -----Original Message----- From: Walter Underwood
> Sent: Friday, December 14, 2012 11:21 AM
> To: solr-user@lucene.apache.org
> Cc: arkadi@smartbit.be
> Subject: Re: NGram with words
>
> Positions for edge ngrams are wrong. They should be handled like synonyms. This breaks phrase matching with ngrams. Not sure if there is a bug filed for this.
>
> wunder
>
> On Dec 14, 2012, at 8:16 AM, Jack Krupansky wrote:
>
>> Yeah, the positions for ngrams have a good chance of not being what you want.
>>
>> But do try the Solr Admin Analysis web page for that index text and see what positions it generates for the sub-words. The two generated words used in your query may not have adjacent positions.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Arkadi Colson
>> Sent: Friday, December 14, 2012 9:10 AM
>> To: solr-user@lucene.apache.org
>> Subject: NGram with words
>>
>> Hi
>>
>> When "abcdefg 123456" is in Solr I would like to have match with
>>
>> - abcd
>> - cdef
>> - abcdefg 123456
>> - "abcdefg 123456"
>> - "defg 1234"
>>
>> The last one is actually not working.
>> What am I doing wrong?
>> My config looks like this.
>>
>> /<field name="smsc_description" type="text" indexed="true"
>> stored="false" multiValued="true" omitNorms="true" omitPositions="false"
>> omitTermFreqAndPositions="false"/>
>> <field name="smsc_description_ngram" type="text_ngram"
>> indexed="true" stored="false" multiValued="true" omitNorms="true"
>> omitPositions="false" omitTermFreqAndPositions="false"/>
>>
>> <copyField source="smsc_description" dest="smsc_description_ngram"/>
>>
>> //<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> <fieldType name="text_ngram" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.NGramFilterFactory" minGramSize="2"
>> maxGramSize="8"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> /
>>
>> BR,
>> Arkadi
>>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
--
Walter Underwood
wunder@wunderwood.org
Re: NGram with words
Posted by Jack Krupansky <ja...@basetechnology.com>.
I can believe it.
Note: He's using "ngrams", not "edge" ngrams.
-- Jack Krupansky
-----Original Message-----
From: Walter Underwood
Sent: Friday, December 14, 2012 11:21 AM
To: solr-user@lucene.apache.org
Cc: arkadi@smartbit.be
Subject: Re: NGram with words
Positions for edge ngrams are wrong. They should be handled like synonyms.
This breaks phrase matching with ngrams. Not sure if there is a bug filed
for this.
wunder
On Dec 14, 2012, at 8:16 AM, Jack Krupansky wrote:
> Yeah, the positions for ngrams have a good chance of not being what you
> want.
>
> But do try the Solr Admin Analysis web page for that index text and see
> what positions it generates for the sub-words. The two generated words
> used in your query may not have adjacent positions.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Arkadi Colson
> Sent: Friday, December 14, 2012 9:10 AM
> To: solr-user@lucene.apache.org
> Subject: NGram with words
>
> Hi
>
> When "abcdefg 123456" is in Solr I would like to have match with
>
> - abcd
> - cdef
> - abcdefg 123456
> - "abcdefg 123456"
> - "defg 1234"
>
> The last one is actually not working.
> What am I doing wrong?
> My config looks like this.
>
> /<field name="smsc_description" type="text" indexed="true"
> stored="false" multiValued="true" omitNorms="true" omitPositions="false"
> omitTermFreqAndPositions="false"/>
> <field name="smsc_description_ngram" type="text_ngram"
> indexed="true" stored="false" multiValued="true" omitNorms="true"
> omitPositions="false" omitTermFreqAndPositions="false"/>
>
> <copyField source="smsc_description" dest="smsc_description_ngram"/>
>
> //<fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <fieldType name="text_ngram" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.NGramFilterFactory" minGramSize="2"
> maxGramSize="8"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> /
>
> BR,
> Arkadi
>
--
Walter Underwood
wunder@wunderwood.org
Re: NGram with words
Posted by Walter Underwood <wu...@wunderwood.org>.
Positions for edge ngrams are wrong. They should be handled like synonyms. This breaks phrase matching with ngrams. Not sure if there is a bug filed for this.
wunder
On Dec 14, 2012, at 8:16 AM, Jack Krupansky wrote:
> Yeah, the positions for ngrams have a good chance of not being what you want.
>
> But do try the Solr Admin Analysis web page for that index text and see what positions it generates for the sub-words. The two generated words used in your query may not have adjacent positions.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Arkadi Colson
> Sent: Friday, December 14, 2012 9:10 AM
> To: solr-user@lucene.apache.org
> Subject: NGram with words
>
> Hi
>
> When "abcdefg 123456" is in Solr I would like to have match with
>
> - abcd
> - cdef
> - abcdefg 123456
> - "abcdefg 123456"
> - "defg 1234"
>
> The last one is actually not working.
> What am I doing wrong?
> My config looks like this.
>
> /<field name="smsc_description" type="text" indexed="true"
> stored="false" multiValued="true" omitNorms="true" omitPositions="false"
> omitTermFreqAndPositions="false"/>
> <field name="smsc_description_ngram" type="text_ngram"
> indexed="true" stored="false" multiValued="true" omitNorms="true"
> omitPositions="false" omitTermFreqAndPositions="false"/>
>
> <copyField source="smsc_description" dest="smsc_description_ngram"/>
>
> //<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <fieldType name="text_ngram" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.NGramFilterFactory" minGramSize="2"
> maxGramSize="8"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> /
>
> BR,
> Arkadi
>
--
Walter Underwood
wunder@wunderwood.org
Re: NGram with words
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yeah, the positions for ngrams have a good chance of not being what you
want.
But do try the Solr Admin Analysis web page for that index text and see what
positions it generates for the sub-words. The two generated words used in
your query may not have adjacent positions.
-- Jack Krupansky
-----Original Message-----
From: Arkadi Colson
Sent: Friday, December 14, 2012 9:10 AM
To: solr-user@lucene.apache.org
Subject: NGram with words
Hi
When "abcdefg 123456" is in Solr I would like to have match with
- abcd
- cdef
- abcdefg 123456
- "abcdefg 123456"
- "defg 1234"
The last one is actually not working.
What am I doing wrong?
My config looks like this.
/<field name="smsc_description" type="text" indexed="true"
stored="false" multiValued="true" omitNorms="true" omitPositions="false"
omitTermFreqAndPositions="false"/>
<field name="smsc_description_ngram" type="text_ngram"
indexed="true" stored="false" multiValued="true" omitNorms="true"
omitPositions="false" omitTermFreqAndPositions="false"/>
<copyField source="smsc_description" dest="smsc_description_ngram"/>
//<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ngram" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2"
maxGramSize="8"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
/
BR,
Arkadi