You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ao...@gmail.com on 2009/09/30 19:54:29 UTC

NGramTokenFilter behaviour

If I index the following text: "I live in Dublin Ireland where
Guinness is brewed"

Then search for: duvlin

Should Solr return a match?

In the admin interface under the analysis section, Solr highlights
some NGram matches?

When I enter the following query string into my browser address bar, I
get 0 results?

http://localhost:8983/solr/select/?q=duvlin&debugQuery=true

Nor do I get results for dub, dubli, ublin, dublin (du does return a result).

I also notice when I use debugQuery=true, the parsed query is a
PhraseQuery. This doesn't make sense to me, as surely the point of the
NGram is to use a Boolean OR between each Gram??

However, if I don't use an NGramFilterFactory at query time, I can get
results for: dub, ublin, du, but not duvlin.

<fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="2"
maxGramSize="15"/>
      </analyzer>
</fieldType>

Can someone please clarify what the purpose of the
NGramFilter/tokenizer is, if not to allow for
misspellings/morphological variation and also, what the correct
configuration is in terms of use at index/query time.

Any help appreciated!

Aodh.

Solr 1.3, JDK 1.6

RE: NGramTokenFilter behaviour

Posted by "Feak, Todd" <To...@smss.sony.com>.

My understanding of a NGramTokenizing is to help with languages that don't necessarily contain spaces as a word delimiter (Japanese et al). In that case bi-gramming is used to find words contained within a stream of unbroken characters. In that case, you want to find all of the bi-grams that you input for the search query. An "OR" wouldn't work as well, as you would find tons of hits.

-Todd Feak

-----Original Message-----
From: aodhol@gmail.com [mailto:aodhol@gmail.com] 
Sent: Wednesday, September 30, 2009 10:54 AM
To: solr-user@lucene.apache.org
Subject: NGramTokenFilter behaviour

If I index the following text: "I live in Dublin Ireland where
Guinness is brewed"

Then search for: duvlin

Should Solr return a match?

In the admin interface under the analysis section, Solr highlights
some NGram matches?

When I enter the following query string into my browser address bar, I
get 0 results?

http://localhost:8983/solr/select/?q=duvlin&debugQuery=true

Nor do I get results for dub, dubli, ublin, dublin (du does return a result).

I also notice when I use debugQuery=true, the parsed query is a
PhraseQuery. This doesn't make sense to me, as surely the point of the
NGram is to use a Boolean OR between each Gram??

However, if I don't use an NGramFilterFactory at query time, I can get
results for: dub, ublin, du, but not duvlin.

<fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="2"
maxGramSize="15"/>
      </analyzer>
</fieldType>

Can someone please clarify what the purpose of the
NGramFilter/tokenizer is, if not to allow for
misspellings/morphological variation and also, what the correct
configuration is in terms of use at index/query time.

Any help appreciated!

Aodh.

Solr 1.3, JDK 1.6

Re: NGramTokenFilter behaviour

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Sep 30, 2009 at 11:24 PM, <ao...@gmail.com> wrote:

> If I index the following text: "I live in Dublin Ireland where
> Guinness is brewed"
>
> Then search for: duvlin
>
> Should Solr return a match?
>
> In the admin interface under the analysis section, Solr highlights
> some NGram matches?
>
> When I enter the following query string into my browser address bar, I
> get 0 results?
>
> http://localhost:8983/solr/select/?q=duvlin&debugQuery=true
>
> Nor do I get results for dub, dubli, ublin, dublin (du does return a
> result).
>
> I also notice when I use debugQuery=true, the parsed query is a
> PhraseQuery. This doesn't make sense to me, as surely the point of the
> NGram is to use a Boolean OR between each Gram??
>
> However, if I don't use an NGramFilterFactory at query time, I can get
> results for: dub, ublin, du, but not duvlin.
>
>
Is the n-grammed field specified as the <defaultSearchField> in your
schema.xml? If not, then you will have to specify the field name during
querying e.g. field_name:duvlin. You can see exactly how your query is being
parsed if you add debugQuery=on as a request parameter.

-- 
Regards,
Shalin Shekhar Mangar.

Re: NGramTokenFilter behaviour

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Sep 30, 2009 at 11:24 PM, <ao...@gmail.com> wrote:

>
> Can someone please clarify what the purpose of the
> NGramFilter/tokenizer is, if not to allow for
> misspellings/morphological variation and also, what the correct
> configuration is in terms of use at index/query time.
>
>
If it is spellcheck you are interested in, take a look at
http://wiki.apache.org/solr/SpellCheckComponent

-- 
Regards,
Shalin Shekhar Mangar.