You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kudrettin Güleryüz <ku...@gmail.com> on 2018/07/02 15:01:18 UTC

NgramTokenizerFactory question

Hi,

When using NgramTokenizerFactory with settings min ngram size=3 and max
ngram size=3 I get the following behaviour.

Assume that search term is, face

I expect the results to show documents with strings:
* interface or
* face or
* faceted

but not
* ace or
* fac

Why would I get the matches with results ace or fac? Am I missing some
settings somewhere? What is the suggested way to change this this
behaviour?

Thank you,

Re: NgramTokenizerFactory question

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
Thank you for the explanation.

To close the loop, I was able to track the problem down to the Lucene Query
parser on 5.2.1 which returned +body:"123 234 345 456" for a query string
123456.

Turned out that It is possible to get the same behavior by turning on split
on white-space and auto Generate Phrase Queries when using
NgramTokenizerFactory.



On Mon, Jul 2, 2018 at 3:24 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> I am not familiar with Lucene method to create analyzer. Perhaps it
> was already doing just analyzes phase. But here is what the NGram
> would do to a string of '123456' with just trigrams:
> 123
> 234
> 345
> 456
>
> So, if you only apply it on the index side, and your query is '2345' -
> there is no such token in the index to match against.
>
> On the other hand, if you apply trigram on the query side as well,
> against the query '2349', it will split into:
> 234
> 349
>
> And 234 would match. If that's ok for you that 2349 would match
> against 123456, you are fine. But if you want any search string to be
> actually present fully, then you need index-only NGram and it needs to
> be maxed at your maximum possible string.
>
> So with index-only min=3 and max=4, you will get:
> 123
> 1234
> 234
> 2345
> 345
> 3456
> 456
>
> Then 2349, not being ngrammed will not match anything, but 2345 will.
>
> Again, Admin UI will show that to you.
>
> Regards,
>    Alex.
>
> On 2 July 2018 at 14:33, Kudrettin Güleryüz <ku...@gmail.com> wrote:
> >> 1) if you want face to match interface, you need max value to be at
> least
> > 4.
> > Can you please explain this a bit more? I am not following this one.
> Values
> > are set to 3,3 and Solr already matches interface and interfaces when
> > searched for face.  In addition to that Solr matches the trigrams of face
> > (fac and ace) as well, which I find not as relevant as interface or
> faceted.
> >
> > Application I am working on moving to Solr 7.3.1 is currently using
> Lucene
> > API 5.3.1 and has a custom analyzer like following:
> >
> >
> > public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
> >     private int indexType;
> >
> >     public TrigramCaseAnalyzer() {
> >         indexType = 1;
> >     }
> >
> >     @Override
> >     public int getIndexType() {
> >         return this.indexType;
> >     }
> >
> >     @Override
> >     public void setIndexType(int type) {
> >         this.indexType = type;
> >     }
> >
> >     @Override
> >     protected TokenStreamComponents createComponents(String fieldName) {
> >         Tokenizer st;
> >         st = new NGramTokenizer(3, 3);
> >         return new TokenStreamComponents(st);
> >     }
> > }
> >
> > This somehow behaves as I described. (for a search: face returns
> interface
> > face faceted but not fac or ace).
> >
> > Is there a change since 5.3.1 regarding this behavious in Lucene? Or is
> the
> > difference in behaviour caused by Solr's implementation of the Lucene
> API?
> >
> > Thank you
> >
> >
> > On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <arafalov@gmail.com
> >
> > wrote:
> >
> >> Two things:
> >> 1) if you want face to match interface, you need max value to be at
> least
> >> 4.
> >> 2) you probably have the factory symmetrically or on Query analyzer. You
> >> probably want it on Index analyzer side only. Otherwise you are trying
> to
> >> match any 3-letter query substring against yoir index.
> >>
> >> Admin UI analysis screen will show that to you.
> >>
> >> Regards,
> >>     Alex
> >>
> >> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <ku...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > When using NgramTokenizerFactory with settings min ngram size=3 and
> max
> >> > ngram size=3 I get the following behaviour.
> >> >
> >> > Assume that search term is, face
> >> >
> >> > I expect the results to show documents with strings:
> >> > * interface or
> >> > * face or
> >> > * faceted
> >> >
> >> > but not
> >> > * ace or
> >> > * fac
> >> >
> >> > Why would I get the matches with results ace or fac? Am I missing some
> >> > settings somewhere? What is the suggested way to change this this
> >> > behaviour?
> >> >
> >> > Thank you,
> >> >
> >>
>

Re: NgramTokenizerFactory question

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I am not familiar with Lucene method to create analyzer. Perhaps it
was already doing just analyzes phase. But here is what the NGram
would do to a string of '123456' with just trigrams:
123
234
345
456

So, if you only apply it on the index side, and your query is '2345' -
there is no such token in the index to match against.

On the other hand, if you apply trigram on the query side as well,
against the query '2349', it will split into:
234
349

And 234 would match. If that's ok for you that 2349 would match
against 123456, you are fine. But if you want any search string to be
actually present fully, then you need index-only NGram and it needs to
be maxed at your maximum possible string.

So with index-only min=3 and max=4, you will get:
123
1234
234
2345
345
3456
456

Then 2349, not being ngrammed will not match anything, but 2345 will.

Again, Admin UI will show that to you.

Regards,
   Alex.

On 2 July 2018 at 14:33, Kudrettin Güleryüz <ku...@gmail.com> wrote:
>> 1) if you want face to match interface, you need max value to be at least
> 4.
> Can you please explain this a bit more? I am not following this one. Values
> are set to 3,3 and Solr already matches interface and interfaces when
> searched for face.  In addition to that Solr matches the trigrams of face
> (fac and ace) as well, which I find not as relevant as interface or faceted.
>
> Application I am working on moving to Solr 7.3.1 is currently using Lucene
> API 5.3.1 and has a custom analyzer like following:
>
>
> public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
>     private int indexType;
>
>     public TrigramCaseAnalyzer() {
>         indexType = 1;
>     }
>
>     @Override
>     public int getIndexType() {
>         return this.indexType;
>     }
>
>     @Override
>     public void setIndexType(int type) {
>         this.indexType = type;
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName) {
>         Tokenizer st;
>         st = new NGramTokenizer(3, 3);
>         return new TokenStreamComponents(st);
>     }
> }
>
> This somehow behaves as I described. (for a search: face returns interface
> face faceted but not fac or ace).
>
> Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
> difference in behaviour caused by Solr's implementation of the Lucene API?
>
> Thank you
>
>
> On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> Two things:
>> 1) if you want face to match interface, you need max value to be at least
>> 4.
>> 2) you probably have the factory symmetrically or on Query analyzer. You
>> probably want it on Index analyzer side only. Otherwise you are trying to
>> match any 3-letter query substring against yoir index.
>>
>> Admin UI analysis screen will show that to you.
>>
>> Regards,
>>     Alex
>>
>> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <ku...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > When using NgramTokenizerFactory with settings min ngram size=3 and max
>> > ngram size=3 I get the following behaviour.
>> >
>> > Assume that search term is, face
>> >
>> > I expect the results to show documents with strings:
>> > * interface or
>> > * face or
>> > * faceted
>> >
>> > but not
>> > * ace or
>> > * fac
>> >
>> > Why would I get the matches with results ace or fac? Am I missing some
>> > settings somewhere? What is the suggested way to change this this
>> > behaviour?
>> >
>> > Thank you,
>> >
>>

Re: NgramTokenizerFactory question

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
> 1) if you want face to match interface, you need max value to be at least
4.
Can you please explain this a bit more? I am not following this one. Values
are set to 3,3 and Solr already matches interface and interfaces when
searched for face.  In addition to that Solr matches the trigrams of face
(fac and ace) as well, which I find not as relevant as interface or faceted.

Application I am working on moving to Solr 7.3.1 is currently using Lucene
API 5.3.1 and has a custom analyzer like following:


public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
    private int indexType;

    public TrigramCaseAnalyzer() {
        indexType = 1;
    }

    @Override
    public int getIndexType() {
        return this.indexType;
    }

    @Override
    public void setIndexType(int type) {
        this.indexType = type;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer st;
        st = new NGramTokenizer(3, 3);
        return new TokenStreamComponents(st);
    }
}

This somehow behaves as I described. (for a search: face returns interface
face faceted but not fac or ace).

Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
difference in behaviour caused by Solr's implementation of the Lucene API?

Thank you


On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Two things:
> 1) if you want face to match interface, you need max value to be at least
> 4.
> 2) you probably have the factory symmetrically or on Query analyzer. You
> probably want it on Index analyzer side only. Otherwise you are trying to
> match any 3-letter query substring against yoir index.
>
> Admin UI analysis screen will show that to you.
>
> Regards,
>     Alex
>
> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <ku...@gmail.com>
> wrote:
>
> > Hi,
> >
> > When using NgramTokenizerFactory with settings min ngram size=3 and max
> > ngram size=3 I get the following behaviour.
> >
> > Assume that search term is, face
> >
> > I expect the results to show documents with strings:
> > * interface or
> > * face or
> > * faceted
> >
> > but not
> > * ace or
> > * fac
> >
> > Why would I get the matches with results ace or fac? Am I missing some
> > settings somewhere? What is the suggested way to change this this
> > behaviour?
> >
> > Thank you,
> >
>

Re: NgramTokenizerFactory question

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Two things:
1) if you want face to match interface, you need max value to be at least 4.
2) you probably have the factory symmetrically or on Query analyzer. You
probably want it on Index analyzer side only. Otherwise you are trying to
match any 3-letter query substring against yoir index.

Admin UI analysis screen will show that to you.

Regards,
    Alex

On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <ku...@gmail.com>
wrote:

> Hi,
>
> When using NgramTokenizerFactory with settings min ngram size=3 and max
> ngram size=3 I get the following behaviour.
>
> Assume that search term is, face
>
> I expect the results to show documents with strings:
> * interface or
> * face or
> * faceted
>
> but not
> * ace or
> * fac
>
> Why would I get the matches with results ace or fac? Am I missing some
> settings somewhere? What is the suggested way to change this this
> behaviour?
>
> Thank you,
>

Re: NgramTokenizerFactory question

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
It is correct that a search string causes following query to be generated:
+(field:fac field:ace)
Hence the results... However, I fail to see how (fac OR ace) is a relevant
query, shouldn't it be
+field:fac +field:ace
instead?

What is the suggested way to change this this behaviour?

On Mon, Jul 2, 2018 at 11:47 AM Erick Erickson <er...@gmail.com>
wrote:

> Take a look at two things:
> 1> the admin/analysis page. This is probably mostly a sanity check to
> insure you're seeing what you expect.
> 2> add debug=query to the query and look at the parsed query. My bet
> is that the grams are being OR'd together
>      and your search term is effectively
>
> fac OR ace
>
> Best,
> Erick
>
> On Mon, Jul 2, 2018 at 8:01 AM, Kudrettin Güleryüz <ku...@gmail.com>
> wrote:
> > Hi,
> >
> > When using NgramTokenizerFactory with settings min ngram size=3 and max
> > ngram size=3 I get the following behaviour.
> >
> > Assume that search term is, face
> >
> > I expect the results to show documents with strings:
> > * interface or
> > * face or
> > * faceted
> >
> > but not
> > * ace or
> > * fac
> >
> > Why would I get the matches with results ace or fac? Am I missing some
> > settings somewhere? What is the suggested way to change this this
> > behaviour?
> >
> > Thank you,
>

Re: NgramTokenizerFactory question

Posted by Erick Erickson <er...@gmail.com>.
Take a look at two things:
1> the admin/analysis page. This is probably mostly a sanity check to
insure you're seeing what you expect.
2> add debug=query to the query and look at the parsed query. My bet
is that the grams are being OR'd together
     and your search term is effectively

fac OR ace

Best,
Erick

On Mon, Jul 2, 2018 at 8:01 AM, Kudrettin Güleryüz <ku...@gmail.com> wrote:
> Hi,
>
> When using NgramTokenizerFactory with settings min ngram size=3 and max
> ngram size=3 I get the following behaviour.
>
> Assume that search term is, face
>
> I expect the results to show documents with strings:
> * interface or
> * face or
> * faceted
>
> but not
> * ace or
> * fac
>
> Why would I get the matches with results ace or fac? Am I missing some
> settings somewhere? What is the suggested way to change this this
> behaviour?
>
> Thank you,