You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stefan Matheis <ma...@gmail.com> on 2016/11/07 15:05:32 UTC

edismax, phrase field gets ignored for keyword tokenizer

I’m guessing that i’m missing something obvious here - so feel free to
ask for more details and as well point out other directions i should
following.

the problem goes as follows: the input in one case might be a phone
number (like +49 1234 12345678), since we’re using edismax the parts
gets split on whitespaces - which is fine. bringing the same field
(based on TextField) to the party (using qf) doesn’t change a thing.

> responseHeader:
>     params:
>         q: '+49 1234 12345678'
>         defType: edismax
>         qf: person_mobile
>         pf: person_mobile^5
> debug:
>     rawquerystring: '+49 1234 12345678'
>     querystring: '+49 1234 12345678'
>     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49)) DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678))) ())/no_coord'
>     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234) (person_mobile:12345678)) ()’

but .. as far as i was able to reduce the culprit, that only happens
when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
that to solr.StandardTokenizerFactory the phrase query appears as
expected:

> responseHeader:
>     params:
>         q: '+49 1234 12345678'
>         defType: edismax
>         qf: person_mobile
>         pf: person_mobile^5
> debug:
>     rawquerystring: '+49 1234 12345678'
>     querystring: '+49 1234 12345678'
>     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49)) DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678))) DisjunctionMaxQuery(((person_mobile:"49 1234 12345678")^5.0)))/no_coord'
>     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234) (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’

removing the + at the beginning, doesn’t make a difference either
(just mentioning since tokee already asked this on #solr, where i’ve
brought up the question earlier)

it’s absolutely possible i’m focusing on a very wrong assumption - but
since switching the tokenizer does result in such a rather large
behaviour change, i think something is spooky here.

i’ve read older issues and posts from the list, some of them pointed
out that it might be a optimization that edismax brings to the table -
i didn’t find anything specific about that.

oh, and btw: if that would be working - my idea is to drop out
everything for a given phrase that is not a number, to match the phone
number, like this:

> <fieldType name="phone_number" class="solr.TextField">
>   <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.PatternReplaceFilterFactory" pattern="[^\d]" replacement=""/>
>   </analyzer>
> </fieldType>

any thoughts? or wild guesses?

Thanks Stefan

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi Stefan, I've been very busy today, I've read your mail but no time to
write an answer.
So now at last, finally everybody is sleeping around me :)

Let's start from the very beginning, sorry if I didn't get everything about
your first question, I just got you're unable to find the phone number when
KeywordTokenizerFactory was enabled.

Let me say again, there isn't anything of strange in what you're
experiencing using solr.KeywordTokenizerFactory, may be you have just to
read how analyzers, tokenizers and filters works.

https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

I'm not the best guy here to explain what's your problem, I'll try to do my
best to put some light on, or at least, what I know trying to avoid all the
complexity on partial matching, boosts and others.

There are two terms you should know: "index time" and "query time". Index
time occurs when you save your text into a field document, so when your
text is tokenized and saved. Query time occurs when a search is run, when
your text is processed in order to be tokenized and searched.

Consider that all the searches in Solr usually are based on tokens
matching, so search tokens should match as much as possible index tokens,
as specified by the parameter mm (minimum should match).

So when you search something with Edismax, your phrase is divided into its
component tokens (aka words or terms). And Edismax will process all the
tokens across all the fields you have defined in qf parameter.

So, when you ask for +49 1234 12345678, at search time your query is
divided in three tokens, and each token is searched across the fields (and
in your case the field phone_number).

The KeywordTokenizerFactory at index time does not tokenize the text, you
have only one big token '+49 1234 12345678' on the other hand at query time
edismax is looking for 3 tokens +49, 1234 and 12345678.

As you can understand not even one of those 3 tokens match with the only
token content inside the field phone_number.

But when you use StandardTokenizerFactory, your input string '+49 1234
12345678' is tokenized into 3 tokens, like edismax did at query time. I
think you can now understand what's happening.

There isn't any chance your phrase query will happen if you don't tokenize
at index time the text in a way edismax will be able to search for.

Hope this helps.

Best regards,
Vincenzo






On Tue, Nov 8, 2016 at 10:46 PM, Stefan Matheis <ma...@gmail.com>
wrote:

> Any more thoughts on this? The longer i look at this situation, the
> more i’m thinking i’m at fault here - expection something that isn’t
> to be expected at all?
>
> Whatever is on your mind once you’ve read mail - don’t keep to it, let me
> know.
>
> -Stefan
>
>
> On November 7, 2016 at 5:23:58 PM, Stefan Matheis
> (matheis.stefan@gmail.com) wrote:
> > Which is everything fine by itself - but doesn’t shed more light on my
> initial question
> > Vincenzo, does it? probably i shoudn’t have mentioned partial matches in
> the first place,
> > that might have lead into the wrong direction - they are not relevant
> for now / not for this
> > question.
> >
> > I’d like to know why & where edismax drops out phrase fields which are
> using a Keyword Tokenizer.
> > Maybe there is a larger idea behind this behavior, but i don’t see it
> (yet).
> >
> > -Stefan
> >
> >
> > On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.damore@gmail.com)
> wrote:
> > > If you don't want partial matches with edismax you should always use
> > > StandardTokenizerFactory and play with mm parameter.
> > >
> > > On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> > > wrote:
> > >
> > > > Vincenzo,
> > > >
> > > > thanks for the response - i know that only the Keyword Tokenizer by
> > > > itself does not do anything. as pointed at the end of the initial
> > > > mail, i’m applying a pattern replace for everything non-numeric to
> > > > make it actually useful.
> > > >
> > > > and especially because of the tokenization based on whitespaces i’d
> > > > like to use the very same field once again as phrase field to around
> > > > this issue. Shawn mentioned in #solr in the meantime that there is
> > > > SOLR-9185 which is similar and would be helpful, but currently very
> > > > very in-the-works.
> > > >
> > > > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > > > edismax does by default in the first place. so i’m not sure how that
> > > > would help? For now, i don’t want to have partial matches on phone
> > > > numbers .. at least not yet.
> > > >
> > > > -Stefan
> > > >
> > > >
> > > > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (
> v.damore@gmail.com)
> > > > wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > I think the problem is solr.KeywordTokenizerFactory.
> > > > > This tokeniser does not make any tokenisation to the string, it
> returns
> > > > > exactly what you have.
> > > > >
> > > > > '+49 1234 12345678' -> '+49 1234 12345678'
> > > > >
> > > > > On the other hand, using edismax you are looking for '+49', '1234'
> and
> > > > > '12345678' and none of these keywords match your phone_number
> field.
> > > > >
> > > > > Try using a different tokenizer like
> solr.StandardTokenizerFactory, this
> > > > > should change your results.
> > > > >
> > > > > Bests,
> > > > > Vincenzo
> > > > >
> > > > > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > > > > wrote:
> > > > >
> > > > > > I’m guessing that i’m missing something obvious here - so feel
> free to
> > > > > > ask for more details and as well point out other directions i
> should
> > > > > > following.
> > > > > >
> > > > > > the problem goes as follows: the input in one case might be a
> phone
> > > > > > number (like +49 1234 12345678), since we’re using edismax the
> parts
> > > > > > gets split on whitespaces - which is fine. bringing the same
> field
> > > > > > (based on TextField) to the party (using qf) doesn’t change a
> thing.
> > > > > >
> > > > > > > responseHeader:
> > > > > > > params:
> > > > > > > q: '+49 1234 12345678'
> > > > > > > defType: edismax
> > > > > > > qf: person_mobile
> > > > > > > pf: person_mobile^5
> > > > > > > debug:
> > > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > > querystring: '+49 1234 12345678'
> > > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > > DisjunctionMaxQuery((person_mobile:1234))
> DisjunctionMaxQuery((person_
> > > > mobile:12345678)))
> > > > > > ())/no_coord'
> > > > > > > parsedquery_toString: '+(+(person_mobile:49)
> (person_mobile:1234)
> > > > > > (person_mobile:12345678)) ()’
> > > > > >
> > > > > > but .. as far as i was able to reduce the culprit, that only
> happens
> > > > > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m
> changing
> > > > > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > > > > expected:
> > > > > >
> > > > > > > responseHeader:
> > > > > > > params:
> > > > > > > q: '+49 1234 12345678'
> > > > > > > defType: edismax
> > > > > > > qf: person_mobile
> > > > > > > pf: person_mobile^5
> > > > > > > debug:
> > > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > > querystring: '+49 1234 12345678'
> > > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > > DisjunctionMaxQuery((person_mobile:1234))
> DisjunctionMaxQuery((person_
> > > > mobile:12345678)))
> > > > > > DisjunctionMaxQuery(((person_mobile:"49 1234
> > > > 12345678")^5.0)))/no_coord'
> > > > > > > parsedquery_toString: '+(+(person_mobile:49)
> (person_mobile:1234)
> > > > > > (person_mobile:12345678)) ((person_mobile:"49 1234
> 12345678")^5.0)’
> > > > > >
> > > > > > removing the + at the beginning, doesn’t make a difference either
> > > > > > (just mentioning since tokee already asked this on #solr, where
> i’ve
> > > > > > brought up the question earlier)
> > > > > >
> > > > > > it’s absolutely possible i’m focusing on a very wrong assumption
> - but
> > > > > > since switching the tokenizer does result in such a rather large
> > > > > > behaviour change, i think something is spooky here.
> > > > > >
> > > > > > i’ve read older issues and posts from the list, some of them
> pointed
> > > > > > out that it might be a optimization that edismax brings to the
> table -
> > > > > > i didn’t find anything specific about that.
> > > > > >
> > > > > > oh, and btw: if that would be working - my idea is to drop out
> > > > > > everything for a given phrase that is not a number, to match the
> phone
> > > > > > number, like this:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > > replacement=""/>
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > any thoughts? or wild guesses?
> > > > > >
> > > > > > Thanks Stefan
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Vincenzo D'Amore
> > > > > email: v.damore@gmail.com
> > > > > skype: free.dev
> > > > > mobile: +39 349 8513251
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Vincenzo D'Amore
> > > email: v.damore@gmail.com
> > > skype: free.dev
> > > mobile: +39 349 8513251
> > >
> >
>



-- 
Vincenzo D'Amore
email: v.damore@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Stefan Matheis <ma...@gmail.com>.

Any more thoughts on this? The longer i look at this situation, the
more i’m thinking i’m at fault here - expection something that isn’t
to be expected at all?

Whatever is on your mind once you’ve read mail - don’t keep to it, let me know.

-Stefan


On November 7, 2016 at 5:23:58 PM, Stefan Matheis
(matheis.stefan@gmail.com) wrote:
> Which is everything fine by itself - but doesn’t shed more light on my initial question
> Vincenzo, does it? probably i shoudn’t have mentioned partial matches in the first place,
> that might have lead into the wrong direction - they are not relevant for now / not for this
> question.
>
> I’d like to know why & where edismax drops out phrase fields which are using a Keyword Tokenizer.
> Maybe there is a larger idea behind this behavior, but i don’t see it (yet).
>
> -Stefan
>
>
> On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.damore@gmail.com) wrote:
> > If you don't want partial matches with edismax you should always use
> > StandardTokenizerFactory and play with mm parameter.
> >
> > On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> > wrote:
> >
> > > Vincenzo,
> > >
> > > thanks for the response - i know that only the Keyword Tokenizer by
> > > itself does not do anything. as pointed at the end of the initial
> > > mail, i’m applying a pattern replace for everything non-numeric to
> > > make it actually useful.
> > >
> > > and especially because of the tokenization based on whitespaces i’d
> > > like to use the very same field once again as phrase field to around
> > > this issue. Shawn mentioned in #solr in the meantime that there is
> > > SOLR-9185 which is similar and would be helpful, but currently very
> > > very in-the-works.
> > >
> > > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > > edismax does by default in the first place. so i’m not sure how that
> > > would help? For now, i don’t want to have partial matches on phone
> > > numbers .. at least not yet.
> > >
> > > -Stefan
> > >
> > >
> > > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.damore@gmail.com)
> > > wrote:
> > > > Hi Stefan,
> > > >
> > > > I think the problem is solr.KeywordTokenizerFactory.
> > > > This tokeniser does not make any tokenisation to the string, it returns
> > > > exactly what you have.
> > > >
> > > > '+49 1234 12345678' -> '+49 1234 12345678'
> > > >
> > > > On the other hand, using edismax you are looking for '+49', '1234' and
> > > > '12345678' and none of these keywords match your phone_number field.
> > > >
> > > > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > > > should change your results.
> > > >
> > > > Bests,
> > > > Vincenzo
> > > >
> > > > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > > > wrote:
> > > >
> > > > > I’m guessing that i’m missing something obvious here - so feel free to
> > > > > ask for more details and as well point out other directions i should
> > > > > following.
> > > > >
> > > > > the problem goes as follows: the input in one case might be a phone
> > > > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > > > gets split on whitespaces - which is fine. bringing the same field
> > > > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > > > >
> > > > > > responseHeader:
> > > > > > params:
> > > > > > q: '+49 1234 12345678'
> > > > > > defType: edismax
> > > > > > qf: person_mobile
> > > > > > pf: person_mobile^5
> > > > > > debug:
> > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > querystring: '+49 1234 12345678'
> > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > > mobile:12345678)))
> > > > > ())/no_coord'
> > > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > > (person_mobile:12345678)) ()’
> > > > >
> > > > > but .. as far as i was able to reduce the culprit, that only happens
> > > > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > > > expected:
> > > > >
> > > > > > responseHeader:
> > > > > > params:
> > > > > > q: '+49 1234 12345678'
> > > > > > defType: edismax
> > > > > > qf: person_mobile
> > > > > > pf: person_mobile^5
> > > > > > debug:
> > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > querystring: '+49 1234 12345678'
> > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > > mobile:12345678)))
> > > > > DisjunctionMaxQuery(((person_mobile:"49 1234
> > > 12345678")^5.0)))/no_coord'
> > > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > > > >
> > > > > removing the + at the beginning, doesn’t make a difference either
> > > > > (just mentioning since tokee already asked this on #solr, where i’ve
> > > > > brought up the question earlier)
> > > > >
> > > > > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > > > > since switching the tokenizer does result in such a rather large
> > > > > behaviour change, i think something is spooky here.
> > > > >
> > > > > i’ve read older issues and posts from the list, some of them pointed
> > > > > out that it might be a optimization that edismax brings to the table -
> > > > > i didn’t find anything specific about that.
> > > > >
> > > > > oh, and btw: if that would be working - my idea is to drop out
> > > > > everything for a given phrase that is not a number, to match the phone
> > > > > number, like this:
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > > replacement=""/>
> > > > > >
> > > > > >
> > > > >
> > > > > any thoughts? or wild guesses?
> > > > >
> > > > > Thanks Stefan
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Vincenzo D'Amore
> > > > email: v.damore@gmail.com
> > > > skype: free.dev
> > > > mobile: +39 349 8513251
> > > >
> > >
> >
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.damore@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
> >
>

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Stefan Matheis <ma...@gmail.com>.

Which is everything fine by itself - but doesn’t shed more light on my
initial question Vincenzo, does it? probably i shoudn’t have mentioned
partial matches in the first place, that might have lead into the
wrong direction - they are not relevant for now / not for this
question.

I’d like to know why & where edismax drops out phrase fields which are
using a Keyword Tokenizer. Maybe there is a larger idea behind this
behavior, but i don’t see it (yet).

-Stefan


On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.damore@gmail.com) wrote:
> If you don't want partial matches with edismax you should always use
> StandardTokenizerFactory and play with mm parameter.
>
> On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> wrote:
>
> > Vincenzo,
> >
> > thanks for the response - i know that only the Keyword Tokenizer by
> > itself does not do anything. as pointed at the end of the initial
> > mail, i’m applying a pattern replace for everything non-numeric to
> > make it actually useful.
> >
> > and especially because of the tokenization based on whitespaces i’d
> > like to use the very same field once again as phrase field to around
> > this issue. Shawn mentioned in #solr in the meantime that there is
> > SOLR-9185 which is similar and would be helpful, but currently very
> > very in-the-works.
> >
> > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > edismax does by default in the first place. so i’m not sure how that
> > would help? For now, i don’t want to have partial matches on phone
> > numbers .. at least not yet.
> >
> > -Stefan
> >
> >
> > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.damore@gmail.com)
> > wrote:
> > > Hi Stefan,
> > >
> > > I think the problem is solr.KeywordTokenizerFactory.
> > > This tokeniser does not make any tokenisation to the string, it returns
> > > exactly what you have.
> > >
> > > '+49 1234 12345678' -> '+49 1234 12345678'
> > >
> > > On the other hand, using edismax you are looking for '+49', '1234' and
> > > '12345678' and none of these keywords match your phone_number field.
> > >
> > > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > > should change your results.
> > >
> > > Bests,
> > > Vincenzo
> > >
> > > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > > wrote:
> > >
> > > > I’m guessing that i’m missing something obvious here - so feel free to
> > > > ask for more details and as well point out other directions i should
> > > > following.
> > > >
> > > > the problem goes as follows: the input in one case might be a phone
> > > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > > gets split on whitespaces - which is fine. bringing the same field
> > > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > > >
> > > > > responseHeader:
> > > > > params:
> > > > > q: '+49 1234 12345678'
> > > > > defType: edismax
> > > > > qf: person_mobile
> > > > > pf: person_mobile^5
> > > > > debug:
> > > > > rawquerystring: '+49 1234 12345678'
> > > > > querystring: '+49 1234 12345678'
> > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > mobile:12345678)))
> > > > ())/no_coord'
> > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > (person_mobile:12345678)) ()’
> > > >
> > > > but .. as far as i was able to reduce the culprit, that only happens
> > > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > > expected:
> > > >
> > > > > responseHeader:
> > > > > params:
> > > > > q: '+49 1234 12345678'
> > > > > defType: edismax
> > > > > qf: person_mobile
> > > > > pf: person_mobile^5
> > > > > debug:
> > > > > rawquerystring: '+49 1234 12345678'
> > > > > querystring: '+49 1234 12345678'
> > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > mobile:12345678)))
> > > > DisjunctionMaxQuery(((person_mobile:"49 1234
> > 12345678")^5.0)))/no_coord'
> > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > > >
> > > > removing the + at the beginning, doesn’t make a difference either
> > > > (just mentioning since tokee already asked this on #solr, where i’ve
> > > > brought up the question earlier)
> > > >
> > > > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > > > since switching the tokenizer does result in such a rather large
> > > > behaviour change, i think something is spooky here.
> > > >
> > > > i’ve read older issues and posts from the list, some of them pointed
> > > > out that it might be a optimization that edismax brings to the table -
> > > > i didn’t find anything specific about that.
> > > >
> > > > oh, and btw: if that would be working - my idea is to drop out
> > > > everything for a given phrase that is not a number, to match the phone
> > > > number, like this:
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > > replacement=""/>
> > > > >
> > > > >
> > > >
> > > > any thoughts? or wild guesses?
> > > >
> > > > Thanks Stefan
> > > >
> > >
> > >
> > >
> > > --
> > > Vincenzo D'Amore
> > > email: v.damore@gmail.com
> > > skype: free.dev
> > > mobile: +39 349 8513251
> > >
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.damore@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Vincenzo D'Amore <v....@gmail.com>.

If you don't want partial matches with edismax you should always use
StandardTokenizerFactory and play with mm parameter.

On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis <ma...@gmail.com>
wrote:

> Vincenzo,
>
> thanks for the response - i know that only the Keyword Tokenizer by
> itself does not do anything. as pointed at the end of the initial
> mail, i’m applying a pattern replace for everything non-numeric to
> make it actually useful.
>
> and especially because of the tokenization based on whitespaces i’d
> like to use the very same field once again as phrase field to around
> this issue. Shawn mentioned in #solr in the meantime that there is
> SOLR-9185 which is similar and would be helpful, but currently very
> very in-the-works.
>
> Standard Tokenizer you’ve mentioned does split on whitespace - as
> edismax does by default in the first place. so i’m not sure how that
> would help? For now, i don’t want to have partial matches on phone
> numbers .. at least not yet.
>
> -Stefan
>
>
> On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.damore@gmail.com)
> wrote:
> > Hi Stefan,
> >
> > I think the problem is solr.KeywordTokenizerFactory.
> > This tokeniser does not make any tokenisation to the string, it returns
> > exactly what you have.
> >
> > '+49 1234 12345678' -> '+49 1234 12345678'
> >
> > On the other hand, using edismax you are looking for '+49', '1234' and
> > '12345678' and none of these keywords match your phone_number field.
> >
> > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > should change your results.
> >
> > Bests,
> > Vincenzo
> >
> > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > wrote:
> >
> > > I’m guessing that i’m missing something obvious here - so feel free to
> > > ask for more details and as well point out other directions i should
> > > following.
> > >
> > > the problem goes as follows: the input in one case might be a phone
> > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > gets split on whitespaces - which is fine. bringing the same field
> > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > >
> > > > responseHeader:
> > > > params:
> > > > q: '+49 1234 12345678'
> > > > defType: edismax
> > > > qf: person_mobile
> > > > pf: person_mobile^5
> > > > debug:
> > > > rawquerystring: '+49 1234 12345678'
> > > > querystring: '+49 1234 12345678'
> > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> mobile:12345678)))
> > > ())/no_coord'
> > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > (person_mobile:12345678)) ()’
> > >
> > > but .. as far as i was able to reduce the culprit, that only happens
> > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > expected:
> > >
> > > > responseHeader:
> > > > params:
> > > > q: '+49 1234 12345678'
> > > > defType: edismax
> > > > qf: person_mobile
> > > > pf: person_mobile^5
> > > > debug:
> > > > rawquerystring: '+49 1234 12345678'
> > > > querystring: '+49 1234 12345678'
> > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> mobile:12345678)))
> > > DisjunctionMaxQuery(((person_mobile:"49 1234
> 12345678")^5.0)))/no_coord'
> > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > >
> > > removing the + at the beginning, doesn’t make a difference either
> > > (just mentioning since tokee already asked this on #solr, where i’ve
> > > brought up the question earlier)
> > >
> > > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > > since switching the tokenizer does result in such a rather large
> > > behaviour change, i think something is spooky here.
> > >
> > > i’ve read older issues and posts from the list, some of them pointed
> > > out that it might be a optimization that edismax brings to the table -
> > > i didn’t find anything specific about that.
> > >
> > > oh, and btw: if that would be working - my idea is to drop out
> > > everything for a given phrase that is not a number, to match the phone
> > > number, like this:
> > >
> > > >
> > > >
> > > >
> > > > > > replacement=""/>
> > > >
> > > >
> > >
> > > any thoughts? or wild guesses?
> > >
> > > Thanks Stefan
> > >
> >
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.damore@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
> >
>



-- 
Vincenzo D'Amore
email: v.damore@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Stefan Matheis <ma...@gmail.com>.

Vincenzo,

thanks for the response - i know that only the Keyword Tokenizer by
itself does not do anything. as pointed at the end of the initial
mail, i’m applying a pattern replace for everything non-numeric to
make it actually useful.

and especially because of the tokenization based on whitespaces i’d
like to use the very same field once again as phrase field to around
this issue. Shawn mentioned in #solr in the meantime that there is
SOLR-9185 which is similar and would be helpful, but currently very
very in-the-works.

Standard Tokenizer you’ve mentioned does split on whitespace - as
edismax does by default in the first place. so i’m not sure how that
would help? For now, i don’t want to have partial matches on phone
numbers .. at least not yet.

-Stefan


On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.damore@gmail.com) wrote:
> Hi Stefan,
>
> I think the problem is solr.KeywordTokenizerFactory.
> This tokeniser does not make any tokenisation to the string, it returns
> exactly what you have.
>
> '+49 1234 12345678' -> '+49 1234 12345678'
>
> On the other hand, using edismax you are looking for '+49', '1234' and
> '12345678' and none of these keywords match your phone_number field.
>
> Try using a different tokenizer like solr.StandardTokenizerFactory, this
> should change your results.
>
> Bests,
> Vincenzo
>
> On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> wrote:
>
> > I’m guessing that i’m missing something obvious here - so feel free to
> > ask for more details and as well point out other directions i should
> > following.
> >
> > the problem goes as follows: the input in one case might be a phone
> > number (like +49 1234 12345678), since we’re using edismax the parts
> > gets split on whitespaces - which is fine. bringing the same field
> > (based on TextField) to the party (using qf) doesn’t change a thing.
> >
> > > responseHeader:
> > > params:
> > > q: '+49 1234 12345678'
> > > defType: edismax
> > > qf: person_mobile
> > > pf: person_mobile^5
> > > debug:
> > > rawquerystring: '+49 1234 12345678'
> > > querystring: '+49 1234 12345678'
> > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678)))
> > ())/no_coord'
> > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > (person_mobile:12345678)) ()’
> >
> > but .. as far as i was able to reduce the culprit, that only happens
> > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > that to solr.StandardTokenizerFactory the phrase query appears as
> > expected:
> >
> > > responseHeader:
> > > params:
> > > q: '+49 1234 12345678'
> > > defType: edismax
> > > qf: person_mobile
> > > pf: person_mobile^5
> > > debug:
> > > rawquerystring: '+49 1234 12345678'
> > > querystring: '+49 1234 12345678'
> > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678)))
> > DisjunctionMaxQuery(((person_mobile:"49 1234 12345678")^5.0)))/no_coord'
> > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> >
> > removing the + at the beginning, doesn’t make a difference either
> > (just mentioning since tokee already asked this on #solr, where i’ve
> > brought up the question earlier)
> >
> > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > since switching the tokenizer does result in such a rather large
> > behaviour change, i think something is spooky here.
> >
> > i’ve read older issues and posts from the list, some of them pointed
> > out that it might be a optimization that edismax brings to the table -
> > i didn’t find anything specific about that.
> >
> > oh, and btw: if that would be working - my idea is to drop out
> > everything for a given phrase that is not a number, to match the phone
> > number, like this:
> >
> > >
> > >
> > >
> > > > > replacement=""/>
> > >
> > >
> >
> > any thoughts? or wild guesses?
> >
> > Thanks Stefan
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.damore@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>

Re: edismax, phrase field gets ignored for keyword tokenizer

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi Stefan,

I think the problem is solr.KeywordTokenizerFactory.
This tokeniser does not make any tokenisation to the string, it returns
exactly what you have.

'+49 1234 12345678' -> '+49 1234 12345678'

On the other hand, using edismax you are looking for '+49', '1234' and
'12345678' and none of these keywords match your phone_number field.

Try using a different tokenizer like solr.StandardTokenizerFactory, this
should change your results.

Bests,
Vincenzo

On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis <ma...@gmail.com>
wrote:

> I’m guessing that i’m missing something obvious here - so feel free to
> ask for more details and as well point out other directions i should
> following.
>
> the problem goes as follows: the input in one case might be a phone
> number (like +49 1234 12345678), since we’re using edismax the parts
> gets split on whitespaces - which is fine. bringing the same field
> (based on TextField) to the party (using qf) doesn’t change a thing.
>
> > responseHeader:
> >     params:
> >         q: '+49 1234 12345678'
> >         defType: edismax
> >         qf: person_mobile
> >         pf: person_mobile^5
> > debug:
> >     rawquerystring: '+49 1234 12345678'
> >     querystring: '+49 1234 12345678'
> >     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678)))
> ())/no_coord'
> >     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> (person_mobile:12345678)) ()’
>
> but .. as far as i was able to reduce the culprit, that only happens
> when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> that to solr.StandardTokenizerFactory the phrase query appears as
> expected:
>
> > responseHeader:
> >     params:
> >         q: '+49 1234 12345678'
> >         defType: edismax
> >         qf: person_mobile
> >         pf: person_mobile^5
> > debug:
> >     rawquerystring: '+49 1234 12345678'
> >     querystring: '+49 1234 12345678'
> >     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_mobile:12345678)))
> DisjunctionMaxQuery(((person_mobile:"49 1234 12345678")^5.0)))/no_coord'
> >     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
>
> removing the + at the beginning, doesn’t make a difference either
> (just mentioning since tokee already asked this on #solr, where i’ve
> brought up the question earlier)
>
> it’s absolutely possible i’m focusing on a very wrong assumption - but
> since switching the tokenizer does result in such a rather large
> behaviour change, i think something is spooky here.
>
> i’ve read older issues and posts from the list, some of them pointed
> out that it might be a optimization that edismax brings to the table -
> i didn’t find anything specific about that.
>
> oh, and btw: if that would be working - my idea is to drop out
> everything for a given phrase that is not a number, to match the phone
> number, like this:
>
> > <fieldType name="phone_number" class="solr.TextField">
> >   <analyzer>
> >     <tokenizer class="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.PatternReplaceFilterFactory" pattern="[^\d]"
> replacement=""/>
> >   </analyzer>
> > </fieldType>
>
> any thoughts? or wild guesses?
>
> Thanks Stefan
>



-- 
Vincenzo D'Amore
email: v.damore@gmail.com
skype: free.dev
mobile: +39 349 8513251