You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Julian Hugo <ju...@data4life.care> on 2022/08/23 13:45:48 UTC

Terms with hyphens and fuzzy search

Hello,

I am getting peculiar results when querying for a term containing hyphens
and add fuzzy search
<https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches>
.

I have indexed two items (1) "term-with-hyphens" and (2) "term with
hyphens". When I query ("q") for "term-with-hyphens" or "term with hyphens"
both items are returned as expected. The same is the case for escaped
hyphens "term\-with\-hyphens".

The problem: When I add the fuzzy search parameter (i.e.,
"term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results back.

I struggle to understand the results, or how to solve this problem. My
intuition tells me that adding a fuzzy search parameter should surely
increase the size of the set of results. I am happy for any help on this!

Our current setup is using the "Extended DisMax Query Parser"
<https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html>
however we observe the same behaviour using the "Standard Query Parser
<https://solr.apache.org/guide/6_6/the-standard-query-parser.html>". We are
using the "Standard Tokenizer
<https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer>",
which splits at hyphens. Does this relate to this problem?

Thank you!

*Julian Hugo*

Working Student
Backend Development

(he/his)

julian.hugo@data4life.care

D4L data4life gGmbH
Charlottenstraße 109
14467 Potsdam, Germany

www.data4life.care

Amtsgericht Potsdam, HRB 30667

Managing Director: Christian-Cornelius Weiß

We are Data4Life. We've been certified by the German Federal Office for
Information Security (BSI) in accordance with ISO 27001 on the basis of
"IT-Grundschutz".

Diversity is the driving force behind our work towards a society where
digital health improves quality of life for everyone.
Data4Life warmly welcomes applicants from the LGBTQI+ community, people
with a migration background, People of Color, and individuals with
disabilities or chronic illnesses to the team.

Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by Morten Ernebjerg <mo...@data4life.care>.

Hi David & Markus
Thanks for the input! - I think we should now have the tools to work out a
solution for this.
Best,
Morten

On Tue, 23 Aug 2022 at 18:19, David Hastings <ha...@gmail.com>
wrote:

> And if you want to get really fun, use a natural language/entity
> extraction, mix just those values into an index field, with stop words
> killed, and then bring in shingles, up the shingle to about four, and boost
> it with the pf. I promise you won’t get bored. Your index size will grow
> but you should already have some metal behind you when you start doing
> that.
>
> On Tue, Aug 23, 2022 at 12:05 PM Dave <ha...@gmail.com>
> wrote:
>
> > Yea now I think you’re getting the concept. The dash is effectively white
> > space and means nothing, like a period or comma. So it’s now three
> separate
> > words. And to quote:
> >
> > Once the list of matching documents has been identified using the fq and
> > qf parameters, the pf parameter can be used to "boost" the score of
> > documents in cases where all of the terms in the q parameter appear in
> > close proximity
> >
> > There is a lot of power in the pf parameter, it might be more what you’re
> > looking for. On a side note there is a whole concept of shingles which
> > could further help you out which combines words together. Like:
> > Dark storm rising
> > Can turn into
> > Dark_storm
> > Storm_rising
> > Dark
> > Storm
> > Rising
> > If you set it to two. It can get really fun when you do this and mix in
> > stop words.
> >
> > On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg
> > <mo...@data4life.care> wrote:
> >
> > Hi again
> >
> >
> >
> > OK, so I think this is starting to make sense, What was confusing us was
> > that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> > just a single term, meaning that fuzzy search should apply as usual.
> > However, if I understand you correctly, it sounds like the correct
> > statement is actually that fuzzy search applies to *terms that result in
> a
> > single token after indexing*. Since the standard tokenizer splits on
> > hyphens, fuzzy search would then not apply. Did I get that right
> >
> > phrase query fields
> >
> >
> > I'm not sure I quite follow - do you mean using the qf query parameter or
> > setting up separate "parallel" fields of some sort?
> >
> > Best,
> >
> > Morten
> >
> > On Tue, 23 Aug 2022 at 17:29, Dave <ha...@gmail.com> wrote:
> >
> > Ok so from what I’m looking at you have a proximity search so the terms
> >
> > have to be within the distance value of each other. In my example, 2,
> which
> >
> > obviously won’t work since there are three terms.  A fuzzy search is
> based
> >
> > on a single term/token. So you need to add ~2 to each term if that’s what
> >
> > you want. There’s really good
> >
> > Documentation about the difference and why it’s not working as you
> >
> > expected here:
> >
> >
> > https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
> >
> >
> > Also try to make use of phrase query fields and boosting them,
> >
> >
> >
> >
> > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
> >
> > <mo...@data4life.care> wrote:
> >
> >
> > (replying on behalf of  my colleague Julius who wrote this question who
> >
> > is
> >
> > unable to reply for technical reasons)
> >
> > Hi David,
> >
> >
> > Thanks for the reply! I think your question may point to something we
> >
> > overlooked. We are actually using Solr 8.11 and we want to use fuzzy
> >
> > search
> >
> > (
> >
> >
> >
> >
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
> >
> > ),
> >
> > i.e. find words that differ from the query by one or a few characters.
> >
> > Our
> >
> > understanding was that to get matches that differ by max two chars from
> >
> > (using separate line to avoid adding confusing quotation marks)
> >
> >
> > term-with-hyphens
> >
> >
> > we should send the following query (without any quotation marks):
> >
> >
> > term-with-hyphens~2
> >
> >
> > Our thinking was that the hyphenated term is one word so there is no need
> >
> > to quote it. We had a quick try quoting the hyphenated term in the query
> >
> > as
> >
> > you suggested and it looks like it works (i.e. returns matches). Since
> >
> > the
> >
> > standard tokenizer splits on hyphens, I'm wondering the unquoted query
> >
> > somehow gets converted to the *proximity search* query
> >
> >
> > "term with hyphens"~2
> >
> >
> > which then fails (though it looks like it should still match
> >
> > term-with-hyphens). Would be great to understand what is happening.
> >
> >
> > Best,
> >
> >
> > Morten
> >
> >
> >
> >
> > On Tue, 23 Aug 2022 at 16:30, David Hastings <
> >
> > hastings.recursive@gmail.com>
> >
> > wrote:
> >
> >
> > I’m not certain of course of your tokenizer but shouldn’t it be
> >
> > “terms-with-hyphens”~1
> >
> >
> > ? Just a syntax thing that may not have translated over email but
> >
> > curious
> >
> >
> > On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.hugo@data4life.care
> >
> >
> > wrote:
> >
> >
> > Hello,
> >
> >
> > I am getting peculiar results when querying for a term containing
> >
> > hyphens
> >
> > and add fuzzy search
> >
> > <
> >
> >
> >
> >
> >
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> >
> >
> > .
> >
> >
> > I have indexed two items (1) "term-with-hyphens" and (2) "term with
> >
> > hyphens". When I query ("q") for "term-with-hyphens" or "term with
> >
> > hyphens"
> >
> > both items are returned as expected. The same is the case for escaped
> >
> > hyphens "term\-with\-hyphens".
> >
> >
> > The problem: When I add the fuzzy search parameter (i.e.,
> >
> > "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> >
> > back.
> >
> >
> > I struggle to understand the results, or how to solve this problem. My
> >
> > intuition tells me that adding a fuzzy search parameter should surely
> >
> > increase the size of the set of results. I am happy for any help on
> >
> > this!
> >
> >
> > Our current setup is using the "Extended DisMax Query Parser"
> >
> > <
> >
> > https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> >
> >
> > however we observe the same behaviour using the "Standard Query Parser
> >
> > <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
> >
> > We
> >
> > are
> >
> > using the "Standard Tokenizer
> >
> > <
> >
> >
> >
> >
> >
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >
> > ",
> >
> > which splits at hyphens. Does this relate to this problem?
> >
> >
> > Thank you!
> >
> >
> > --
> >
> >
> > *Julian Hugo*
> >
> >
> > Working Student
> >
> > Backend Development
> >
> >
> > (he/his)
> >
> >
> >
> > julian.hugo@data4life.care
> >
> >
> >
> > D4L data4life gGmbH
> >
> > Charlottenstraße 109
> >
> > 14467 Potsdam, Germany
> >
> >
> > www.data4life.care
> >
> >
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> >
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> >
> > "IT-Grundschutz".
> >
> >
> >
> > Diversity is the driving force behind our work towards a society where
> >
> > digital health improves quality of life for everyone.
> >
> > Data4Life warmly welcomes applicants from the LGBTQI+ community, people
> >
> > with a migration background, People of Color, and individuals with
> >
> > disabilities or chronic illnesses to the team.
> >
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
> >
> >
> >
> > --
> >
> >
> > *Morten Ernebjerg, Ph.D.*
> >
> >
> > Senior Developer
> >
> >
> >
> > morten.ernebjerg@data4life.care
> >
> >
> > D4L data4life gGmbH
> >
> >
> > Charlottenstraße 109
> >
> >
> > 14467 Potsdam, Germany
> >
> >
> > www.data4life.care
> >
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> >
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> >
> > "IT-Grundschutz".
> >
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
> >
> >
> > --
> >
> > *Morten Ernebjerg, Ph.D.*
> >
> > Senior Developer
> >
> >
> > morten.ernebjerg@data4life.care
> >
> > D4L data4life gGmbH
> >
> > Charlottenstraße 109
> >
> > 14467 Potsdam, Germany
> >
> > www.data4life.care
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > "IT-Grundschutz".
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
>


-- 

*Morten Ernebjerg, Ph.D.*

Senior Developer


morten.ernebjerg@data4life.care

D4L data4life gGmbH

Charlottenstraße 109

14467 Potsdam, Germany

www.data4life.care

Amtsgericht Potsdam, HRB 30667

Managing Director: Christian-Cornelius Weiß


We are Data4Life. We've been certified by the German Federal Office for
Information Security (BSI) in accordance with ISO 27001 on the basis of
"IT-Grundschutz".


Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by David Hastings <ha...@gmail.com>.

And if you want to get really fun, use a natural language/entity
extraction, mix just those values into an index field, with stop words
killed, and then bring in shingles, up the shingle to about four, and boost
it with the pf. I promise you won’t get bored. Your index size will grow
but you should already have some metal behind you when you start doing
that.

On Tue, Aug 23, 2022 at 12:05 PM Dave <ha...@gmail.com> wrote:

> Yea now I think you’re getting the concept. The dash is effectively white
> space and means nothing, like a period or comma. So it’s now three separate
> words. And to quote:
>
> Once the list of matching documents has been identified using the fq and
> qf parameters, the pf parameter can be used to "boost" the score of
> documents in cases where all of the terms in the q parameter appear in
> close proximity
>
> There is a lot of power in the pf parameter, it might be more what you’re
> looking for. On a side note there is a whole concept of shingles which
> could further help you out which combines words together. Like:
> Dark storm rising
> Can turn into
> Dark_storm
> Storm_rising
> Dark
> Storm
> Rising
> If you set it to two. It can get really fun when you do this and mix in
> stop words.
>
> On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg
> <mo...@data4life.care> wrote:
>
> Hi again
>
>
>
> OK, so I think this is starting to make sense, What was confusing us was
> that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> just a single term, meaning that fuzzy search should apply as usual.
> However, if I understand you correctly, it sounds like the correct
> statement is actually that fuzzy search applies to *terms that result in a
> single token after indexing*. Since the standard tokenizer splits on
> hyphens, fuzzy search would then not apply. Did I get that right
>
> phrase query fields
>
>
> I'm not sure I quite follow - do you mean using the qf query parameter or
> setting up separate "parallel" fields of some sort?
>
> Best,
>
> Morten
>
> On Tue, 23 Aug 2022 at 17:29, Dave <ha...@gmail.com> wrote:
>
> Ok so from what I’m looking at you have a proximity search so the terms
>
> have to be within the distance value of each other. In my example, 2, which
>
> obviously won’t work since there are three terms.  A fuzzy search is based
>
> on a single term/token. So you need to add ~2 to each term if that’s what
>
> you want. There’s really good
>
> Documentation about the difference and why it’s not working as you
>
> expected here:
>
>
> https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
>
>
> Also try to make use of phrase query fields and boosting them,
>
>
>
>
> On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
>
> <mo...@data4life.care> wrote:
>
>
> (replying on behalf of  my colleague Julius who wrote this question who
>
> is
>
> unable to reply for technical reasons)
>
> Hi David,
>
>
> Thanks for the reply! I think your question may point to something we
>
> overlooked. We are actually using Solr 8.11 and we want to use fuzzy
>
> search
>
> (
>
>
>
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
>
> ),
>
> i.e. find words that differ from the query by one or a few characters.
>
> Our
>
> understanding was that to get matches that differ by max two chars from
>
> (using separate line to avoid adding confusing quotation marks)
>
>
> term-with-hyphens
>
>
> we should send the following query (without any quotation marks):
>
>
> term-with-hyphens~2
>
>
> Our thinking was that the hyphenated term is one word so there is no need
>
> to quote it. We had a quick try quoting the hyphenated term in the query
>
> as
>
> you suggested and it looks like it works (i.e. returns matches). Since
>
> the
>
> standard tokenizer splits on hyphens, I'm wondering the unquoted query
>
> somehow gets converted to the *proximity search* query
>
>
> "term with hyphens"~2
>
>
> which then fails (though it looks like it should still match
>
> term-with-hyphens). Would be great to understand what is happening.
>
>
> Best,
>
>
> Morten
>
>
>
>
> On Tue, 23 Aug 2022 at 16:30, David Hastings <
>
> hastings.recursive@gmail.com>
>
> wrote:
>
>
> I’m not certain of course of your tokenizer but shouldn’t it be
>
> “terms-with-hyphens”~1
>
>
> ? Just a syntax thing that may not have translated over email but
>
> curious
>
>
> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.hugo@data4life.care
>
>
> wrote:
>
>
> Hello,
>
>
> I am getting peculiar results when querying for a term containing
>
> hyphens
>
> and add fuzzy search
>
> <
>
>
>
>
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
>
>
> .
>
>
> I have indexed two items (1) "term-with-hyphens" and (2) "term with
>
> hyphens". When I query ("q") for "term-with-hyphens" or "term with
>
> hyphens"
>
> both items are returned as expected. The same is the case for escaped
>
> hyphens "term\-with\-hyphens".
>
>
> The problem: When I add the fuzzy search parameter (i.e.,
>
> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
>
> back.
>
>
> I struggle to understand the results, or how to solve this problem. My
>
> intuition tells me that adding a fuzzy search parameter should surely
>
> increase the size of the set of results. I am happy for any help on
>
> this!
>
>
> Our current setup is using the "Extended DisMax Query Parser"
>
> <
>
> https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
>
>
> however we observe the same behaviour using the "Standard Query Parser
>
> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
>
> We
>
> are
>
> using the "Standard Tokenizer
>
> <
>
>
>
>
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
>
> ",
>
> which splits at hyphens. Does this relate to this problem?
>
>
> Thank you!
>
>
> --
>
>
> *Julian Hugo*
>
>
> Working Student
>
> Backend Development
>
>
> (he/his)
>
>
>
> julian.hugo@data4life.care
>
>
>
> D4L data4life gGmbH
>
> Charlottenstraße 109
>
> 14467 Potsdam, Germany
>
>
> www.data4life.care
>
>
>
> Amtsgericht Potsdam, HRB 30667
>
>
> Managing Director: Christian-Cornelius Weiß
>
>
>
> We are Data4Life. We've been certified by the German Federal Office for
>
> Information Security (BSI) in accordance with ISO 27001 on the basis of
>
> "IT-Grundschutz".
>
>
>
> Diversity is the driving force behind our work towards a society where
>
> digital health improves quality of life for everyone.
>
> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
>
> with a migration background, People of Color, and individuals with
>
> disabilities or chronic illnesses to the team.
>
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>
>
>
>
>
> --
>
>
> *Morten Ernebjerg, Ph.D.*
>
>
> Senior Developer
>
>
>
> morten.ernebjerg@data4life.care
>
>
> D4L data4life gGmbH
>
>
> Charlottenstraße 109
>
>
> 14467 Potsdam, Germany
>
>
> www.data4life.care
>
>
> Amtsgericht Potsdam, HRB 30667
>
>
> Managing Director: Christian-Cornelius Weiß
>
>
>
> We are Data4Life. We've been certified by the German Federal Office for
>
> Information Security (BSI) in accordance with ISO 27001 on the basis of
>
> "IT-Grundschutz".
>
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>
>
>
>
> --
>
> *Morten Ernebjerg, Ph.D.*
>
> Senior Developer
>
>
> morten.ernebjerg@data4life.care
>
> D4L data4life gGmbH
>
> Charlottenstraße 109
>
> 14467 Potsdam, Germany
>
> www.data4life.care
>
> Amtsgericht Potsdam, HRB 30667
>
> Managing Director: Christian-Cornelius Weiß
>
>
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>
>

Re: Terms with hyphens and fuzzy search

Posted by Dave <ha...@gmail.com>.

Yea now I think you’re getting the concept. The dash is effectively white space and means nothing, like a period or comma. So it’s now three separate words. And to quote:

Once the list of matching documents has been identified using the fq and qf parameters, the pf parameter can be used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close proximity

There is a lot of power in the pf parameter, it might be more what you’re looking for. On a side note there is a whole concept of shingles which could further help you out which combines words together. Like:
Dark storm rising
Can turn into 
Dark_storm
Storm_rising
Dark
Storm 
Rising
If you set it to two. It can get really fun when you do this and mix in stop words. 

> On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg <mo...@data4life.care> wrote:
> 
> Hi again
> 
> OK, so I think this is starting to make sense, What was confusing us was
> that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> just a single term, meaning that fuzzy search should apply as usual.
> However, if I understand you correctly, it sounds like the correct
> statement is actually that fuzzy search applies to *terms that result in a
> single token after indexing*. Since the standard tokenizer splits on
> hyphens, fuzzy search would then not apply. Did I get that right
> 
>> phrase query fields
> 
> I'm not sure I quite follow - do you mean using the qf query parameter or
> setting up separate "parallel" fields of some sort?
> 
> Best,
> 
> Morten
> 
>> On Tue, 23 Aug 2022 at 17:29, Dave <ha...@gmail.com> wrote:
>> 
>> Ok so from what I’m looking at you have a proximity search so the terms
>> have to be within the distance value of each other. In my example, 2, which
>> obviously won’t work since there are three terms.  A fuzzy search is based
>> on a single term/token. So you need to add ~2 to each term if that’s what
>> you want. There’s really good
>> Documentation about the difference and why it’s not working as you
>> expected here:
>> 
>> https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
>> 
>> Also try to make use of phrase query fields and boosting them,
>> 
>> 
>> 
>>> On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
>> <mo...@data4life.care> wrote:
>>> 
>>> (replying on behalf of  my colleague Julius who wrote this question who
>> is
>>> unable to reply for technical reasons)
>>> Hi David,
>>> 
>>> Thanks for the reply! I think your question may point to something we
>>> overlooked. We are actually using Solr 8.11 and we want to use fuzzy
>> search
>>> (
>>> 
>> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
>> ),
>>> i.e. find words that differ from the query by one or a few characters.
>> Our
>>> understanding was that to get matches that differ by max two chars from
>>> (using separate line to avoid adding confusing quotation marks)
>>> 
>>> term-with-hyphens
>>> 
>>> we should send the following query (without any quotation marks):
>>> 
>>> term-with-hyphens~2
>>> 
>>> Our thinking was that the hyphenated term is one word so there is no need
>>> to quote it. We had a quick try quoting the hyphenated term in the query
>> as
>>> you suggested and it looks like it works (i.e. returns matches). Since
>> the
>>> standard tokenizer splits on hyphens, I'm wondering the unquoted query
>>> somehow gets converted to the *proximity search* query
>>> 
>>> "term with hyphens"~2
>>> 
>>> which then fails (though it looks like it should still match
>>> term-with-hyphens). Would be great to understand what is happening.
>>> 
>>> Best,
>>> 
>>> Morten
>>> 
>>> 
>>> 
>>>> On Tue, 23 Aug 2022 at 16:30, David Hastings <
>> hastings.recursive@gmail.com>
>>>> wrote:
>>>> 
>>>> I’m not certain of course of your tokenizer but shouldn’t it be
>>>> “terms-with-hyphens”~1
>>>> 
>>>> ? Just a syntax thing that may not have translated over email but
>> curious
>>>> 
>>>> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.hugo@data4life.care
>>> 
>>>> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am getting peculiar results when querying for a term containing
>> hyphens
>>>>> and add fuzzy search
>>>>> <
>>>>> 
>>>> 
>> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
>>>>>> 
>>>>> .
>>>>> 
>>>>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
>>>>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
>>>> hyphens"
>>>>> both items are returned as expected. The same is the case for escaped
>>>>> hyphens "term\-with\-hyphens".
>>>>> 
>>>>> The problem: When I add the fuzzy search parameter (i.e.,
>>>>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
>>>> back.
>>>>> 
>>>>> I struggle to understand the results, or how to solve this problem. My
>>>>> intuition tells me that adding a fuzzy search parameter should surely
>>>>> increase the size of the set of results. I am happy for any help on
>> this!
>>>>> 
>>>>> Our current setup is using the "Extended DisMax Query Parser"
>>>>> <
>> https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
>>>>> 
>>>>> however we observe the same behaviour using the "Standard Query Parser
>>>>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
>> We
>>>>> are
>>>>> using the "Standard Tokenizer
>>>>> <
>>>>> 
>>>> 
>> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
>>>>>> ",
>>>>> which splits at hyphens. Does this relate to this problem?
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> --
>>>>> 
>>>>> *Julian Hugo*
>>>>> 
>>>>> Working Student
>>>>> Backend Development
>>>>> 
>>>>> (he/his)
>>>>> 
>>>>> 
>>>>> julian.hugo@data4life.care
>>>>> 
>>>>> 
>>>>> D4L data4life gGmbH
>>>>> Charlottenstraße 109
>>>>> 14467 Potsdam, Germany
>>>>> 
>>>>> www.data4life.care
>>>>> 
>>>>> 
>>>>> Amtsgericht Potsdam, HRB 30667
>>>>> 
>>>>> Managing Director: Christian-Cornelius Weiß
>>>>> 
>>>>> 
>>>>> We are Data4Life. We've been certified by the German Federal Office for
>>>>> Information Security (BSI) in accordance with ISO 27001 on the basis of
>>>>> "IT-Grundschutz".
>>>>> 
>>>>> 
>>>>> Diversity is the driving force behind our work towards a society where
>>>>> digital health improves quality of life for everyone.
>>>>> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
>>>>> with a migration background, People of Color, and individuals with
>>>>> disabilities or chronic illnesses to the team.
>>>>> 
>>>>> 
>>>>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> *Morten Ernebjerg, Ph.D.*
>>> 
>>> Senior Developer
>>> 
>>> 
>>> morten.ernebjerg@data4life.care
>>> 
>>> D4L data4life gGmbH
>>> 
>>> Charlottenstraße 109
>>> 
>>> 14467 Potsdam, Germany
>>> 
>>> www.data4life.care
>>> 
>>> Amtsgericht Potsdam, HRB 30667
>>> 
>>> Managing Director: Christian-Cornelius Weiß
>>> 
>>> 
>>> We are Data4Life. We've been certified by the German Federal Office for
>>> Information Security (BSI) in accordance with ISO 27001 on the basis of
>>> "IT-Grundschutz".
>>> 
>>> 
>>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>> 
> 
> 
> -- 
> 
> *Morten Ernebjerg, Ph.D.*
> 
> Senior Developer
> 
> 
> morten.ernebjerg@data4life.care
> 
> D4L data4life gGmbH
> 
> Charlottenstraße 109
> 
> 14467 Potsdam, Germany
> 
> www.data4life.care
> 
> Amtsgericht Potsdam, HRB 30667
> 
> Managing Director: Christian-Cornelius Weiß
> 
> 
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
> 
> 
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by Markus Jelsma <ma...@openindex.io>.

It's a while ago but i think to remember that fuzzy queries are not
analyzed. That means that you are looking for term-with-hyphens as a single
token, with a maximum of 1 edit distance. But because you use an analyzer
that splits hyphens, you have no term with a hyphen in your index.

If you move to a WhitespaceTokenizer (and no WordDelimiterFilter), and
reindex, you will have term-with-hyphens as it is in the index. Then you
can find it using FuzzyQuery.

Op di 23 aug. 2022 om 17:50 schreef Morten Ernebjerg
<mo...@data4life.care>:

> Hi again
>
> OK, so I think this is starting to make sense, What was confusing us was
> that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> just a single term, meaning that fuzzy search should apply as usual.
> However, if I understand you correctly, it sounds like the correct
> statement is actually that fuzzy search applies to *terms that result in a
> single token after indexing*. Since the standard tokenizer splits on
> hyphens, fuzzy search would then not apply. Did I get that right
>
> >phrase query fields
>
> I'm not sure I quite follow - do you mean using the qf query parameter or
> setting up separate "parallel" fields of some sort?
>
> Best,
>
> Morten
>
> On Tue, 23 Aug 2022 at 17:29, Dave <ha...@gmail.com> wrote:
>
> > Ok so from what I’m looking at you have a proximity search so the terms
> > have to be within the distance value of each other. In my example, 2,
> which
> > obviously won’t work since there are three terms.  A fuzzy search is
> based
> > on a single term/token. So you need to add ~2 to each term if that’s what
> > you want. There’s really good
> > Documentation about the difference and why it’s not working as you
> > expected here:
> >
> > https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
> >
> > Also try to make use of phrase query fields and boosting them,
> >
> >
> >
> > > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
> > <mo...@data4life.care> wrote:
> > >
> > > (replying on behalf of  my colleague Julius who wrote this question
> who
> > is
> > > unable to reply for technical reasons)
> > > Hi David,
> > >
> > > Thanks for the reply! I think your question may point to something we
> > > overlooked. We are actually using Solr 8.11 and we want to use fuzzy
> > search
> > > (
> > >
> >
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
> > ),
> > > i.e. find words that differ from the query by one or a few characters.
> > Our
> > > understanding was that to get matches that differ by max two chars from
> > > (using separate line to avoid adding confusing quotation marks)
> > >
> > > term-with-hyphens
> > >
> > > we should send the following query (without any quotation marks):
> > >
> > > term-with-hyphens~2
> > >
> > > Our thinking was that the hyphenated term is one word so there is no
> need
> > > to quote it. We had a quick try quoting the hyphenated term in the
> query
> > as
> > > you suggested and it looks like it works (i.e. returns matches). Since
> > the
> > > standard tokenizer splits on hyphens, I'm wondering the unquoted query
> > > somehow gets converted to the *proximity search* query
> > >
> > > "term with hyphens"~2
> > >
> > > which then fails (though it looks like it should still match
> > > term-with-hyphens). Would be great to understand what is happening.
> > >
> > > Best,
> > >
> > > Morten
> > >
> > >
> > >
> > >> On Tue, 23 Aug 2022 at 16:30, David Hastings <
> > hastings.recursive@gmail.com>
> > >> wrote:
> > >>
> > >> I’m not certain of course of your tokenizer but shouldn’t it be
> > >> “terms-with-hyphens”~1
> > >>
> > >> ? Just a syntax thing that may not have translated over email but
> > curious
> > >>
> > >> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo
> <julian.hugo@data4life.care
> > >
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I am getting peculiar results when querying for a term containing
> > hyphens
> > >>> and add fuzzy search
> > >>> <
> > >>>
> > >>
> >
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> > >>>>
> > >>> .
> > >>>
> > >>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
> > >>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
> > >> hyphens"
> > >>> both items are returned as expected. The same is the case for escaped
> > >>> hyphens "term\-with\-hyphens".
> > >>>
> > >>> The problem: When I add the fuzzy search parameter (i.e.,
> > >>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> > >> back.
> > >>>
> > >>> I struggle to understand the results, or how to solve this problem.
> My
> > >>> intuition tells me that adding a fuzzy search parameter should surely
> > >>> increase the size of the set of results. I am happy for any help on
> > this!
> > >>>
> > >>> Our current setup is using the "Extended DisMax Query Parser"
> > >>> <
> > https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> > >>>
> > >>> however we observe the same behaviour using the "Standard Query
> Parser
> > >>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
> > We
> > >>> are
> > >>> using the "Standard Tokenizer
> > >>> <
> > >>>
> > >>
> >
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >>>> ",
> > >>> which splits at hyphens. Does this relate to this problem?
> > >>>
> > >>> Thank you!
> > >>>
> > >>> --
> > >>>
> > >>> *Julian Hugo*
> > >>>
> > >>> Working Student
> > >>> Backend Development
> > >>>
> > >>> (he/his)
> > >>>
> > >>>
> > >>> julian.hugo@data4life.care
> > >>>
> > >>>
> > >>> D4L data4life gGmbH
> > >>> Charlottenstraße 109
> > >>> 14467 Potsdam, Germany
> > >>>
> > >>> www.data4life.care
> > >>>
> > >>>
> > >>> Amtsgericht Potsdam, HRB 30667
> > >>>
> > >>> Managing Director: Christian-Cornelius Weiß
> > >>>
> > >>>
> > >>> We are Data4Life. We've been certified by the German Federal Office
> for
> > >>> Information Security (BSI) in accordance with ISO 27001 on the basis
> of
> > >>> "IT-Grundschutz".
> > >>>
> > >>>
> > >>> Diversity is the driving force behind our work towards a society
> where
> > >>> digital health improves quality of life for everyone.
> > >>> Data4Life warmly welcomes applicants from the LGBTQI+ community,
> people
> > >>> with a migration background, People of Color, and individuals with
> > >>> disabilities or chronic illnesses to the team.
> > >>>
> > >>>
> > >>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> > >>>
> > >>
> > >
> > >
> > > --
> > >
> > > *Morten Ernebjerg, Ph.D.*
> > >
> > > Senior Developer
> > >
> > >
> > > morten.ernebjerg@data4life.care
> > >
> > > D4L data4life gGmbH
> > >
> > > Charlottenstraße 109
> > >
> > > 14467 Potsdam, Germany
> > >
> > > www.data4life.care
> > >
> > > Amtsgericht Potsdam, HRB 30667
> > >
> > > Managing Director: Christian-Cornelius Weiß
> > >
> > >
> > > We are Data4Life. We've been certified by the German Federal Office for
> > > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > > "IT-Grundschutz".
> > >
> > >
> > > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
>
>
> --
>
> *Morten Ernebjerg, Ph.D.*
>
> Senior Developer
>
>
> morten.ernebjerg@data4life.care
>
> D4L data4life gGmbH
>
> Charlottenstraße 109
>
> 14467 Potsdam, Germany
>
> www.data4life.care
>
> Amtsgericht Potsdam, HRB 30667
>
> Managing Director: Christian-Cornelius Weiß
>
>
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>

Re: Terms with hyphens and fuzzy search

Posted by Morten Ernebjerg <mo...@data4life.care>.

Hi again

OK, so I think this is starting to make sense, What was confusing us was
that we indeed thought of a hyphenated term (like: term-with-hyphens) as
just a single term, meaning that fuzzy search should apply as usual.
However, if I understand you correctly, it sounds like the correct
statement is actually that fuzzy search applies to *terms that result in a
single token after indexing*. Since the standard tokenizer splits on
hyphens, fuzzy search would then not apply. Did I get that right

>phrase query fields

I'm not sure I quite follow - do you mean using the qf query parameter or
setting up separate "parallel" fields of some sort?

Best,

Morten

On Tue, 23 Aug 2022 at 17:29, Dave <ha...@gmail.com> wrote:

> Ok so from what I’m looking at you have a proximity search so the terms
> have to be within the distance value of each other. In my example, 2, which
> obviously won’t work since there are three terms.  A fuzzy search is based
> on a single term/token. So you need to add ~2 to each term if that’s what
> you want. There’s really good
> Documentation about the difference and why it’s not working as you
> expected here:
>
> https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
>
> Also try to make use of phrase query fields and boosting them,
>
>
>
> > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
> <mo...@data4life.care> wrote:
> >
> > (replying on behalf of  my colleague Julius who wrote this question who
> is
> > unable to reply for technical reasons)
> > Hi David,
> >
> > Thanks for the reply! I think your question may point to something we
> > overlooked. We are actually using Solr 8.11 and we want to use fuzzy
> search
> > (
> >
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
> ),
> > i.e. find words that differ from the query by one or a few characters.
> Our
> > understanding was that to get matches that differ by max two chars from
> > (using separate line to avoid adding confusing quotation marks)
> >
> > term-with-hyphens
> >
> > we should send the following query (without any quotation marks):
> >
> > term-with-hyphens~2
> >
> > Our thinking was that the hyphenated term is one word so there is no need
> > to quote it. We had a quick try quoting the hyphenated term in the query
> as
> > you suggested and it looks like it works (i.e. returns matches). Since
> the
> > standard tokenizer splits on hyphens, I'm wondering the unquoted query
> > somehow gets converted to the *proximity search* query
> >
> > "term with hyphens"~2
> >
> > which then fails (though it looks like it should still match
> > term-with-hyphens). Would be great to understand what is happening.
> >
> > Best,
> >
> > Morten
> >
> >
> >
> >> On Tue, 23 Aug 2022 at 16:30, David Hastings <
> hastings.recursive@gmail.com>
> >> wrote:
> >>
> >> I’m not certain of course of your tokenizer but shouldn’t it be
> >> “terms-with-hyphens”~1
> >>
> >> ? Just a syntax thing that may not have translated over email but
> curious
> >>
> >> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.hugo@data4life.care
> >
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> I am getting peculiar results when querying for a term containing
> hyphens
> >>> and add fuzzy search
> >>> <
> >>>
> >>
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> >>>>
> >>> .
> >>>
> >>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
> >>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
> >> hyphens"
> >>> both items are returned as expected. The same is the case for escaped
> >>> hyphens "term\-with\-hyphens".
> >>>
> >>> The problem: When I add the fuzzy search parameter (i.e.,
> >>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> >> back.
> >>>
> >>> I struggle to understand the results, or how to solve this problem. My
> >>> intuition tells me that adding a fuzzy search parameter should surely
> >>> increase the size of the set of results. I am happy for any help on
> this!
> >>>
> >>> Our current setup is using the "Extended DisMax Query Parser"
> >>> <
> https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> >>>
> >>> however we observe the same behaviour using the "Standard Query Parser
> >>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
> We
> >>> are
> >>> using the "Standard Tokenizer
> >>> <
> >>>
> >>
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >>>> ",
> >>> which splits at hyphens. Does this relate to this problem?
> >>>
> >>> Thank you!
> >>>
> >>> --
> >>>
> >>> *Julian Hugo*
> >>>
> >>> Working Student
> >>> Backend Development
> >>>
> >>> (he/his)
> >>>
> >>>
> >>> julian.hugo@data4life.care
> >>>
> >>>
> >>> D4L data4life gGmbH
> >>> Charlottenstraße 109
> >>> 14467 Potsdam, Germany
> >>>
> >>> www.data4life.care
> >>>
> >>>
> >>> Amtsgericht Potsdam, HRB 30667
> >>>
> >>> Managing Director: Christian-Cornelius Weiß
> >>>
> >>>
> >>> We are Data4Life. We've been certified by the German Federal Office for
> >>> Information Security (BSI) in accordance with ISO 27001 on the basis of
> >>> "IT-Grundschutz".
> >>>
> >>>
> >>> Diversity is the driving force behind our work towards a society where
> >>> digital health improves quality of life for everyone.
> >>> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
> >>> with a migration background, People of Color, and individuals with
> >>> disabilities or chronic illnesses to the team.
> >>>
> >>>
> >>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >>>
> >>
> >
> >
> > --
> >
> > *Morten Ernebjerg, Ph.D.*
> >
> > Senior Developer
> >
> >
> > morten.ernebjerg@data4life.care
> >
> > D4L data4life gGmbH
> >
> > Charlottenstraße 109
> >
> > 14467 Potsdam, Germany
> >
> > www.data4life.care
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > "IT-Grundschutz".
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>


-- 

*Morten Ernebjerg, Ph.D.*

Senior Developer


morten.ernebjerg@data4life.care

D4L data4life gGmbH

Charlottenstraße 109

14467 Potsdam, Germany

www.data4life.care

Amtsgericht Potsdam, HRB 30667

Managing Director: Christian-Cornelius Weiß


We are Data4Life. We've been certified by the German Federal Office for
Information Security (BSI) in accordance with ISO 27001 on the basis of
"IT-Grundschutz".


Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by Dave <ha...@gmail.com>.

Ok so from what I’m looking at you have a proximity search so the terms have to be within the distance value of each other. In my example, 2, which obviously won’t work since there are three terms.  A fuzzy search is based on a single term/token. So you need to add ~2 to each term if that’s what you want. There’s really good
Documentation about the difference and why it’s not working as you expected here:

https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/

Also try to make use of phrase query fields and boosting them, 



> On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg <mo...@data4life.care> wrote:
> 
> (replying on behalf of  my colleague Julius who wrote this question who is
> unable to reply for technical reasons)
> Hi David,
> 
> Thanks for the reply! I think your question may point to something we
> overlooked. We are actually using Solr 8.11 and we want to use fuzzy search
> (
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches),
> i.e. find words that differ from the query by one or a few characters. Our
> understanding was that to get matches that differ by max two chars from
> (using separate line to avoid adding confusing quotation marks)
> 
> term-with-hyphens
> 
> we should send the following query (without any quotation marks):
> 
> term-with-hyphens~2
> 
> Our thinking was that the hyphenated term is one word so there is no need
> to quote it. We had a quick try quoting the hyphenated term in the query as
> you suggested and it looks like it works (i.e. returns matches). Since the
> standard tokenizer splits on hyphens, I'm wondering the unquoted query
> somehow gets converted to the *proximity search* query
> 
> "term with hyphens"~2
> 
> which then fails (though it looks like it should still match
> term-with-hyphens). Would be great to understand what is happening.
> 
> Best,
> 
> Morten
> 
> 
> 
>> On Tue, 23 Aug 2022 at 16:30, David Hastings <ha...@gmail.com>
>> wrote:
>> 
>> I’m not certain of course of your tokenizer but shouldn’t it be
>> “terms-with-hyphens”~1
>> 
>> ? Just a syntax thing that may not have translated over email but curious
>> 
>> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <ju...@data4life.care>
>> wrote:
>> 
>>> Hello,
>>> 
>>> I am getting peculiar results when querying for a term containing hyphens
>>> and add fuzzy search
>>> <
>>> 
>> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
>>>> 
>>> .
>>> 
>>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
>>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
>> hyphens"
>>> both items are returned as expected. The same is the case for escaped
>>> hyphens "term\-with\-hyphens".
>>> 
>>> The problem: When I add the fuzzy search parameter (i.e.,
>>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
>> back.
>>> 
>>> I struggle to understand the results, or how to solve this problem. My
>>> intuition tells me that adding a fuzzy search parameter should surely
>>> increase the size of the set of results. I am happy for any help on this!
>>> 
>>> Our current setup is using the "Extended DisMax Query Parser"
>>> <https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
>>> 
>>> however we observe the same behaviour using the "Standard Query Parser
>>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>". We
>>> are
>>> using the "Standard Tokenizer
>>> <
>>> 
>> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
>>>> ",
>>> which splits at hyphens. Does this relate to this problem?
>>> 
>>> Thank you!
>>> 
>>> --
>>> 
>>> *Julian Hugo*
>>> 
>>> Working Student
>>> Backend Development
>>> 
>>> (he/his)
>>> 
>>> 
>>> julian.hugo@data4life.care
>>> 
>>> 
>>> D4L data4life gGmbH
>>> Charlottenstraße 109
>>> 14467 Potsdam, Germany
>>> 
>>> www.data4life.care
>>> 
>>> 
>>> Amtsgericht Potsdam, HRB 30667
>>> 
>>> Managing Director: Christian-Cornelius Weiß
>>> 
>>> 
>>> We are Data4Life. We've been certified by the German Federal Office for
>>> Information Security (BSI) in accordance with ISO 27001 on the basis of
>>> "IT-Grundschutz".
>>> 
>>> 
>>> Diversity is the driving force behind our work towards a society where
>>> digital health improves quality of life for everyone.
>>> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
>>> with a migration background, People of Color, and individuals with
>>> disabilities or chronic illnesses to the team.
>>> 
>>> 
>>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>>> 
>> 
> 
> 
> -- 
> 
> *Morten Ernebjerg, Ph.D.*
> 
> Senior Developer
> 
> 
> morten.ernebjerg@data4life.care
> 
> D4L data4life gGmbH
> 
> Charlottenstraße 109
> 
> 14467 Potsdam, Germany
> 
> www.data4life.care
> 
> Amtsgericht Potsdam, HRB 30667
> 
> Managing Director: Christian-Cornelius Weiß
> 
> 
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
> 
> 
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by Morten Ernebjerg <mo...@data4life.care>.

(replying on behalf of  my colleague Julius who wrote this question who is
unable to reply for technical reasons)
Hi David,

Thanks for the reply! I think your question may point to something we
overlooked. We are actually using Solr 8.11 and we want to use fuzzy search
(
https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches),
i.e. find words that differ from the query by one or a few characters. Our
understanding was that to get matches that differ by max two chars from
(using separate line to avoid adding confusing quotation marks)

term-with-hyphens

we should send the following query (without any quotation marks):

term-with-hyphens~2

Our thinking was that the hyphenated term is one word so there is no need
to quote it. We had a quick try quoting the hyphenated term in the query as
you suggested and it looks like it works (i.e. returns matches). Since the
standard tokenizer splits on hyphens, I'm wondering the unquoted query
somehow gets converted to the *proximity search* query

"term with hyphens"~2

which then fails (though it looks like it should still match
term-with-hyphens). Would be great to understand what is happening.

Best,

Morten



On Tue, 23 Aug 2022 at 16:30, David Hastings <ha...@gmail.com>
wrote:

> I’m not certain of course of your tokenizer but shouldn’t it be
> “terms-with-hyphens”~1
>
> ? Just a syntax thing that may not have translated over email but curious
>
> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <ju...@data4life.care>
> wrote:
>
> > Hello,
> >
> > I am getting peculiar results when querying for a term containing hyphens
> > and add fuzzy search
> > <
> >
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> > >
> > .
> >
> > I have indexed two items (1) "term-with-hyphens" and (2) "term with
> > hyphens". When I query ("q") for "term-with-hyphens" or "term with
> hyphens"
> > both items are returned as expected. The same is the case for escaped
> > hyphens "term\-with\-hyphens".
> >
> > The problem: When I add the fuzzy search parameter (i.e.,
> > "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> back.
> >
> > I struggle to understand the results, or how to solve this problem. My
> > intuition tells me that adding a fuzzy search parameter should surely
> > increase the size of the set of results. I am happy for any help on this!
> >
> > Our current setup is using the "Extended DisMax Query Parser"
> > <https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> >
> > however we observe the same behaviour using the "Standard Query Parser
> > <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>". We
> > are
> > using the "Standard Tokenizer
> > <
> >
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >",
> > which splits at hyphens. Does this relate to this problem?
> >
> > Thank you!
> >
> > --
> >
> > *Julian Hugo*
> >
> > Working Student
> > Backend Development
> >
> > (he/his)
> >
> >
> > julian.hugo@data4life.care
> >
> >
> > D4L data4life gGmbH
> > Charlottenstraße 109
> > 14467 Potsdam, Germany
> >
> > www.data4life.care
> >
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > "IT-Grundschutz".
> >
> >
> > Diversity is the driving force behind our work towards a society where
> > digital health improves quality of life for everyone.
> > Data4Life warmly welcomes applicants from the LGBTQI+ community, people
> > with a migration background, People of Color, and individuals with
> > disabilities or chronic illnesses to the team.
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
>


-- 

*Morten Ernebjerg, Ph.D.*

Senior Developer


morten.ernebjerg@data4life.care

D4L data4life gGmbH

Charlottenstraße 109

14467 Potsdam, Germany

www.data4life.care

Amtsgericht Potsdam, HRB 30667

Managing Director: Christian-Cornelius Weiß


We are Data4Life. We've been certified by the German Federal Office for
Information Security (BSI) in accordance with ISO 27001 on the basis of
"IT-Grundschutz".


Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Posted by David Hastings <ha...@gmail.com>.

I’m not certain of course of your tokenizer but shouldn’t it be
“terms-with-hyphens”~1

? Just a syntax thing that may not have translated over email but curious

On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <ju...@data4life.care>
wrote:

> Hello,
>
> I am getting peculiar results when querying for a term containing hyphens
> and add fuzzy search
> <
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> >
> .
>
> I have indexed two items (1) "term-with-hyphens" and (2) "term with
> hyphens". When I query ("q") for "term-with-hyphens" or "term with hyphens"
> both items are returned as expected. The same is the case for escaped
> hyphens "term\-with\-hyphens".
>
> The problem: When I add the fuzzy search parameter (i.e.,
> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results back.
>
> I struggle to understand the results, or how to solve this problem. My
> intuition tells me that adding a fuzzy search parameter should surely
> increase the size of the set of results. I am happy for any help on this!
>
> Our current setup is using the "Extended DisMax Query Parser"
> <https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html>
> however we observe the same behaviour using the "Standard Query Parser
> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>". We
> are
> using the "Standard Tokenizer
> <
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >",
> which splits at hyphens. Does this relate to this problem?
>
> Thank you!
>
> --
>
> *Julian Hugo*
>
> Working Student
> Backend Development
>
> (he/his)
>
>
> julian.hugo@data4life.care
>
>
> D4L data4life gGmbH
> Charlottenstraße 109
> 14467 Potsdam, Germany
>
> www.data4life.care
>
>
> Amtsgericht Potsdam, HRB 30667
>
> Managing Director: Christian-Cornelius Weiß
>
>
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
>
>
> Diversity is the driving force behind our work towards a society where
> digital health improves quality of life for everyone.
> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
> with a migration background, People of Color, and individuals with
> disabilities or chronic illnesses to the team.
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>