You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Drob <md...@mdrob.com> on 2020/10/08 22:01:59 UTC

Folding Repeated Letters

I'm looking for a way to transform words with repeated letters into the
same token - does something like this exist out of the box? Do our stemmers
support it?

For example, say I would want all of these terms to return the same search
results:

YES
YESSS
YYYEEESSS
YYEESSSS[...]S

I don't know how long a user would hold down the S key at the end to
capture their level of excitement, and I don't want to manually define
synonyms for every length.

I'm pretty sure that I don't want PhoneticFilter here, maybe
PatternReplace? Not a huge fan of how that one is configured, and I think
I'd have to set up a bunch of patterns inline for it?

Mike

Re: Folding Repeated Letters

Posted by Walter Underwood <wu...@wunderwood.org>.
Actually, helping the humans to use proper spelling is a good approach. Include a
spelling correction step (non-optional) for user-generated content and spelling
suggestions for queries. Completion/suggestion is another way to guide people
to properly spelled words that exist in your index.

I agree that trying to fix this after you have the query is hard.

If edismax supported fuzzy matching, it would be much easier. I know that, because
we’ve been running that patch (SOLR-629) in prod for several years.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2020, at 4:27 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> Anything you do will be wrong ;).
> 
> I suppose you could kick out words that weren’t in some dictionary and accumulate a list of words not in the dictionary and just deal with them “somehow", but that’s labor-intensive since you then have to deal with proper names and the like. Sometimes you can get by with ignoring words with _only_ the first letter capitalized, which is also not perfect but might get you closer. You mentioned phonetic filters, but frankly I have no idea whether YES and YYYYYYEEEEEEEESSSSSSSS would reduce to the same code, I rather doubt it.
> 
> In general, you _can’t_ solve this problem perfectly without inspecting each input, you can only get an approximation. And at some point it’s worth asking “is it worth it?”. I suppose you could try the regex Andy suggested in a copyField destination and use that as well as the primary field in queries, that might help at least find things like this.
> 
> If we were just able to require humans to use proper spelling, this would be a lot easier….
> 
> Wish there were a solution
> 
> Best,
> Erick
> 
>> On Oct 8, 2020, at 10:59 PM, Mike Drob <md...@mdrob.com> wrote:
>> 
>> I was thinking about that, but there are words that are legitimately
>> different with repeated consonants. My primary school teacher lost hair
>> over getting us to learn the difference between desert and dessert.
>> 
>> Maybe we need something that can borrow the boosting behaviour of fuzzy
>> query - match the exact term, but also the neighbors with a slight deboost,
>> so that if the main term exists those others won't show up.
>> 
>> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb <an...@gmail.com> wrote:
>> 
>>> How about something like this?
>>> 
>>> {
>>>   "add-field-type": [
>>>       {
>>>           "name": "norepeat",
>>>           "class": "solr.TextField",
>>>           "analyzer": {
>>>               "tokenizer": {
>>>                   "class": "solr.StandardTokenizerFactory"
>>>               },
>>>               "filters": [
>>>                   {
>>>                       "class": "solr.LowerCaseFilterFactory"
>>>                   },
>>>                   {
>>>                       "class": "solr.PatternReplaceFilterFactory",
>>>                       "pattern": "(.)\\1+",
>>>                       "replacement": "$1"
>>>                   }
>>>               ]
>>>           }
>>>       }
>>>   ]
>>> }
>>> 
>>> This finds a match...
>>> 
>>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat
>>> 
>>> Andy
>>> 
>>> 
>>> 
>>> On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote:
>>> 
>>>> I'm looking for a way to transform words with repeated letters into the
>>>> same token - does something like this exist out of the box? Do our
>>> stemmers
>>>> support it?
>>>> 
>>>> For example, say I would want all of these terms to return the same
>>> search
>>>> results:
>>>> 
>>>> YES
>>>> YESSS
>>>> YYYEEESSS
>>>> YYEESSSS[...]S
>>>> 
>>>> I don't know how long a user would hold down the S key at the end to
>>>> capture their level of excitement, and I don't want to manually define
>>>> synonyms for every length.
>>>> 
>>>> I'm pretty sure that I don't want PhoneticFilter here, maybe
>>>> PatternReplace? Not a huge fan of how that one is configured, and I think
>>>> I'd have to set up a bunch of patterns inline for it?
>>>> 
>>>> Mike
>>>> 
>>> 
> 


Re: Folding Repeated Letters

Posted by Erick Erickson <er...@gmail.com>.
Anything you do will be wrong ;).

I suppose you could kick out words that weren’t in some dictionary and accumulate a list of words not in the dictionary and just deal with them “somehow", but that’s labor-intensive since you then have to deal with proper names and the like. Sometimes you can get by with ignoring words with _only_ the first letter capitalized, which is also not perfect but might get you closer. You mentioned phonetic filters, but frankly I have no idea whether YES and YYYYYYEEEEEEEESSSSSSSS would reduce to the same code, I rather doubt it.

In general, you _can’t_ solve this problem perfectly without inspecting each input, you can only get an approximation. And at some point it’s worth asking “is it worth it?”. I suppose you could try the regex Andy suggested in a copyField destination and use that as well as the primary field in queries, that might help at least find things like this.

If we were just able to require humans to use proper spelling, this would be a lot easier….

Wish there were a solution

Best,
Erick

> On Oct 8, 2020, at 10:59 PM, Mike Drob <md...@mdrob.com> wrote:
> 
> I was thinking about that, but there are words that are legitimately
> different with repeated consonants. My primary school teacher lost hair
> over getting us to learn the difference between desert and dessert.
> 
> Maybe we need something that can borrow the boosting behaviour of fuzzy
> query - match the exact term, but also the neighbors with a slight deboost,
> so that if the main term exists those others won't show up.
> 
> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb <an...@gmail.com> wrote:
> 
>> How about something like this?
>> 
>> {
>>    "add-field-type": [
>>        {
>>            "name": "norepeat",
>>            "class": "solr.TextField",
>>            "analyzer": {
>>                "tokenizer": {
>>                    "class": "solr.StandardTokenizerFactory"
>>                },
>>                "filters": [
>>                    {
>>                        "class": "solr.LowerCaseFilterFactory"
>>                    },
>>                    {
>>                        "class": "solr.PatternReplaceFilterFactory",
>>                        "pattern": "(.)\\1+",
>>                        "replacement": "$1"
>>                    }
>>                ]
>>            }
>>        }
>>    ]
>> }
>> 
>> This finds a match...
>> 
>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat
>> 
>> Andy
>> 
>> 
>> 
>> On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote:
>> 
>>> I'm looking for a way to transform words with repeated letters into the
>>> same token - does something like this exist out of the box? Do our
>> stemmers
>>> support it?
>>> 
>>> For example, say I would want all of these terms to return the same
>> search
>>> results:
>>> 
>>> YES
>>> YESSS
>>> YYYEEESSS
>>> YYEESSSS[...]S
>>> 
>>> I don't know how long a user would hold down the S key at the end to
>>> capture their level of excitement, and I don't want to manually define
>>> synonyms for every length.
>>> 
>>> I'm pretty sure that I don't want PhoneticFilter here, maybe
>>> PatternReplace? Not a huge fan of how that one is configured, and I think
>>> I'd have to set up a bunch of patterns inline for it?
>>> 
>>> Mike
>>> 
>> 


Re: Folding Repeated Letters

Posted by Mike Drob <md...@mdrob.com>.
I was thinking about that, but there are words that are legitimately
different with repeated consonants. My primary school teacher lost hair
over getting us to learn the difference between desert and dessert.

Maybe we need something that can borrow the boosting behaviour of fuzzy
query - match the exact term, but also the neighbors with a slight deboost,
so that if the main term exists those others won't show up.

On Thu, Oct 8, 2020 at 5:46 PM Andy Webb <an...@gmail.com> wrote:

> How about something like this?
>
> {
>     "add-field-type": [
>         {
>             "name": "norepeat",
>             "class": "solr.TextField",
>             "analyzer": {
>                 "tokenizer": {
>                     "class": "solr.StandardTokenizerFactory"
>                 },
>                 "filters": [
>                     {
>                         "class": "solr.LowerCaseFilterFactory"
>                     },
>                     {
>                         "class": "solr.PatternReplaceFilterFactory",
>                         "pattern": "(.)\\1+",
>                         "replacement": "$1"
>                     }
>                 ]
>             }
>         }
>     ]
> }
>
> This finds a match...
>
> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat
>
> Andy
>
>
>
> On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote:
>
> > I'm looking for a way to transform words with repeated letters into the
> > same token - does something like this exist out of the box? Do our
> stemmers
> > support it?
> >
> > For example, say I would want all of these terms to return the same
> search
> > results:
> >
> > YES
> > YESSS
> > YYYEEESSS
> > YYEESSSS[...]S
> >
> > I don't know how long a user would hold down the S key at the end to
> > capture their level of excitement, and I don't want to manually define
> > synonyms for every length.
> >
> > I'm pretty sure that I don't want PhoneticFilter here, maybe
> > PatternReplace? Not a huge fan of how that one is configured, and I think
> > I'd have to set up a bunch of patterns inline for it?
> >
> > Mike
> >
>

Re: Folding Repeated Letters

Posted by Andy Webb <an...@gmail.com>.
How about something like this?

{
    "add-field-type": [
        {
            "name": "norepeat",
            "class": "solr.TextField",
            "analyzer": {
                "tokenizer": {
                    "class": "solr.StandardTokenizerFactory"
                },
                "filters": [
                    {
                        "class": "solr.LowerCaseFilterFactory"
                    },
                    {
                        "class": "solr.PatternReplaceFilterFactory",
                        "pattern": "(.)\\1+",
                        "replacement": "$1"
                    }
                ]
            }
        }
    ]
}

This finds a match...
http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat

Andy



On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote:

> I'm looking for a way to transform words with repeated letters into the
> same token - does something like this exist out of the box? Do our stemmers
> support it?
>
> For example, say I would want all of these terms to return the same search
> results:
>
> YES
> YESSS
> YYYEEESSS
> YYEESSSS[...]S
>
> I don't know how long a user would hold down the S key at the end to
> capture their level of excitement, and I don't want to manually define
> synonyms for every length.
>
> I'm pretty sure that I don't want PhoneticFilter here, maybe
> PatternReplace? Not a huge fan of how that one is configured, and I think
> I'd have to set up a bunch of patterns inline for it?
>
> Mike
>

Re: Folding Repeated Letters

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Are there that many of those words.?Because even if you deal with
yyyyeeeessss, there is still yas!

Maybe you just have regexp synonyms? (ye+s+)

Good luck,
   413x

On Thu., Oct. 8, 2020, 6:02 p.m. Mike Drob, <md...@mdrob.com> wrote:

> I'm looking for a way to transform words with repeated letters into the
> same token - does something like this exist out of the box? Do our stemmers
> support it?
>
> For example, say I would want all of these terms to return the same search
> results:
>
> YES
> YESSS
> YYYEEESSS
> YYEESSSS[...]S
>
> I don't know how long a user would hold down the S key at the end to
> capture their level of excitement, and I don't want to manually define
> synonyms for every length.
>
> I'm pretty sure that I don't want PhoneticFilter here, maybe
> PatternReplace? Not a huge fan of how that one is configured, and I think
> I'd have to set up a bunch of patterns inline for it?
>
> Mike
>