You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Mayya Sharipova (Jira)" <ji...@apache.org> on 2021/06/23 13:39:03 UTC

[jira] [Closed] (LUCENE-9575) Add PatternTypingFilter

     [ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mayya Sharipova closed LUCENE-9575.
-----------------------------------

Closing after the 8.9.0 release

> Add PatternTypingFilter
> -----------------------
>
>                 Key: LUCENE-9575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9575
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>             Fix For: 8.9
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> One of the key asks when the Library of Congress was asking me to develop the Advanced Query Parser was to be able to recognize arbitrary patterns that included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they wanted 401k and 401(k) to match documents with either style reference, and NOT match documents that happen to have isolated 401 or k tokens (i.e. not documents about the http status code) And of course we wanted to give up as little of the text analysis features they were already using.
> This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and one solr specific filter in SOLR-14597 that re-analyzes tokens with an arbitrary analyzer defined for a type in the solr schema, combine to achieve this. 
> This filter has the job of spotting the patterns, and adding the intended synonym as at type to the token (from which minimal punctuation has been removed). It also sets flags on the token which are retained through the analysis chain, and at the very end the type is converted to a synonym and the original token(s) for that type are dropped avoiding the match on 401 (for example) 
> The pattern matching is specified in a file that looks like: 
> {code}
> 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
> 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
> 2 C\+\+ ::: c_plus_plus
> {code}
> That file would match match legal reference patterns such as 401(k), 401k, 501(c)3 and C++ The format is:
> <flagsInt> <pattern> ::: <replacement>
> and groups in the pattern are substituted into the replacement so the first line above would create synonyms such as:
> {code}
> 401k   --> legal2_401_k
> 401(k) --> legal2_401_k
> 503(c) --> legal2_503_c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org