You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ivan Provalov (JIRA)" <ji...@apache.org> on 2016/06/08 17:47:21 UTC

[jira] [Updated] (LUCENE-7321) Character Mapping

     [ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Provalov updated LUCENE-7321:
----------------------------------
    Attachment: LUCENE-7321.patch

Initial patch.

> Character Mapping
> -----------------
>
>                 Key: LUCENE-7321
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7321
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>            Reporter: Ivan Provalov
>            Priority: Minor
>              Labels: patch
>             Fix For: 6.0.1
>
>         Attachments: LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing variant.  These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark.  This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools.  The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet.  This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration.  There are a few cases, when the mappings grow complex, where some additional debugging may be required.  The mappings can contain any sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal	for Arabic; suffix folding for Japanese.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org