You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Marko Asplund (JIRA)" <ji...@apache.org> on 2007/10/15 09:26:51 UTC

[jira] Created: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Illegal character replacements in ISOLatin1AccentFilter
-------------------------------------------------------

                 Key: LUCENE-1029
                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.2
            Reporter: Marko Asplund


The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".

Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.

This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.

There's no connection between the sounds represented by ä and a; ö and o or å and a. 

In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Marko Asplund (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534829 ] 

Marko Asplund commented on LUCENE-1029:
---------------------------------------

It's also very easy to find examples in the finnish language where the meaning of the word changes when you make the character replacements done by the filter class.

Just to give you a some examples:
- sää (weather) ==> saa (will have)
- pässi (goat) ==> passi (passport)
...

The filter class Javadoc says the following:

"A filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered."

In my opinion changing the meaning of a word does not qualify as an "equivalent" replacement.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller closed LUCENE-1029.
-------------------------------

    Resolution: Invalid

The new ASCIIFoldingFilter is the current best work on this. Future issues should be targeted at that - but I think it does what we want it to do - individual issues can be brought up if they exist.

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-by-Collator.patch, ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534818 ] 

Karl Wettin commented on LUCENE-1029:
-------------------------------------

>> With the accent filter, running the Swedish word "kön" through the filter would 
>> create "kon". The first means "gender" and the second "cow". That would not be accetable.
>
> I am feeling lazy right now, but it seems to me you could find a similar rare stemming
> example (eg something that means something else in its stemmed form). The process
> is algorithmic after all, and there are many language with plenty of words out there.

Just to point out, pretty much any small (less than say 6 letters or so) in Swedish containing å, ä or ö would get a complete different meaning if you replace the letters.





> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hiroaki Kawai updated LUCENE-1029:
----------------------------------

    Attachment: ISOLatin1AccentFilter-by-Collator.patch

Wrote a patch that use java.text.Collator. 

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-by-Collator.patch, ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535040 ] 

Karl Wettin commented on LUCENE-1029:
-------------------------------------

Hoss Man - 15/Oct/07 02:47 PM
> the equivalence described in the javadocs is one of visual character equivalence, not of semantic word equivalence - that would be a lot more complicated. if anyone would like to submit a patch contianing a new filter that is capable of doing that, i'm sure the community would certianly welcome it.

I think you missunderstand why I focused on the stemmer. My point was that this filter can not be compared with stmmer as in earlier posts.

I do not think that the documentation is missleadning, nor do I think there is any need to break the backwards compability. All I say is that I welcome a solution that makes this filter more configurable. Not sure what a smart way to do that would be though. I'm open for a discussion. Perhaps one could feed it with exceptions, perhaps a per language definition, perhaps something else?

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534987 ] 

Hoss Man commented on LUCENE-1029:
----------------------------------

The functionality of ISOLatin1AccentFilter shouldn't change in a way that wouldn't be bckward compatible.  if people feel the documentation is misleading and doesn't accurately reflect what the Filter does, then by all means please submit a documentation patch.

first and foremost the purpose of this filter is to replace accented characters with non-accented characters ... the equivalence described in the javadocs is one of visual character equivalence, not of semantic word equivalence -- that would be a lot more complicated.  if anyone would like to submit a patch contianing a new filter that is capable of doing that, i'm sure the community would certianly welcome it.

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "DM Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535047 ] 

DM Smith commented on LUCENE-1029:
----------------------------------

One could maintain compatibility by adding a constructor that supplies a transliteration, where a transliteration is an implementation of an interface Transliteration. The default would be the current behavior. But I don't think that buys very much. It is kind of like saying a filter can contain a filter.

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534804 ] 

Mark Miller commented on LUCENE-1029:
-------------------------------------

I think Uwe nailed this one. Stripping accents in general is just not "legal". But many times it is desirable. This filter does that for you. It goes without saying that if you strip the accent you change the meaning...likewise, when you stem a word you create illegal words...

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534797 ] 

Uwe Schindler commented on LUCENE-1029:
---------------------------------------

This is true for other european languages, too. In Germany it is also a difference between "ä" and "a" (it sounds different). A correct replacement in German would be to replace "ä" by "ae" (two chars).
But I think it is not a problem. The real use of this filter is to enable people coming from other countries without the keys on their keyboard to search in a lucene index. Many americans for example search for the German last name "Müller" always by typing "Muller", because they cannot enter the umlaut. In Scandianian languages it will be the same, they would enter "o" instead of "ø". The accent filter is just to enable this. If you create an index just for one scandinavian country, just leave this filter out.
And in principle it is no problem to find documents that does not match the entered keywords exact. 
The filter is the same like the Soundex filter. After a transformation to soundex the word lokks different and has never his original meaning :)

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535049 ] 

markrmiller@gmail.com edited comment on LUCENE-1029 at 10/15/07 8:58 PM:
---------------------------------------------------------------

My comment about stemming was not meant to compare a stemmer to a diacritical stripper, but rather to point out that the result of such an operation does not necessarily have to create something 'legal' (just as a stemmer does not create 'legal' words). This was in response to the comment 'Some of the ISOLatin1AccentFilter are legal while others are illegal. '

Your point about semantic meaning is well taken, but was not intended to be part of the comparison I was going for. My bad. 

I think that the fact that ripping diacriticals can change the meaning of words goes without saying...otherwise, why even have them in the language? As Uwe said, the main motivating factor is to allow easy entry with the keyboard of another language. Of course this must come with a compromise. Other search engines I have seen offer the exact functionality of this class. (CPL, SearchServer, etc)

Literally, this thing is called an accent filter...letters go in, accents come off. Doing more really does seem like a job for another class. If I can borrow a word I didn't know from DM Smith, transliteration seems to go beyond an ISOLatin1AccentFilter. This is a tough sell I know -- programmers seem to push the definition of filter to its limits and IMHO into the realm of transform/translate.

Anyhow...I apologize for beating a dead horse...<g>

*edit* better mention that I realize my filter comment above does not necessarily fit with Lucene's already well defined use of the word Filter. Not looking to start that battle.

      was (Author: markrmiller@gmail.com):
    My comment about stemming was not meant to compare a stemmer to a diacritical stripper, but rather to point out that the result of such an operation does not necessarily have to create something 'legal' (just as a stemmer does not create 'legal' words). This was in response to the comment 'Some of the ISOLatin1AccentFilter are legal while others are illegal. '

Your point about semantic meaning is well taken, but was not intended to be part of the comparison I was going for. My bad. 

I think that the fact that ripping diacriticals can change the meaning of words goes without saying...otherwise, why even have them in the language? As Uwe said, the main motivating factor is to allow easy entry with the keyboard of another language. Of course this must come with a compromise. Other search engines I have seen offer the exact functionality of this class. (CPL, SearchServer, etc)

Literally, this thing is called an accent filter...letters go in, accents come off. Doing more really does seem like a job for another class. If I can borrow a word I didn't know from DM Smith, transliteration seems to go beyond an ISOLatin1AccentFilter. This is a tough sell I know -- programmers seem to push the definition of filter to its limits and IMHO into the realm of transform/translate.

Anyhow...I apologize for beating a dead horse...<g>
  
> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
This gets even more complicated when you throw Polish in. We do have diacritics 
(such as ó, ż, ź or ą)

http://www.fileformat.info/info/unicode/char/0105/index.htm

but we _also_ have things like "ł" (l with a stroke):

http://www.fileformat.info/info/unicode/char/0142/index.htm

I don't think the stroke in "ł" would qualify as a diacritic mark... to me it's 
more like a different letter.

Anyway, most Poles are _very_ comfortable with writing e-mails and querying 
search engines with stripped diacritics (and the letter ł replaced by l) even if 
this often leads to change of meaning of the original word. I guess it is so 
because typing diacritics slows you down a bit. Pragmatism.

Dawid


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Doug Cutting <cu...@apache.org>.
Mark Miller wrote:
> My point was that wikipedia (the link i gave and other definitions I 
> saw) seem to refer to the little markings around a letter as 
> diacriticals whether they mean the letter is a completely different 
> letter or not (see the part mentioning Scandinavian, as well as possibly 
> Websters dictionary). Marko disputed this in his last comment, and I 
> don't know that he is wrong. All I have seen seems to indicate this though.

It is confusing.

 From http://en.wikipedia.org/wiki/Diacritic:

   A diacritical mark or diacritic, also called an accent, is a small
   sign added to a letter to alter pronunciation or to distinguish
   between similar words.

In Swedish these are not added to a letter: they're part of the letter, 
so they're not diacritics.  Later in the page it says:

   The Scandinavian languages, by contrast, treat the characters with
   diacritics ä, ö and å as new and separate letters of the alphabet,
   and sort them after z.

Perhaps they could more properly say something like, "Scandinavian 
languages treat as separate letters things that other languages consider 
letters with diacritics".

Webster defines a diactritic as:

   a mark near or through an orthographic or phonetic character or
   combination of characters indicating a phonetic value different
   from that given the unmarked or otherwise marked element

Which points to the diacritic as a marker, but in Swedish the dots are 
no more a marker than the upright on a 'b' is a marker to pronounce it 
differently than an 'o'.

Ah, it's fun to be pedantic in the morning!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Mark Miller <ma...@gmail.com>.
I feel like a fool continuing this debate, being the least intelligent 
guy in the room, but here goes:

My point was that wikipedia (the link i gave and other definitions I 
saw) seem to refer to the little markings around a letter as 
diacriticals whether they mean the letter is a completely different 
letter or not (see the part mentioning Scandinavian, as well as possibly 
Websters dictionary). Marko disputed this in his last comment, and I 
don't know that he is wrong. All I have seen seems to indicate this though.

I also dispute this sentence in the new javadoc patch proposed:

*It will also be impossible to search for the word in its original form.*

If you use the same analyzer at search and query time, there should be no such problem.


Doug Cutting wrote:
> Mark Miller wrote:
>> I wouldn't pretend to know the truth on this matter, but you might 
>> update the wikipedia article http://en.wikipedia.org/wiki/Diacritic 
>> if you do, as it does not agree with your comments.
>
> Wikipedia says, "Swedish uses characters identical to a-diaeresis (ä) 
> and o-diaeresis (ö)".  This is a little ambiguous.  Identical how?  I 
> think they mean "visually identical to".  The distinction is whether 
> Swedish treats 'ä' as a variant of 'a' or as a completely separate 
> letter.  The latter is the case.
>
> http://en.wikipedia.org/wiki/Umlaut_(diacritic) states:
>
>   Swedish [...] treat[s] them as independent letters.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Doug Cutting <cu...@apache.org>.
Mark Miller wrote:
> I wouldn't pretend to know the truth on this matter, but you might 
> update the wikipedia article http://en.wikipedia.org/wiki/Diacritic if 
> you do, as it does not agree with your comments.

Wikipedia says, "Swedish uses characters identical to a-diaeresis (ä) 
and o-diaeresis (ö)".  This is a little ambiguous.  Identical how?  I 
think they mean "visually identical to".  The distinction is whether 
Swedish treats 'ä' as a variant of 'a' or as a completely separate 
letter.  The latter is the case.

http://en.wikipedia.org/wiki/Umlaut_(diacritic) states:

   Swedish [...] treat[s] them as independent letters.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Mark Miller <ma...@gmail.com>.
I wouldn't pretend to know the truth on this matter, but you might 
update the wikipedia article http://en.wikipedia.org/wiki/Diacritic if 
you do, as it does not agree with your comments.

Marko Asplund (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Marko Asplund updated LUCENE-1029:
> ----------------------------------
>
>     Attachment: ISOLatin1AccentFilter-javadoc.patch
>
> I think the class javadoc is very misleading so I'm attaching a documentation patch.
>
> For one the scandinavian characters do not contain diacritical marks or accents.  The dots in ä and ö as well as the ring in å is considered part of the letter, not diacritics. The class name implies that it does something with accents so for this reason I would not have expected the class to replace the scandinavian characters.
>
> The javadoc also says it replaces characters with their "equivalent" ASCII characters. There are no equivalents for the scandinavian characters.
>
>
>   
>> Illegal character replacements in ISOLatin1AccentFilter
>> -------------------------------------------------------
>>
>>                 Key: LUCENE-1029
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis
>>    Affects Versions: 2.2
>>            Reporter: Marko Asplund
>>         Attachments: ISOLatin1AccentFilter-javadoc.patch
>>
>>
>> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
>> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
>> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
>> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
>> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Marko Asplund (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marko Asplund updated LUCENE-1029:
----------------------------------

    Attachment: ISOLatin1AccentFilter-javadoc.patch

I think the class javadoc is very misleading so I'm attaching a documentation patch.

For one the scandinavian characters do not contain diacritical marks or accents.  The dots in ä and ö as well as the ring in å is considered part of the letter, not diacritics. The class name implies that it does something with accents so for this reason I would not have expected the class to replace the scandinavian characters.

The javadoc also says it replaces characters with their "equivalent" ASCII characters. There are no equivalents for the scandinavian characters.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Digy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536510 ] 

Digy commented on LUCENE-1029:
------------------------------


I think , the phrase

+ * Please note that the replacements performed by this filter will result in words changing their original semantic meaning in many cases.<br>
>>>> + * It will also be impossible to search for the word in its original form. <<<<<

is wrong.

If you index and then search  the text "kön" using the same analyzer that uses ISOLatin1AccentFilter, you will get the result. Who cares that it is stored as "kon" in the index.( of course searching "kön" will also return results containing "kon" , but there are a lot of cases where  it is better getting more than getting nothing).

DIGY

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534804 ] 

markrmiller@gmail.com edited comment on LUCENE-1029 at 10/15/07 4:47 AM:
---------------------------------------------------------------

I think Uwe nailed this one. Stripping accents in general is just not "legal". But many times it is desirable. This filter does that for you. It goes without saying that if you strip the accent you change the meaning...likewise, when you stem a word you create illegal words...

p.s.

Changing this filter is not really a great option as it would break indexes out there that use it. I think the better idea would be to create a new stripper that has the alternate functionality that you are thinking of -- rather than stripping accents, replace accented characters with letters that approximate the original sound/meaning.

      was (Author: markrmiller@gmail.com):
    I think Uwe nailed this one. Stripping accents in general is just not "legal". But many times it is desirable. This filter does that for you. It goes without saying that if you strip the accent you change the meaning...likewise, when you stem a word you create illegal words...
  
> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597703#action_12597703 ] 

Otis Gospodnetic commented on LUCENE-1029:
------------------------------------------

I only skimmed the comments.
Has anyone tried Hiroaki's patch?  Does it satisfy your needs?  Marko, Uwe, Karl, Mark?

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-by-Collator.patch, ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535049 ] 

Mark Miller commented on LUCENE-1029:
-------------------------------------

My comment about stemming was not meant to compare a stemmer to a diacritical stripper, but rather to point out that the result of such an operation does not necessarily have to create something 'legal' (just as a stemmer does not create 'legal' words). This was in response to the comment 'Some of the ISOLatin1AccentFilter are legal while others are illegal. '

Your point about semantic meaning is well taken, but was not intended to be part of the comparison I was going for. My bad. 

I think that the fact that ripping diacriticals can change the meaning of words goes without saying...otherwise, why even have them in the language? As Uwe said, the main motivating factor is to allow easy entry with the keyboard of another language. Of course this must come with a compromise. Other search engines I have seen offer the exact functionality of this class. (CPL, SearchServer, etc)

Literally, this thing is called an accent filter...letters go in, accents come off. Doing more really does seem like a job for another class. If I can borrow a word I didn't know from DM Smith, transliteration seems to go beyond an ISOLatin1AccentFilter. This is a tough sell I know -- programmers seem to push the definition of filter to its limits and IMHO into the realm of transform/translate.

Anyhow...I apologize for beating a dead horse...<g>

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534814 ] 

Mark Miller commented on LUCENE-1029:
-------------------------------------

> With the accent filter, running the Swedish word "kön" through the filter would create "kon". The first means "gender" and the second "cow". That would not be accetable.

I am feeling lazy right now, but it seems to me you could find a similar rare stemming example (eg something that means something else in its stemmed form). The process is algorithmic after all, and there are many language with plenty of words out there.

Regardless, it doesn't seem this filter claims it will maintain the meaning of "kön"...rather it will strip the '..' off the top of the 'o'. Its a brute force and somewhat dangerous filter from the get go...stripping accents its not a valid language operation that I know of.

I'll leave at that from my side of the argument <g> Let the Lucene gods speak.

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Marko Asplund (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534800 ] 

Marko Asplund commented on LUCENE-1029:
---------------------------------------

I have to disagree, I think it's a problem that the filter makes illegal character replacements.
Soundex match is different since by definition it's all about non-exact or approximate matching.

In some languages accented characters may have equivalent unaccented characters with which the accented ones may be replaced without change or loss of meaning.
Some of the ISOLatin1AccentFilter are legal while others are illegal. The illegal ones should be fixed.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578195#action_12578195 ] 

Hiroaki Kawai commented on LUCENE-1029:
---------------------------------------

I'd like to comment that we have another tool for this. :-)

java.text.Collator can collate the texts, and the instance is base on Locale, wow! So, if we use this collator, you might get a better query result, i.e, more low search noise that German "ä" might hit with "ae".

I'd like to submit a patch later.

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Marko Asplund (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536589 ] 

Marko Asplund commented on LUCENE-1029:
---------------------------------------

Perhaps it could be expressed more accurately using the following sentence:

It will be impossible to search for a word in its original form without also matching filtered forms of the word at the same time.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by Mark Miller <ma...@gmail.com>.
> If you are to compare with stemmers, consider that these creates unique tokens that does not interfere with semantic meanings.
>   
Not starting anything here again, but it took me so darn long to find 
something that porter stems and kills the semantic meaning that I had to 
share. That damn algorithm is amazing...I was coming to the conclusion 
that it was absolutely perfect on the English language...until after a 
couple days of searching I found international goes to intern. Eureka! 
Though a hollow victory at best. That algorithm is pretty amazing...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534810 ] 

Karl Wettin commented on LUCENE-1029:
-------------------------------------

I'm on Markos line here.

If you are to compare with stemmers, consider that these creates unique tokens that does not interfere with semantic meanings.

With the accent filter, running the Swedish word "kön" through the filter would create "kon". The first means "gender" and the second "cow". That would not be accetable.

I say this filter needs to be more configurable.



> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "Marko Asplund (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597883#action_12597883 ] 

Marko Asplund commented on LUCENE-1029:
---------------------------------------

I've switched projects some time ago so it'll take while for me to setup the tests.
I'll try to get around doing this at a later time.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>         Attachments: ISOLatin1AccentFilter-by-Collator.patch, ISOLatin1AccentFilter-javadoc.patch
>
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

Posted by "DM Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534839 ] 

DM Smith commented on LUCENE-1029:
----------------------------------

Transliteration rules are language dependent. I suggest that the documentation for the ISOLatin1AccentFilter be adjusted to match it's behavior, stating that it strips diacritics from characters and does further substitutions (giving the precise list) and that it does not do transliteration. Further give examples as stated in the above comments that the results for such a stripping may result in examples that are entirely inappropriate.

ICU4J can be used to do per language transliteration.  IIRC, dependency on third party code is allowed in contrib. So, it would be appropriate for such filters to be in contrib.


> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish, swedish, danish languages etc.) are illegal. The scandinavian characters are different from the accented characters used e.g. in latin based languages such as french in that these characters (ä, ö, å) represent entirely independent sounds in the language and therefore cannot be represented with any other sound without change of meaning. It is therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa (will have) because these are two entirely different words with different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and a. 
> In addition to the three characters mentioned above danish and norwegian use other special characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org