You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openoffice.apache.org by bu...@apache.org on 2016/03/09 13:57:02 UTC

[Issue 126863] New: en_AU.dic has UTF-8 errors

https://bz.apache.org/ooo/show_bug.cgi?id=126863

          Issue ID: 126863
        Issue Type: DEFECT
           Summary: en_AU.dic has UTF-8 errors
           Product: General
           Version: 4.2.0-dev
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: Normal
          Priority: P5 (lowest)
         Component: spell checking
          Assignee: issues@openoffice.apache.org
          Reporter: ioot@yahoo.com

In regards to the en_AU.dic extension for Australian spelling, a number of
spellings were corrupted. This appears to have occurred due to incorrect
conversion to/from UTF-8 during adding new words or in the editing process in
2008, but these errors persist to the current version of the en_AU.dic. I would
fix these errors myself but surely there is a maintainer to contact in regards
to this issue? Has it occurred with other dictionaries?

Two options I see, delete all entries with characters that are not Australian
English, or change all those bad characters to good ones. Noting that the
character � implies error, not a particular character. In other words we see
variants such as pi�ata (should be piñata) and clich� (should be cliché).

I tracked this down through various versions of the en_AU.dic
http://extensions.services.openoffice.org/en/project/AustralianDictionary

Here is some analysis of version and line numbers of 2 words as they changed
over time. This problem is rife in the newest version of en_AU.dic, with at
least 211 occurrences of the ¿ character, which indicates a failed conversion.
The word cliche, for example, is misrepresented over time in different ways.
Note that many words with the � character in the en_AU.dic file never
appeared correctly, although this example for the word cliché was originally
correct but was corrupted over time.

Version 2016.03.01 (Newest)
1700: clich�/SM

Version: 2010.03.16
1700: clich�/SM

Version: 2008.11.25
1700: clich�/SM

Version: 2008.10.3
1702: cliché/MS
1703: clich�/SM

Version: 1.0.0
1523: cliché/MS

With reference to files at:
http://extensions.services.openoffice.org/en/project/english-dictionaries-apache-openoffice
http://extensions.services.openoffice.org/en/project/AustralianDictionary

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

--- Comment #8 from Matthias Seidel <ms...@apache.org> ---
Hi,

thank you  for the confirmation.

I want to close some old issues here. ;-)

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

Tom Anderson <io...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ioot@yahoo.com

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

--- Comment #7 from Tom Anderson <io...@yahoo.com> ---
(In reply to Matthias Seidel from comment #6)
> I assume this is fixed, at least I couldn't find these errors in the latest
> en_AU dictionary...
> 
> If you disagree, feel free to reopen.

I also checked and the errors are not there. All the words with the corrupted
accents are now without any accents, e.g. like "cliché/MS" is now "cliche/MDS".
Should work just fine.

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

--- Comment #3 from Andrea Pescetti <pe...@apache.org> ---
Words to be fixed are about 200. The following search lists all "suspicious"
entries, each with its line number in the en_AU.dic file. It is likely that
some smart find/replace can fix the file easily.

$ grep -n -v -e "^[a-zA-Z0-9/'-\. \!]*$" en_AU.dic 
1104:bour�e
1110:boutonni�re/SM
1626:ch�telaine/SM
1700:clich�/SM
1701:clich�d
1713:cloisonn�/M
1882:Concepci�n/M
2083:coul�e/SM
2202:cr�pe/SM
2468:derri�re/S
2533:diamant�
2604:discoth�que/SM
2647:d�mod�
2721:d�pays�e
2761:D�sseldorf
2762:d'�tre
3010:entrec�te/SM
3011:entrep�t/S
3786:glac�/DGS
4919:kinderg�rtner/SM
5209:litt�rateur/S
5331:macram�/MS
5458:matin�e/S
5461:ma�tre
5652:�migr�/S
5730:m�nage
5985:n�e
6047:n�glig�
6196:Noun�a
6759:pi�on/S
7123:pur�eing
7480:r�gime/SM
7540:r�le/MS
8002:shouldâve
8190:S�o
8208:soign�
8209:soir�e/SM
8731:Tannh�user/M
9119:t�te-b�che
9120:t�te-�-t�te
10250:appliqu�/SMG
10251:appliqu�d
10381:attach�/MS
10808:Bogot�/M
11388:ch�teau/MS
11445:�clat/M
11660:confr�re/MS
11819:cort�ge/SM
11896:cr�che/MS
11940:crudit�s
12044:Dana�
12075:d�colletage/S
12076:d�coupage
12391:divorc�
12392:divorc�e/SM
12472:d�pays�
12477:d�railleur/SM
12505:d�tente/S
12924:expos�/SM
12954:fa�ade/MS
13301:Fran�oise/M
13687:Gr�newald/M
13702:G�teborg/M
14541:jalape�o/S
15118:lyc�e
15228:manqu�/M
15294:mat�riel/MS
15527:m�l�e/SM
15528:m�moire
15663:m�tier/S
15784:na�vety/S
16016:Noum�a/M
16370:pass�/M
16824:premi�re/DMGS
16834:p�res/F
16995:pur�e/DSM
17101:raison d'�tre
17465:r�sum�/S
17491:R�union/M
17801:se�orita/SM
18913:touch�
19393:voil�
19772:abb�/S
19872:adi�s
20700:blas�
21017:caf�/SM
21838:cr�pey
21978:d�but/S
21980:d�collet�
22316:d�nouement
22492:Dvor�k/M
22929:Faberg�/M
23037:fianc�/MS
23275:Fran�ois
23589:gr�ce
23932:H�loise/M
24445:jardini�re/MS
24494:Jos�/M
24703:lam�
24717:�lan/M
24952:Lom�/M
25096:Mallarm�/M
25383:mightâve
25709:naivet�/MS
25745:na�vet�/S
26162:outr�
27135:recherch�
27421:ros�
27443:rou�/SM
27541:s�ance/MS
27776:se�ora/SM
27777:Se�ora/M
28142:soup�on/SM
28900:Tom�/M
29414:vis-�-vis
30136:anim�
30291:ar�te/MS
30341:Asunci�n/M
30549:Bart�k/M
30579:b�che
30854:boucl�
30935:bric-�-brac
31646:comp�re/M
31723:consomm�/S
32070:d�class�
32071:d�class�e
32073:d�cor/MS
32369:d�j�
32406:doppelg�nger
32633:Elys�e/M
32762:Esterh�zy/M
32920:fa�ence/S
33018:fianc�e/MS
33269:frapp�
33422:G�del/M
33484:Gew�rztraminer
33731:habitu�/SM
34305:ing�nue/S
35087:Lumi�re/M
35494:M�nchhausen/M
35719:na�veness
36676:porti�re/SM
36721:pr�cis/dMS
36869:prot�g�/SM
36870:prot�g�e/S
36903:p�t�/M
37244:rep�chage
37552:saut�/GSD
37596:Schr�dinger/M
37754:se�ores
38976:T�rshavn/M
39263:Vel�squez/M
39308:vicu�a/S
39889:aide-m�moire
40658:blowhole
40864:b�te/S
40865:b�tise
41015:canap�/S
41322:�clair/MS
41360:client�le/M
41490:communiqu�/SM
41741:coup�/SM
41840:cro�ton/SM
41853:C�te
41960:d�b�cle/MS
41962:d�butante/MS
41963:d�collet�e
42395:d�shabill�'s
42678:entr�e/S
43040:f�hn
43041:f�hrer/SM
43108:flamb�/DSG
43315:f�te/SM
43385:�gar
43397:gar�on/MS
43668:Gruy�re
44086:howâd
45140:ma�ana/M
45216:man�ge/GDS
45346:M�bius
45604:m�lange
45803:n�
45854:na�ve/Y
45881:neglig�e/SM
46200:ol�
46501:pass�e
46688:pi�ata/S
47063:Proven�al
47489:risqu�
47872:se�or/M
48226:souffl�/SM
49086:t�te
49797:Yaound�/M

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

Matthias Seidel <ms...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mseidel@apache.org

--- Comment #5 from Matthias Seidel <ms...@apache.org> ---
Are these fixes in the latest en_AU dictionary?

If yes, can we close this issue?

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

marcoagpinto <ma...@mail.telepac.pt> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jza@oooes.org,
                   |                            |pescetti@apache.org

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

Marcus <ma...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marcoagpinto@mail.telepac.p
                   |                            |t

--- Comment #1 from Marcus <ma...@apache.org> ---
@Marco:
Please can you have a look? Maybe you can explain and solve the issue. Thanks.

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

--- Comment #4 from marcoagpinto <ma...@mail.telepac.pt> ---
Created attachment 85357
  --> https://bz.apache.org/ooo/attachment.cgi?id=85357&action=edit
en_AU - accents fixes

Here is the fixed .DIC .

I auto-replaced all corrupted characters with an "é" and then had to check the
entire .DIC because over 90% of é's were other characters with accents.

It seems that the corrupted symbol was all the same.

Please tell me if you find any invalid words with accents.

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

marcoagpinto <ma...@mail.telepac.pt> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Latest|---                         |4.1.2
    Confirmation in|                            |
             Status|UNCONFIRMED                 |CONFIRMED
     Ever confirmed|0                           |1

--- Comment #2 from marcoagpinto <ma...@mail.telepac.pt> ---
I have opened the .AFF + .DIC and I confirm that there are corrupted words.

The person who converted the .DIC to UTF-8 probably did something wrong in the
procedure.

I am the maintainer of the English dictionaries but I only add words to the
British one.

The other dictionaries are only packed by me in the monthly OXT.

@Tom, is there any chance you could fix the words since you know what to search
for?

If you do it, please also update the .AFF and README with your name and state
that you fixed the issue.

Then, ZIP and upload the files here and in the monthly update it will be fixed
for everyone.

Thanks!

:-)

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

Matthias Seidel <ms...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|CONFIRMED                   |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Matthias Seidel <ms...@apache.org> ---
I assume this is fixed, at least I couldn't find these errors in the latest
en_AU dictionary...

If you disagree, feel free to reopen.

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 126863] en_AU.dic has UTF-8 errors

Posted by bu...@apache.org.
https://bz.apache.org/ooo/show_bug.cgi?id=126863

Matthias Seidel <ms...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED

-- 
You are receiving this mail because:
You are the assignee for the issue.