You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by "Marco A.G.Pinto" <ma...@mail.telepac.pt> on 2013/06/23 23:26:29 UTC
Converting Thesurus (.DAT), Spellers (.DIC + .AFF) to UTF-8: question
Hello!
I am preparing to release V2.0 of my tool "Proofing Tool GUI" on the 1st
of July.
I have optimised the code a lot and, in Windows, it just takes a few
seconds to open the big *th_en_US_v2.dat* (18 MB).
I still have some issues in Linux due to a big pause after opening the
thesaurus/dictionaries. I have been able to locate what is causing it,
but even that way there is still a pause. For example: the PT-pt
thesaurus (12940 words) takes one second to open on my Ubuntu 12 x86 VM
but then there is a 4 minutes pause before it can be edited. I tried
changing the code and it now only takes 40 seconds on my VM. This is
still slow but, on a real Linux system, I believe it will be much
faster. Also, the good news is that I believe my tool now works on all
Linuxes.
My question in this e-mail is because the files need to be in UTF-8 to
be edited with Proofing Tool GUI and I was explaining in the user guide
how to do it using UniRed Editor *[1]*. Then, I found out that UniRed
messes the number of lines in some files, so I tried to find a solution.
A friend of mine who is an expert in coding suggested NotePad++ *[2]*
and I made a few tests with it and it works perfectly, so I need to
improve the user guide to explain how to use it. But I have a question:
when I convert the files to UTF-8, which option shall I use?:
1) Convert to UTF-8 without BOM
2) Convert to UTF-8
I am not 100% sure which one is the most correct.
[1] http://www.esperanto.mv.ru/UniRed/ENG/index.html
[2] http://notepad-plus-plus.org
Thank you very much!
Kind regards from,
>Marco A.G.Pinto
-----------------------
--
Re: Converting Thesurus (.DAT), Spellers (.DIC + .AFF) to UTF-8:
question
Posted by Andrea Pescetti <pe...@apache.org>.
Marco A.G.Pinto wrote:
> when I convert the files to UTF-8, which option shall I use?:
> 1) Convert to UTF-8 without BOM
> 2) Convert to UTF-8
Without BOM is fine. The Italian dictionary still uses ISO-8859 for
historical reasons, but I've seen several working dictionary files
encoded as UTF-8 without BOM (it is likely that with BOM is fine too,
but I didn't check). Also, on Linux-based systems you will usually have
many tools able to convert a text file to UTF-8, so you needn't
recommend a specific tool in that case.
Regards,
Andrea.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org