You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by "Marco A.G.Pinto" <ma...@mail.telepac.pt> on 2013/06/23 23:26:29 UTC

Converting Thesurus (.DAT), Spellers (.DIC + .AFF) to UTF-8: question

Hello!

I am preparing to release V2.0 of my tool "Proofing Tool GUI" on the 1st 
of July.

I have optimised the code a lot and, in Windows, it just takes a few 
seconds to open the big *th_en_US_v2.dat* (18 MB).

I still have some issues in Linux due to a big pause after opening the 
thesaurus/dictionaries. I have been able to locate what is causing it, 
but even that way there is still a pause. For example: the PT-pt 
thesaurus (12940 words) takes one second to open on my Ubuntu 12 x86 VM 
but then there is a 4 minutes pause before it can be edited. I tried 
changing the code and it now only takes 40 seconds on my VM. This is 
still slow but, on a real Linux system, I believe it will be much 
faster. Also, the good news is that I believe my tool now works on all 
Linuxes.

My question in this e-mail is because the files need to be in UTF-8 to 
be edited with Proofing Tool GUI and I was explaining in the user guide 
how to do it using UniRed Editor *[1]*. Then, I found out that UniRed 
messes the number of lines in some files, so I tried to find a solution.

A friend of mine who is an expert in coding suggested NotePad++ *[2]* 
and I made a few tests with it and it works perfectly, so I need to 
improve the user guide to explain how to use it. But I have a question: 
when I convert the files to UTF-8, which option shall I use?:
1) Convert to UTF-8 without BOM
2) Convert to UTF-8

I am not 100% sure which one is the most correct.


[1] http://www.esperanto.mv.ru/UniRed/ENG/index.html

[2] http://notepad-plus-plus.org


Thank you very much!

Kind regards from,
        >Marco A.G.Pinto
          -----------------------

-- 

Re: Converting Thesurus (.DAT), Spellers (.DIC + .AFF) to UTF-8: question

Posted by Andrea Pescetti <pe...@apache.org>.
Marco A.G.Pinto wrote:
> when I convert the files to UTF-8, which option shall I use?:
> 1) Convert to UTF-8 without BOM
> 2) Convert to UTF-8

Without BOM is fine. The Italian dictionary still uses ISO-8859 for 
historical reasons, but I've seen several working dictionary files 
encoded as UTF-8 without BOM (it is likely that with BOM is fine too, 
but I didn't check). Also, on Linux-based systems you will usually have 
many tools able to convert a text file to UTF-8, so you needn't 
recommend a specific tool in that case.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org