You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Sylvain Wallez <sy...@apache.org> on 2005/03/11 20:05:04 UTC

Flowscript encoding weirdness and a solution

Hi all,

I encountered some weird things with a flowscript containing strings 
with accented characters, saved in UTF-8. This is because the flow 
interpreter uses the platform's default encoding to read script files. 
And of course this default encoding isn't the same on Windows and Mac...

To solve this, I added the possibility to specify the file's encoding as 
a comment in the very first line of the script, e.g.

  // encoding = UTF-8
  function blah()
  ...

If no special comment exists, we fall back to the platform's default 
encoding as of today.

This works beautifully, and I'm thinking of adding this to 2.1 even if 
(or especially because) the release is coming soon.

WDYT?

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

Re: Flowscript encoding weirdness and a solution

Posted by Sylvain Wallez <sy...@apache.org>.

Bertrand Delacretaz wrote:

> Le 11 mars 05, à 21:42, Sylvain Wallez a écrit :
>
>> ....Or even a more javadoc-like
>>
>> // @encoding UTF-8...
>
>
> Looks good.
>
> Note that IIUC the same problem exists for java source files: unless 
> the -encoding switch is used for javac, the default platform encoding 
> is used to compile. Should we add it to our build targets?
>
> I haven't seen problems, but if you have a use case for encoded 
> strings in flowscript it might apply to java source code as well.

Over time, I have written a small (but useful) library of flowscript 
dialog functions inspired by javax.swing.JOptionPane. For example, I can 
write:

if (Dialog.confirm("Item already exists. Overwrite it?")) {
    overwrite();
} else {
    cancel();
}

As you can see, the message is the one displayed to the user, and may 
therefore contain accented letters in french. There's also a i18n-ized 
version, but setting up a dictionary is overkill for quick 
single-language demos and prototypes.

I never encountered this problem in Java classes as they're used as 
logic components and therefore don't produce user-readable messages, and 
also because encoding problems are solved at compilation time and not at 
runtime.

Now with Javaflow+CompilingClassLoader, this problem is certainly likely 
to arise. So this should probably be a setting of the CompilingClassloader.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

Re: Flowscript encoding weirdness and a solution

Posted by Bertrand Delacretaz <bd...@apache.org>.

Le 11 mars 05, à 21:42, Sylvain Wallez a écrit :
> ....Or even a more javadoc-like
>
> // @encoding UTF-8...

Looks good.

Note that IIUC the same problem exists for java source files: unless 
the -encoding switch is used for javac, the default platform encoding 
is used to compile. Should we add it to our build targets?

I haven't seen problems, but if you have a use case for encoded strings 
in flowscript it might apply to java source code as well.

-Bertrand

Re: Flowscript encoding weirdness and a solution

Posted by Sylvain Wallez <sy...@apache.org>.

Stefano Mazzocchi wrote:

> Sylvain Wallez wrote:
>
>> Hi all,
>>
>> I encountered some weird things with a flowscript containing strings 
>> with accented characters, saved in UTF-8. This is because the flow 
>> interpreter uses the platform's default encoding to read script 
>> files. And of course this default encoding isn't the same on Windows 
>> and Mac...
>>
>> To solve this, I added the possibility to specify the file's encoding 
>> as a comment in the very first line of the script, e.g.
>>
>>  // encoding = UTF-8
>>  function blah()
>>  ...
>>
>> If no special comment exists, we fall back to the platform's default 
>> encoding as of today.
>>
>> This works beautifully, and I'm thinking of adding this to 2.1 even 
>> if (or especially because) the release is coming soon.
>
>
> how about
>
>  //@ encoding = UTF-8
>
> instead? so that we can discriminate between comments and 'metadata 
> comments'?


Or even a more javadoc-like

// @encoding UTF-8

However, just like <?xml encoding="..."?>, this comment must appear on 
the _first_ line, as a PushbackInputStream is used to re-read the script 
with the correct encoding and therefore we cannot do some complicated 
parsing to determine the encoding.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

Re: Flowscript encoding weirdness and a solution

Posted by Sylvain Wallez <sy...@apache.org>.

Stefano Mazzocchi wrote:

>>> how about
>>>
>>>  //@ encoding = UTF-8
>>>
>>> instead? so that we can discriminate between comments and 'metadata 
>>> comments'?
>>>
>>
>>
>> had a similar reflex, but from a different angle though:
>> namely by considering how vim is doing this:
>>
>> // vim: set fileencoding=iso-8859-1 nu ai:
>>
>> so: I surely like the @ idea, but am doubthing if we shouldn't 
>> 'namespace' it some more (god knows how many more apps out there 
>> might be willing to do interesting annotations inside comments)
>>
>>
>> thinking of annotations, and the resemblance of js to java: we could 
>> require /** comments?
>> (which is not single line however, so stretches the first-line 
>> requirement)
>
>
> here people would suggest to embed RDF in it ;-)
>
> KISS!


Ok, so here's the regexp:

^.*encoding\s*=\s*([^\s]+)

This matches "encoding = xxx" on the first line with any space 
combination around "=" and with anything you like before "encoding", be 
it "//" "// @" or "// vim: set file".

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

Re: Flowscript encoding weirdness and a solution

Posted by Stefano Mazzocchi <st...@apache.org>.

Marc Portier wrote:
> 
> 
> Stefano Mazzocchi wrote:
> 
>> Sylvain Wallez wrote:
>>
>>> Hi all,
>>>
>>> I encountered some weird things with a flowscript containing strings 
>>> with accented characters, saved in UTF-8. This is because the flow 
>>> interpreter uses the platform's default encoding to read script 
>>> files. And of course this default encoding isn't the same on Windows 
>>> and Mac...
>>>
>>> To solve this, I added the possibility to specify the file's encoding 
>>> as a comment in the very first line of the script, e.g.
>>>
>>>  // encoding = UTF-8
>>>  function blah()
>>>  ...
>>>
>>> If no special comment exists, we fall back to the platform's default 
>>> encoding as of today.
>>>
>>> This works beautifully, and I'm thinking of adding this to 2.1 even 
>>> if (or especially because) the release is coming soon.
>>
>>
>>
>> how about
>>
>>  //@ encoding = UTF-8
>>
>> instead? so that we can discriminate between comments and 'metadata 
>> comments'?
>>
> 
> 
> had a similar reflex, but from a different angle though:
> namely by considering how vim is doing this:
> 
> // vim: set fileencoding=iso-8859-1 nu ai:
> 
> so: I surely like the @ idea, but am doubthing if we shouldn't 
> 'namespace' it some more (god knows how many more apps out there might 
> be willing to do interesting annotations inside comments)
> 
> 
> thinking of annotations, and the resemblance of js to java: we could 
> require /** comments?
> (which is not single line however, so stretches the first-line requirement)

here people would suggest to embed RDF in it ;-)

KISS!

-- 
Stefano.

Re: Flowscript encoding weirdness and a solution

Posted by Marc Portier <mp...@outerthought.org>.


Stefano Mazzocchi wrote:
> Sylvain Wallez wrote:
> 
>> Hi all,
>>
>> I encountered some weird things with a flowscript containing strings 
>> with accented characters, saved in UTF-8. This is because the flow 
>> interpreter uses the platform's default encoding to read script files. 
>> And of course this default encoding isn't the same on Windows and Mac...
>>
>> To solve this, I added the possibility to specify the file's encoding 
>> as a comment in the very first line of the script, e.g.
>>
>>  // encoding = UTF-8
>>  function blah()
>>  ...
>>
>> If no special comment exists, we fall back to the platform's default 
>> encoding as of today.
>>
>> This works beautifully, and I'm thinking of adding this to 2.1 even if 
>> (or especially because) the release is coming soon.
> 
> 
> how about
> 
>  //@ encoding = UTF-8
> 
> instead? so that we can discriminate between comments and 'metadata 
> comments'?
> 


had a similar reflex, but from a different angle though:
namely by considering how vim is doing this:

// vim: set fileencoding=iso-8859-1 nu ai:

so: I surely like the @ idea, but am doubthing if we shouldn't 
'namespace' it some more (god knows how many more apps out there might 
be willing to do interesting annotations inside comments)


thinking of annotations, and the resemblance of js to java: we could 
require /** comments?
(which is not single line however, so stretches the first-line requirement)


-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

Re: Flowscript encoding weirdness and a solution

Posted by Stefano Mazzocchi <st...@apache.org>.

Sylvain Wallez wrote:
> Hi all,
> 
> I encountered some weird things with a flowscript containing strings 
> with accented characters, saved in UTF-8. This is because the flow 
> interpreter uses the platform's default encoding to read script files. 
> And of course this default encoding isn't the same on Windows and Mac...
> 
> To solve this, I added the possibility to specify the file's encoding as 
> a comment in the very first line of the script, e.g.
> 
>  // encoding = UTF-8
>  function blah()
>  ...
> 
> If no special comment exists, we fall back to the platform's default 
> encoding as of today.
> 
> This works beautifully, and I'm thinking of adding this to 2.1 even if 
> (or especially because) the release is coming soon.

how about

  //@ encoding = UTF-8

instead? so that we can discriminate between comments and 'metadata 
comments'?

-- 
Stefano.