You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by roy huang <li...@hotmail.com> on 2004/08/12 13:45:43 UTC

[Help]How can I use non-ascii file name?

Hi,all:
    Use reader to display jpg or gif is quite simple,like:
   <map:match pattern="*.jpg">
    <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
   </map:match>
   But if the file name is not ASCII but utf-8 or other encoding like 花.jpg (simplified Chinese),the resolver didn't resolve the name correctly,error occur:
org.apache.cocoon.ResourceNotFoundException: Error during resolving of the input stream: org.apache.excalibur.source.SourceNotFoundException: file:/C:/My Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg doesn't exist.

How can I use non-ASCII file name in cocoon?I can't find any description or help in wiki or archived mail list. 

Roy Huang

Re: [Help]How can I use non-ascii file name?

Posted by Marc Portier <mp...@outerthought.org>.

Pier Fumagalli wrote:

> On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:
> 
>> I'll see why this happens in Jetty, I'll poke Jen and Greg to have 
>> either a fix, or an explaination and workaround... For now, brrrr, I 
>> think that the hack is the only way to go...
> 
> 
> I don't know about Tomcat, but if you're not on the jetty developers 
> list, here's the outcome:
> 

I'm not, thx for copying over...

> Jetty defaults (for compatibility to all the other broken containers, 
> and because there's no "official standard" about UTF-8 URIs) to 
> ISO-8859-1. And this ain't great.
> 
> Now, the good thing is that if you start your jetty specifying the 
> "org.mortbay.util.URI.charset" system property, it will use that one as 
> the charset used for decoding URLs.
> 
> So, by putting in "-Dorg.mortbay.util.URI.charset=UTF-8" we get the 
> expected behavior.
> 

cool

> How about setting it up as the default behavior for Cocoon's internal 
> Jetty distro?
> 

makes sense, but: (whishing all this brokenness wan't there but helas)

- it shouldn't keep us from actually get about solving it for all 
containers? (my guess is that just a fraction of cocoon deployments 
actually run on the internal jetty distro, i.e. using the cocoon.sh or 
.bat?)

- learning about this org.mortbay.util.URI.charset property we should 
probably use it to override (or at least log-warn deployers if it's 
different to) the container-encoding setting in the web.xml
(assuming that the mentioned property will also be in effect when 
decoding the request parameters, and taking in account that current 
cocoon code assumes ISO-8859-1 as the default there)

- once we've run that far, we might even consider making a scan of other 
servlet containers and how they possibly allow setting the 
container-encoding?

wdyt?


while typing I started rethinking why we ended up with this 
container-encoding init-param in web.xml?

IIRC we did that because of required compliance to servlet spec versions 
prior to 2.3?  So first question is are we still on servlet 2.2?

If not: Since 2.3 there exists a setCharacterEncoding()
<quote from="servlet 2.3 javadoc" 
href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)">
   Overrides the name of the character encoding used in the body of this
   request. This method must be called prior to reading request
   parameters or reading input using getReader().
</quote>

- I assume the cocoon servlet could easily arrange for calling the 
method before anything else
- I'm a bit unsure here if the javadoc mentioning of 'in the body of 
this request' is going to be interpreted by implementations as a 
limiting scope, and if so if they include the URI (and the request 
params using get vs post) as part of it or not

(talk about possible confusion when writing specs like this, yuk!)


regards,
-marc=
(sorry for just popping up the questions, lacking the time to 
investigate deeper myself ATM)
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: [Help]How can I use non-ascii file name?

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On 17 Aug 2004, at 11:03, Pier Fumagalli wrote:

> On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:
>
>> I'll see why this happens in Jetty, I'll poke Jen and Greg to have 
>> either a fix, or an explaination and workaround... For now, brrrr, I 
>> think that the hack is the only way to go...
>
> I don't know about Tomcat, but if you're not on the jetty developers 
> list, here's the outcome:
>
> Jetty defaults (for compatibility to all the other broken containers, 
> and because there's no "official standard" about UTF-8 URIs) to 
> ISO-8859-1. And this ain't great.
>
> Now, the good thing is that if you start your jetty specifying the 
> "org.mortbay.util.URI.charset" system property, it will use that one 
> as the charset used for decoding URLs.
>
> So, by putting in "-Dorg.mortbay.util.URI.charset=UTF-8" we get the 
> expected behavior.
>
> How about setting it up as the default behavior for Cocoon's internal 
> Jetty distro?

you got my +1

regards Jeremy


--------------------------------------------------------

                   If email from this address is not signed
                                 IT IS NOT FROM ME

                         Always check the label, folks !!!!!
--------------------------------------------------------


Re: [Help]How can I use non-ascii file name?

Posted by Marc Portier <mp...@outerthought.org>.

Pier Fumagalli wrote:

> On 17 Aug 2004, at 16:20, Marc Portier wrote:
> 
>>> How about setting it up as the default behavior for Cocoon's 
>>> internal  Jetty distro?
>>
>>
>> makes sense, but: (whishing all this brokenness wan't there but helas)
> 
> 
> It's not really "brokenness" but more along the lines of an inversion  
> of the Robustness Principle, as outlined by J. Postel in RFC-791  
> (http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later  
> dogmatized by R. Braden in RFC-1122  
> (http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).
> 
> "Be liberal in what you accept, and conservative in what you send."
> 
> In this case browsers are liberal in what they send (URL-Encoded UTF-8)  
> and servlet containers are conservative in what they accept  
> (URL-Encoded ISO-8859-1).
> 

indeed

>> - it shouldn't keep us from actually get about solving it for all
>> containers? (my guess is that just a fraction of cocoon deployments
>> actually run on the internal jetty distro, i.e. using the cocoon.sh or
>> .bat?)
> 
> 
> Well, we found that Jetty in production was much better than anyone  
> else. So, in our production environment we have Jetty (not the Cocoon  
> distro one, a full blown copy)... Works pretty neatly! :-P
> 
>> - learning about this org.mortbay.util.URI.charset property we should
>> probably use it to override (or at least log-warn deployers if it's
>> different to) the container-encoding setting in the web.xml
>> (assuming that the mentioned property will also be in effect when
>> decoding the request parameters, and taking in account that current
>> cocoon code assumes ISO-8859-1 as the default there)
> 
> 
> I agree, but as I said, my world revolves around the best container in  
> the world (whops, Jetty), so I already have "my" fix to the problem:  
> switch! :-P
> 
>> - once we've run that far, we might even consider making a scan of  other
>> servlet containers and how they possibly allow setting the
>> container-encoding?
> 
> 
> The "conteiner-encoding" servlet initialization parameter simply  
> applies for request parameters (form data), and I suppose it only  
> affects how the way in which from the ServletRequest.getInputStream()  
> we read full blown characters, and parse forms.
> 

I'ld need to check but assume the request params are included regardless 
off the GET or POST method

of course the uri-part before ? would need to been used already 
internally in the servlet container at least to point to the correct JSP 
or servlet...

hm, I'ld need to try-out some jsp/servlet with a euro-sign in the 
file-name or so and check whether the path indication in the web.xml is 
able to find it...

>> while typing I started rethinking why we ended up with this
>> container-encoding init-param in web.xml?
>>
>> IIRC we did that because of required compliance to servlet spec  versions
>> prior to 2.3?  So first question is are we still on servlet 2.2?
>>

Just found the thread that answers the question:
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=108858029423811&w=2

>> If not: Since 2.3 there exists a setCharacterEncoding()
>> <quote from="servlet 2.3 javadoc"
>> href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ 
>> ServletRequest.html#setCharacterEncoding(java.lang.String)">
>>   Overrides the name of the character encoding used in the body of this
>>   request. This method must be called prior to reading request
>>   parameters or reading input using getReader().
>> </quote>
> 
> 
> Indeed, the problem here is that it's nowhere specified how the request  
> BODY (not the URL, source of this problem) should be encoded.
> 

yep, but as stated above: I suppose that the border-case 'request-params 
in GET mode' is included (even if those are -stricktly speaking- not in 
the body?).

This seems to suggest that the current use of the en-re-decoding trick 
in cocoon's request-wrapper could be cleaned out (since we voted to go 
with 2.3 from now on)

> Normally, from browser behaviour, I can see that usually browsers tend  
> to post application/www-form-urlencoded in the same charset they used  
> interpreting the form. So given an HTTP request like this:
> 
> C: GET /myForm HTTP/1.1
> C: Host: localhost:80
> C:
> S: HTTP/1.1 200 OK
> S: Date: Wed, 18 Aug 2004 08:30:28 GMT
> S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
> S: Content-Type: text/html; charset=utf-8
> 
> When the form included in /myForm is posted back to its action, the  
> UTF-8 charset will be used to encode the form data...
> 
> That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be  
> used for all forms, and should always used be as the default encoding  
> for writing and riding.
> 

yep,
we have wiki info already indicating that to our users:
http://wiki.apache.org/cocoon/RequestParameterEncoding

(hm, more interesting stuff out there, and probably some of the new 
viewpoints from this thread could be added there)


>> - I assume the cocoon servlet could easily arrange for calling the
>> method before anything else
> 
> 
> Yes, hoping that it actually works. But cocoon should call the method  
> with the encoding used to send the form from where data is read...  

yep, they should be consistent.
fact is there was a patch on the serializers to do so by default

(but the other way around: by default they are taking the setting of 
form_encoding init param for doing the serialization)

fixcommit here:
http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java?r1=24666&r2=26246&p1=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&p2=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&diff_format=h&root=Apache-SVN

archived discussion here: 
http://marc.theaimsgroup.com/?t=106760662600010&r=1&w=2

> should be easy for continuations, but in most of the cases, I'd say  
> that it's a good principle to choose one encoding for your entire  
> application and stick to it...
> 

agree, just running through the (above mentioned) wiki page however I 
noticed some paragraph on wanting to 'locally' override the 
form-encoding for certain pipelines (use case being support for 
different clients then only the classic browsers which might behave 
differently)

the suggested setCharacterEncodingAction seems to be a good match to 
that issue and it somewhat suggests we should keep some form of possible 
en-re-decoding scheme in our request-wrapper (looks like the 2.3 switch 
should not make us jump to hasty conclusions on that part)

(boy this issue seems to be a rose with many thorns, and it seems to 
blossom every year or so :-))

>> - I'm a bit unsure here if the javadoc mentioning of 'in the body of
>> this request' is going to be interpreted by implementations as a
>> limiting scope, and if so if they include the URI (and the request
>> params using get vs post) as part of it or not
> 
> 
> The point you mentioned in the spec _DOES_NOT_ include the request URI.  
> We've talked quite extensively over it while writing Servlet 2.4, which  
> (in theory) should expand more on the concepts of charset and i18n.
> 

thx for the clarrification and inside info

>> (talk about possible confusion when writing specs like this, yuk!)
> 
> 
> Well, it's a big gray area... Most of my knowledge is based on my  
> girlfriend's PC. She's japanese, and although I don't understand what's  
> all that gibberish on her screen, I can still test out few bits and  
> bobs...
> 
> For all our MacOS/X folks, if you want to try out playing with  
> different encodings and internationalization settings, close your  
> Safari, Mozilla, Firefox, and so on, go into the System Preferences and  
> drag the three "bookcase, christmas tree, lotsa-lines block"  
> (ni-hon-go) sequence of three characters right up to the top. Start  
> your browser, and then restore english (french, italian, german) up on  
> top where it was in the preferences.
> 
> Your browser will now think it's working on a Japanese PC and will do  
> everything like you were living in Tokyo.
> 
> On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a  
> copy of WindowsXP in Japanese. :-(
> 

yeah testing isn't obvious as one also needs to rely on having a 
as-unicode-complete-as-they-come font so you are sure you are seeing 
what you think you are seeing...

any case: my personal testing-candidate for these cases is just using 
the euro-sign (\u20AC, utf-8: %E2%82%AC) in pathnames, filenames, 
classnames, request params and whatnot.

most european systems (even windows) would have a native encoding 
supporting the eurosign (while iso-8859-1 obviously doesn't)

geek detail: you can even use it in your Java source code:

public class \u20ACToBEF
{
...
}

(in fact java's compiler is completely unicode aware towards the source 
code: if you're sick enough you might even go about writing the keywords 
like 'public' and 'class' in their escaped unicode variants :-)
notice that you will need to be able to specify an euro-sign in the 
filename of that source though)


regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 17 Aug 2004, at 16:20, Marc Portier wrote:

>> How about setting it up as the default behavior for Cocoon's internal  
>> Jetty distro?
>
> makes sense, but: (whishing all this brokenness wan't there but helas)

It's not really "brokenness" but more along the lines of an inversion  
of the Robustness Principle, as outlined by J. Postel in RFC-791  
(http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later  
dogmatized by R. Braden in RFC-1122  
(http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).

"Be liberal in what you accept, and conservative in what you send."

In this case browsers are liberal in what they send (URL-Encoded UTF-8)  
and servlet containers are conservative in what they accept  
(URL-Encoded ISO-8859-1).

> - it shouldn't keep us from actually get about solving it for all
> containers? (my guess is that just a fraction of cocoon deployments
> actually run on the internal jetty distro, i.e. using the cocoon.sh or
> .bat?)

Well, we found that Jetty in production was much better than anyone  
else. So, in our production environment we have Jetty (not the Cocoon  
distro one, a full blown copy)... Works pretty neatly! :-P

> - learning about this org.mortbay.util.URI.charset property we should
> probably use it to override (or at least log-warn deployers if it's
> different to) the container-encoding setting in the web.xml
> (assuming that the mentioned property will also be in effect when
> decoding the request parameters, and taking in account that current
> cocoon code assumes ISO-8859-1 as the default there)

I agree, but as I said, my world revolves around the best container in  
the world (whops, Jetty), so I already have "my" fix to the problem:  
switch! :-P

> - once we've run that far, we might even consider making a scan of  
> other
> servlet containers and how they possibly allow setting the
> container-encoding?

The "conteiner-encoding" servlet initialization parameter simply  
applies for request parameters (form data), and I suppose it only  
affects how the way in which from the ServletRequest.getInputStream()  
we read full blown characters, and parse forms.

> while typing I started rethinking why we ended up with this
> container-encoding init-param in web.xml?
>
> IIRC we did that because of required compliance to servlet spec  
> versions
> prior to 2.3?  So first question is are we still on servlet 2.2?
>
> If not: Since 2.3 there exists a setCharacterEncoding()
> <quote from="servlet 2.3 javadoc"
> href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ 
> ServletRequest.html#setCharacterEncoding(java.lang.String)">
>   Overrides the name of the character encoding used in the body of this
>   request. This method must be called prior to reading request
>   parameters or reading input using getReader().
> </quote>

Indeed, the problem here is that it's nowhere specified how the request  
BODY (not the URL, source of this problem) should be encoded.

Normally, from browser behaviour, I can see that usually browsers tend  
to post application/www-form-urlencoded in the same charset they used  
interpreting the form. So given an HTTP request like this:

C: GET /myForm HTTP/1.1
C: Host: localhost:80
C:
S: HTTP/1.1 200 OK
S: Date: Wed, 18 Aug 2004 08:30:28 GMT
S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
S: Content-Type: text/html; charset=utf-8

When the form included in /myForm is posted back to its action, the  
UTF-8 charset will be used to encode the form data...

That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be  
used for all forms, and should always used be as the default encoding  
for writing and riding.

> - I assume the cocoon servlet could easily arrange for calling the
> method before anything else

Yes, hoping that it actually works. But cocoon should call the method  
with the encoding used to send the form from where data is read...  
should be easy for continuations, but in most of the cases, I'd say  
that it's a good principle to choose one encoding for your entire  
application and stick to it...

> - I'm a bit unsure here if the javadoc mentioning of 'in the body of
> this request' is going to be interpreted by implementations as a
> limiting scope, and if so if they include the URI (and the request
> params using get vs post) as part of it or not

The point you mentioned in the spec _DOES_NOT_ include the request URI.  
We've talked quite extensively over it while writing Servlet 2.4, which  
(in theory) should expand more on the concepts of charset and i18n.

> (talk about possible confusion when writing specs like this, yuk!)

Well, it's a big gray area... Most of my knowledge is based on my  
girlfriend's PC. She's japanese, and although I don't understand what's  
all that gibberish on her screen, I can still test out few bits and  
bobs...

For all our MacOS/X folks, if you want to try out playing with  
different encodings and internationalization settings, close your  
Safari, Mozilla, Firefox, and so on, go into the System Preferences and  
drag the three "bookcase, christmas tree, lotsa-lines block"  
(ni-hon-go) sequence of three characters right up to the top. Start  
your browser, and then restore english (french, italian, german) up on  
top where it was in the preferences.

Your browser will now think it's working on a Japanese PC and will do  
everything like you were living in Tokyo.

On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a  
copy of WindowsXP in Japanese. :-(

	Pier

Re: [Help]How can I use non-ascii file name?

Posted by Marc Portier <mp...@outerthought.org>.
(repost: just noticed I forgot to copy dev-list)

Pier Fumagalli wrote:

> On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:
> 
>> I'll see why this happens in Jetty, I'll poke Jen and Greg to have 
>> either a fix, or an explaination and workaround... For now, brrrr, I 
>> think that the hack is the only way to go...
> 
> 
> I don't know about Tomcat, but if you're not on the jetty developers 
> list, here's the outcome:
> 

I'm not, thx for copying over...

> Jetty defaults (for compatibility to all the other broken containers, 
> and because there's no "official standard" about UTF-8 URIs) to 
> ISO-8859-1. And this ain't great.
> 
> Now, the good thing is that if you start your jetty specifying the 
> "org.mortbay.util.URI.charset" system property, it will use that one as 
> the charset used for decoding URLs.
> 
> So, by putting in "-Dorg.mortbay.util.URI.charset=UTF-8" we get the 
> expected behavior.
> 

cool

> How about setting it up as the default behavior for Cocoon's internal 
> Jetty distro?
> 

makes sense, but: (whishing all this brokenness wan't there but helas)

- it shouldn't keep us from actually get about solving it for all
containers? (my guess is that just a fraction of cocoon deployments
actually run on the internal jetty distro, i.e. using the cocoon.sh or
.bat?)

- learning about this org.mortbay.util.URI.charset property we should
probably use it to override (or at least log-warn deployers if it's
different to) the container-encoding setting in the web.xml
(assuming that the mentioned property will also be in effect when
decoding the request parameters, and taking in account that current
cocoon code assumes ISO-8859-1 as the default there)

- once we've run that far, we might even consider making a scan of other
servlet containers and how they possibly allow setting the
container-encoding?

wdyt?


while typing I started rethinking why we ended up with this
container-encoding init-param in web.xml?

IIRC we did that because of required compliance to servlet spec versions
prior to 2.3?  So first question is are we still on servlet 2.2?

If not: Since 2.3 there exists a setCharacterEncoding()
<quote from="servlet 2.3 javadoc"
href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)">
   Overrides the name of the character encoding used in the body of this
   request. This method must be called prior to reading request
   parameters or reading input using getReader().
</quote>

- I assume the cocoon servlet could easily arrange for calling the
method before anything else
- I'm a bit unsure here if the javadoc mentioning of 'in the body of
this request' is going to be interpreted by implementations as a
limiting scope, and if so if they include the URI (and the request
params using get vs post) as part of it or not

(talk about possible confusion when writing specs like this, yuk!)


regards,
-marc=
(sorry for just popping up the questions, lacking the time to
investigate deeper myself ATM)
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org



Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:

> I'll see why this happens in Jetty, I'll poke Jen and Greg to have 
> either a fix, or an explaination and workaround... For now, brrrr, I 
> think that the hack is the only way to go...

I don't know about Tomcat, but if you're not on the jetty developers 
list, here's the outcome:

Jetty defaults (for compatibility to all the other broken containers, 
and because there's no "official standard" about UTF-8 URIs) to 
ISO-8859-1. And this ain't great.

Now, the good thing is that if you start your jetty specifying the 
"org.mortbay.util.URI.charset" system property, it will use that one as 
the charset used for decoding URLs.

So, by putting in "-Dorg.mortbay.util.URI.charset=UTF-8" we get the 
expected behavior.

How about setting it up as the default behavior for Cocoon's internal 
Jetty distro?

	Pier


Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
References to non-hack:

http://www.w3.org/International/O-URL-and-ident

	Pier

On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:

> Ok, I tracked the sucker down... It's the servlet container... They  
> all decode the stupid URL using ISO-8859-1... And therefore, utterly  
> incompatible with 3/4 of the non-english-speaking world...
>
> At best, I was able to _HACK_ the whole thing through, by getting the  
> path info in this way:
>
> <WARNING note="shit-code-follows">
>
> new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));
>
> </WARNING>
>
> Therefore, I get the BYTES of the path-info string as if they were in  
> ISO-8859-1, and re-create a new string by taking those bytes and  
> forcing them to be in UTF-8...
>
> Niiiiiiiiiiiiiiiiiiice!
>
> Note that this stupidity also happens with accented letters (that for  
> us Italians is a big p-i-t-a).
>
> I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
> either a fix, or an explaination and workaround... For now, brrrr, I  
> think that the hack is the only way to go...
>
> Oh, I checked it also on Tomcat. Same problem there as well...
>
> 	Pier
>
>
>
> On 16 Aug 2004, at 12:05, Marc Portier wrote:
>
>> Pier,
>>
>>
>> As a coincidence we recently (last week) had a similar post on  
>> xreporter-list (which uses cocoon)
>>
>> Bad news is that I didn't track it down to the bottom yet, just some  
>> findings below:
>> (in fact the odd-char-in-filename for map:read and map:mount was one  
>> of the first things I was going to test, seems I'm already presented  
>> with the results)
>>
>>
>> what I did find already was this:
>>
>> Cocoon's Request.getSitemapURI() will return an assembly of  
>> javax.servlet.http.HttpServletRequest.getServletPath()
>> + javax.servlet.http.HttpServletRequest.getPathInfo()
>>
>> Servlet spec on those states they will be (url-) decoded
>> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
>> translated into single bytes. The obtained byte-sequence is then  
>> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
>> haven't found yet if this is container specific, modifiable or hard  
>> noted in some spec. Only thing I found is this:  
>> http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but  
>> I'm yet unsure on how this influences servlet specs, or actual  
>> container and even browser implementations for that matter)
>>
>>
>> Alternatively there is:
>> Cocoon's Request.getRequestURI() which maps onto the
>> javax.servlet.http.HttpServletRequest.getRequestURI()
>>
>> This one resembles the URI as transferred over the wire: ie. not  
>> (url-)decoded, or in other words still holding the %XX sequences
>>
>>
>> As an extra clarification on all these the servlet spec explicitely  
>> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
>> <quote>
>> It is important to note that, *except for URL encoding differences*  
>> between the request URI and the path parts, the following equation is  
>> always true:
>>
>> requestURI = contextPath + servletPath + pathInfo
>> </quote>
>>
>>
>> I (for now) assume that this is the same encoding we expect  
>> cocoon-deploy people to specify in the 'container-encoding'  
>> init-parameter in the web.xml (allowing to correctly en-re-decode  
>> request-paramater-values in case of mismatching form and container  
>> encodings)
>>
>>
>>
>>
>> Ok, above is dull data, and not much into a direction of any solution  
>> yet.  My current feeling (long shot, needs time to test and try, and  
>> based on above assumption) is that we should
>>
>> In terms of backwards compatibility I'm unsure if we could just go  
>> about changing the semantics (histrocally implied use of iso-8859-1  
>> encoding) of getSitemapURI() or rather should deprecate and/or have a  
>> different method next to it?
>>
>> In any case this new implementation should then probably apply the  
>> same kind of dirty en-re-decoding-trick
>>
>> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>>
>> as we do today with the request param values?
>>
>> (see  
>> http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/ 
>> cocoon/environment/http/HttpRequest.java?annotate=1.11#391
>> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
>> seem to support 'annotate' ?)
>>
>>
>> For the record: the fast hack/workaround in the xreporter case was  
>> exactly to apply this.
>>
>>
>>
>>
>> Attached to this I'm also seeing the trouble of mount-points in  
>> cocoon.   I've seen a number of installments needing (well, 'using'  
>> at least) some insertion of that  
>> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
>> links in source xml.files refer to other resources managed by the  
>> same mounted sitemap without the need to explicitely mention that  
>> part (but have it dynamically inserted by some xsl in stead).
>>
>> In those occasions I've seen people mostly subtract siteMapURI from  
>> requestURI to obtain that prefix part. Regarding the above  
>> observations this algorithm will however fail due to encoding  
>> differences.
>>
>> My proposal would be to not only add a method for decoding the  
>> sitemapURI properly, but in the mean time adding the convenience  
>> method to return the mounted-sitemap-part as well on the level of  
>> cocoon's request.
>>
>>
>>
>> Above are early observations that need some backing, so comments  
>> welcome. (and hoping someone beats me to this since I'm lacking the  
>> time to pursue myself)
>> -marc=
>>
>>
>> Pier Fumagalli wrote:
>>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>>> Hi,all:
>>>>     Use reader to display jpg or gif is quite simple,like:
>>>>    <map:match pattern="*.jpg">
>>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>>    </map:match>
>>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the  
>>>> name correctly,error occur:
>>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>>> of the input stream:  
>>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>>> doesn't exist.
>>>>
>>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>>> description or help in wiki or archived mail list.
>>>>
>>>> Roy Huang
>>> It appears indeed as a bug...
>>> I have this sitemap snippet:
>>>     <map:match pattern="谷*">
>>>       <map:generate src="谷{1}.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>>> request for "http://localhost:8888/谷理子", the whole thing goes  
>>> berserk...
>>> Now, the URL is passed correctly, as I see that in the access log:
>>> INFO    (2004-08-16) 10:26.36:538   [access]  
>>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0  
>>> B7 E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow  
>>> it gets lost in the process.
>>> Now, if I modify my itemap to
>>>     <map:match pattern="tanisatoko">
>>>       <map:generate src="谷理子.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> And I make a request to "http://localhost:8888/tanisatoko", the  
>>> thing works perfectly. We can safely exclude the fact that it's the  
>>> generation process.
>>> Now, the _odd_ thing I noticed is that in those cases, I get an  
>>> error of "PipelineNotFound", not a "ResourceNotFound", which means  
>>> that the matcher seriously doesn't see that request.
>>> Changing over the matcher to a 'regexp' matcher doesn't change, so,  
>>> I bet it's the data we feed to the matcher.
>>> Now, changing that matcher to  
>>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the  
>>> encoding, and running it again, I get my nice page correctly.
>>> I bet that somewhere (I don't know where, but surely somewhere), the  
>>> UTF-8 encoded URL converted into a string using the current locale  
>>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>>> string is actually given to the sitemap.
>>> Not having the sources at hand at the moment, I can't do a quick  
>>> build to put out some debugging instruction, but  you get the idea.
>>>     Pier
>>
>> -- 
>> Marc Portier                            http://outerthought.org/
>> Outerthought - Open Source, Java & XML Competence Support Center
>> Read my weblog at                http://blogs.cocoondev.org/mpo/
>> mpo@outerthought.org                              mpo@apache.org
>>

Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:

> I'll see why this happens in Jetty, I'll poke Jen and Greg to have 
> either a fix, or an explaination and workaround... For now, brrrr, I 
> think that the hack is the only way to go...

I don't know about Tomcat, but if you're not on the jetty developers 
list, here's the outcome:

Jetty defaults (for compatibility to all the other broken containers, 
and because there's no "official standard" about UTF-8 URIs) to 
ISO-8859-1. And this ain't great.

Now, the good thing is that if you start your jetty specifying the 
"org.mortbay.util.URI.charset" system property, it will use that one as 
the charset used for decoding URLs.

So, by putting in "-Dorg.mortbay.util.URI.charset=UTF-8" we get the 
expected behavior.

How about setting it up as the default behavior for Cocoon's internal 
Jetty distro?

	Pier


Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
References to non-hack:

http://www.w3.org/International/O-URL-and-ident

	Pier

On 16 Aug 2004, at 14:02, Pier Fumagalli wrote:

> Ok, I tracked the sucker down... It's the servlet container... They  
> all decode the stupid URL using ISO-8859-1... And therefore, utterly  
> incompatible with 3/4 of the non-english-speaking world...
>
> At best, I was able to _HACK_ the whole thing through, by getting the  
> path info in this way:
>
> <WARNING note="shit-code-follows">
>
> new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));
>
> </WARNING>
>
> Therefore, I get the BYTES of the path-info string as if they were in  
> ISO-8859-1, and re-create a new string by taking those bytes and  
> forcing them to be in UTF-8...
>
> Niiiiiiiiiiiiiiiiiiice!
>
> Note that this stupidity also happens with accented letters (that for  
> us Italians is a big p-i-t-a).
>
> I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
> either a fix, or an explaination and workaround... For now, brrrr, I  
> think that the hack is the only way to go...
>
> Oh, I checked it also on Tomcat. Same problem there as well...
>
> 	Pier
>
>
>
> On 16 Aug 2004, at 12:05, Marc Portier wrote:
>
>> Pier,
>>
>>
>> As a coincidence we recently (last week) had a similar post on  
>> xreporter-list (which uses cocoon)
>>
>> Bad news is that I didn't track it down to the bottom yet, just some  
>> findings below:
>> (in fact the odd-char-in-filename for map:read and map:mount was one  
>> of the first things I was going to test, seems I'm already presented  
>> with the results)
>>
>>
>> what I did find already was this:
>>
>> Cocoon's Request.getSitemapURI() will return an assembly of  
>> javax.servlet.http.HttpServletRequest.getServletPath()
>> + javax.servlet.http.HttpServletRequest.getPathInfo()
>>
>> Servlet spec on those states they will be (url-) decoded
>> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
>> translated into single bytes. The obtained byte-sequence is then  
>> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
>> haven't found yet if this is container specific, modifiable or hard  
>> noted in some spec. Only thing I found is this:  
>> http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but  
>> I'm yet unsure on how this influences servlet specs, or actual  
>> container and even browser implementations for that matter)
>>
>>
>> Alternatively there is:
>> Cocoon's Request.getRequestURI() which maps onto the
>> javax.servlet.http.HttpServletRequest.getRequestURI()
>>
>> This one resembles the URI as transferred over the wire: ie. not  
>> (url-)decoded, or in other words still holding the %XX sequences
>>
>>
>> As an extra clarification on all these the servlet spec explicitely  
>> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
>> <quote>
>> It is important to note that, *except for URL encoding differences*  
>> between the request URI and the path parts, the following equation is  
>> always true:
>>
>> requestURI = contextPath + servletPath + pathInfo
>> </quote>
>>
>>
>> I (for now) assume that this is the same encoding we expect  
>> cocoon-deploy people to specify in the 'container-encoding'  
>> init-parameter in the web.xml (allowing to correctly en-re-decode  
>> request-paramater-values in case of mismatching form and container  
>> encodings)
>>
>>
>>
>>
>> Ok, above is dull data, and not much into a direction of any solution  
>> yet.  My current feeling (long shot, needs time to test and try, and  
>> based on above assumption) is that we should
>>
>> In terms of backwards compatibility I'm unsure if we could just go  
>> about changing the semantics (histrocally implied use of iso-8859-1  
>> encoding) of getSitemapURI() or rather should deprecate and/or have a  
>> different method next to it?
>>
>> In any case this new implementation should then probably apply the  
>> same kind of dirty en-re-decoding-trick
>>
>> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>>
>> as we do today with the request param values?
>>
>> (see  
>> http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/ 
>> cocoon/environment/http/HttpRequest.java?annotate=1.11#391
>> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
>> seem to support 'annotate' ?)
>>
>>
>> For the record: the fast hack/workaround in the xreporter case was  
>> exactly to apply this.
>>
>>
>>
>>
>> Attached to this I'm also seeing the trouble of mount-points in  
>> cocoon.   I've seen a number of installments needing (well, 'using'  
>> at least) some insertion of that  
>> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
>> links in source xml.files refer to other resources managed by the  
>> same mounted sitemap without the need to explicitely mention that  
>> part (but have it dynamically inserted by some xsl in stead).
>>
>> In those occasions I've seen people mostly subtract siteMapURI from  
>> requestURI to obtain that prefix part. Regarding the above  
>> observations this algorithm will however fail due to encoding  
>> differences.
>>
>> My proposal would be to not only add a method for decoding the  
>> sitemapURI properly, but in the mean time adding the convenience  
>> method to return the mounted-sitemap-part as well on the level of  
>> cocoon's request.
>>
>>
>>
>> Above are early observations that need some backing, so comments  
>> welcome. (and hoping someone beats me to this since I'm lacking the  
>> time to pursue myself)
>> -marc=
>>
>>
>> Pier Fumagalli wrote:
>>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>>> Hi,all:
>>>>     Use reader to display jpg or gif is quite simple,like:
>>>>    <map:match pattern="*.jpg">
>>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>>    </map:match>
>>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the  
>>>> name correctly,error occur:
>>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>>> of the input stream:  
>>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>>> doesn't exist.
>>>>
>>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>>> description or help in wiki or archived mail list.
>>>>
>>>> Roy Huang
>>> It appears indeed as a bug...
>>> I have this sitemap snippet:
>>>     <map:match pattern="谷*">
>>>       <map:generate src="谷{1}.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>>> request for "http://localhost:8888/谷理子", the whole thing goes  
>>> berserk...
>>> Now, the URL is passed correctly, as I see that in the access log:
>>> INFO    (2004-08-16) 10:26.36:538   [access]  
>>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0  
>>> B7 E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow  
>>> it gets lost in the process.
>>> Now, if I modify my itemap to
>>>     <map:match pattern="tanisatoko">
>>>       <map:generate src="谷理子.xml"/>
>>>       <map:transform src="welcome.xslt">
>>>         <map:parameter name="contextPath"  
>>> value="{request:contextPath}"/>
>>>       </map:transform>
>>>       <map:serialize type="xhtml"/>
>>>     </map:match>
>>> And I make a request to "http://localhost:8888/tanisatoko", the  
>>> thing works perfectly. We can safely exclude the fact that it's the  
>>> generation process.
>>> Now, the _odd_ thing I noticed is that in those cases, I get an  
>>> error of "PipelineNotFound", not a "ResourceNotFound", which means  
>>> that the matcher seriously doesn't see that request.
>>> Changing over the matcher to a 'regexp' matcher doesn't change, so,  
>>> I bet it's the data we feed to the matcher.
>>> Now, changing that matcher to  
>>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the  
>>> encoding, and running it again, I get my nice page correctly.
>>> I bet that somewhere (I don't know where, but surely somewhere), the  
>>> UTF-8 encoded URL converted into a string using the current locale  
>>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>>> string is actually given to the sitemap.
>>> Not having the sources at hand at the moment, I can't do a quick  
>>> build to put out some debugging instruction, but  you get the idea.
>>>     Pier
>>
>> -- 
>> Marc Portier                            http://outerthought.org/
>> Outerthought - Open Source, Java & XML Competence Support Center
>> Read my weblog at                http://blogs.cocoondev.org/mpo/
>> mpo@outerthought.org                              mpo@apache.org
>>

Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
Ok, I tracked the sucker down... It's the servlet container... They all  
decode the stupid URL using ISO-8859-1... And therefore, utterly  
incompatible with 3/4 of the non-english-speaking world...

At best, I was able to _HACK_ the whole thing through, by getting the  
path info in this way:

<WARNING note="shit-code-follows">

new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));

</WARNING>

Therefore, I get the BYTES of the path-info string as if they were in  
ISO-8859-1, and re-create a new string by taking those bytes and  
forcing them to be in UTF-8...

Niiiiiiiiiiiiiiiiiiice!

Note that this stupidity also happens with accented letters (that for  
us Italians is a big p-i-t-a).

I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
either a fix, or an explaination and workaround... For now, brrrr, I  
think that the hack is the only way to go...

Oh, I checked it also on Tomcat. Same problem there as well...

	Pier



On 16 Aug 2004, at 12:05, Marc Portier wrote:

> Pier,
>
>
> As a coincidence we recently (last week) had a similar post on  
> xreporter-list (which uses cocoon)
>
> Bad news is that I didn't track it down to the bottom yet, just some  
> findings below:
> (in fact the odd-char-in-filename for map:read and map:mount was one  
> of the first things I was going to test, seems I'm already presented  
> with the results)
>
>
> what I did find already was this:
>
> Cocoon's Request.getSitemapURI() will return an assembly of  
> javax.servlet.http.HttpServletRequest.getServletPath()
> + javax.servlet.http.HttpServletRequest.getPathInfo()
>
> Servlet spec on those states they will be (url-) decoded
> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
> translated into single bytes. The obtained byte-sequence is then  
> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
> haven't found yet if this is container specific, modifiable or hard  
> noted in some spec. Only thing I found is this:  
> http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but  
> I'm yet unsure on how this influences servlet specs, or actual  
> container and even browser implementations for that matter)
>
>
> Alternatively there is:
> Cocoon's Request.getRequestURI() which maps onto the
> javax.servlet.http.HttpServletRequest.getRequestURI()
>
> This one resembles the URI as transferred over the wire: ie. not  
> (url-)decoded, or in other words still holding the %XX sequences
>
>
> As an extra clarification on all these the servlet spec explicitely  
> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
> <quote>
> It is important to note that, *except for URL encoding differences*  
> between the request URI and the path parts, the following equation is  
> always true:
>
> requestURI = contextPath + servletPath + pathInfo
> </quote>
>
>
> I (for now) assume that this is the same encoding we expect  
> cocoon-deploy people to specify in the 'container-encoding'  
> init-parameter in the web.xml (allowing to correctly en-re-decode  
> request-paramater-values in case of mismatching form and container  
> encodings)
>
>
>
>
> Ok, above is dull data, and not much into a direction of any solution  
> yet.  My current feeling (long shot, needs time to test and try, and  
> based on above assumption) is that we should
>
> In terms of backwards compatibility I'm unsure if we could just go  
> about changing the semantics (histrocally implied use of iso-8859-1  
> encoding) of getSitemapURI() or rather should deprecate and/or have a  
> different method next to it?
>
> In any case this new implementation should then probably apply the  
> same kind of dirty en-re-decoding-trick
>
> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>
> as we do today with the request param values?
>
> (see  
> http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/ 
> cocoon/environment/http/HttpRequest.java?annotate=1.11#391
> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
> seem to support 'annotate' ?)
>
>
> For the record: the fast hack/workaround in the xreporter case was  
> exactly to apply this.
>
>
>
>
> Attached to this I'm also seeing the trouble of mount-points in  
> cocoon.   I've seen a number of installments needing (well, 'using' at  
> least) some insertion of that  
> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
> links in source xml.files refer to other resources managed by the same  
> mounted sitemap without the need to explicitely mention that part (but  
> have it dynamically inserted by some xsl in stead).
>
> In those occasions I've seen people mostly subtract siteMapURI from  
> requestURI to obtain that prefix part. Regarding the above  
> observations this algorithm will however fail due to encoding  
> differences.
>
> My proposal would be to not only add a method for decoding the  
> sitemapURI properly, but in the mean time adding the convenience  
> method to return the mounted-sitemap-part as well on the level of  
> cocoon's request.
>
>
>
> Above are early observations that need some backing, so comments  
> welcome. (and hoping someone beats me to this since I'm lacking the  
> time to pursue myself)
> -marc=
>
>
> Pier Fumagalli wrote:
>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>> Hi,all:
>>>     Use reader to display jpg or gif is quite simple,like:
>>>    <map:match pattern="*.jpg">
>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>    </map:match>
>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the name  
>>> correctly,error occur:
>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>> of the input stream:  
>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>> doesn't exist.
>>>
>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>> description or help in wiki or archived mail list.
>>>
>>> Roy Huang
>> It appears indeed as a bug...
>> I have this sitemap snippet:
>>     <map:match pattern="谷*">
>>       <map:generate src="谷{1}.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>> request for "http://localhost:8888/谷理子", the whole thing goes  
>> berserk...
>> Now, the URL is passed correctly, as I see that in the access log:
>> INFO    (2004-08-16) 10:26.36:538   [access]  
>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7  
>> E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it  
>> gets lost in the process.
>> Now, if I modify my itemap to
>>     <map:match pattern="tanisatoko">
>>       <map:generate src="谷理子.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> And I make a request to "http://localhost:8888/tanisatoko", the thing  
>> works perfectly. We can safely exclude the fact that it's the  
>> generation process.
>> Now, the _odd_ thing I noticed is that in those cases, I get an error  
>> of "PipelineNotFound", not a "ResourceNotFound", which means that the  
>> matcher seriously doesn't see that request.
>> Changing over the matcher to a 'regexp' matcher doesn't change, so, I  
>> bet it's the data we feed to the matcher.
>> Now, changing that matcher to  
>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the  
>> encoding, and running it again, I get my nice page correctly.
>> I bet that somewhere (I don't know where, but surely somewhere), the  
>> UTF-8 encoded URL converted into a string using the current locale  
>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>> string is actually given to the sitemap.
>> Not having the sources at hand at the moment, I can't do a quick  
>> build to put out some debugging instruction, but  you get the idea.
>>     Pier
>
> -- 
> Marc Portier                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at                http://blogs.cocoondev.org/mpo/
> mpo@outerthought.org                              mpo@apache.org
>

Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
Ok, I tracked the sucker down... It's the servlet container... They all  
decode the stupid URL using ISO-8859-1... And therefore, utterly  
incompatible with 3/4 of the non-english-speaking world...

At best, I was able to _HACK_ the whole thing through, by getting the  
path info in this way:

<WARNING note="shit-code-follows">

new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));

</WARNING>

Therefore, I get the BYTES of the path-info string as if they were in  
ISO-8859-1, and re-create a new string by taking those bytes and  
forcing them to be in UTF-8...

Niiiiiiiiiiiiiiiiiiice!

Note that this stupidity also happens with accented letters (that for  
us Italians is a big p-i-t-a).

I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
either a fix, or an explaination and workaround... For now, brrrr, I  
think that the hack is the only way to go...

Oh, I checked it also on Tomcat. Same problem there as well...

	Pier



On 16 Aug 2004, at 12:05, Marc Portier wrote:

> Pier,
>
>
> As a coincidence we recently (last week) had a similar post on  
> xreporter-list (which uses cocoon)
>
> Bad news is that I didn't track it down to the bottom yet, just some  
> findings below:
> (in fact the odd-char-in-filename for map:read and map:mount was one  
> of the first things I was going to test, seems I'm already presented  
> with the results)
>
>
> what I did find already was this:
>
> Cocoon's Request.getSitemapURI() will return an assembly of  
> javax.servlet.http.HttpServletRequest.getServletPath()
> + javax.servlet.http.HttpServletRequest.getPathInfo()
>
> Servlet spec on those states they will be (url-) decoded
> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
> translated into single bytes. The obtained byte-sequence is then  
> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
> haven't found yet if this is container specific, modifiable or hard  
> noted in some spec. Only thing I found is this:  
> http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but  
> I'm yet unsure on how this influences servlet specs, or actual  
> container and even browser implementations for that matter)
>
>
> Alternatively there is:
> Cocoon's Request.getRequestURI() which maps onto the
> javax.servlet.http.HttpServletRequest.getRequestURI()
>
> This one resembles the URI as transferred over the wire: ie. not  
> (url-)decoded, or in other words still holding the %XX sequences
>
>
> As an extra clarification on all these the servlet spec explicitely  
> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
> <quote>
> It is important to note that, *except for URL encoding differences*  
> between the request URI and the path parts, the following equation is  
> always true:
>
> requestURI = contextPath + servletPath + pathInfo
> </quote>
>
>
> I (for now) assume that this is the same encoding we expect  
> cocoon-deploy people to specify in the 'container-encoding'  
> init-parameter in the web.xml (allowing to correctly en-re-decode  
> request-paramater-values in case of mismatching form and container  
> encodings)
>
>
>
>
> Ok, above is dull data, and not much into a direction of any solution  
> yet.  My current feeling (long shot, needs time to test and try, and  
> based on above assumption) is that we should
>
> In terms of backwards compatibility I'm unsure if we could just go  
> about changing the semantics (histrocally implied use of iso-8859-1  
> encoding) of getSitemapURI() or rather should deprecate and/or have a  
> different method next to it?
>
> In any case this new implementation should then probably apply the  
> same kind of dirty en-re-decoding-trick
>
> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>
> as we do today with the request param values?
>
> (see  
> http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/ 
> cocoon/environment/http/HttpRequest.java?annotate=1.11#391
> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
> seem to support 'annotate' ?)
>
>
> For the record: the fast hack/workaround in the xreporter case was  
> exactly to apply this.
>
>
>
>
> Attached to this I'm also seeing the trouble of mount-points in  
> cocoon.   I've seen a number of installments needing (well, 'using' at  
> least) some insertion of that  
> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
> links in source xml.files refer to other resources managed by the same  
> mounted sitemap without the need to explicitely mention that part (but  
> have it dynamically inserted by some xsl in stead).
>
> In those occasions I've seen people mostly subtract siteMapURI from  
> requestURI to obtain that prefix part. Regarding the above  
> observations this algorithm will however fail due to encoding  
> differences.
>
> My proposal would be to not only add a method for decoding the  
> sitemapURI properly, but in the mean time adding the convenience  
> method to return the mounted-sitemap-part as well on the level of  
> cocoon's request.
>
>
>
> Above are early observations that need some backing, so comments  
> welcome. (and hoping someone beats me to this since I'm lacking the  
> time to pursue myself)
> -marc=
>
>
> Pier Fumagalli wrote:
>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>> Hi,all:
>>>     Use reader to display jpg or gif is quite simple,like:
>>>    <map:match pattern="*.jpg">
>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>    </map:match>
>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the name  
>>> correctly,error occur:
>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>> of the input stream:  
>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>> doesn't exist.
>>>
>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>> description or help in wiki or archived mail list.
>>>
>>> Roy Huang
>> It appears indeed as a bug...
>> I have this sitemap snippet:
>>     <map:match pattern="谷*">
>>       <map:generate src="谷{1}.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>> request for "http://localhost:8888/谷理子", the whole thing goes  
>> berserk...
>> Now, the URL is passed correctly, as I see that in the access log:
>> INFO    (2004-08-16) 10:26.36:538   [access]  
>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7  
>> E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it  
>> gets lost in the process.
>> Now, if I modify my itemap to
>>     <map:match pattern="tanisatoko">
>>       <map:generate src="谷理子.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> And I make a request to "http://localhost:8888/tanisatoko", the thing  
>> works perfectly. We can safely exclude the fact that it's the  
>> generation process.
>> Now, the _odd_ thing I noticed is that in those cases, I get an error  
>> of "PipelineNotFound", not a "ResourceNotFound", which means that the  
>> matcher seriously doesn't see that request.
>> Changing over the matcher to a 'regexp' matcher doesn't change, so, I  
>> bet it's the data we feed to the matcher.
>> Now, changing that matcher to  
>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the  
>> encoding, and running it again, I get my nice page correctly.
>> I bet that somewhere (I don't know where, but surely somewhere), the  
>> UTF-8 encoded URL converted into a string using the current locale  
>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>> string is actually given to the sitemap.
>> Not having the sources at hand at the moment, I can't do a quick  
>> build to put out some debugging instruction, but  you get the idea.
>>     Pier
>
> -- 
> Marc Portier                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at                http://blogs.cocoondev.org/mpo/
> mpo@outerthought.org                              mpo@apache.org
>

Re: [Help]How can I use non-ascii file name?

Posted by Marc Portier <mp...@outerthought.org>.
Pier,


As a coincidence we recently (last week) had a similar post on 
xreporter-list (which uses cocoon)

Bad news is that I didn't track it down to the bottom yet, just some 
findings below:
(in fact the odd-char-in-filename for map:read and map:mount was one of 
the first things I was going to test, seems I'm already presented with 
the results)


what I did find already was this:

Cocoon's Request.getSitemapURI() will return an assembly of 
javax.servlet.http.HttpServletRequest.getServletPath()
+ javax.servlet.http.HttpServletRequest.getPathInfo()

Servlet spec on those states they will be (url-) decoded
Thus 3 char sequences of the kind "%BYTE_HEX" will have been translated 
into single bytes. The obtained byte-sequence is then decoded using 
SOME_DECODING (my guess would be using ISO-8859-1, but haven't found yet 
if this is container specific, modifiable or hard noted in some spec. 
Only thing I found is this: 
http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but I'm 
yet unsure on how this influences servlet specs, or actual container and 
even browser implementations for that matter)


Alternatively there is:
Cocoon's Request.getRequestURI() which maps onto the
javax.servlet.http.HttpServletRequest.getRequestURI()

This one resembles the URI as transferred over the wire: ie. not 
(url-)decoded, or in other words still holding the %XX sequences


As an extra clarification on all these the servlet spec explicitely 
states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
<quote>
It is important to note that, *except for URL encoding differences* 
between the request URI and the path parts, the following equation is 
always true:

requestURI = contextPath + servletPath + pathInfo
</quote>


I (for now) assume that this is the same encoding we expect 
cocoon-deploy people to specify in the 'container-encoding' 
init-parameter in the web.xml (allowing to correctly en-re-decode 
request-paramater-values in case of mismatching form and container 
encodings)




Ok, above is dull data, and not much into a direction of any solution 
yet.  My current feeling (long shot, needs time to test and try, and 
based on above assumption) is that we should

In terms of backwards compatibility I'm unsure if we could just go about 
changing the semantics (histrocally implied use of iso-8859-1 encoding) 
of getSitemapURI() or rather should deprecate and/or have a different 
method next to it?

In any case this new implementation should then probably apply the same 
kind of dirty en-re-decoding-trick

new return(getSitemapURI().getBytes(container_encoding),form_encoding)

as we do today with the request param values?

(see 
http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/cocoon/environment/http/HttpRequest.java?annotate=1.11#391
sorry for the old cvs-style link, the svn version of viewcvs doesn't 
seem to support 'annotate' ?)


For the record: the fast hack/workaround in the xreporter case was 
exactly to apply this.




Attached to this I'm also seeing the trouble of mount-points in cocoon. 
   I've seen a number of installments needing (well, 'using' at least) 
some insertion of that part-of-the-URL-that-maps-to-the-mounted-sitemap 
to be able to have links in source xml.files refer to other resources 
managed by the same mounted sitemap without the need to explicitely 
mention that part (but have it dynamically inserted by some xsl in stead).

In those occasions I've seen people mostly subtract siteMapURI from 
requestURI to obtain that prefix part. Regarding the above observations 
this algorithm will however fail due to encoding differences.

My proposal would be to not only add a method for decoding the 
sitemapURI properly, but in the mean time adding the convenience method 
to return the mounted-sitemap-part as well on the level of cocoon's request.



Above are early observations that need some backing, so comments 
welcome. (and hoping someone beats me to this since I'm lacking the time 
to pursue myself)
-marc=


Pier Fumagalli wrote:
> On 12 Aug 2004, at 12:45, roy huang wrote:
> 
>> Hi,all:
>>     Use reader to display jpg or gif is quite simple,like:
>>    <map:match pattern="*.jpg">
>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>    </map:match>
>>    But if the file name is not ASCII but utf-8 or other encoding like 
>> 花.jpg (simplified Chinese),the resolver didn't resolve the name 
>> correctly,error occur:
>> org.apache.cocoon.ResourceNotFoundException: Error during resolving of 
>> the input stream: org.apache.excalibur.source.SourceNotFoundException: 
>> file:/C:/My 
>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg 
>> doesn't exist.
>>
>> How can I use non-ASCII file name in cocoon?I can't find any 
>> description or help in wiki or archived mail list.
>>
>> Roy Huang
> 
> 
> It appears indeed as a bug...
> 
> I have this sitemap snippet:
> 
>     <map:match pattern="谷*">
>       <map:generate src="谷{1}.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> and a file on the disk called "谷理子.xml". Somewhere, when I make a 
> request for "http://localhost:8888/谷理子", the whole thing goes berserk...
> 
> Now, the URL is passed correctly, as I see that in the access log:
> 
> INFO    (2004-08-16) 10:26.36:538   [access] 
> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' 
> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
> 
> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 E7 
> 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it gets 
> lost in the process.
> 
> Now, if I modify my itemap to
> 
>     <map:match pattern="tanisatoko">
>       <map:generate src="谷理子.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> And I make a request to "http://localhost:8888/tanisatoko", the thing 
> works perfectly. We can safely exclude the fact that it's the generation 
> process.
> 
> Now, the _odd_ thing I noticed is that in those cases, I get an error of 
> "PipelineNotFound", not a "ResourceNotFound", which means that the 
> matcher seriously doesn't see that request.
> 
> Changing over the matcher to a 'regexp' matcher doesn't change, so, I 
> bet it's the data we feed to the matcher.
> 
> Now, changing that matcher to 
> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the encoding, 
> and running it again, I get my nice page correctly.
> 
> I bet that somewhere (I don't know where, but surely somewhere), the 
> UTF-8 encoded URL converted into a string using the current locale 
> (MacRoman on my system), or a default of "ISO-8859-1", before the string 
> is actually given to the sitemap.
> 
> Not having the sources at hand at the moment, I can't do a quick build 
> to put out some debugging instruction, but  you get the idea.
> 
>     Pier
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: [Help]How can I use non-ascii file name?

Posted by Marc Portier <mp...@outerthought.org>.
Pier,


As a coincidence we recently (last week) had a similar post on 
xreporter-list (which uses cocoon)

Bad news is that I didn't track it down to the bottom yet, just some 
findings below:
(in fact the odd-char-in-filename for map:read and map:mount was one of 
the first things I was going to test, seems I'm already presented with 
the results)


what I did find already was this:

Cocoon's Request.getSitemapURI() will return an assembly of 
javax.servlet.http.HttpServletRequest.getServletPath()
+ javax.servlet.http.HttpServletRequest.getPathInfo()

Servlet spec on those states they will be (url-) decoded
Thus 3 char sequences of the kind "%BYTE_HEX" will have been translated 
into single bytes. The obtained byte-sequence is then decoded using 
SOME_DECODING (my guess would be using ISO-8859-1, but haven't found yet 
if this is container specific, modifiable or hard noted in some spec. 
Only thing I found is this: 
http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but I'm 
yet unsure on how this influences servlet specs, or actual container and 
even browser implementations for that matter)


Alternatively there is:
Cocoon's Request.getRequestURI() which maps onto the
javax.servlet.http.HttpServletRequest.getRequestURI()

This one resembles the URI as transferred over the wire: ie. not 
(url-)decoded, or in other words still holding the %XX sequences


As an extra clarification on all these the servlet spec explicitely 
states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
<quote>
It is important to note that, *except for URL encoding differences* 
between the request URI and the path parts, the following equation is 
always true:

requestURI = contextPath + servletPath + pathInfo
</quote>


I (for now) assume that this is the same encoding we expect 
cocoon-deploy people to specify in the 'container-encoding' 
init-parameter in the web.xml (allowing to correctly en-re-decode 
request-paramater-values in case of mismatching form and container 
encodings)




Ok, above is dull data, and not much into a direction of any solution 
yet.  My current feeling (long shot, needs time to test and try, and 
based on above assumption) is that we should

In terms of backwards compatibility I'm unsure if we could just go about 
changing the semantics (histrocally implied use of iso-8859-1 encoding) 
of getSitemapURI() or rather should deprecate and/or have a different 
method next to it?

In any case this new implementation should then probably apply the same 
kind of dirty en-re-decoding-trick

new return(getSitemapURI().getBytes(container_encoding),form_encoding)

as we do today with the request param values?

(see 
http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/cocoon/environment/http/HttpRequest.java?annotate=1.11#391
sorry for the old cvs-style link, the svn version of viewcvs doesn't 
seem to support 'annotate' ?)


For the record: the fast hack/workaround in the xreporter case was 
exactly to apply this.




Attached to this I'm also seeing the trouble of mount-points in cocoon. 
   I've seen a number of installments needing (well, 'using' at least) 
some insertion of that part-of-the-URL-that-maps-to-the-mounted-sitemap 
to be able to have links in source xml.files refer to other resources 
managed by the same mounted sitemap without the need to explicitely 
mention that part (but have it dynamically inserted by some xsl in stead).

In those occasions I've seen people mostly subtract siteMapURI from 
requestURI to obtain that prefix part. Regarding the above observations 
this algorithm will however fail due to encoding differences.

My proposal would be to not only add a method for decoding the 
sitemapURI properly, but in the mean time adding the convenience method 
to return the mounted-sitemap-part as well on the level of cocoon's request.



Above are early observations that need some backing, so comments 
welcome. (and hoping someone beats me to this since I'm lacking the time 
to pursue myself)
-marc=


Pier Fumagalli wrote:
> On 12 Aug 2004, at 12:45, roy huang wrote:
> 
>> Hi,all:
>>     Use reader to display jpg or gif is quite simple,like:
>>    <map:match pattern="*.jpg">
>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>    </map:match>
>>    But if the file name is not ASCII but utf-8 or other encoding like 
>> 花.jpg (simplified Chinese),the resolver didn't resolve the name 
>> correctly,error occur:
>> org.apache.cocoon.ResourceNotFoundException: Error during resolving of 
>> the input stream: org.apache.excalibur.source.SourceNotFoundException: 
>> file:/C:/My 
>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg 
>> doesn't exist.
>>
>> How can I use non-ASCII file name in cocoon?I can't find any 
>> description or help in wiki or archived mail list.
>>
>> Roy Huang
> 
> 
> It appears indeed as a bug...
> 
> I have this sitemap snippet:
> 
>     <map:match pattern="谷*">
>       <map:generate src="谷{1}.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> and a file on the disk called "谷理子.xml". Somewhere, when I make a 
> request for "http://localhost:8888/谷理子", the whole thing goes berserk...
> 
> Now, the URL is passed correctly, as I see that in the access log:
> 
> INFO    (2004-08-16) 10:26.36:538   [access] 
> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' 
> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
> 
> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 E7 
> 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it gets 
> lost in the process.
> 
> Now, if I modify my itemap to
> 
>     <map:match pattern="tanisatoko">
>       <map:generate src="谷理子.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> And I make a request to "http://localhost:8888/tanisatoko", the thing 
> works perfectly. We can safely exclude the fact that it's the generation 
> process.
> 
> Now, the _odd_ thing I noticed is that in those cases, I get an error of 
> "PipelineNotFound", not a "ResourceNotFound", which means that the 
> matcher seriously doesn't see that request.
> 
> Changing over the matcher to a 'regexp' matcher doesn't change, so, I 
> bet it's the data we feed to the matcher.
> 
> Now, changing that matcher to 
> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the encoding, 
> and running it again, I get my nice page correctly.
> 
> I bet that somewhere (I don't know where, but surely somewhere), the 
> UTF-8 encoded URL converted into a string using the current locale 
> (MacRoman on my system), or a default of "ISO-8859-1", before the string 
> is actually given to the sitemap.
> 
> Not having the sources at hand at the moment, I can't do a quick build 
> to put out some debugging instruction, but  you get the idea.
> 
>     Pier
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 12 Aug 2004, at 12:45, roy huang wrote:

> Hi,all:
>     Use reader to display jpg or gif is quite simple,like:
>    <map:match pattern="*.jpg">
>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>    </map:match>
>    But if the file name is not ASCII but utf-8 or other encoding like 
> 花.jpg (simplified Chinese),the resolver didn't resolve the name 
> correctly,error occur:
> org.apache.cocoon.ResourceNotFoundException: Error during resolving of 
> the input stream: org.apache.excalibur.source.SourceNotFoundException: 
> file:/C:/My 
> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg 
> doesn't exist.
>
> How can I use non-ASCII file name in cocoon?I can't find any 
> description or help in wiki or archived mail list.
>
> Roy Huang

It appears indeed as a bug...

I have this sitemap snippet:

     <map:match pattern="谷*">
       <map:generate src="谷{1}.xml"/>
       <map:transform src="welcome.xslt">
         <map:parameter name="contextPath" 
value="{request:contextPath}"/>
       </map:transform>
       <map:serialize type="xhtml"/>
     </map:match>

and a file on the disk called "谷理子.xml". Somewhere, when I make a 
request for "http://localhost:8888/谷理子", the whole thing goes 
berserk...

Now, the URL is passed correctly, as I see that in the access log:

INFO    (2004-08-16) 10:26.36:538   [access] 
(/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' 
Processed by Apache Cocoon 2.1.5 in 27 milliseconds.

The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 
E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it 
gets lost in the process.

Now, if I modify my itemap to

     <map:match pattern="tanisatoko">
       <map:generate src="谷理子.xml"/>
       <map:transform src="welcome.xslt">
         <map:parameter name="contextPath" 
value="{request:contextPath}"/>
       </map:transform>
       <map:serialize type="xhtml"/>
     </map:match>

And I make a request to "http://localhost:8888/tanisatoko", the thing 
works perfectly. We can safely exclude the fact that it's the 
generation process.

Now, the _odd_ thing I noticed is that in those cases, I get an error 
of "PipelineNotFound", not a "ResourceNotFound", which means that the 
matcher seriously doesn't see that request.

Changing over the matcher to a 'regexp' matcher doesn't change, so, I 
bet it's the data we feed to the matcher.

Now, changing that matcher to 
"&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the encoding, 
and running it again, I get my nice page correctly.

I bet that somewhere (I don't know where, but surely somewhere), the 
UTF-8 encoded URL converted into a string using the current locale 
(MacRoman on my system), or a default of "ISO-8859-1", before the 
string is actually given to the sitemap.

Not having the sources at hand at the moment, I can't do a quick build 
to put out some debugging instruction, but  you get the idea.

	Pier

Re: [Help]How can I use non-ascii file name?

Posted by roy huang <li...@hotmail.com>.
Sorry,it should be :
name1=new Packages.java.lang.String(name);
 name2=new Packages.java.lang.String(name1.getBytes("ISO-8859-1"));
 cocoon.sendPage(name2);
 }
 

----- Original Message ----- 
From: "roy huang" <li...@hotmail.com>
To: <de...@cocoon.apache.org>
Sent: Thursday, September 02, 2004 7:13 PM
Subject: Re: [Help]How can I use non-ascii file name?


> After reading all the following mail,I finally using flowscript to solve this problem(thought I don't like this way)
> sitemap:
>    <map:match pattern="images">
>     <map:call function="display" >
>     </map:call>
>    </map:match>
>    <map:match pattern="*.jpg">
>     <map:read mime-type="image/jpg" src="jpg/花.jpg" />
>    </map:match>
> flowscript:
> function display(){
> name=cocoon.request.getParameter("name");
> name1=new Packages.java.lang.String(name);
> cocoon.sendPage(name1);
> }
> 
> it works,if you want to decode it,you can also :
> name2=new Packages.java.lang.String(name1.getBytes("ISO-8859-1"));
> 
> Thought,I don't like this way,just post it hope it is helpful for somebody.
> 
> Roy Huang
> ----- Original Message ----- 
> From: "roy huang" <li...@hotmail.com>
> To: <us...@cocoon.apache.org>; <de...@cocoon.apache.org>
> Sent: Thursday, August 12, 2004 7:45 PM
> Subject: [Help]How can I use non-ascii file name?
> 
> 
> > Hi,all:
> >     Use reader to display jpg or gif is quite simple,like:
> >    <map:match pattern="*.jpg">
> >     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
> >    </map:match>
> >    But if the file name is not ASCII but utf-8 or other encoding like 花.jpg (simplified Chinese),the resolver didn't resolve the name correctly,error occur:
> > org.apache.cocoon.ResourceNotFoundException: Error during resolving of the input stream: org.apache.excalibur.source.SourceNotFoundException: file:/C:/My Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg doesn't exist.
> > 
> > How can I use non-ASCII file name in cocoon?I can't find any description or help in wiki or archived mail list. 
> > 
> > Roy Huang

Re: [Help]How can I use non-ascii file name?

Posted by roy huang <li...@hotmail.com>.
After reading all the following mail,I finally using flowscript to solve this problem(thought I don't like this way)
sitemap:
   <map:match pattern="images">
    <map:call function="display" >
    </map:call>
   </map:match>
   <map:match pattern="*.jpg">
    <map:read mime-type="image/jpg" src="jpg/花.jpg" />
   </map:match>
flowscript:
function display(){
name=cocoon.request.getParameter("name");
name1=new Packages.java.lang.String(name);
cocoon.sendPage(name1);
}

it works,if you want to decode it,you can also :
name2=new Packages.java.lang.String(name1.getBytes("ISO-8859-1"));

Thought,I don't like this way,just post it hope it is helpful for somebody.

Roy Huang
----- Original Message ----- 
From: "roy huang" <li...@hotmail.com>
To: <us...@cocoon.apache.org>; <de...@cocoon.apache.org>
Sent: Thursday, August 12, 2004 7:45 PM
Subject: [Help]How can I use non-ascii file name?


> Hi,all:
>     Use reader to display jpg or gif is quite simple,like:
>    <map:match pattern="*.jpg">
>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>    </map:match>
>    But if the file name is not ASCII but utf-8 or other encoding like 花.jpg (simplified Chinese),the resolver didn't resolve the name correctly,error occur:
> org.apache.cocoon.ResourceNotFoundException: Error during resolving of the input stream: org.apache.excalibur.source.SourceNotFoundException: file:/C:/My Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg doesn't exist.
> 
> How can I use non-ASCII file name in cocoon?I can't find any description or help in wiki or archived mail list. 
> 
> Roy Huang

Re: [Help]How can I use non-ascii file name?

Posted by "Volkm@r" <pl...@arcor.de>.
roy huang wrote:
> [...]
> How can I use non-ASCII file name in cocoon?I can't find any description or help in wiki or archived mail list. 

Not yet tested. But maybe the SetCharacterEncodingAction described in 
<http://wiki.apache.org/cocoon/RequestParameterEncoding> would help.

-- 
Volkmar W. Pogatzki


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: [Help]How can I use non-ascii file name?

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 12 Aug 2004, at 12:45, roy huang wrote:

> Hi,all:
>     Use reader to display jpg or gif is quite simple,like:
>    <map:match pattern="*.jpg">
>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>    </map:match>
>    But if the file name is not ASCII but utf-8 or other encoding like 
> 花.jpg (simplified Chinese),the resolver didn't resolve the name 
> correctly,error occur:
> org.apache.cocoon.ResourceNotFoundException: Error during resolving of 
> the input stream: org.apache.excalibur.source.SourceNotFoundException: 
> file:/C:/My 
> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg 
> doesn't exist.
>
> How can I use non-ASCII file name in cocoon?I can't find any 
> description or help in wiki or archived mail list.
>
> Roy Huang

It appears indeed as a bug...

I have this sitemap snippet:

     <map:match pattern="谷*">
       <map:generate src="谷{1}.xml"/>
       <map:transform src="welcome.xslt">
         <map:parameter name="contextPath" 
value="{request:contextPath}"/>
       </map:transform>
       <map:serialize type="xhtml"/>
     </map:match>

and a file on the disk called "谷理子.xml". Somewhere, when I make a 
request for "http://localhost:8888/谷理子", the whole thing goes 
berserk...

Now, the URL is passed correctly, as I see that in the access log:

INFO    (2004-08-16) 10:26.36:538   [access] 
(/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' 
Processed by Apache Cocoon 2.1.5 in 27 milliseconds.

The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 
E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it 
gets lost in the process.

Now, if I modify my itemap to

     <map:match pattern="tanisatoko">
       <map:generate src="谷理子.xml"/>
       <map:transform src="welcome.xslt">
         <map:parameter name="contextPath" 
value="{request:contextPath}"/>
       </map:transform>
       <map:serialize type="xhtml"/>
     </map:match>

And I make a request to "http://localhost:8888/tanisatoko", the thing 
works perfectly. We can safely exclude the fact that it's the 
generation process.

Now, the _odd_ thing I noticed is that in those cases, I get an error 
of "PipelineNotFound", not a "ResourceNotFound", which means that the 
matcher seriously doesn't see that request.

Changing over the matcher to a 'regexp' matcher doesn't change, so, I 
bet it's the data we feed to the matcher.

Now, changing that matcher to 
"&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;", the encoding, 
and running it again, I get my nice page correctly.

I bet that somewhere (I don't know where, but surely somewhere), the 
UTF-8 encoded URL converted into a string using the current locale 
(MacRoman on my system), or a default of "ISO-8859-1", before the 
string is actually given to the sitemap.

Not having the sources at hand at the moment, I can't do a quick build 
to put out some debugging instruction, but  you get the idea.

	Pier