You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Jan Hoskens <jh...@schaubroeck.be> on 2004/02/13 11:32:46 UTC

Bug? Reading File Source

Hi,

I've had some problems concerning special characters in my flow, but could
fix it. One of my problems occurred when loading a document. When I used the
proposed way of loading (in woody binding sample):

        source = resolver.resolveURI(uri); (resolve is ok)
        var is = new
Packages.org.xml.sax.InputSource(source.getInputStream());
        is.setSystemId(source.getURI());
        return parser.parseDocument(is); (crashes here)

I got an error concerning special characters. When an 'Ü' appeared in the
filename I got an exception concerning UTF-8 illegal characters. I created
this workaround with an encoding function to make sure that the string is in
UTF-8:

        source = resolver.resolveURI(uri);
        var file = new java.io.File(new
java.net.URI(encodeURI(source.getURI()))); // just another way to access the
file
        var is = new Packages.org.xml.sax.InputSource(new
java.io.FileReader(file));
        return parser.parseDocument(is);

The encodeURI() function essentially does this:
    split up the uri so that eg '/' is preserved, take the pieces (thus the
directories and filenames) and do java.net.URLEncoder.encode(part,"UTF-8"),
then replace the '+' (stands for whitespaces) with '%20'

This does work and my file is loaded correctly.
I thought that I had overcome this special character problem, but no, I
hadn't! I tried to read a directory with xml files and aggregat them to one
big xml so I can create one pdf file, but again this failed because of the
special character 'Ü' appearing in my filename. I tried two combinations:
A) dir generator with xls that creates includes and then include transformer
B) easy way: XPathDirectoryGenerator

The first combination just crashes on the include, the second one ignores
the problem:

XPathDirectoryGenerator: Warning: Problem while reading the file AYGÜL.xml.
Ignoring.
java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
 at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)

It seems to me that the same method of reading a file is used as I get the
same UTF error (that would be logical, reusing parts). So I think that the
inputSource doesn't take the special characters into account and when trying
to set an inputstream, it simply crashes because no conversion is done.
Isn't this a bug? Isn't it the responsibility of the InputSource object to
give a valid inputstream, even when special characters are used? (Or maybe
the Source gives an incorrect InputSource?)

Greetings,
Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: REPOST: Re: Bug? Reading File Source

Posted by Jan Hoskens <jh...@schaubroeck.be>.
> But all this doesn't immediately help you of course...
>
> I guess trying to avoid filenames containing non-ascii characters is a
> bad suggestion? ;-)
>

I already had that in mind, but the filenames contain names, and therefore I
would prefer that they are correct. I thought it would be better trying to
solve the problem, not trying avoiding it. I guess that there will be a fix
somewhere in the future of cocoon?(well I won't wait for it of course;-)

Thanks for the reply!

Best Regards,

Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: REPOST: Re: Bug? Reading File Source

Posted by Bruno Dumon <br...@outerthought.org>.
AFAIK the source of this problem is a "bug" in the File.toURL method of
the Java API. (in the javadoc of jdk 1.4 I see this limitation is now
documented and an alternative method is provided).

The trouble with fixing this is that there are probably already people
depending on this incorrect behaviour, and doing the encoding
themselves, and thus fixing it would lead for them to double-encoding.

But all this doesn't immediately help you of course...

I guess trying to avoid filenames containing non-ascii characters is a
bad suggestion? ;-)

On Mon, 2004-02-16 at 12:04, Jan Hoskens wrote:
> Nobody has any remarks about this? Or is it because it was posted at the end
> of the week;-)
> 
> Or should I ask dev list?
> 
> Kind Regards,
> Jan
> 
> ----- Original Message ----- 
> From: "Jan Hoskens" <jh...@schaubroeck.be>
> To: <us...@cocoon.apache.org>
> Sent: Friday, February 13, 2004 11:32 AM
> Subject: Bug? Reading File Source
> 
> 
> > Hi,
> >
> > I've had some problems concerning special characters in my flow, but could
> > fix it. One of my problems occurred when loading a document. When I used
> the
> > proposed way of loading (in woody binding sample):
> >
> >         source = resolver.resolveURI(uri); (resolve is ok)
> >         var is = new
> > Packages.org.xml.sax.InputSource(source.getInputStream());
> >         is.setSystemId(source.getURI());
> >         return parser.parseDocument(is); (crashes here)
> >
> > I got an error concerning special characters. When an 'Ü' appeared in the
> > filename I got an exception concerning UTF-8 illegal characters. I created
> > this workaround with an encoding function to make sure that the string is
> in
> > UTF-8:
> >
> >         source = resolver.resolveURI(uri);
> >         var file = new java.io.File(new
> > java.net.URI(encodeURI(source.getURI()))); // just another way to access
> the
> > file
> >         var is = new Packages.org.xml.sax.InputSource(new
> > java.io.FileReader(file));
> >         return parser.parseDocument(is);
> >
> > The encodeURI() function essentially does this:
> >     split up the uri so that eg '/' is preserved, take the pieces (thus
> the
> > directories and filenames) and do
> java.net.URLEncoder.encode(part,"UTF-8"),
> > then replace the '+' (stands for whitespaces) with '%20'
> >
> > This does work and my file is loaded correctly.
> > I thought that I had overcome this special character problem, but no, I
> > hadn't! I tried to read a directory with xml files and aggregat them to
> one
> > big xml so I can create one pdf file, but again this failed because of the
> > special character 'Ü' appearing in my filename. I tried two combinations:
> > A) dir generator with xls that creates includes and then include
> transformer
> > B) easy way: XPathDirectoryGenerator
> >
> > The first combination just crashes on the include, the second one ignores
> > the problem:
> >
> > XPathDirectoryGenerator: Warning: Problem while reading the file
> AYGÜL.xml.
> > Ignoring.
> > java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
> >  at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
> >
> > It seems to me that the same method of reading a file is used as I get the
> > same UTF error (that would be logical, reusing parts). So I think that the
> > inputSource doesn't take the special characters into account and when
> trying
> > to set an inputstream, it simply crashes because no conversion is done.
> > Isn't this a bug? Isn't it the responsibility of the InputSource object to
> > give a valid inputstream, even when special characters are used? (Or maybe
> > the Source gives an incorrect InputSource?)
> >
> > Greetings,
> > Jan
> >

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


REPOST: Re: Bug? Reading File Source

Posted by Jan Hoskens <jh...@schaubroeck.be>.
Nobody has any remarks about this? Or is it because it was posted at the end
of the week;-)

Or should I ask dev list?

Kind Regards,
Jan

----- Original Message ----- 
From: "Jan Hoskens" <jh...@schaubroeck.be>
To: <us...@cocoon.apache.org>
Sent: Friday, February 13, 2004 11:32 AM
Subject: Bug? Reading File Source


> Hi,
>
> I've had some problems concerning special characters in my flow, but could
> fix it. One of my problems occurred when loading a document. When I used
the
> proposed way of loading (in woody binding sample):
>
>         source = resolver.resolveURI(uri); (resolve is ok)
>         var is = new
> Packages.org.xml.sax.InputSource(source.getInputStream());
>         is.setSystemId(source.getURI());
>         return parser.parseDocument(is); (crashes here)
>
> I got an error concerning special characters. When an 'Ü' appeared in the
> filename I got an exception concerning UTF-8 illegal characters. I created
> this workaround with an encoding function to make sure that the string is
in
> UTF-8:
>
>         source = resolver.resolveURI(uri);
>         var file = new java.io.File(new
> java.net.URI(encodeURI(source.getURI()))); // just another way to access
the
> file
>         var is = new Packages.org.xml.sax.InputSource(new
> java.io.FileReader(file));
>         return parser.parseDocument(is);
>
> The encodeURI() function essentially does this:
>     split up the uri so that eg '/' is preserved, take the pieces (thus
the
> directories and filenames) and do
java.net.URLEncoder.encode(part,"UTF-8"),
> then replace the '+' (stands for whitespaces) with '%20'
>
> This does work and my file is loaded correctly.
> I thought that I had overcome this special character problem, but no, I
> hadn't! I tried to read a directory with xml files and aggregat them to
one
> big xml so I can create one pdf file, but again this failed because of the
> special character 'Ü' appearing in my filename. I tried two combinations:
> A) dir generator with xls that creates includes and then include
transformer
> B) easy way: XPathDirectoryGenerator
>
> The first combination just crashes on the include, the second one ignores
> the problem:
>
> XPathDirectoryGenerator: Warning: Problem while reading the file
AYGÜL.xml.
> Ignoring.
> java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
>  at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
>
> It seems to me that the same method of reading a file is used as I get the
> same UTF error (that would be logical, reusing parts). So I think that the
> inputSource doesn't take the special characters into account and when
trying
> to set an inputstream, it simply crashes because no conversion is done.
> Isn't this a bug? Isn't it the responsibility of the InputSource object to
> give a valid inputstream, even when special characters are used? (Or maybe
> the Source gives an incorrect InputSource?)
>
> Greetings,
> Jan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org