You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Jonas Berlin <jo...@linuxfan.com> on 2000/10/14 01:54:53 UTC

Xalan & Xerces: Problem with pathnames (URIs) containing non-ASCII characters..

Hello ! (warning: a 151-line bug raport ahead ;-)

I've just started out using xerces & xalan, but I already found a
quite annoying bug. Or at least I consider it a bug. ;-O

The problem is that it doesn't handle non-7bit-ascii characters in
pathnames correctly.

I had my software in a subdirectory called "ohjelmatyö", the "ö" being
the problem. In the directory I had a program that reads XML files,
and converts them to HTML files by using a simple XSL file.

So I run the program like:

$ java XML2HTML test.xml test.xsl test.html

And the program basically contains the following code:

    XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
    processor.process(new XSLTInputSource(args[0]), 
                      new XSLTInputSource(args[1]),
                      new XSLTResultTarget(args[2]));

However, this program fails:

test.xml; Line 0; Column 0
XSL Error: Could not parse test.xml document!
XSL Error: SAX Exception
org.apache.xalan.xslt.XSLProcessorException: File "test.xml" not found.
        at org.apache.xalan.xslt.XSLTEngineImpl.error(XSLTEngineImpl.java:1674)
        at org.apache.xalan.xslt.XSLTEngineImpl.getSourceTreeFromInput(XSLTEngineImpl.java:894)
        at org.apache.xalan.xslt.XSLTEngineImpl.process(XSLTEngineImpl.java:568)
        at XML2HTML.main(XML2HTML.java:15)

I took a 75-minute dive into the xerces sources and found that there
was a bug (as I see it) in the handling of pathnames in
org.apache.xerces.readers.DefaultEntityHandler that kept me from
succeeding in my test.

When I did some direct testing with the URI class involved, I instead
got the original exception:

org.apache.xerces.utils.URI$MalformedURIException: Path contains
invalid character: ö

.. which sadly got lost during

try { ... } catch(ABC x) { throw new DEF(); }

-style code.

--

But, let's start from the ground up:

The XSLTInputSource class' constructor that takes a String contains
the following javadoc:

    * Create a new input source with a system identifier (for a URL or
    * file name) -- the equivalent of creating an input source with
    * the zero-argument constructor and setting the new object's
    * SystemId property.

    * If the system identifier is a URL, it must be fully resolved.

According to this, I find that it should be perfectly legal to give
files like "/absolute/path/to/file" and "relative/path/to/file" to
this constructor. The code in the method calls setSystemId, which is a
method in the org.xml.sax.InputSource . It says:

    * Set the system identifier for this input source.
    *
    * <p>The system identifier is optional if there is a byte stream
    * or a character stream, but it is still useful to provide one,
    * since the application can use it to resolve relative URIs
    * and can include it in error messages and warnings (the parser
    * will attempt to open a connection to the URI only if
    * there is no byte stream or character stream specified).</p>
    *
    * <p>If the application knows the character encoding of the
    * object pointed to by the system identifier, it can register
    * the encoding using the setEncoding method.</p>
    *
    * <p>If the system ID is a URL, it must be fully resolved.</p>

There is this talk about character encoding of the object, but a bit
of studying tells it hasn't got anything to do with filename encoding.

After this, the created InputSource object traverses through mystical
ways (a bunch of *Liaison and other more or less complex classes) and
gets its way down to org.apache.xerces.readers.DefaultEntityHandler ,
which is the first to actually do anything useful with the filename I
originally gave to the constructor.

We are in the startReadingFromDocument() method, which (at least in my
case) is the one that actually starts reading from the file. There it
calls [this.]expandSystemId() with the filename as its first parameter.

This method first tries to interpret it as a URI ("file://foo..",
"http://bar.." etc..), and if successful, it returns the filename, now
verified to actually be a valid URI, (or URL if you like). In my case,
this isn't the case, it's just a plain relative filename. So the
method catches the MalformedURIException thrown due to the "invalid"
URI, and the method then continues by interpreting it as a ordinary
file. As it goes on, it also notices that the filename is a relative
one, and fetches the current directory from the system properties by
doing System.getProperty("user.dir") . It then creates a URI from
these components, resulting in the URI
"file://home/jonas/ohjelmatyö/test.xml".

This is where the actual bug is. When the standard unix path/filename
was converted into an URI, the "ö" character in the pathname is no
longer an allowed character. Instead it should be in the form
"file://home/jonas/ohjelmaty%F6/test.xml", as the URL-encoding syntax
goes.

--

One might argue that the filename should already be in URL-encoded
form when given to the XSLTInputSource (or similar) constructor, and
even if this is not what the javadoc states (in my opinion), I could
happily accept this. But as this case shows, the code picks parts of
the path from the system properties (the current directory), without
doing the necessary url-encoding, and there is no question about how
to interpret javadocs or alike in this case. The code simply can't
assume that the system property would already be in its url-encoded
form at entry.

Although I think that plain filenames should not be expected to be in
their url-encoded form unless preceeded with the file:// "URIfyer", I
can live with whatever turns out to be your solution. Still, at least
the system property should be url-encoded.

   Please note that (seemingly) identical code also exists in a
   equally named method in
   org.apache.xerces.validators.schema.TraverseSchema .

--

I did some further testing and noticed that the system did not work
even if I url-encoded the filenames (and gave them as absolute paths,
thus not involving the system property). I did not have time to invest
this matter further, though.

I hope you can fix all this (also the problem with the original
exception message not getting through and thus misleading the user
about what the real problem is). These are perhaps critical bugs, but
still bugs.

Otherwise the product seems great, confirming what I had heard from my
friends...

-- 
 Jonas