You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Olivier DURAND <od...@clever-age.com> on 2009/04/24 12:52:42 UTC

Accessing xml prolog via SAX

Hi!

I am trying to use Xerces to read the XML prolog in order to get the 
file encoding/version.

Unfortunately, I could not get it to work (the getEncoding() method 
returns ""). So I guess I must have missed something out despite the 
fact that I have read the FAQ, the API, as well as previous posts:
<http://xerces.apache.org/xerces2-j/faq-sax.html#faq-6>
<http://www.nabble.com/Accessing-xml-and-doctype-declaration-via-SAX-td15992692.html#a16039477>
<http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ext/Locator2.html#getEncoding() 
<http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ext/Locator2.html#getEncoding%28%29>>

So in an attempt to be as accurate as possible, here is a code sample:

    private class CustomHandler extends DefaultHandler2 {
       
        Locator2 myLocator = null;
       
        public CustomHandler() {
            super();
        }
       
        public void startDocument() {
            log.debug("Start analysing the XML document");              
        }
       
        public void startElement(String uri, String localName, String 
qName, Attributes attributes){
            String encoding = myLocator.getEncoding();
            log.debug("encoding: " + encoding);
        }
       
        public void setDocumentLocator(Locator2 aLocator){
            this.myLocator = aLocator;
            super.setDocumentLocator(aLocator);               
        }           
    }

    ...

       try{
            //SAX Parser instantiation
            SAXParserFactory factory = 
SAXParserFactory.newInstance();                       
            SAXParser parser = factory.newSAXParser();
           
            //Handler + Locator instantiation
            CustomHandler myHandler = new CustomHandler();
            Locator2 myLocator = new Locator2Impl();
            myHandler.setDocumentLocator(myLocator);
           
            parser.parse(aFile, myHandler);
           
           
           }catch(Throwable e){
             log.debug("Error!", e); 
           }
   

For information, the XML file I am working with looks like:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
    <tag1></tag1>
    <tag2></tag2>
    ...
</root>

I hope someone can help.

Thanks in advance,

-- 
Olivier DURAND - odurand@clever-age.com
Clever Age  - http://www.clever-age.com
37, Bd des Capucines    -   75002 Paris
Tel:..................+33 1 53 34 66 10
FAX:..................+33 1 53 34 65 20


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Accessing xml prolog via SAX

Posted by Olivier DURAND <od...@clever-age.com>.
Hi Elliotte,

Well spotted,  actually I was after the real encoding! However, as you 
mentioned this approach is only reliable 90% of the time, so I might end 
up using the declared encoding instead. From what I understand, I will 
get this information on the startElement() invocation.

To summarise, the mistakes I was making were:
 * I was instantiating the Locator myself.
 * I declared the Handler setDocumentLocator() method with the Locator2 type

Thanks a lot for your help.

Olivier DURAND - odurand@clever-age.com
Clever Age  - http://www.clever-age.com
37, Bd des Capucines    -   75002 Paris
Tel:..................+33 1 53 34 66 10
FAX:..................+33 1 53 34 65 20



Elliotte Harold wrote:
> Do you want the declared encoding or the real encoding? If the latter, see here:
>
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: Accessing xml prolog via SAX

Posted by Adam Retter <Ad...@landmark.co.uk>.
Sorry wrong email ;-)

 
Adam Retter
Software Developer
Landmark Information Group
 
T: 01392 685403 (x5403) 
 
5-7 Abbey Court, Eagle Way, Sowton,
Exeter, Devon, EX2 7HY
 
www.landmark.co.uk
 

-----Original Message-----
From: Elliotte Harold [mailto:elharo@ibiblio.org] 
Sent: 24 April 2009 13:49
To: j-users@xerces.apache.org
Subject: Re: Accessing xml prolog via SAX

Do you want the declared encoding or the real encoding? If the latter,
see here:

http://www.ibm.com/developerworks/library/x-tipsaxxni/

-- 
Elliotte Rusty Harold
elharo@ibiblio.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: Accessing xml prolog via SAX

Posted by Adam Retter <Ad...@landmark.co.uk>.
Can I let you know at around 4pm?


Got a meeting first....

 
Adam Retter
Software Developer
Landmark Information Group
 
T: 01392 685403 (x5403) 
 
5-7 Abbey Court, Eagle Way, Sowton,
Exeter, Devon, EX2 7HY
 
www.landmark.co.uk
 
-----Original Message-----
From: Elliotte Harold [mailto:elharo@ibiblio.org] 
Sent: 24 April 2009 13:49
To: j-users@xerces.apache.org
Subject: Re: Accessing xml prolog via SAX

Do you want the declared encoding or the real encoding? If the latter,
see here:

http://www.ibm.com/developerworks/library/x-tipsaxxni/

-- 
Elliotte Rusty Harold
elharo@ibiblio.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Accessing xml prolog via SAX

Posted by Olivier DURAND <od...@clever-age.com>.
Hi Michael,

I confirm, in my case I am working with ANSI documents and the encoding 
returned in the startDocument() method would consistently return an 
"UTF-8" encoding, which is wrong.

So the best bet is to read the prolog, or otherwise to rely on the 
parser's guessing...

BR,

Olivier DURAND - odurand@clever-age.com
Clever Age  - http://www.clever-age.com
37, Bd des Capucines    -   75002 Paris
Tel:..................+33 1 53 34 66 10
FAX:..................+33 1 53 34 65 20



Michael Glavassevich wrote:
>
> Hi Elliotte,
>
> I had a peek at your article and see in the code snippets that what 
> you're calling the "actual encoding" or "real encoding" actually 
> isn't. The one passed to startDocument() in XNI is the auto-detected 
> encoding, the one which Xerces guessed by peeking at the first few 
> bytes in the document. The actual encoding may not be known until the 
> XML declaration has been read and at this point it hasn't been read yet.
>
> In SAX it's not legal to read from the Locator in startDocument() so 
> any calls to the Locator you make in that method may not be correct 
> and generally won't be with Xerces because at the point it calls 
> startDocument() it hasn't read enough of the document yet to be sure 
> of what the actual encoding is. If it looked like it was working you 
> were probably just getting lucky because the documents you tried were 
> in UTF-8 or UTF-16. Specifically the Javadoc [1] says: "Note that the 
> locator will return correct information only during the invocation SAX 
> event callbacks after startDocument returns and before endDocument is 
> called. The application should not attempt to use it at any other 
> time." So you have to wait until an event following startDocument() 
> before you can read the encoding (or anything else) from the Locator.
>
> Thanks.
>
> [1] 
> http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator(org.xml.sax.Locator) 
> <http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator%28org.xml.sax.Locator%29>
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Elliotte Harold <el...@ibiblio.org> wrote on 04/24/2009 08:48:52 AM:
>
> > Do you want the declared encoding or the real encoding? If the
> > latter, see here:
> >
> > http://www.ibm.com/developerworks/library/x-tipsaxxni/
> >
> > --
> > Elliotte Rusty Harold
> > elharo@ibiblio.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Accessing xml prolog via SAX

Posted by Jacob Kjome <ho...@visi.com>.
I suggest looking at Rome's [1] XMLReader [2][3][4][5].  For instance, it can 
be used like this...

InputSource inputSource = new InputSource(url.toExternalForm());
try {
    XmlReader reader = new XmlReader(url);
    inputSource.setCharacterStream(reader);
    inputSource.setEncoding(reader.getEncoding());
} catch (XmlReaderException xre) {
    //This is somewhat unlikely to happen, but doesn't hurt to have
    //extra fallback, which XmlReader conveniently allows for by
    //providing access to the original unconsumed inputstream via
    //the XmlReaderException
    inputSource.setByteStream(xre.getInputStream());
    String encoding = xre.getBomEncoding();
    if (encoding == null) encoding = xre.getXmlGuessEncoding();
    if (encoding == null) encoding = xre.getXmlEncoding();
    inputSource.setEncoding(encoding != null ? encoding : "UTF-8");
}
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
parser.parse(inputSource);



[1] https://rome.dev.java.net/
[2] 
https://rome.dev.java.net/apidocs/1_0/com/sun/syndication/io/XmlReader.html
[3] 
https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReader.java?rev=1.19&view=markup
[4] 
https://rome.dev.java.net/apidocs/1_0/com/sun/syndication/io/XmlReaderException.html
[5] 
https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReaderException.java?rev=1.1&view=markup


Jake


On Fri, 24 Apr 2009 12:16:29 -0400
  Michael Glavassevich <mr...@ca.ibm.com> wrote:
> 
> Hi Elliotte,
> 
> I had a peek at your article and see in the code snippets that what you're
> calling the "actual encoding" or "real encoding" actually isn't. The one
> passed to startDocument() in XNI is the auto-detected encoding, the one
> which Xerces guessed by peeking at the first few bytes in the document. The
> actual encoding may not be known until the XML declaration has been read
> and at this point it hasn't been read yet.
> 
> In SAX it's not legal to read from the Locator in startDocument() so any
> calls to the Locator you make in that method may not be correct and
> generally won't be with Xerces because at the point it calls
> startDocument() it hasn't read enough of the document yet to be sure of
> what the actual encoding is. If it looked like it was working you were
> probably just getting lucky because the documents you tried were in UTF-8
> or UTF-16. Specifically the Javadoc [1] says: "Note that the locator will
> return correct information only during the invocation SAX event callbacks
> after startDocument returns and before endDocument is called. The
> application should not attempt to use it at any other time." So you have to
> wait until an event following startDocument() before you can read the
> encoding (or anything else) from the Locator.
> 
> Thanks.
> 
> [1]
> http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator(org.xml.sax.Locator)
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 
> Elliotte Harold <el...@ibiblio.org> wrote on 04/24/2009 08:48:52 AM:
> 
>> Do you want the declared encoding or the real encoding? If the
>> latter, see here:
>>
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> --
>> Elliotte Rusty Harold
>> elharo@ibiblio.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Accessing xml prolog via SAX

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Elliotte,

I had a peek at your article and see in the code snippets that what you're
calling the "actual encoding" or "real encoding" actually isn't. The one
passed to startDocument() in XNI is the auto-detected encoding, the one
which Xerces guessed by peeking at the first few bytes in the document. The
actual encoding may not be known until the XML declaration has been read
and at this point it hasn't been read yet.

In SAX it's not legal to read from the Locator in startDocument() so any
calls to the Locator you make in that method may not be correct and
generally won't be with Xerces because at the point it calls
startDocument() it hasn't read enough of the document yet to be sure of
what the actual encoding is. If it looked like it was working you were
probably just getting lucky because the documents you tried were in UTF-8
or UTF-16. Specifically the Javadoc [1] says: "Note that the locator will
return correct information only during the invocation SAX event callbacks
after startDocument returns and before endDocument is called. The
application should not attempt to use it at any other time." So you have to
wait until an event following startDocument() before you can read the
encoding (or anything else) from the Locator.

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator(org.xml.sax.Locator)

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Elliotte Harold <el...@ibiblio.org> wrote on 04/24/2009 08:48:52 AM:

> Do you want the declared encoding or the real encoding? If the
> latter, see here:
>
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> --
> Elliotte Rusty Harold
> elharo@ibiblio.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Accessing xml prolog via SAX

Posted by Elliotte Harold <el...@ibiblio.org>.
Do you want the declared encoding or the real encoding? If the latter, see here:

http://www.ibm.com/developerworks/library/x-tipsaxxni/

-- 
Elliotte Rusty Harold
elharo@ibiblio.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org