You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org> on 2011/06/09 10:15:58 UTC

[jira] [Created] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
--------------------------------------------------------------------------------------------------------------------------------

                 Key: XERCESC-1967
                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
             Project: Xerces-C++
          Issue Type: Bug
          Components: Non-Validating Parser
    Affects Versions: 3.1.1
         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
And also tested the XMLmind XML editor on same platorm.
            Reporter: Leif Halvard Silli


[1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
[2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no

It is a XML 1.0 spec vioation. well-formed violation.

Test cases without XML declaration: http://malform.no/testing/html5/bom/
Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: [jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by John Snelson <jo...@snelson.org.uk>.

That sounds like a good plan :-).

On 13/06/11 11:18, Alberto Massari wrote:
> Hi John,
> I know the API, and I was planning on reusing it by changing ReaderMgr from
>
> if (src.getEncoding())
> {
> retVal = new (fMemoryManager) XMLReader
> (
> src.getPublicId()
> , src.getSystemId()
> , newStream
> , src.getEncoding()
>
> to
>
> const XMLCh* encoding = src.getEncoding();
> if(encoding == 0)
> encoding = newStream->getContentType();
> if (encoding)
> {
> retVal = new (fMemoryManager) XMLReader
> (
> src.getPublicId()
> , src.getSystemId()
> , newStream
> , encoding
>
> i.e. if the InputSource doesn't have a user-specified encoding, check if
> the actual stream carries an encoding.
>
> However, the getContentType returns the full header value, e.g.
> "application/xhtml+xml; charset=koi8-r", instead of an encoding; I
> guess you need getContentType to stay the same for supporting XQilla's
> unparsed-text(), so I was inclined to add a getEncoding method to
> BinInputStream.
>
> Alberto
>
>
> Il 13/06/2011 12:04, John Snelson ha scritto:
>> Hi Alby,
>>
>> I added BinInputStream::getContentType() some time ago so that I could
>> accomplish this kind of thing in XQilla. My guess is that you can build
>> Xerces-C stream encoding support on top of this. InputSource currently
>> has a getEncoding() method, but the HTTP call hasn't been made by this
>> point - maybe BinInputStream also needs a getEncoding() method which
>> takes it's default from the InputSource?
>>
>> John
>>
>> On 09/06/11 13:44, Alberto Massari (JIRA) wrote:
>>> [
>>> https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508
>>> ]
>>>
>>> Alberto Massari commented on XERCESC-1967:
>>> ------------------------------------------
>>>
>>> I don't agree on your request of reversing the priorities, but that's
>>> a discussion that shouldn't be done here. Good luck in trying to
>>> convince W3C.
>>> The XML spec says that the BOM+internal encoding have the precedence
>>> when the XML is in a *file*, because it is likely that no transcoding
>>> has been performed on top of it. For all the other scenarios (when
>>> the XML is in a byte stream) the component that does the wrapping
>>> should take care of telling the parser the new setting. This is what
>>> Xerces is doing now, and in my opinion it's correct and shouldn't be
>>> changed.
>>> What is missing in Xerces is the capability of propagating the
>>> content-type read from the HTTP stream to the parser; whether the
>>> content type is text/xml vs application/xml, this is simply affecting
>>> what is the default encoding when the content-type is not specified.
>>> And in case 8.20 there is an encoding specified, so it doesn't matter
>>> which one (text/xml or application/xml) was specified.
>>>
>>> In short, if you think that pparse (or saxcount) should refuse to
>>> parse your web page (that has an HTTP content type specifying Korean,
>>> plus an UTF-8 BOM), I agree and I will try to fix it.
>>>
>>>
>>>> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also
>>>> ignores the charset parameter of the HTTP content-type: header
>>>> --------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> Key: XERCESC-1967
>>>> URL: https://issues.apache.org/jira/browse/XERCESC-1967
>>>> Project: Xerces-C++
>>>> Issue Type: Bug
>>>> Components: Non-Validating Parser
>>>> Affects Versions: 3.1.1
>>>> Environment: Mac OS X Snow Leopard (Intel).
>>>> (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
>>>>
>>>> And also tested the XMLmind XML editor on same platorm.
>>>> Reporter: Leif Halvard Silli
>>>> Original Estimate: 4h
>>>> Remaining Estimate: 4h
>>>>
>>>> [1]
>>>> http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
>>>> [2]
>>>> http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
>>>> It is a XML 1.0 spec vioation. well-formed violation.
>>>> Test cases without XML declaration:
>>>> http://malform.no/testing/html5/bom/
>>>> Test cases *with* XML declartion to be added later.
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see:
>>> http://www.atlassian.com/software/jira
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>> For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: c-dev-help@xerces.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: [jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by Alberto Massari <Al...@progress.com>.

Hi John,
I know the API, and I was planning on reusing it by changing ReaderMgr from

         if (src.getEncoding())
         {
             retVal = new (fMemoryManager) XMLReader
                 (
                 src.getPublicId()
                 , src.getSystemId()
                 , newStream
                 , src.getEncoding()

to

         const XMLCh* encoding = src.getEncoding();
         if(encoding == 0)
             encoding = newStream->getContentType();
         if (encoding)
         {
             retVal = new (fMemoryManager) XMLReader
                 (
                 src.getPublicId()
                 , src.getSystemId()
                 , newStream
                 , encoding

i.e. if the InputSource doesn't have a user-specified encoding, check if 
the actual stream carries an encoding.

However, the getContentType returns the full header value, e.g. 
"application/xhtml+xml; charset=koi8-r", instead of an encoding; I 
guess you need getContentType to stay the same for supporting XQilla's 
unparsed-text(), so I was inclined to add a getEncoding method to 
BinInputStream.

Alberto


Il 13/06/2011 12:04, John Snelson ha scritto:
> Hi Alby,
>
> I added BinInputStream::getContentType() some time ago so that I could
> accomplish this kind of thing in XQilla. My guess is that you can build
> Xerces-C stream encoding support on top of this. InputSource currently
> has a getEncoding() method, but the HTTP call hasn't been made by this
> point - maybe BinInputStream also needs a getEncoding() method which
> takes it's default from the InputSource?
>
> John
>
> On 09/06/11 13:44, Alberto Massari (JIRA) wrote:
>>       [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508 ]
>>
>> Alberto Massari commented on XERCESC-1967:
>> ------------------------------------------
>>
>> I don't agree on your request of reversing the priorities, but that's a discussion that shouldn't be done here. Good luck in trying to convince W3C.
>> The XML spec says that the BOM+internal encoding have the precedence when the XML is in a *file*, because it is likely that no transcoding has been performed on top of it. For all the other scenarios (when the XML is in a byte stream) the component that does the wrapping should take care of telling the parser the new setting. This is what Xerces is doing now, and in my opinion it's correct and shouldn't be changed.
>> What is missing in Xerces is the capability of propagating the content-type read from the HTTP stream to the parser; whether the content type is text/xml vs application/xml, this is simply affecting what is the default encoding when the content-type is not specified. And in case 8.20 there is an encoding specified, so it doesn't matter which one (text/xml or application/xml) was specified.
>>
>> In short, if you think that pparse (or saxcount) should refuse to parse your web page (that has an HTTP content type specifying Korean, plus an UTF-8 BOM), I agree and I will try to fix it.
>>
>>
>>> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
>>> --------------------------------------------------------------------------------------------------------------------------------
>>>
>>>                   Key: XERCESC-1967
>>>                   URL: https://issues.apache.org/jira/browse/XERCESC-1967
>>>               Project: Xerces-C++
>>>            Issue Type: Bug
>>>            Components: Non-Validating Parser
>>>      Affects Versions: 3.1.1
>>>           Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
>>> And also tested the XMLmind XML editor on same platorm.
>>>              Reporter: Leif Halvard Silli
>>>     Original Estimate: 4h
>>>    Remaining Estimate: 4h
>>>
>>> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
>>> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
>>> It is a XML 1.0 spec vioation. well-formed violation.
>>> Test cases without XML declaration: http://malform.no/testing/html5/bom/
>>> Test cases *with* XML declartion to be added later.
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: c-dev-help@xerces.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: [jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by John Snelson <jo...@snelson.org.uk>.

Hi Alby,

I added BinInputStream::getContentType() some time ago so that I could 
accomplish this kind of thing in XQilla. My guess is that you can build 
Xerces-C stream encoding support on top of this. InputSource currently 
has a getEncoding() method, but the HTTP call hasn't been made by this 
point - maybe BinInputStream also needs a getEncoding() method which 
takes it's default from the InputSource?

John

On 09/06/11 13:44, Alberto Massari (JIRA) wrote:
>
>      [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508 ]
>
> Alberto Massari commented on XERCESC-1967:
> ------------------------------------------
>
> I don't agree on your request of reversing the priorities, but that's a discussion that shouldn't be done here. Good luck in trying to convince W3C.
> The XML spec says that the BOM+internal encoding have the precedence when the XML is in a *file*, because it is likely that no transcoding has been performed on top of it. For all the other scenarios (when the XML is in a byte stream) the component that does the wrapping should take care of telling the parser the new setting. This is what Xerces is doing now, and in my opinion it's correct and shouldn't be changed.
> What is missing in Xerces is the capability of propagating the content-type read from the HTTP stream to the parser; whether the content type is text/xml vs application/xml, this is simply affecting what is the default encoding when the content-type is not specified. And in case 8.20 there is an encoding specified, so it doesn't matter which one (text/xml or application/xml) was specified.
>
> In short, if you think that pparse (or saxcount) should refuse to parse your web page (that has an HTTP content type specifying Korean, plus an UTF-8 BOM), I agree and I will try to fix it.
>
>
>> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
>> --------------------------------------------------------------------------------------------------------------------------------
>>
>>                  Key: XERCESC-1967
>>                  URL: https://issues.apache.org/jira/browse/XERCESC-1967
>>              Project: Xerces-C++
>>           Issue Type: Bug
>>           Components: Non-Validating Parser
>>     Affects Versions: 3.1.1
>>          Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
>> And also tested the XMLmind XML editor on same platorm.
>>             Reporter: Leif Halvard Silli
>>    Original Estimate: 4h
>>   Remaining Estimate: 4h
>>
>> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
>> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
>> It is a XML 1.0 spec vioation. well-formed violation.
>> Test cases without XML declaration: http://malform.no/testing/html5/bom/
>> Test cases *with* XML declartion to be added later.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046508#comment-13046508 ] 

Alberto Massari commented on XERCESC-1967:
------------------------------------------

I don't agree on your request of reversing the priorities, but that's a discussion that shouldn't be done here. Good luck in trying to convince W3C.
The XML spec says that the BOM+internal encoding have the precedence when the XML is in a *file*, because it is likely that no transcoding has been performed on top of it. For all the other scenarios (when the XML is in a byte stream) the component that does the wrapping should take care of telling the parser the new setting. This is what Xerces is doing now, and in my opinion it's correct and shouldn't be changed.
What is missing in Xerces is the capability of propagating the content-type read from the HTTP stream to the parser; whether the content type is text/xml vs application/xml, this is simply affecting what is the default encoding when the content-type is not specified. And in case 8.20 there is an encoding specified, so it doesn't matter which one (text/xml or application/xml) was specified.

In short, if you think that pparse (or saxcount) should refuse to parse your web page (that has an HTTP content type specifying Korean, plus an UTF-8 BOM), I agree and I will try to fix it. 


> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046521#comment-13046521 ] 

Leif Halvard Silli commented on XERCESC-1967:
---------------------------------------------

Of course it says so about the file - other places in the spec. However, that specific section *do* speak about files that are served over HTTP. The thinking goes like this: as you say, it is likely that the file has a correct declaration. And therefore it should be adherted to over HTTP too. But as you say, this is a spec issue - not decided in this bug. However, attention to the problem and, eventually, disagreement with the spec, does matter.  The transcoding issue is the justification for text/xml. But it is not the justification - in RFC3023 (perhaps it is, in the HTTP rfc?) - for application/xml.

I agree that parse should refuse, yes, per the specs as they stand. I am not sure that parsers should behave that way, though. So, you must make a judgement yourself about whether to fix it.

Libxml2 adheres to RFC3023. However, for files on the computer, then libxml2 does not adhere to XML 1.0. See  bug: https://bugzilla.gnome.org/show_bug.cgi?id=652185  The Libxml2 bug also includes an attachment which you should test. Fact is that Xerces has the same bug as Libxml. Link to attachment: https://bugzilla.gnome.org/attachment.cgi?id=189543

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046411#comment-13046411 ] 

Alberto Massari commented on XERCESC-1967:
------------------------------------------

In the mailing list thread you reference I see that you complain about HTML5, IE, Webkit, XML and "also Xerces". Can you narrow down your report to a sentence like "I ran SAXCount <URL> and it reported valid/invalid data instead of invalid/valid"?

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046487#comment-13046487 ] 

Alberto Massari commented on XERCESC-1967:
------------------------------------------

No, I believe that RFC3023 is correct, but I am making a distinction between the parser and the code invoking the parser.
The XML parser is responsible for providing a parse(stream) function, and only knows what is written inside the stream; so, it expects a BOM, an encoding declaration and an XML-compliant sequence of bytes. If the BOM and/or the encoding is missing, it has its own fallback machanism in place to determine the encoding to be used in parsing. It only obeys to the XML specifications.
It also allows the stream to state "this is the encoding you should use, regardless of what you think", that someone from outside takes care of setting.
RFC3023 regulates how an HTTP transport can specify an encoding for the HTTP communication of an XML fragment, and is correct in saying that the HTTP envelope has the precedence over the XML content. After all, it's the HTTP transport that took the original payload and decided to re-encode it (case 8.20 in the RFC), so the client should trust the HTTP content type more than the internal XML fragment. In the Xerces case, the NetAccessor is the piece of code, external to the parser, that should take care of setting in the stream the setting "this is your encoding, ignore what you find in the XML".

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046454#comment-13046454 ] 

Alberto Massari commented on XERCESC-1967:
------------------------------------------

In my opinion, the parser should obey only to the encoding seen in the XML declaration, as it has no control on whatever envelope was used to transmit the XML. It's up to the code involved in the transmission of the XML to decide how to encode/decode the data for its communication channel; so, speaking of Xerces, the entity resolver that reads from HTTP should read the content-type and force this encoding in the parser. I'll check if this is working as expected.

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046478#comment-13046478 ] 

Leif Halvard Silli commented on XERCESC-1967:
---------------------------------------------

One one side you say: "should obey only to the encoding seen in the XML declaration". This sounds as if you think RFC3023 should be ignored. (Which I am leaning towards myself, but note that my focus is principally UTF-8 encoded documents with a BOM.)

One other side, you say: "entity resolver that reads from HTTP should read the content-type and force this encoding in the parser". This sounds as if you think RFC3023 nevertheless should apply.



> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046493#comment-13046493 ] 

Leif Halvard Silli commented on XERCESC-1967:
---------------------------------------------

Note that case 8.20 (http://tools.ietf.org/html/rfc3023#section-8.20) is 'text/xml'. The RFC does not discuss transcoding for application/xml (whcih e.g. application/xmlxhtml+xml' is a subtype of - so says Mark Piligrim at least: http://feedparser.org/docs/character-encoding.html )

For application/xml, the RFC only presents "because the HTTP RFC says so" justifiction. http://tools.ietf.org/html/rfc3023#section-3.2 And transcoding should not happen for application/xml, as much as I understand.

Note, also, that all this started because of a bug against HTML5/XHTML5: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 In my view (outlined in the bug), one should consider give priority to the (UTF-8) BOM over both HTTP and the XML encoding declaration. This, in the "interests of interoperability", as XML 1.0 puts it (http://www.w3.org/TR/xml/#sec-guessing-with-ext-info)

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046449#comment-13046449 ] 

Leif Halvard Silli commented on XERCESC-1967:
---------------------------------------------

Well, it is not only "also Xerces". I describe in [2]  how Xerces behaves. I also link to test cases without XML declaration. 

But to give what you ask for: 

* I rand from command line this:
   $ pparse http://malform.no/testing/html5/bom/xml.html
* that test case page is an 'application/xhtml+xml' document
* This document is UTF-8 encoded, with a BOM, but is *served* by HTTP as ISO-8859-1 encoded. 
*  Because HTTP says that the Content-Type charset parameter has priority over document internal encoding information, the document is not well-formed, because the there is a illegal character - "BOM" - in the begining of the document. (It actually isn't a BOM character when it is read as ISO-8859-1.)

The Xerces pparser should therefore emit 'fatal error' message. But instead of doing so, it simply emits this:
 
http://malform.no/testing/html5/bom/xml.html: 122 ms (24 elems, 7 attrs, 0 spaces, 2469 chars)

PS: Please note that I am not sure that Xerces should actually be corrected to adhere to RFC3023. I am actually advocating that XML 1.0 should be changed to say that the document information overrides the HTTP information. Because, the only parsers behaving like RFC3023 says, seems to be Opera and Firefox.

PPS: I will add example with document containg XML encoding declaration later on. (Time constraint.)

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Resolved] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.

     [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alberto Massari resolved XERCESC-1967.
--------------------------------------

       Resolution: Fixed
    Fix Version/s: 3.2.0
         Assignee: Alberto Massari

A fix is in SVN; please verify.

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Issue Comment Edited] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052988#comment-13052988 ] 

Michael Glavassevich edited comment on XERCESC-1967 at 6/22/11 2:33 AM:
------------------------------------------------------------------------

I'm not sure what was done in Xerces-C, but in my opinion reading HTTP headers to obtain "external encoding information" is the responsibility of the application not the XML parser. Users of Java implementations have always had to fetch that information themselves and provide the encoding by setting it on the InputSource (i.e. InputSource.setEncoding()). This is by design. It's not a bug.

      was (Author: mrglavas@ca.ibm.com):
    I'm not sure what was done in Xerces-C, but in my opinion reading HTTP headers to obtain "external encoding information" is the responsibility of the application not the XML parser. Users of Java implementations have always had to fetch that information themselves and provide the encoding by setting it on the InputSource (i.e. InputSource.setEncoding()). This by design. It's not a bug.
  
> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: [jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by Steve Hathaway <sh...@e-z.net>.

The build instructions for Xerces Version 3 are here:

   http://xerces.apache.org/xerces-c/build-3.html

- Steve Hathaway

On 6/21/2011 3:31 PM, Leif Halvard Silli (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052898#comment-13052898 ]
>
> Leif Halvard Silli commented on XERCESC-1967:
> ---------------------------------------------
>
> 1) Is it possible to provide some command line commands which I could use to build it from SVN locally? I'm sorry by my knowledge is quite narrow. Then I can check how it works.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Leif Halvard Silli (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052898#comment-13052898 ] 

Leif Halvard Silli commented on XERCESC-1967:
---------------------------------------------

Two questions: 
1) Is it possible to provide some command line commands which I could use to build it from SVN locally? I'm sorry by my knowledge is quite narrow. Then I can check how it works.  
2) Will this fix be "propagated" to Xerces Java too? (I mentioned the XMLmind editor initially, and that editor turns out to be based on Xerces Java - I know the bugs is also there. (Verified with XMLmind developer))

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[jira] [Commented] (XERCESC-1967) Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header

Posted by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESC-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052988#comment-13052988 ] 

Michael Glavassevich commented on XERCESC-1967:
-----------------------------------------------

I'm not sure what was done in Xerces-C, but in my opinion reading HTTP headers to obtain "external encoding information" is the responsibility of the application not the XML parser. Users of Java implementations have always had to fetch that information themselves and provide the encoding by setting it on the InputSource (i.e. InputSource.setEncoding()). This by design. It's not a bug.

> Xerces ignores (deletes, swallow, ignores) the UTF-8 BOM and also ignores the charset parameter of the HTTP content-type: header
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1967
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1967
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser
>    Affects Versions: 3.1.1
>         Environment: Mac OS X Snow Leopard (Intel).  (http://mirrorservice.nomedia.no/apache.org//xerces/c/3/binaries/xerces-c-3.1.1-x86-macosx-gcc-4.0.tar.gz)
> And also tested the XMLmind XML editor on same platorm.
>            Reporter: Leif Halvard Silli
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [1] http://www.w3.org/mid/20110609033243875895.0f711adc@xn--mlform-iua.no
> [2] http://www.w3.org/mid/20110609090401531862.04ce13e8@xn--mlform-iua.no
> It is a XML 1.0 spec vioation. well-formed violation.
> Test cases without XML declaration: http://malform.no/testing/html5/bom/
> Test cases *with* XML declartion to be added later.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org