You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Christophe Marchand <cm...@oxiane.com> on 2022/08/21 15:09:18 UTC

Fwd: XML Schema validation and https redirects

Here a forward from xmlschema-dev@w3.org, I think Xerces is concerned by 
this. There is an active thread on this mailing list, with archives 
available at https://lists.w3.org/Archives/Public/xmlschema-dev/2022Aug/

Best regards,
Christophe

    W3C's main web site https://www.w3.org/ will soon start to redirect
    all http requests to https. Will this cause issues for XML
    Schema-related resources hosted on www.w3.org?

    We announced this intended change a few weeks ago,

    [[
    W3C’s main web site www.w3.org has been available via https for over
    a decade, but until now we have not been redirecting all requests to
    https as is commonly done on most other sites.

    The primary reason for this is that we wanted to avoid causing
    issues for software requesting machine-readable resources from
    www.w3.org such as HTML DTDs, XML Schemas, and namespace documents.

    We believe enough time has passed for most such software to have
    been updated to handle redirects and https, so we are planning to
    start redirecting all requests received over http to https within a
    month or two.
    ]]
    -- https://www.w3.org/blog/2022/07/redirecting-to-https-on-www-w3-org/

    And following an initial test of this change on August 1 we received
    some feedback that this caused issues with XML Schema validation. We
    are planning a followup test for 3 days starting at 14:00 UTC
    tomorrow, August 18.

    Some questions I have:

    Is it intended that www.w3.org is in the critical path when
    performing XML Schema validation? Are .xsd files and/or namespace
    documents retrieved each time a validation is done? Are there other
    use cases besides validation that might cause automated requests to
    www.w3.org?

    What are the most popular software packages that might be making
    these requests to www.w3.org? In what contexts do they make these
    requests? Do the latest versions typically have the ability to
    follow http to https redirects? Would XML catalogs help?

    The top UAs making requests for .xsd resources on www.w3.org are:

       127574 Java/1.8.0_121
        96712
        25860 Python-urllib/2.7
        16673 Apache-CXF/3.3.4
        16215 Zeep/4.1.0 (www.python-zeep.org)
         6481 Apache-CXF/3.2.10
         6205 Java/1.6.0_26
         4176 Java/17.0.2
         1827 Java/1.8.0_162
         1485 Python-urllib/3.7

    (1st col is the number of requests in a 90-min sample of the logs)

    Omitting version numbers:

       159765 Java
       101314
        29012 Python-urllib
        27912 Apache-CXF
        17640 Zeep
         1467 Mozilla
          623 Apache CXF
          322 sax Java
          211 Apache-HttpClient
          187 Oracle HTTPClient Version 10h
          120 node-soap
           88 SOA Model (see http:
           87 Elastic-Heartbeat
           74 python-requests
           74 curl

    Top UAs making requests matching /2001/XMLSchema :

        43290 Java
        15014 Python-urllib
         8358
         6106 ALTOVA
         3427 Mozilla
          364 Go-http-client
          130 Java1.8.0_291
           88 Zabbix
           70 WebexTeams
           66 MVision
           53 curl
           44 Baiduspider+(+http:
           42 Apache-HttpClient
           40 MapForce
           40 cubebot

    If we start redirecting http to https, will that fundamentally break
    compliance with W3C RECs that specify http: in references to .xsd
    files and namespaces? If so, which URIs would we need to continue to
    serve via http?

    Thanks,

    -- 
    Gerald Oskoboiny <ge...@w3.org>
    http://www.w3.org/People/Gerald/
    tel:+1-604-906-1232 (mobile)

Re: XML Schema validation and https redirects

Posted by Michael Glavassevich <mr...@gmail.com>.
I think something that many people don't realize is that whatever is
bundled in Java must have required rework / rewriting by Oracle/Sun in
order to support the StAX API (javax.xml.stream), because Apache Xerces has
never had that support. It would not surprise me if the core parsing code
that exists in Java is quite different from the Apache version and has its
own quirks in behaviour that never existed out here. The Java folks are in
the best position to comment on what it's doing. We would have no idea.
There's no feedback from Java to the community here.

On Mon, Aug 22, 2022 at 2:42 AM Christophe Marchand <cm...@oxiane.com>
wrote:

> Nice, Michael, that xerces supports redirects !
>
> There was a warning in the thread about xerces bundled in Java11, which
> seems to not support redirects. But I know it's an old one !
>
> Best regards,
> Christophe
> Le 22/08/2022 à 00:03, Michael Glavassevich a écrit :
>
> My first thoughts when reading this was the action [1] the W3C took
> against excessively accessing DTD and XML Schema documents hosted on their
> site. I would hope in the years since, users of XML parsers like Xerces
> learned a lesson and are caching these resources and using a resolver (such
> as an XML catalog) to load them.
>
> As for concerns about redirects, I recall that java.net.URL supports that
> by default or at least it did in the pre-Oracle days of Java. I am
> responsible for patching Xerces’ XMLEntityManager to check if an HTTP URL
> was redirected and use that for resolving any resources relative to it.
> This worked with the current versions of Java when it was implemented and
> the code has not changed in the Apache version.
>
> [1] https://www.w3.org/Help/Webmaster#block
>
> On Aug 21, 2022, at 11:10 AM, Christophe Marchand <cm...@oxiane.com>
> <cm...@oxiane.com> wrote:
>
> 
>
> Here a forward from xmlschema-dev@w3.org, I think Xerces is concerned by
> this. There is an active thread on this mailing list, with archives
> available at https://lists.w3.org/Archives/Public/xmlschema-dev/2022Aug/
>
> Best regards,
> Christophe
>
> W3C's main web site https://www.w3.org/ will soon start to redirect all
> http requests to https. Will this cause issues for XML Schema-related
> resources hosted on www.w3.org?
>
> We announced this intended change a few weeks ago,
>
> [[
> W3C’s main web site www.w3.org has been available via https for over a
> decade, but until now we have not been redirecting all requests to https as
> is commonly done on most other sites.
>
> The primary reason for this is that we wanted to avoid causing issues for
> software requesting machine-readable resources from www.w3.org such as
> HTML DTDs, XML Schemas, and namespace documents.
>
> We believe enough time has passed for most such software to have been
> updated to handle redirects and https, so we are planning to start
> redirecting all requests received over http to https within a month or two.
> ]]
> -- https://www.w3.org/blog/2022/07/redirecting-to-https-on-www-w3-org/
>
> And following an initial test of this change on August 1 we received some
> feedback that this caused issues with XML Schema validation. We are
> planning a followup test for 3 days starting at 14:00 UTC tomorrow, August
> 18.
>
> Some questions I have:
>
> Is it intended that www.w3.org is in the critical path when performing
> XML Schema validation? Are .xsd files and/or namespace documents retrieved
> each time a validation is done? Are there other use cases besides
> validation that might cause automated requests to www.w3.org?
>
> What are the most popular software packages that might be making these
> requests to www.w3.org? In what contexts do they make these requests? Do
> the latest versions typically have the ability to follow http to https
> redirects? Would XML catalogs help?
>
> The top UAs making requests for .xsd resources on www.w3.org are:
>
>   127574 Java/1.8.0_121
>    96712
>    25860 Python-urllib/2.7
>    16673 Apache-CXF/3.3.4
>    16215 Zeep/4.1.0 (www.python-zeep.org)
>     6481 Apache-CXF/3.2.10
>     6205 Java/1.6.0_26
>     4176 Java/17.0.2
>     1827 Java/1.8.0_162
>     1485 Python-urllib/3.7
>
> (1st col is the number of requests in a 90-min sample of the logs)
>
> Omitting version numbers:
>
>   159765 Java
>   101314
>    29012 Python-urllib
>    27912 Apache-CXF
>    17640 Zeep
>     1467 Mozilla
>      623 Apache CXF
>      322 sax Java
>      211 Apache-HttpClient
>      187 Oracle HTTPClient Version 10h
>      120 node-soap
>       88 SOA Model (see http:
>       87 Elastic-Heartbeat
>       74 python-requests
>       74 curl
>
> Top UAs making requests matching /2001/XMLSchema :
>
>    43290 Java
>    15014 Python-urllib
>     8358
>     6106 ALTOVA
>     3427 Mozilla
>      364 Go-http-client
>      130 Java1.8.0_291
>       88 Zabbix
>       70 WebexTeams
>       66 MVision
>       53 curl
>       44 Baiduspider+(+http:
>       42 Apache-HttpClient
>       40 MapForce
>       40 cubebot
>
> If we start redirecting http to https, will that fundamentally break
> compliance with W3C RECs that specify http: in references to .xsd files and
> namespaces? If so, which URIs would we need to continue to serve via http?
>
> Thanks,
>
> --
> Gerald Oskoboiny <ge...@w3.org> <ge...@w3.org>
> http://www.w3.org/People/Gerald/
> tel:+1-604-906-1232 <+1-604-906-1232> (mobile)
>
>

Re: XML Schema validation and https redirects

Posted by Christophe Marchand <cm...@oxiane.com>.
Nice, Michael, that xerces supports redirects !

There was a warning in the thread about xerces bundled in Java11, which 
seems to not support redirects. But I know it's an old one !

Best regards,
Christophe

Le 22/08/2022 à 00:03, Michael Glavassevich a écrit :
> My first thoughts when reading this was the action [1] the W3C took 
> against excessively accessing DTD and XML Schema documents hosted on 
> their site. I would hope in the years since, users of XML parsers like 
> Xerces learned a lesson and are caching these resources and using a 
> resolver (such as an XML catalog) to load them.
>
> As for concerns about redirects, I recall that java.net.URL supports 
> that by default or at least it did in the pre-Oracle days of Java. I 
> am responsible for patching Xerces’ XMLEntityManager to check if an 
> HTTP URL was redirected and use that for resolving any resources 
> relative to it. This worked with the current versions of Java when it 
> was implemented and the code has not changed in the Apache version.
>
> [1] https://www.w3.org/Help/Webmaster#block
>
>> On Aug 21, 2022, at 11:10 AM, Christophe Marchand 
>> <cm...@oxiane.com> wrote:
>>
>> 
>>
>> Here a forward from xmlschema-dev@w3.org, I think Xerces is concerned 
>> by this. There is an active thread on this mailing list, with 
>> archives available at 
>> https://lists.w3.org/Archives/Public/xmlschema-dev/2022Aug/
>>
>> Best regards,
>> Christophe
>>
>>     W3C's main web site https://www.w3.org/ will soon start to
>>     redirect all http requests to https. Will this cause issues for
>>     XML Schema-related resources hosted on www.w3.org?
>>
>>     We announced this intended change a few weeks ago,
>>
>>     [[
>>     W3C’s main web site www.w3.org has been available via https for
>>     over a decade, but until now we have not been redirecting all
>>     requests to https as is commonly done on most other sites.
>>
>>     The primary reason for this is that we wanted to avoid causing
>>     issues for software requesting machine-readable resources from
>>     www.w3.org such as HTML DTDs, XML Schemas, and namespace documents.
>>
>>     We believe enough time has passed for most such software to have
>>     been updated to handle redirects and https, so we are planning to
>>     start redirecting all requests received over http to https within
>>     a month or two.
>>     ]]
>>     --
>>     https://www.w3.org/blog/2022/07/redirecting-to-https-on-www-w3-org/
>>
>>     And following an initial test of this change on August 1 we
>>     received some feedback that this caused issues with XML Schema
>>     validation. We are planning a followup test for 3 days starting
>>     at 14:00 UTC tomorrow, August 18.
>>
>>     Some questions I have:
>>
>>     Is it intended that www.w3.org is in the critical path when
>>     performing XML Schema validation? Are .xsd files and/or namespace
>>     documents retrieved each time a validation is done? Are there
>>     other use cases besides validation that might cause automated
>>     requests to www.w3.org?
>>
>>     What are the most popular software packages that might be making
>>     these requests to www.w3.org? In what contexts do they make these
>>     requests? Do the latest versions typically have the ability to
>>     follow http to https redirects? Would XML catalogs help?
>>
>>     The top UAs making requests for .xsd resources on www.w3.org are:
>>
>>       127574 Java/1.8.0_121
>>        96712
>>        25860 Python-urllib/2.7
>>        16673 Apache-CXF/3.3.4
>>        16215 Zeep/4.1.0 (www.python-zeep.org)
>>         6481 Apache-CXF/3.2.10
>>         6205 Java/1.6.0_26
>>         4176 Java/17.0.2
>>         1827 Java/1.8.0_162
>>         1485 Python-urllib/3.7
>>
>>     (1st col is the number of requests in a 90-min sample of the logs)
>>
>>     Omitting version numbers:
>>
>>       159765 Java
>>       101314
>>        29012 Python-urllib
>>        27912 Apache-CXF
>>        17640 Zeep
>>         1467 Mozilla
>>          623 Apache CXF
>>          322 sax Java
>>          211 Apache-HttpClient
>>          187 Oracle HTTPClient Version 10h
>>          120 node-soap
>>           88 SOA Model (see http:
>>           87 Elastic-Heartbeat
>>           74 python-requests
>>           74 curl
>>
>>     Top UAs making requests matching /2001/XMLSchema :
>>
>>        43290 Java
>>        15014 Python-urllib
>>         8358
>>         6106 ALTOVA
>>         3427 Mozilla
>>          364 Go-http-client
>>          130 Java1.8.0_291
>>           88 Zabbix
>>           70 WebexTeams
>>           66 MVision
>>           53 curl
>>           44 Baiduspider+(+http:
>>           42 Apache-HttpClient
>>           40 MapForce
>>           40 cubebot
>>
>>     If we start redirecting http to https, will that fundamentally
>>     break compliance with W3C RECs that specify http: in references
>>     to .xsd files and namespaces? If so, which URIs would we need to
>>     continue to serve via http?
>>
>>     Thanks,
>>
>>     -- 
>>     Gerald Oskoboiny <ge...@w3.org>
>>     http://www.w3.org/People/Gerald/
>>     tel:+1-604-906-1232 (mobile)
>>

Re: XML Schema validation and https redirects

Posted by Michael Glavassevich <mr...@gmail.com>.
My first thoughts when reading this was the action [1] the W3C took against excessively accessing DTD and XML Schema documents hosted on their site. I would hope in the years since, users of XML parsers like Xerces learned a lesson and are caching these resources and using a resolver (such as an XML catalog) to load them.

As for concerns about redirects, I recall that java.net.URL supports that by default or at least it did in the pre-Oracle days of Java. I am responsible for patching Xerces’ XMLEntityManager to check if an HTTP URL was redirected and use that for resolving any resources relative to it. This worked with the current versions of Java when it was implemented and the code has not changed in the Apache version.

[1] https://www.w3.org/Help/Webmaster#block

> On Aug 21, 2022, at 11:10 AM, Christophe Marchand <cm...@oxiane.com> wrote:
> 
> 
> Here a forward from xmlschema-dev@w3.org, I think Xerces is concerned by this. There is an active thread on this mailing list, with archives available at https://lists.w3.org/Archives/Public/xmlschema-dev/2022Aug/
> 
> Best regards,
> Christophe
> 
> W3C's main web site https://www.w3.org/ will soon start to redirect all http requests to https. Will this cause issues for XML Schema-related resources hosted on www.w3.org? 
> 
> We announced this intended change a few weeks ago, 
> 
> [[ 
> W3C’s main web site www.w3.org has been available via https for over a decade, but until now we have not been redirecting all requests to https as is commonly done on most other sites. 
> 
> The primary reason for this is that we wanted to avoid causing issues for software requesting machine-readable resources from www.w3.org such as HTML DTDs, XML Schemas, and namespace documents. 
> 
> We believe enough time has passed for most such software to have been updated to handle redirects and https, so we are planning to start redirecting all requests received over http to https within a month or two. 
> ]] 
> -- https://www.w3.org/blog/2022/07/redirecting-to-https-on-www-w3-org/ 
> 
> And following an initial test of this change on August 1 we received some feedback that this caused issues with XML Schema validation. We are planning a followup test for 3 days starting at 14:00 UTC tomorrow, August 18. 
> 
> Some questions I have: 
> 
> Is it intended that www.w3.org is in the critical path when performing XML Schema validation? Are .xsd files and/or namespace documents retrieved each time a validation is done? Are there other use cases besides validation that might cause automated requests to www.w3.org? 
> 
> What are the most popular software packages that might be making these requests to www.w3.org? In what contexts do they make these requests? Do the latest versions typically have the ability to follow http to https redirects? Would XML catalogs help? 
> 
> The top UAs making requests for .xsd resources on www.w3.org are: 
> 
>   127574 Java/1.8.0_121 
>    96712 
>    25860 Python-urllib/2.7 
>    16673 Apache-CXF/3.3.4 
>    16215 Zeep/4.1.0 (www.python-zeep.org) 
>     6481 Apache-CXF/3.2.10 
>     6205 Java/1.6.0_26 
>     4176 Java/17.0.2 
>     1827 Java/1.8.0_162 
>     1485 Python-urllib/3.7 
> 
> (1st col is the number of requests in a 90-min sample of the logs) 
> 
> Omitting version numbers: 
> 
>   159765 Java 
>   101314 
>    29012 Python-urllib 
>    27912 Apache-CXF 
>    17640 Zeep 
>     1467 Mozilla 
>      623 Apache CXF 
>      322 sax Java 
>      211 Apache-HttpClient 
>      187 Oracle HTTPClient Version 10h 
>      120 node-soap 
>       88 SOA Model (see http: 
>       87 Elastic-Heartbeat 
>       74 python-requests 
>       74 curl 
> 
> Top UAs making requests matching /2001/XMLSchema : 
> 
>    43290 Java 
>    15014 Python-urllib 
>     8358 
>     6106 ALTOVA 
>     3427 Mozilla 
>      364 Go-http-client 
>      130 Java1.8.0_291 
>       88 Zabbix 
>       70 WebexTeams 
>       66 MVision 
>       53 curl 
>       44 Baiduspider+(+http: 
>       42 Apache-HttpClient 
>       40 MapForce 
>       40 cubebot 
> 
> If we start redirecting http to https, will that fundamentally break compliance with W3C RECs that specify http: in references to .xsd files and namespaces? If so, which URIs would we need to continue to serve via http? 
> 
> Thanks, 
> 
> -- 
> Gerald Oskoboiny <ge...@w3.org> 
> http://www.w3.org/People/Gerald/ 
> tel:+1-604-906-1232 (mobile)