You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Donald Ball <ba...@webslingerZ.com> on 2002/03/07 20:08:34 UTC

problems parsing xml with dtd from a foreign source

(sent to cocoon-users, no help there...)

hey guys. i'm trying to retrieve some xml content over http to begin one
of my pipelines:

/nlm/query?author=Smith

<map:match pattern="nlm/query">
  <map:match type="request" pattern="author">
    <map:generate src="http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]"/>
    <map:serialize type="xml"/>
  </map:match>
</map:match>

the xml returned from the nih server will begin like so:

<?xml version="1.0"?>
<!DOCTYPE QueryResult PUBLIC "-//NLM//DTD QueryResult, 22 Jan 2002//EN"
"/entrez/query/DTD/pmqty_020122.dtd" >
<QueryResult>

unfortunately, i get an exception when cocoon tries to parse this
document. it claims that it cannot access the dtd:

java.net.MalformedURLException: no protocol:
/entrez/query/DTD/pmqty_020122.dtd
	at java.net.URL.(URL.java:473)
	at java.net.URL.(URL.java:376)
	at java.net.URL.(URL.java:330)
	at
org.apache.xerces.impl.XMLEntityManager.startEntity(XMLEntityManager.java:731)
	at
org.apache.xerces.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:691)
	at
org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:258)
	at
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(XMLDocumentScannerImpl.java:811)
	at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:333)
	at
org.apache.xerces.parsers.StandardParserConfiguration.parse(StandardParserConfiguration.java:525)
	at
org.apache.xerces.parsers.StandardParserConfiguration.parse(StandardParserConfiguration.java:581)
	at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:147)
	at
org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1157)
	at
org.apache.avalon.excalibur.xml.JaxpParser.parse(JaxpParser.java:241)
	at
org.apache.cocoon.components.source.AbstractStreamSource.toSAX(AbstractStreamSource.java:204)
	at
org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:142)

shouldn't it be trying to download the DTD from this url:

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pmqty_020122.dtd

where it does, in fact, live?

i did manage to work around this problem using the excellent entity
catalogs facility, and i suspect that's what we'll want to use in the long
term, but i would like to track down why this isn't working as (i think)
it ought to. thanks in advance.

- donald


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: problems parsing xml with dtd from a foreign source

Posted by David Crossley <cr...@indexgeo.com.au>.
Donald Ball wrote:
> David Crossley wrote:
> > Donald Ball wrote:
> > 
> > > hey guys. i'm trying to retrieve some xml content over http to begin one
> > > of my pipelines:
> > >
> > > /nlm/query?author=Smith
> > >
> > > <map:match pattern="nlm/query">
> > >   <map:match type="request" pattern="author">
> > >     <map:generate src="http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]"/>
> > >     <map:serialize type="xml"/>
> > >   </map:match>
> > > </map:match>
> > >
> > > the xml returned from the nih server will begin like so:
> > >
> > > <?xml version="1.0"?>
> > > <!DOCTYPE QueryResult PUBLIC "-//NLM//DTD QueryResult, 22 Jan 2002//EN"
> > > "/entrez/query/DTD/pmqty_020122.dtd" >
> > > <QueryResult>
> > >
> > > unfortunately, i get an exception when cocoon tries to parse this
> > > document. it claims that it cannot access the dtd:
> > >
> > > java.net.MalformedURLException: no protocol:
> > > /entrez/query/DTD/pmqty_020122.dtd
> > > 
> > > ....... <snip what="rest of error log">
> > >
> > > shouldn't it be trying to download the DTD from this url:
> > > http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pmqty_020122.dtd
> > > 
> > > where it does, in fact, live?
> > >
> > > i did manage to work around this problem using the excellent entity
> > > catalogs facility, and i suspect that's what we'll want to use in the long
> > > term, but i would like to track down why this isn't working as (i think)
> > > it ought to. thanks in advance.
> > > 
> > > - donald
> >
> > Good to hear that the entity catlogs worked for you.
> > I think that the reason that you cannot do without the
> > entity catalog resolver, is that the document type declaration
> > in the XML instance document is not using a full URL, i.e.
> > http://www.ncbi.nlm.nih.gov/entrez/qu...
> > So the parser is tying to find the DTD at the root of your
> > local filesystem, i.e. /entrez/qu...
> 
> but it shouldn't do that. according to the xml spec on system ids:
> 
> http://www.w3.org/TR/REC-xml#dt-sysid
> 
> "Unless otherwise provided by information outside the scope of this
> specification (e.g. a special XML element type defined by a particular
> DTD, or a processing instruction defined by a particular application
> specification), relative URIs are relative to the location of the resource
> within which the entity declaration occurs."
> 
> the location of the resource in this case is clearly its url:
> 
> http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]
> 
> and that's the context in which the system identifier should be resolved,
> right? (i could easily be wrong, i'm a little sketchy on the doctype
> stuff. the spec seems clear enough on this point to me tho.)

The issue may be that Cocoon has retrieved the XML instance
across the network and the parser is treating it as though it is
in the context of your local filesystem. The error log snippet that
you provided certainly indicates that the parser is looking
locally for the DTD ...
----
java.net.MalformedURLException: no protocol:
/entrez/query/DTD/pmqty_020122.dtd
----

Consider this. The system at NLM will do quality control
behind-the-scenes. Their server would retrieve instances
from their app and parse them. All will be OK because the
DTD can be found on their local filesystem, on the same
machine.

I just used the Entrez URL that you provided to get an XML
instance document. I tried processing with a command-line
parser. As expected, the parser reported an error looking
for the DTD on my local system.

Your Cocoon app is also taking the XML instance out of
its original context and into yours. So i think that we are
seeing the expected behaviour - locate the DTD relative
to the document because that is what the instance declared.

Of course, modifying the document type declaration in my
local instance, to use a full URL to the DTD at NLM,
satisfies the parser in a brute-force manner - not what you
want for production, but sufficient to explain what the issue is.

I suppose that the crux is ... where is the essential XML instance
in a generated network environment? Does it exist at NLM?
Is it the stream coming out of Cocoon's generator?

> if so, then while entity catalogs are a nice workaround, they don't work
> unless you know in advance the dtd of the remote xml and also know that
> it's not going to change. ...

Ah, that is a different issue. The Public Identifier in the
docment type declaration is the key for the entity catalog
resolver. The resolver looks at the PublicId (-//NLM//DTD Qu...)
and tries to find a mapping in the catalog to a local copy
of the DTD.

The potential for the remote DTD (and hence the structure of
the instance) to change, is certainly a management issue.
Catalog resolver will not be able to handle that automatically
for you. Thankfully NLM are using proper Public Identifiers
(with a proper version number) so you can identify when
there is a change and act accordingly.

> ... otherwise, your webapp can break without notice.
> that's not cool! ...

Certainly not. Perhaps get your stylesheet to send an alert
if the document type declaration changes. At least you will
then know that your Cocoon app needs attention. Anyway
you will need to know, because your stylesheets will then
need to change to accommodate the modified XML structure.

Of course, this issue is present whenever we use a source
that is out of our control. The Coccon dist Samples that do
a HTML scrape from a remote site, have the same problem.
Actually this is a bigger issue because there is no DTD to
specify the structure or control its revisions.

> ... i'm sorry that i've not been able to come up with a patch
> for this, i can't figure out which component is guilty. any clues?

The XML instance at NLM is the main issue. However, that is
not to say that they are doing the wrong thing ... it is actually
good to have a local System Identifier. This is the rub - when
you declare the DTD URL in the XML instance then crappy
clients will try to retrieve the DTD every time - the parser
cannot help it. 

The only way around that is you maintain a collection of local
DTD copies and use the excellent catalog entity resolver.
This is an application management issue - yet another
"separate concern". Hopefully you can collaborate with
NLM so that you can anticipate DTD revisions.

Jeff Turner's Doctypechanger solution could be plugged
into Cocoon to help with this, by modifying the document
type declaration. But that may be too late, the parser has
already started to read the XML instance. So you might
need a special Cocoon Generator to intercept.
Re: Doctypechanger tool for xml-commons
http://marc.theaimsgroup.com/?l=xml-commons-dev&m=101482808223187&w=2

-- David

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


RE: problems parsing xml with dtd from a foreign source

Posted by Vadim Gritsenko <va...@verizon.net>.
> From: Donald Ball [mailto:balld@webslingerZ.com]
> 
> On Sun, 10 Mar 2002, David Crossley wrote:
... 
> > > the xml returned from the nih server will begin like so:
> > >
> > > <?xml version="1.0"?>
> > > <!DOCTYPE QueryResult PUBLIC "-//NLM//DTD QueryResult, 22 Jan
2002//EN"
> > > "/entrez/query/DTD/pmqty_020122.dtd" >
> > > <QueryResult>
> > >
> > > unfortunately, i get an exception when cocoon tries to parse this
> > > document. it claims that it cannot access the dtd:
> > >
> > > java.net.MalformedURLException: no protocol:
> > > /entrez/query/DTD/pmqty_020122.dtd
 
...
 
> but it shouldn't do that. according to the xml spec on system ids:
> 
> http://www.w3.org/TR/REC-xml#dt-sysid
> 
> "Unless otherwise provided by information outside the scope of this
> specification (e.g. a special XML element type defined by a particular
> DTD, or a processing instruction defined by a particular application
> specification), relative URIs are relative to the location of the
resource
> within which the entity declaration occurs."
> 
> the location of the resource in this case is clearly its url:
> 
>
http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=X
ML&amp
> ;dispmax=999&amp;term={1}[au]
> 
> and that's the context in which the system identifier should be
resolved,
> right? (i could easily be wrong, i'm a little sketchy on the doctype
> stuff. the spec seems clear enough on this point to me tho.)
> 
> if so, then while entity catalogs are a nice workaround, they don't
work
> unless you know in advance the dtd of the remote xml and also know
that
> it's not going to change. otherwise, your webapp can break without
notice.
> that's not cool! i'm sorry that i've not been able to come up with a
patch
> for this, i can't figure out which component is guilty. any clues?

Have you tried to parse this XML with standalone Xerces?

Vadim

> 
> - donald
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: problems parsing xml with dtd from a foreign source

Posted by Donald Ball <ba...@webslingerZ.com>.
On Sun, 10 Mar 2002, David Crossley wrote:

> > hey guys. i'm trying to retrieve some xml content over http to begin one
> > of my pipelines:
> >
> > /nlm/query?author=Smith
> >
> > <map:match pattern="nlm/query">
> >   <map:match type="request" pattern="author">
> >     <map:generate src="http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]"/>
> >     <map:serialize type="xml"/>
> >   </map:match>
> > </map:match>
> >
> > the xml returned from the nih server will begin like so:
> >
> > <?xml version="1.0"?>
> > <!DOCTYPE QueryResult PUBLIC "-//NLM//DTD QueryResult, 22 Jan 2002//EN"
> > "/entrez/query/DTD/pmqty_020122.dtd" >
> > <QueryResult>
> >
> > unfortunately, i get an exception when cocoon tries to parse this
> > document. it claims that it cannot access the dtd:
> >
> > java.net.MalformedURLException: no protocol:
> > /entrez/query/DTD/pmqty_020122.dtd

...

> Good to hear that the entity catlogs worked for you.
> I think that the reason that you cannot do without the
> entity catalog resolver, is that the document type declaration
> in the XML instance document is not using a full URL, i.e.
> http://www.ncbi.nlm.nih.gov/entrez/qu...
> So the parser is tying to find the DTD at the root of your
> local filesystem, i.e. /entrez/qu...

but it shouldn't do that. according to the xml spec on system ids:

http://www.w3.org/TR/REC-xml#dt-sysid

"Unless otherwise provided by information outside the scope of this
specification (e.g. a special XML element type defined by a particular
DTD, or a processing instruction defined by a particular application
specification), relative URIs are relative to the location of the resource
within which the entity declaration occurs."

the location of the resource in this case is clearly its url:

http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]

and that's the context in which the system identifier should be resolved,
right? (i could easily be wrong, i'm a little sketchy on the doctype
stuff. the spec seems clear enough on this point to me tho.)

if so, then while entity catalogs are a nice workaround, they don't work
unless you know in advance the dtd of the remote xml and also know that
it's not going to change. otherwise, your webapp can break without notice.
that's not cool! i'm sorry that i've not been able to come up with a patch
for this, i can't figure out which component is guilty. any clues?

- donald


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: problems parsing xml with dtd from a foreign source

Posted by David Crossley <cr...@indexgeo.com.au>.
Donald Ball wrote:
> (sent to cocoon-users, no help there...)
> 
> hey guys. i'm trying to retrieve some xml content over http to begin one
> of my pipelines:
> 
> /nlm/query?author=Smith
> 
> <map:match pattern="nlm/query">
>   <map:match type="request" pattern="author">
>     <map:generate src="http://www.ncbi.nlm.nih.gov/entrez/utils/pmqty.fcgi?db=PubMed&amp;mode=XML&amp;dispmax=999&amp;term={1}[au]"/>
>     <map:serialize type="xml"/>
>   </map:match>
> </map:match>
> 
> the xml returned from the nih server will begin like so:
> 
> <?xml version="1.0"?>
> <!DOCTYPE QueryResult PUBLIC "-//NLM//DTD QueryResult, 22 Jan 2002//EN"
> "/entrez/query/DTD/pmqty_020122.dtd" >
> <QueryResult>
> 
> unfortunately, i get an exception when cocoon tries to parse this
> document. it claims that it cannot access the dtd:
> 
> java.net.MalformedURLException: no protocol:
> /entrez/query/DTD/pmqty_020122.dtd
> 	at java.net.URL.(URL.java:473)
> 	at java.net.URL.(URL.java:376)
> 	at java.net.URL.(URL.java:330)
> 	at
> org.apache.xerces.impl.XMLEntityManager.startEntity(XMLEntityManager.java:731)
> 	at
> org.apache.xerces.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:691)
> 	at
> org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:258)
> 	at
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(XMLDocumentScannerImpl.java:811)
> 	at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:333)
> 	at
> org.apache.xerces.parsers.StandardParserConfiguration.parse(StandardParserConfiguration.java:525)
> 	at
> org.apache.xerces.parsers.StandardParserConfiguration.parse(StandardParserConfiguration.java:581)
> 	at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:147)
> 	at
> org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1157)
> 	at
> org.apache.avalon.excalibur.xml.JaxpParser.parse(JaxpParser.java:241)
> 	at
> org.apache.cocoon.components.source.AbstractStreamSource.toSAX(AbstractStreamSource.java:204)
> 	at
> org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:142)
> 
> shouldn't it be trying to download the DTD from this url:
> 
> http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pmqty_020122.dtd
> 
> where it does, in fact, live?
> 
> i did manage to work around this problem using the excellent entity
> catalogs facility, and i suspect that's what we'll want to use in the long
> term, but i would like to track down why this isn't working as (i think)
> it ought to. thanks in advance.
> 
> - donald

Good to hear that the entity catlogs worked for you.
I think that the reason that you cannot do without the
entity catalog resolver, is that the document type declaration
in the XML instance document is not using a full URL, i.e.
http://www.ncbi.nlm.nih.gov/entrez/qu...
So the parser is tying to find the DTD at the root of your
local filesystem, i.e. /entrez/qu...
--David


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org