You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Ricardo A Capurro <rc...@ta.telecom.com.ar> on 2000/08/09 20:43:07 UTC

Question about defining DTDs

Hello there.

I am evaluating to use XML for the interchange of structured data between many
applications, so I am trying to define a couple of DTDs, so both applications
can agree on them.
The kind of data I am working on is very structured, for example this kind of
data in "C++"

struct Contact {
    string last_name;
    string first_name;
    string company;
    int year_of_birth;
};

After thinking a while I decided not to use attributes for the elements of the
Contact structure, because certain kinds of elements (like vectors or hashes)
would become very difficult to implement using attributes, so I decided to use
XML elements for each element on my C++ structure.

The definition of my DTD would become something like this:

<!DOCTYPE Contact [
<!ELEMENT last_name (#PCDATA)>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT year_of_birth (#PCDATA)>
<!ELEMENT Contact (last_name,first_name,company?,year_of_birth)>
]>

I can live with the element year_of_birth being of type #PCDATA, besides its an
integer in my application, so I would need to do a validation against this
regular expression /[1-9][0-9]*/ to accept it in my program.

So, an example of a Contact element can be as follows

<Contact><last_name>Capurro</last_name><first_name>Ricardo</first_name><year_of_birth>1968</year_of_birth></Contact>

But as you can see, the readability of this text is difficult, I mean, its
difficult for a person, not for a XML processor, and the difficulty increases
with the size of the structure, so I would like to write it this way to increase
readability

<Contact>
    <last_name>Capurro</last_name>
    <first_name>Ricardo</first_name>
    <year_of_birth>1968</year_of_birth>
</Contact>

But if I use this XML idiom, perhaps I am getting a Contact element that don't
agree with its definition,

<!ELEMENT Contact (last_name,first_name,company?,year_of_birth)>

because I am using white space between each of the elements that conform a
Contact.

Looking at the XML 1.0 specification and according to the element production yo
can see that the content must be a sequence of none, oneo or more element,
CharData, Reference, CDSect, PI or Comment. But of all these, the only one that
matches White Space is CharData, and it matches almost any character too!!!

[39] element ::= EmptyElemTag
                 | STag content ETag [WFC: Element Type Match]
                 [VC: Element Valid]

[40] STag ::= '<' Name (S Attribute)* S? '>' [WFC: Unique Att Spec]
[41] Attribute ::= Name Eq AttValue [VC: Attribute Value Type]
                   [WFC: No External Entity References]
                   [WFC: No < in Attribute Values]
[42] ETag ::= '</' Name S? '>'
[43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*
[44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' [WFC: Unique Att Spec]
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

In the case abobe, I can rewrite the Contact element definition to accept white
space (and other characte data) between the elements this way

<!ELEMENT Contact (#PCDATA|last_name|first_name|company|year_of_birth)*>

But if I do this, i would have to do a lot more validations in my applications
to ensure I am receiving the elements I need
(last_name,first_name,year_of_birth) and exactly one of each, and to reject
elements defined like this.

<Contact>
trash ... trash
    <last_name>Capurro</last_name>
    <first_name>Ricardo</first_name>
more trash ... and trash
    <year_of_birth>1968</year_of_birth>
and the end trash
</Contact>

So I suppose that I am not understanding very well this part of the XML
specification

The final questions are:

¿Is the white space interpreted as PCDATA inside an element?

In the case tha the white space is not interpreted as PCDATA, If I have for
example
<last_name>   Capurro        </last_name>
¿Am I going to loose the white space at the left ant the right when I process
that element?

I have found another question about the XML Specification, for exaple if you
look at the productions os names and nmtokens you can find that they aren´t
being used by any other productions at all.
Did I miss something or these productions aren´t necessary at all for the XML
1.0 Specification

Thank You very much

Ricardo

Re: Question about defining DTDs

Posted by Arnaud Le Hors <le...@us.ibm.com>.

Ricardo A Capurro wrote:
> 
> so I would like to write it this way to increase
> readability
> 
> <Contact>
>     <last_name>Capurro</last_name>
>     <first_name>Ricardo</first_name>
>     <year_of_birth>1968</year_of_birth>
> </Contact>
> 
> But if I use this XML idiom, perhaps I am getting a Contact element that don't
> agree with its definition,
> 
> <!ELEMENT Contact (last_name,first_name,company?,year_of_birth)>
> 
> because I am using white space between each of the elements that conform a
> Contact.

No, you'd be just fine. The parser would report the whitespace as being
"ignorable", or if you tell it not to bother (by setting the righ
option), it would simply skip them.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group