You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Miles Elam <mi...@pcextremist.com> on 2004/02/25 09:48:50 UTC

USQL candidate? [Was Re: [RT] Connecting databases with Cocoon]

Here's an idea I've been kicking around.  Let's see what folks think of 
it.  I'm only covering the non-Java-programmer crowd here.  Programmatic 
access via EJB/Component is and should be done through Flow IMHO.  For 
others, especially those with a more publishing and/or simple in-out 
requirements, this is my proposal.

---------------------------------------------------------------------

Have mapping like with XML-DBMS but not via a RelaxNG/XMLSchema 
wannabe.  No offense intended toward the XML-DBMS folks, but the task 
didn't seem all that inviting to me.  I've seen similar approaches via 
DTD annotation.

This aversion is on multiple levels.  First and foremost, the mapping 
file is yet another file to create and maintain.  Sure it's not Java, 
but the layer-specific syntax might as well be another programming 
language.  "Just" XML is a misnomer.  After all, is XSLT just another 
document schema or is it a programming language?  (Didn't someone find 
it to Turing-complete some time back?)  I also have concerns over 
complexity which I will cover later.

Most relevant to me, I am tired of doing the same old thing over and 
over again. This is what computers are for: to take the burden of 
repetitive, tedious tasks from us and to do these task faster than we 
can.  And in my book that includes writing yet another set of SQL 
statements.  To be perfectly honest, I do not consider knowledge of SQL 
syntax to be a significant enhancement to my life. Don't get me wrong. 
It has come in very handy over the years, but it is simply an obscure 
(albeit relatively simple) syntax to reference an ephemeral data 
organization pattern. I might say the same about XML were it not for the 
fact that XML is just the data structure without much in the way of 
syntax; XML is a serialization and processing format and serves that 
function far better than any alternative I can dream up.

This brings me to my point.  A relational database is a largely 
interrelated data structure.  XML is a largely interrelated data 
structure.  SQL is simply the convenient middleman.  Others have seen 
this fact as is evidenced by the number of relational-to-XML tools out 
there.  But where these tool strive desperately to solve all issues that 
many come up with data structure synchronization, I just wanted my 
common cases to go away.  If greater than ninety percent of my cases can 
be solved with a simple syntax and less than ten percent of my cases 
require more difficult roads, I would be satisfied.  Let's face it, 
we've all been confronted with the ninety percent case more times than 
we care to speak about.  It is the corner case that intrigues us so much 
that we enjoy spending time to solve it.  The less time the common case 
takes, the more time we can spend in enjoyable problem solving.

I'm tired of spending time on getting an integer out of a result set and 
explicitly converting it to a string representation of the hex value.  
Frankly, I'm tired of having to remind the computer that I'm working 
with an integer.  My database knows what type the field is.  I'm the one 
who told it when setting up the database schema.  My application server 
knows where my database is.  I'm the one who specified the datasource.  
I don't see the need to tell my servers over and over again when they 
have things called "memory" and "processing power."

So here's my proposal for a syntax that's as simple as possible but no 
simpler.

------------------------------------------------

Step 1: Let's assume we have a table called "books" in a database with 
the following structure and information. To make the example as simple 
as possible, we'll keep the number of items in the table small. It 
should be trivial to imagine a larger dataset.

Table "books"
  id [serial]
  isbn [integer]
  title [varchar(100)]
  subtitle [varchar(100)]
  author [vachar(50)]
  pubdate [date]

------------------------------
1
764543555
Professional XML Development with Apache Tools
Xerces, Xalan, FOP, Cocoon, Axis, Xindice
Theodore W. Leung
1 Dec 2000
------------------------------
2
735712352
Cocoon
Building XML Applications
Carsten Ziegeler, Matthew Langham
24 July 2002
------------------------------
3
672322579
Cocoon Developer's Handbook
NULL
Lajos Moczar, Jeremy Aston
10 Dec 2002
------------------------------

------------------------------------------------

Step 2: Create the Cocoon (or EJB?) datasource that points to the 
above.  In this case, we'll call it "library."

------------------------------------------------

Step 3: Create a simple XML file that reflects the structure and what we 
want the output to look like.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
  <Book>
    <ISBN/>
    <Title/>
    <Subtitle/>
    <Author/>
    <Published/>
  </Book>
</BookList>

------------------------------------------------

Step 4: Add a few "extra" attributes to that XML file. Adding the 
attributes "table" and "datasource" signal a context point. All results 
from this query will be relative to this XML element.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
  <Book xmlns:ds="http://geekspeak.org/datasource/1.0"
             ds:datasource="library"
             ds:table="books">
    <ISBN ds:select="isbn:##########"/>
    <Title ds:select="title"/>
    <Subtitle ds:select="subtitle"/>
    <Author ds:select="author"/>
    <Published ds:select="pubdate:yyyy-MM-dd"/>
  </Book>
</BookList>

------------------------------------------------

The output XML after transformation would be the following.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
  <Book>
    <ISBN>0764543555</ISBN>
    <Title>Professional XML Development with Apache Tools</Title>
    <Subtitle>Xerces, Xalan, FOP, Cocoon, Axis, Xindice</Subtitle>
    <Author>Theodore W. Leung</Author>
    <Published>2000-12-01</Published>
  </Book>
  <Book>
    <ISBN>0735712352</ISBN>
    <Title>Cocoon</Title>
    <Subtitle>Building XML Applications</Subtitle>
    <Author>Carsten Ziegeler, Matthew Langham</Author>
    <Published>2002-07-24</Published>
  </Book>
  <Book>
    <ISBN>0672322579</ISBN>
    <Title>Cocoon Developer's Handbook</Title>
    <Author>Lajos Moczar, Jeremy Aston</Author>
    <Published>2002-12-10</Published>
  </Book>
</BookList>

The datasource namespace is stripped during transformation. Note that 
for the ISBN, we simply use (Java) standard number formatting syntax. 
How does it know that it's an integer? Because the database told it so. 
The same is true for how it knows that the <Published> element is a date 
datatype. The database has defined the corresponding field to be a date. 
All we need to do at this point is tell it what format you want the date 
returned.

The third row doesn't have a <Subtitle> tag because the table column 
"subtitle" was NULL.  How do we enforce that the tag always exists?  
That's your datasource's data model that decides.  If it shouldn't be 
missing/NULL, the table column should be "NOT NULL" or have some other 
similar constraint.  The resultant XML will always be as strict/loose as 
you underlying data model.

------------------------------------------------

This mucks with the idea of a logicsheet obviously (This isn't XSP 
generator input).  It fits well with Stefano's notion of virtual 
generators though.  If a file generator is followed by a datasource 
output transformer in a virtual generator named "datasource." 
(http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=106131747919501&w=2)  
In fact, virtual sitemap components could obviate the need for many 
logicsheets, but that is another (probably very controversial) topic to 
be covered later.

This is actually more of a move toward "compiler vs. assembly": make the 
syntax as simple as possible for the author and leave the heavy lifting 
for the computer to handle.  In other words, computer processing time is 
cheap; developer time is expensive.  The most important goal for me is 
ease of authorship, comprehension, and maintainability.  Authorship is 
covered by having the output document be as close to the mapping file as 
possible.  Comprehension is handled (as with logicsheets) through the 
use of namespaces -- making separate portions as distinct from one 
another as possible.  Maintainability comes from having the underlying 
database handle all datatype issues and keeping everything in a single 
mapping file -- fewer possibilities for stale description files.

With a mapping/template file, you could have more complex structures 
such as the following for a radio show I work on:

<?xml version="1.0" encoding="ISO-8859-1"?>
<show xmlns:ds="http://geekspeak.org/datasource/1.0"
      ds:table="shows"               <!-- main table reference -->
      ds:constraint="livedate = {#date}">
  <showinfo>
    <topic ds:select="topic"/>
    <livedate ds:select="livedate"/>
    <participationgroup>
      <participant ds:table="showparticipants"    <!-- nested table 
reference -->
                   ds:constraint="showid = {id}"
                   username="{username}">
        <ds:value-of select="info:xml"/>
        <role ds:select="role"/>
      </participant>
    </participationgroup>
    <abstract ds:select="abstract:xml"/>
  </showinfo>
  <ds:value-of select="content:xml"/>
</show>

Here the model is the same as XSLT's attribute value replacement with 
'{' and '}' as value delimiters.  Also similar to XSLT is the <value-of> 
element albeit with a different namespace.  In the main table 
declaration, we have the placeholder "{#date}".  This is meant to refer 
to a parameter passed from the sitemap but could just as easily be 
removed in favor of a variable replacement further up the sitemap 
processing chain (in a dedicated generator/transformer or XSP).  The 
nested table reference would take the "{id}" value from the parent query 
context.  Presumably this nested query would be translated into a 
PreparedStatement by the transformer with the parameter updated on each 
iteration, but this is an implementation detail.  I'm talking about 
designing the API and covering use cases here.

The item:

    <abstract ds:select="abstract:xml"/>

is meant to be an extraction of XML from the specified (text-based) 
column with the root being the "abstract" element.  Unlike with XInclude 
or Cocoon's CInclude, I see it as a document fragment rather than a 
complete document on its own;  In other words, while requiring 
well-formedness, it could have mutiple roots since it's being added to a 
larger document with an already-defined root element.

------------------------------------------------

Caching becomes an interesting aspect.  Cache keys would simply be a 
sitemap parameter concatenation.  Geoff Howard's event based cache is 
one option for validity.  Another is to check the value of a particular 
field (which gets updated by a trigger for example).  Even in nested 
table mapping files, I'd suggest just one cache trigger or you could end 
up with validity objects that are in fact aggregate validities of every 
joined value.

Constructs such as the following would allow for flexible datasource 
caching strategies.

<?xml version="1.0" encoding="ISO-8859-1"?>
<show xmlns="http://geekspeak.org/2.0"
      xmlns:ds="http://geekspeak.org/datasource/1.0"
      ds:table="shows"
      ds:constraint="livedate = {#date}">
  <showinfo>
    <topic ds:select="topic"/>
    <livedate ds:select="livedate"/>

    <!-- ### Include a piece from elsewhere with its own caching (or 
lack thereof) ### -->
    <include href="show{id}_participants.xml" 
xmlns="http://www.w3.org/2001/XInclude">

    <abstract ds:select="abstract:xml"/>
  </showinfo>
  <ds:value-of select="content:xml"/>
</show>



<?xml version="1.0" encoding="ISO-8859-1"?>
<participationgroup>
  <participant ds:table="showparticipants"    <!-- new main table 
reference -->
               ds:constraint="showid = {#showid}"
               username="{username}">
    <ds:value-of select="info:xml"/>
    <role ds:select="role"/>
  </participant>
</participationgroup>



------------------------------------------------

What about insert/update/delete?  This is simultaneously easier and 
harder.  Easier because the exact same input file can be used as an 
input map as well.

<map:transform type="datasourceinsert" src="mappingfile.xml"/>

Is this less flexible than mapping syntaxes based on or similar to DTD, 
RelaxNG, or XMLSchema?  Yes.  But it is far easier to read and 
understand.  Also that flexibility only really shows up with 
document-oriented sections of the input.  Data-oriented portions are 
usually tightly locked with the XML schema.  In this case, 
document-oriented portions are simply treated as database text items and 
thrown in wholesale.  Validation can be performed on the source document 
to check correctness, but I haven't yet come across a concrete reason to 
address the issue in datasource interchange layer.  Perhaps the 
relatively strict constraints on the input could be more boon than bane?

Extracting the data from the source via rules from the map should be 
pretty straightforward.  Most items will simply be "ds:select" lookups 
and the attribute placeholder syntax lends itself well to regexes.

     <MyElement myattr="a {foo} deserves a {bar}"/>

"myattr" transforms easily by converting the "{...}" patterns to "(.*)" 
as in the following regex pattern for input processing

     /^a (.*) deserves a (.*)$/

So when is it harder?  When it isn't clear exactly what you want.  
Should it be a database update?  Should it be an insert?  Should it be a 
sync (checking for an existing entry, inserting if missing, and updating 
if present)?  With this mapping syntax, only one choice is possible at 
one time; no explicit sequence of updates followed by some inserts like 
with the SQLTransformer.  It can do multiple (in the sync case), but 
that is more processing time and may require table locks.  In most web 
environments, data is viewed far more often than updated so perhaps this 
isn't a real performance issue.  80-20 rule may dictate that 
inserts/updates are not in the 20% of code that takes up 80% of the 
time/resources.  Your mileage may vary.  Personally, I usually do one 
type of operation at a time, but I am not everyone.  Anyone have ideas 
or thoughts about this one?

------------------------------------------------

In short, there is strong reliance on the database for processing.  How 
does it work?  I don't know. (Well, I *do* know how the code I've 
written works, but this is the hypothetical user "I".)  I shouldn't 
know.  In certain predictable cases, a join could be employed behind the 
scenes.  SAX pass-throughs are possible until you hit a context point;  
But then only the context need be saved in memory, not the whole 
document.  Something akin to XSLTC could be written as well.  But I'm 
getting way too far ahead of myself.  The point being that the syntax 
being as clear and functional as possible is more important.  Just as 
XSLT processors are much faster today than initial attempts, this could 
be optimized as time and need presented itself.

Some caveats though.

While I have very clear ideas on how it could be implemented and have 
writted about 1500 lines of Java for the RDBMS->XML transformer portion, 
it doesn't currently exist (unlike the XSP logicsheet and 
SQLTransformer).  I'm getting slammed in school and paid work takes 
precedence as well so I can't guarantee when I can get in 
working/debugged.  I have every intention of donating the code if/when 
finished, so that's not the issue.  I also wouldn't mind giving what I 
have to folks who request it so that they can tinker with it (or rewrite 
it completely after badmouthing my coding skills...  ;-)).

And finally, it requires Java 1.4.  This is because of the 
PreparedStatements and passing parameters.  ParameterMetaData is 
required to see what the parameter datatype is.  This class is only 
available with JDBC 3.0 hence the dependency on v1.4.

Bottom line: I'm going to use it when finished unless someone in the 
Cocoon community comes up with something that blows my socks off (it's 
happened before).  The real question is whether others want it too.

Do you?

- Miles



Reinhard Poetz wrote:

>After using Cocoon for years I still think we are not very clear in
>telling our users what is the right tool for their needs. Of course this
>depends on their use cases and their skills.
>
>I see three different "Cocoon & databases" szenarios:
>
>1. Use Cocoon in enterprise szenarios which are 
>   highly interactive (--> many forms) + data which
>   has to be saved in databases
>
>   --> use CocoonForms, controlled by the Flow, a
>       business facade written in Java, 
>       O/R-mapping as persistence layer
>
>   --> or use a tool like XML-DBMS (see Mail from
>       Daniel) as CocoonForms - database binding
>       tool.
>
>2. Use Cocoon as 'pure' publishing engine
>   --> use ESQL or SQL Transformer (or sooner or
>       later USQL ...)
>
>
>3. Use Cocoon 'mainly' for publishing and there are
>   only a few interactive pages that have to be
>   persistent in a database and the user is not
>   familiar with Java (this is for many people
>   a reason why to use Cocoon because they can
>   use XML everywhere and don't need to learn
>   a programming language)
>
>As you can see szenario 1 and 2 have clear solutions, scenario 3 doesn't
>have. Some use ESQL with the drawback of mixing concers, some use
>SQLTransformer which doesn't make it easy to make
>create/update/delete-applications and some use DatabaseActions which are
>IMO the simplest way but they are not supported by flow and which are
>hidden in the depth of Cocoon (IMHO). 
>
>IMO we should make the DatabaseActions more general components (...
>shouldn't be too difficult) which can be used from within flow or from
>actions and we should make clear statements what we as developer
>consider as 'best practices' and what are the pros and cons of each
>technology.
>
>WDYT?
>
>--
>Reinhard
>  
>