You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Bryan Pendleton <bp...@amberpoint.com> on 2006/08/07 00:03:52 UTC
DERBY-688: Some review comments and feedback
Hi Army, thanks for posting the patches and for continuing the work
on the XML support. I think this is going to be a great feature!
Here's my feedback; hope it's useful.
thanks,
bryan
1) The patches read well. The comments are fantastic! The effort is greatly
appreciated.
2) The patches all applied cleanly for me, once I locally edited the
absolute file names in the patches. After each individual patch in the
sequence, I had no problems re-building derby. So really no build problems
to mention.
3) Who will run these tests, and when? If all the execution code is optional,
how do we ensure that it doesn't get broken?
4) Can you further explain the BY VALUE vs. BY REF behaviors? What do these
clauses mean, why is BY REF better, at what point would we want to
re-introduce BY VALUE, how does this manifest itself in the code?
5) If/when you re-generate the patches, please use relative path names for
the files in the patches so that we don't get strings like
c:/private/derby_src/java in the file names.
6) This is kind of a user-level question, and shows my ignorance about how
XML support is supposed to fit into Derby: most of your examples and tests
show the use of extremely tiny XML documents; they can fit into literal
strings and are at most a few hundred bytes long. But in practice, XML
documents are often ridiculous gigantic things which are hundreds of
thousands of bytes long, and people try not to manipulate them in memory,
but rather read them from files and write them to files, streaming them
through parsers and into in-memory DOM trees only as needed.
How does this work in Derby? Some questions that occur:
a) If I have a large XML document in a file, how do I get that into my
XML column in my database? Is it like a CLOB/BLOB where I work with
some sort of a special stream class?
b) The mirror-image question is how do I fetch a large XML document from
my table and stream it to my file on my client efficiently?
c) Internally, does the store use CLOB/BLOB techniques for XML storage?
does it store them in separate files?
d) how does DRDA tranmit XML over the net? Is it externalized data?
Obviously, these questions are motivated by some of the work that
Tomohito Nakayama and others have been doing recently with BLOB/CLOB
efficiency, for example DERBY-326 and DERBY-550.
7) Another user-level question: in your test programs, your XML documents
tend to be quite simple. They don't have the sorts of things that
real-life XML documents have, like:
a) <?xml ... ?> headers, with varying encodings and the like
b) multiple namespaces with various namespace prefixes
c) strange sections of escaped CDATA
d) DTD declarations with external DTDs
e) named external entities
etc.
Presumably, since all of this is handled by the parser, "it just works".
However, I'm a little confused about how the parsing happens in a
client-server scenario: is the XMLPARSE processing performed on the
client side? Or on the server side? I think this only becomes relevant
when the user must do something to ensure that the XML parser and the
XPATH/XQUERY engines are configured properly; they need to know which
"side" (client/server) of their environment needs to be so configured.
8) We need to make sure that the documentation clearly specifies which versions
of the add-on XML software (parsers, XPATH, etc.) are specified, and we
need to do our best to make the error messages when a bad version is used
clear and specific. For example, XALAN 2.4 is bundled with the Sun 1.4
JDK but it is probably far too old to be used successfully. Yet how to
install a newer version as an endorsed standard, and how to recognize the
error messages when the wrong version is being used, is pretty subtle
right now.
9) When I run lang/xmlBinding.java, I see the following diff. This diff occurs
in all three configurations I tried (embedded, DerbyNet, and DerbyNetClient)
-bash-2.05b$ java org.apache.derbyTesting.functionTests.harness.RunTest lang/xmlBinding.java
*** Start: xmlBinding jdk1.4.2_11 2006-08-05 17:28:52 ***
9 del
< Inserted roughly 40k of data.
10 del
< Inserted roughly 40k of data.
10a9,10
> Inserted roughly 39k of data.
> Inserted roughly 37k of data.
21 del
< 1, [ roughly 40k ]
22 del
< 2, [ roughly 40k ]
22a21,22
> 1, [ roughly 39k ]
> 2, [ roughly 37k ]
Test Failed.
*** End: xmlBinding jdk1.4.2_11 2006-08-05 17:28:59 ***
Re: DERBY-688: ready for commit?
Posted by Bryan Pendleton <bp...@amberpoint.com>.
Yip Ng wrote:
> I think these patches are ready for commit. +1
Thanks Yip!
I will proceed with committing the patches starting today.
thanks,
bryan
Re: DERBY-688: ready for commit?
Posted by Yip Ng <yi...@gmail.com>.
Hi Army:
I have reviewed your Derby-688 XML patches. Thanks for continuing
improving
XML support for Derby. I have applied the patches and was able to compile
without any problems (except for the part where I have to modified the
patches manually to remove the absolute paths). I think Bryan have asked
most of my concerns/questions and they have
been addressed. (Great comments in the code btw). I think these patches
are ready
for commit. +1
Yip
On 8/7/06, Bryan Pendleton <bp...@amberpoint.com> wrote:
>
> > Otherwise, if any of my answers above would make you uncomfortable with
> > committing the patches (or with approving their commit), please let me
> > know and I will try to address your concerns.
>
> Hi Army,
>
> I am comfortable with your responses, and in my opinion the 5 patches
> are ready for commit.
>
> Is anybody else intending to review these patches over the next few days?
>
> If no other reviews are underway, I propose to commit these patches by Aug
> 10.
>
> A question: is it best that I should commit them as 5 separate commit
> operations? An alternative would be to apply all 5 patches to my sandbox,
> and then commit them with a single commit, which matches the way that I
> reviewed them, but does not match the way that they are attached to the
> JIRA issue. Is there any particular reason to favor one technique versus
> the other?
>
> thanks,
>
> bryan
>
>
Re: DERBY-688: ready for commit?
Posted by Daniel John Debrunner <dj...@apache.org>.
Bryan Pendleton wrote:
>> Otherwise, if any of my answers above would make you uncomfortable
>> with committing the patches (or with approving their commit), please
>> let me know and I will try to address your concerns.
>
>
> Hi Army,
>
> I am comfortable with your responses, and in my opinion the 5 patches
> are ready for commit.
>
> Is anybody else intending to review these patches over the next few days?
>
> If no other reviews are underway, I propose to commit these patches by
> Aug 10.
>
> A question: is it best that I should commit them as 5 separate commit
> operations? An alternative would be to apply all 5 patches to my sandbox,
> and then commit them with a single commit, which matches the way that I
> reviewed them, but does not match the way that they are attached to the
> JIRA issue. Is there any particular reason to favor one technique versus
> the other?
I'd vote for separate patches it makes it " easier to identify as a
cause of a regression after the fact".
http://wiki.apache.org/db-derby/IncrementalDevelopment
Dan.
Re: DERBY-688: ready for commit?
Posted by Bryan Pendleton <bp...@amberpoint.com>.
> Otherwise, if any of my answers above would make you uncomfortable with
> committing the patches (or with approving their commit), please let me
> know and I will try to address your concerns.
Hi Army,
I am comfortable with your responses, and in my opinion the 5 patches
are ready for commit.
Is anybody else intending to review these patches over the next few days?
If no other reviews are underway, I propose to commit these patches by Aug 10.
A question: is it best that I should commit them as 5 separate commit
operations? An alternative would be to apply all 5 patches to my sandbox,
and then commit them with a single commit, which matches the way that I
reviewed them, but does not match the way that they are attached to the
JIRA issue. Is there any particular reason to favor one technique versus
the other?
thanks,
bryan
Re: DERBY-688: Some review comments and feedback
Posted by Army <qo...@gmail.com>.
Hi Bryan,
Thanks so much for reviewing the changes and for verifying the patches. I
appreciate your time with this.
> 3) Who will run these tests, and when? If all the execution code is
> optional, how do we ensure that it doesn't get broken?
I still have a couple of more patches to post to complete the XML work (phases
4, 5, ...). One of those patches (probably the final one) will enable the
xmlSuite to be run as part of derbyall. Due to the dependencies on Xalan and on
a JAXP implementation, the suite will not run against all JVMs: to start, I'll
just have it running against JVMs that have the required classes included in
them. This was originally going to be Sun and IBM 1.4, but as you discovered
this past weekend, the Xalan that's embedded with Sun jdk 1.4.2 is not a recent
enough version to pass. The tests cannot run against Sun 1.5 because the Xalan
classes have (I believe) been renamed in Sun jdk 1.5 and thus will not be
available to Derby (so far as I understand it, the user would have to download
an external version of Xalan, as with Sun jdk 1.4.2).
So at this point, I think the xmlSuite will only be run for IBM 1.4 and IBM 1.5,
since those jvms include Xalan 2.5 or greater and also include a JAXP parser.
So anyone running derbyall with either of those jvms would run the XML tests
automatically. That said, the nightly tests that are reported here:
http://people.apache.org/~fuzzylogic/derby_test_results/
show results against ibm1.4.2. So while XML failures will not show up in Ole's
report, we should at least be able to see them at the above-indicated URL.
> 4) Can you further explain the BY VALUE vs. BY REF behaviors? What do these
> clauses mean, why is BY REF better, at what point would we want to
> re-introduce BY VALUE, how does this manifest itself in the code?
The main way in which BY VALUE vs BY REF manifests itself in the code is when
dealing with variable bindings. SQL/XML[2006] defines a syntax by which a value
can be bound into a query expression. For example:
select
xmlserialize(
xmlquery('$ci/my/stuff' passing by ref xcol as "ci" empty on empty)
as clob)
from xt_1
In this query "xcol" is bound into the variable "$ci" and then the query is
executed. A key way in which BY REF and BY VALUE come into play, then, is when
comparison operations between more than one XML value are part of the query.
Take the following query:
select
xmlserialize(
xmlquery('$ci/stuff[@id=4] = $c2/stuff[@id=4]' passing by ref
xcol as "ci", xcol as "c2" empty on empty
)
as clob)
from xt_1
If "ci" and "c2" are passed BY REF then the result of this query would be
"true"; if either was passed BY VALUE, the result would be "false". This is
what I tried to capture in the comments in sqlgrammar.jj, where I have:
* [I]f the same XML value is passed BY REF into two different XML arguments
* for a single operator, then every node in the first XML argument must have an
* identical node in the second XML argument, and the ids for both nodes must be
* the same.
I admit that the comment there could use some more explanation--but hopefully I
can do that as a follow-up patch, instead of re-generate the patches from square
one...?
A while back I was prototyping some code to support XML binding and I found that
it was both easier and more efficient to support BY REF, so that's the
experience on which I've based the decision to use BY REF.
That said, though, it turns out that Xalan does not support variable binding (or
if it does, I haven't figured it out yet), so the difference between BY REF and
BY VALUE is just syntactic right now. I've chosen "BY REF" because that was
easiest for me to implement when I did the prototyping for variable binding, and
I think that's the way to go in future. If at any point someone wants to fry
the "BY VALUE" fish, then s/he should certainly feel free to do so :)
> 5) If/when you re-generate the patches, please use relative path names for
> the files in the patches so that we don't get strings like
> c:/private/derby_src/java in the file names.
Yes, will do. Sorry.
> most of your examples and tests show the use of extremely tiny XML documents;
> they can fit into literal strings and are at most a few hundred bytes long.
> But in practice, XML documents are often ridiculous gigantic things which are
> hundreds of thousands of bytes long, and people try not to manipulate them in
> memory, but rather read them from files and write them to files, streaming
> them through parsers and into in-memory DOM trees only as needed.
>
> How does this work in Derby?
For 10.2 I am only working to add XML support to Derby in the SQL layer. I do
not plan to address XML-specific JDBC processing. I'm planning to include in
the documentation something to this effect:
<begin doc>
There is no JDBC-side support for the XML datatype in Derby. This means it is
not possible to bind directly into an XML value nor to retrieve an XML value
directly from a result set using JDBC. Instead, users must bind and retrieve
the XML data as Java strings or character streams by explicitly specifying the
appropriate XML operators (XMLPARSE and XMLSERIALIZE) as part of their SQL
queries. This also means that there is no JDBC metatadata support for the XML
datatype.
<end_doc>
Note that with respect to keeping large XML documents in memory, this is
undoubtedly a place for improvement in the future. For 10.2, though, I'm just
taking some "baby steps" to introduce XML as a Derby SQL type and to enable its
usage. Thus far I've been working with a target XML size of up to about
40k-ish, as a sort of "starter" for XML support in Derby. When Derby becomes
serious about larger XML documents, more work is definitely going to be required.
> Some questions that occur:
> a) If I have a large XML document in a file, how do I get that into my
> XML column in my database? Is it like a CLOB/BLOB where I work with
> some sort of a special stream class?
In order to store an XML value into a Derby database using JDBC, one must
explicitly use the XMLPARSE operator in the SQL statement and then use any of
the setXXXmethods that are compatible with String types. For example:
insert into myXmlTable(xcol) values XMLPARSE(DOCUMENT ? PRESERVE WHITESPACE)
And then use setString/setCharacterStream to bind the operator.
> b) The mirror-image question is how do I fetch a large XML document from
> my table and stream it to my file on my client efficiently?
In order to retrieve XML values from a Derby database using JDBC, one must
explicitly use the XMLSERIALIZE operator in the SQL query and then use the
getXXX method that corresponds to the target serialization type. For example:
Query:
select XMLSERIALIZE(xcol as CLOB) from myXmlTable
Then to retrieve the XML value, one would use any of the getXXXmethods that work
for CLOB.
> c) Internally, does the store use CLOB/BLOB techniques for XML storage?
> does it store them in separate files?
The XML datatype internally wraps a SQLChar value, so any writing/reading from
store occurs via calls to the corresponding SQLChar methods. For example, see
XML.writeExternal(), XML.setStream(), etc. There is currently no special XML
logic with respect to store.
> d) how does DRDA tranmit XML over the net? Is it externalized data?
The answer to this one is "I don't know". And the reason I think we can get
away with this answer is that, as mentioned above, I've only implemented XML at
the SQL side. Thus XML data that flows from client to server can only flow as
part of the SQL/XML operators, namely XMLPARSE and XMLSERIALIZE. Parameters to
these operators are always string values of an existing SQL type (CHAR, VARCHAR,
CLOB) and thus those are the types that are sent by DRDA. We never transmit
"XML" per se across the network.
Note that this is one of the big reasons why I decided to put off any JDBC side
support for now: the issue of sending "XML" across the wire without a DRDA XML
type was not one I felt like addressing in the short term.
> Obviously, these questions are motivated by some of the work that
> Tomohito Nakayama and others have been doing recently with BLOB/CLOB
> efficiency, for example DERBY-326 and DERBY-550.
Understood--thanks for bringing them up. Please let me know if you have any
other questions about how this is going to work.
> 7) Another user-level question: in your test programs, your XML documents
> tend to be quite simple.
<snip different kinds of XML items>
> Presumably, since all of this is handled by the parser, "it just works".
Right :) If the JAXP parser that is in the user's classpath can handle it, then
the XMLPARSE operand should be able to handle it, as well. If the XML document
relies on external files, such as external schema documents, then those files
must be accessible to the JAXP parser that Derby is using. Which brings us to
the next question...
> However, I'm a little confused about how the parsing happens in a
> client-server scenario: is the XMLPARSE processing performed on the
> client side? Or on the server side? I think this only becomes relevant
> when the user must do something to ensure that the XML parser and the
> XPATH/XQUERY engines are configured properly; they need to know which
> "side" (client/server) of their environment needs to be so configured.
Since all XML operations occur through SQL, all XML processing occurs where the
SQL processing occurs--i.e. on the server. The client is just sending a SQL
statement (as a string) to the server and then sending the XML content as a
string (or stream) as well. The actual execution of XMLPARSE is performed on
the server, and thus it is the server that must have the XML parser (JAXP) and
the XML query engine (Xalan) in its classpath.
> 8) We need to make sure that the documentation clearly specifies which
> versions of the add-on XML software (parsers, XPATH, etc.) are specified,
> and we need to do our best to make the error messages when a bad version is
> used clear and specific.
Agreed. Phase 4 of the DERBY-688 changes is almost complete, and that's the
phase in which I check for the required XML classes and throw a user-friendly
error if the classes are not found.
I have not yet looked at checking for specific versions of the software, and I'm
not exactly sure how that should work. If, for example, we say that we require
Xalan 2.5 because of the bug described in XALANJ-1643, then what happens if/when
some other bug in Xalan 2.5.xxx (or whatever) pops up? Do we now require that
Xalan 2.5.xxx be installed? Do we want to code the required version into the
error message? Do we want to disallow XML operations if the required version
isn't present? Or just a warning? What happens if some app is using 10.2.2
which requires version 2.5 and then upgrades to version 10.2.3 which requires
versions 2.6--are we going to force the app to upgrade its Xalan release, as
well? I don't know what the policies around this kind of thing are, but that
seems a bit unfriendly...
> 9) When I run lang/xmlBinding.java, I see the following diff. This diff
> occurs in all three configurations I tried (embedded, DerbyNet, and
> DerbyNetClient)
Thanks for pointing this out; my guess is that it's a test issue, but I will
look into it. This seems like something that could be addressed in follow-up
patch--perhaps the one in which I enable the tests to run as part of derbyall?
Many thanks for taking the time to review the changes, Bryan, and for the
excellent questions. If you have any more, please do ask!
Otherwise, if any of my answers above would make you uncomfortable with
committing the patches (or with approving their commit), please let me know and
I will try to address your concerns.
Thanks again (and again!) for the review,
Army