You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@xmlbeans.apache.org by David Bau <da...@bea.com> on 2003/09/26 22:42:34 UTC

V2 store

Eric,

You called this morning with a difficult design problem that you're facing
with v2 store given the features listed on the feature page, and I'm
summarizing here.  Perhaps somebody reading here will have some ideas.

Some of the problems that need to be solved are:
(1) Support DOM in addition to our cursor API
(2) Work with very large payloads without running out of RAM
(3) Keep us small, keep us fast.  That means try to reduce object
allocation, and try to avoid slower things like synchronize{} blocks.
(4) When dealing with read-only data, a naive multithreaded user should be
able to assume that they do not need to synchronize reads. (This is not on
the feature list, but seems like an important API property.)

But when you put together (1) (2) (3) and (4), you get some fundamental
tensions:

Here's the tension:

(a) The DOM API (1) implies many more objects than you actually need.  For
example, who really cares about the whitespace between tags in a typical
app?  And if you can bind directly to "int", who really wants to ever
allocate the string object that contains "413231"?  So that's in conflict
with goal (3), being small, unless we build a "lazy DOM" that creates
objects on demand.

(b) Dealing with very large instances (2) also seems to leads to "lazy
object" created on demand.  For example, if the bulk of an 20GB instance is
stored on disk, yet an app can hold on to an object that represents a node,
then certainly not all nodes can be in memory at once.  They're created on
demand.

(c) But creating objects on demand means that read operations mutate the
underlying data structure.  This is in conflict with goal (4), that is,
multiple readers on multiple threads need to syncrhonize against each other,
unless we synchronize for them.  But if we synchornize for them, that's
again in conflict with goal (3).

(d) The upshot: it seems like
- we need to synchronize at a low level to satisfy (4) at the same time as
allocate-on-demand
- to satisfy (3) - i.e., no synchronization cost, perhaps we should have a
global option per instance to turn off synchronization; users can use this
option if they are synchronizing themselves in a savvy mulithreaded app, or
if they are truly single-threaded.

That last bullet is a bit clumsy.  But I don't see anything better....

Thoughts?

David


- ---------------------------------------------------------------------
To unsubscribe, e-mail:   xmlbeans-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xmlbeans-dev-help@xml.apache.org
Apache XMLBeans Project -- URL: http://xml.apache.org/xmlbeans/

Re: [xmlbeans-dev] Re: V2 store

Posted by David Bau <da...@bea.com>.

From: Patrick Calahan
> In any event, though, maybe we could just defer the question of
synchronization until later?  It seems like it would be easier to just get
going on an unsynchronized impl, and than come back and see where we need to
lock it down in order to provide a threadsafe one.

I think the reason eric and I are obsessing a bit about synchronization
up-front is that during the development of v1, we left the synchronization
issue to the end and paid a significant performance penalty for it at the
very end.  (We still ended up being quite fast, but if you remove the
synchronization, it's measurably faster.)

This issue might be inherent to the problem; it is sorta looking that way.
Or it's possible that there is some way out if we plan ahead, e.g., by
altering our public API design etc (although obviously W3C DOM is
immovable), so that's why we're worrying about it early.

David


- ---------------------------------------------------------------------
To unsubscribe, e-mail:   xmlbeans-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xmlbeans-dev-help@xml.apache.org
Apache XMLBeans Project -- URL: http://xml.apache.org/xmlbeans/

Re: [xmlbeans-dev] Re: V2 store

Posted by David Bau <da...@bea.com>.

From: Patrick Calahan
> In any event, though, maybe we could just defer the question of
synchronization until later?  It seems like it would be easier to just get
going on an unsynchronized impl, and than come back and see where we need to
lock it down in order to provide a threadsafe one.

I think the reason eric and I are obsessing a bit about synchronization
up-front is that during the development of v1, we left the synchronization
issue to the end and paid a significant performance penalty for it at the
very end.  (We still ended up being quite fast, but if you remove the
synchronization, it's measurably faster.)

This issue might be inherent to the problem; it is sorta looking that way.
Or it's possible that there is some way out if we plan ahead, e.g., by
altering our public API design etc (although obviously W3C DOM is
immovable), so that's why we're worrying about it early.

David


- ---------------------------------------------------------------------
To unsubscribe, e-mail:   xmlbeans-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xmlbeans-dev-help@xml.apache.org
Apache XMLBeans Project -- URL: http://xml.apache.org/xmlbeans/

Re: V2 store

Posted by David Bau <da...@bea.com>.

Pcal writes:
> I'm not sure I agree that large payloads do necessarily lead us to lazy
object creation.  In many (most?) cases, large payloads are large because
they contain big chunks of base64 data, and those can be dealt with
out-of-band.

Agreed, big base64 seems like it should be special-cased if you're going
straight from parser->unmarshaller or marshaller->outputter.  We should
build awareness of large base64 blobs into our fast lossy binding framework.

I'm not convinced blobs are what "large data" means for most people.
Anecdotally there are indeed folks with large numbers of rows of data rather
than just blobs that are large.  Just had a (gee fun work ;-) dinner sitting
next to somebody in the pharma IT business who told me all about their
many-rows-of-data.  A single regulatory filing, which is basically a single
atomic transaction, when printed out (as is sometimes required by law),
weighs many many tons and fills an entire tractor-trailer.  They're
transmitting things like tables of data for millions of drug-trial data
points at once.

Pcal writes:
> And in the case where someone really does have to bite off 20GB of
structured XML data at once, I have to wonder if they aren't better served
by writing directly to an API like 173.

Maybe; but maybe it's possible to make it easier too, and imho worth some
thinking.  Seems a shame to have to rewrite your app when some parameter
(like size of message) evolves.

David

- ---------------------------------------------------------------------
To unsubscribe, e-mail:   xmlbeans-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xmlbeans-dev-help@xml.apache.org
Apache XMLBeans Project -- URL: http://xml.apache.org/xmlbeans/

Re: V2 store

Posted by David Bau <da...@bea.com>.

Pcal writes:
> I'm not sure I agree that large payloads do necessarily lead us to lazy
object creation.  In many (most?) cases, large payloads are large because
they contain big chunks of base64 data, and those can be dealt with
out-of-band.

Agreed, big base64 seems like it should be special-cased if you're going
straight from parser->unmarshaller or marshaller->outputter.  We should
build awareness of large base64 blobs into our fast lossy binding framework.

I'm not convinced blobs are what "large data" means for most people.
Anecdotally there are indeed folks with large numbers of rows of data rather
than just blobs that are large.  Just had a (gee fun work ;-) dinner sitting
next to somebody in the pharma IT business who told me all about their
many-rows-of-data.  A single regulatory filing, which is basically a single
atomic transaction, when printed out (as is sometimes required by law),
weighs many many tons and fills an entire tractor-trailer.  They're
transmitting things like tables of data for millions of drug-trial data
points at once.

Pcal writes:
> And in the case where someone really does have to bite off 20GB of
structured XML data at once, I have to wonder if they aren't better served
by writing directly to an API like 173.

Maybe; but maybe it's possible to make it easier too, and imho worth some
thinking.  Seems a shame to have to rewrite your app when some parameter
(like size of message) evolves.

David

- ---------------------------------------------------------------------
To unsubscribe, e-mail:   xmlbeans-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xmlbeans-dev-help@xml.apache.org
Apache XMLBeans Project -- URL: http://xml.apache.org/xmlbeans/

Re: V2 store

Posted by Patrick Calahan <pc...@bea.com>.

At 04:42 PM 9/26/2003 -0400, David Bau wrote:

>(b) Dealing with very large instances (2) also seems to leads to "lazy
>object" created on demand.  For example, if the bulk of an 20GB instance is
>stored on disk, yet an app can hold on to an object that represents a node,
>then certainly not all nodes can be in memory at once.  They're created on
>demand.

I'm not sure I agree that large payloads do necessarily lead us to lazy 
object creation.  In many (most?) cases, large payloads are large because 
they contain big chunks of base64 data, and those can be dealt with 
out-of-band.

And in the case where someone really does have to bite off 20GB of 
structured XML data at once, I have to wonder if they aren't better served 
by writing directly to an API like 173.

(d) The upshot: it seems like
>- we need to synchronize at a low level to satisfy (4) at the same time as
>allocate-on-demand
>- to satisfy (3) - i.e., no synchronization cost, perhaps we should have a
>global option per instance to turn off synchronization; users can use this
>option if they are synchronizing themselves in a savvy mulithreaded app, or
>if they are truly single-threaded.
>
>That last bullet is a bit clumsy.  But I don't see anything better....

Agreed.  Maybe we could consider having separate synchronized and 
unsynchronized impls.  This would spare us the overhead of repeatedly 
checking a flag at runtime and might help keep the implementation more 
clean.  It obviously would result in more code, though, and at least a 
little redundancy.

In any event, though, maybe we could just defer the question of 
synchronization until later?  It seems like it would be easier to just get 
going on an unsynchronized impl, and than come back and see where we need 
to lock it down in order to provide a threadsafe one.

-p

Re: V2 store

Posted by Patrick Calahan <pc...@bea.com>.

At 04:42 PM 9/26/2003 -0400, David Bau wrote:

>(b) Dealing with very large instances (2) also seems to leads to "lazy
>object" created on demand.  For example, if the bulk of an 20GB instance is
>stored on disk, yet an app can hold on to an object that represents a node,
>then certainly not all nodes can be in memory at once.  They're created on
>demand.

I'm not sure I agree that large payloads do necessarily lead us to lazy 
object creation.  In many (most?) cases, large payloads are large because 
they contain big chunks of base64 data, and those can be dealt with 
out-of-band.

And in the case where someone really does have to bite off 20GB of 
structured XML data at once, I have to wonder if they aren't better served 
by writing directly to an API like 173.

(d) The upshot: it seems like
>- we need to synchronize at a low level to satisfy (4) at the same time as
>allocate-on-demand
>- to satisfy (3) - i.e., no synchronization cost, perhaps we should have a
>global option per instance to turn off synchronization; users can use this
>option if they are synchronizing themselves in a savvy mulithreaded app, or
>if they are truly single-threaded.
>
>That last bullet is a bit clumsy.  But I don't see anything better....

Agreed.  Maybe we could consider having separate synchronized and 
unsynchronized impls.  This would spare us the overhead of repeatedly 
checking a flag at runtime and might help keep the implementation more 
clean.  It obviously would result in more code, though, and at least a 
little redundancy.

In any event, though, maybe we could just defer the question of 
synchronization until later?  It seems like it would be easier to just get 
going on an unsynchronized impl, and than come back and see where we need 
to lock it down in order to provide a threadsafe one.

-p