You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Sriram Narayanan <sr...@gmail.com> on 2007/03/02 12:40:44 UTC

Results of a JR Oracle test that we conducted

Hi list:

We ran a few tests on JR and Oracle and have posted our results here.
We welcome comments from the community.

JR and Database:
Single Workspace					
Single Repository (Oracle 10g)
225 Client Data					
DB Size	5330 MB				
Index Size	947 MB				

All times are in milliseconds

With Sequential Client Access					
100 Millisecond sleep between each thread creation					
Clients	Users	Min	Max	Total	Avg
5	1	313	703	2688	537
5	20	156	828	23793	237
10	5	156	938	14313	286
10	20	171	719	49475	247
20	5	187	812	27029	270
20	10	156	797	49588	247
50	5	156	844	59746	238
50	10	156	671	113150	226

With Random Client Access					
100 Millisecond sleep between each thread creation					
Clients	Users	Min	Max	Total	Avg
5	1	297	766	2548	509
5	20	171	781	25196	251
10	5	172	922	15418	308
10	20	156	813	46451	232
20	5	172	890	25684	256
20	10	141	813	45023	225
50	5	156	860	57672	230
50	10	156	812	130177	260


With Random Client Access					
100 Millisecond sleep between each thread creation					
Read-Write-Read for one subset each in each user (thread)					

Clients	Users	Min	Max	Total	Avg
5	1	594	1016	3954	790
5	20	1172	3187	218414	2184
10	5	875	2219	77668	1553
10	20	1125	4485	587238	2936
20	5	1109	3391	216036	2160
20	20	938	7438	1732304	4330
50	10	578	8438	2461820	4923
100	8	1671	28593	9677890	12097

Hardware Configuration running JR:
- Intel Xeon 5130 (QuadCore) 2.0 GHz processor
- 3 GB RAM
- Windows XP Professional (32-Bit)
- Sun JDK 1.5.0_07
- JackRabbit 1.2.1
- Lucene 2.1.0
- JCR 1.0

Hardware Configuration running Oracle 10g
- Intel Xeon 5130 (QuadCore) 2.0 GHz processor
- 3 GB RAM
- Windows XP Professional (32-Bit)
- Sun JDK 1.5.0_07
- Oracle 10g as database for repository

Test Data:
Test data has 225 clients fullupgrade information in Oracle Database.
The Search Index for lucene is stored on the local machine.

A Customer has the following hierarchy:

/Product/Customer1/Configuration
/Product/Customer1/SalesData
/Product/Customer1/OtherData
/Product/Customer1/Configuration
/Product/Customer1/SalesData
/Product/Customer1/OtherData

Each of the above has some 30 to 50 nodes with properties.

Test Scenario:
Test is conducted for different clients (Customer1-Customer225) with
different combination of simultaneous users. Each user is a concurrent
thread. Each user is assumed to query 3 different subsets (nodes
within SalesData/Configuration/OtherData).
For example: In a setup when 10 customers are accessed by 5 concurrent
users each, 10x5x3 subset calls are made to JackRabbit.

Sequential / Random Customer Access:
In sequential Customer access, we request customers specific
information in order. If a test run needs to process 10 customers,
then it will query Customer1 through Customer225. In Random client
access from 225 clients, any 10 clients would be accessed.

Each sheet displays statistic for a single run.

Read/Write/Read test scenario tests concurrent Read/Write capabilities
of JackRabbit. Each user will have
3 queries. For the 1st and 3rd query, the program reads the data and
in the 2nd query it updates the node
with a test property..

The code:
final String ORDER_BY_JCR_SCORE_DESCENDING = " order by @jcr:score descending";


               String clientId = "Customer" + new Random().nextInt(225);

               String[] xpaths = new String[]{
                       "Product/" + clientId +
"/Configuration/*/*[@type='primary'and @name='connectivity']",
                       "Reno/" + clientId +
"/Sales/*/*[@type='local'and @name='Bill']",
                       "Reno/" + clientId +
"/OtherData/*/*[@type='admin'and @name='Carl']"};

               long before = System.currentTimeMillis();
               String nodeName = null;
               for (int i = 0; i < xpaths.length; i++) {
                   String xpath = xpaths[i];

                   if (i != 1) {
                       // Write a node
                       Query query =
_session.getWorkspace().getQueryManager().createQuery(xpath +
ORDER_BY_JCR_SCORE_DESCENDING, Query.XPATH);
                       QueryResult queryResult = query.execute();
                       NodeIterator iterator = queryResult.getNodes();
                       queryResult = query.execute();
                       iterator = queryResult.getNodes();
                       while (iterator.hasNext()) {
                           Node node = iterator.nextNode();
                       }
                   } else {
                       Query query =
_session.getWorkspace().getQueryManager().createQuery(xpath +
ORDER_BY_JCR_SCORE_DESCENDING, Query.XPATH);
                       QueryResult queryResult = query.execute();
                       NodeIterator iterator = queryResult.getNodes();
                       int itCount = 0;
                       while (iterator.hasNext()) {
                           Node node = iterator.nextNode();
                           nodeName = node.getName();
   //                    System.out.println(nodeName);
                       }
                   }
               }
									

Notes:
1. The above code runs within a thread.
2. Between creating threads, we have introduced a 1000 ms delay, and
we get the above results.
3. If we do not introduce the thread sleep, then the java process
quickly runs out of heap space for new threads.
4. Each time a thread starts, we logon.
5. Each time work completes in a thread, we logout.

-- Sriram

Re: Results of a JR Oracle test that we conducted

Posted by Sriram Narayanan <sr...@gmail.com>.
On 3/2/07, Marcin Nowak <ma...@comarch.com> wrote:
> Hi,
>
> Could you put some info about the structure of repository such as node
> number, max and average depth and width of repository tree, did you use
> version mechanism or references between nodes?
>

My apologies for the delayed response.

Our hierarchy is

A/B/C/D

Here, each "D" has about 50 nodes.
Each node has 10 properties.
Each property has about 50 kb of content within it.

We have not yet used versioning or node references.

-- Sriram

Re: Results of a JR Oracle test that we conducted

Posted by Marcin Nowak <ma...@comarch.com>.
Hi,

Could you put some info about the structure of repository such as node 
number, max and average depth and width of repository tree, did you use 
version mechanism or references between nodes?

BR,
Marcin Nowak

Sriram Narayanan wrote:
> Hi list:
>
> We ran a few tests on JR and Oracle and have posted our results here.
> We welcome comments from the community.
>
> JR and Database:
> Single Workspace                   
> Single Repository (Oracle 10g)
> 225 Client Data                   
> DB Size    5330 MB               
> Index Size    947 MB               
>
> All times are in milliseconds
>
> With Sequential Client Access                   
> 100 Millisecond sleep between each thread creation                   
> Clients    Users    Min    Max    Total    Avg
> 5    1    313    703    2688    537
> 5    20    156    828    23793    237
> 10    5    156    938    14313    286
> 10    20    171    719    49475    247
> 20    5    187    812    27029    270
> 20    10    156    797    49588    247
> 50    5    156    844    59746    238
> 50    10    156    671    113150    226
>
> With Random Client Access                   
> 100 Millisecond sleep between each thread creation                   
> Clients    Users    Min    Max    Total    Avg
> 5    1    297    766    2548    509
> 5    20    171    781    25196    251
> 10    5    172    922    15418    308
> 10    20    156    813    46451    232
> 20    5    172    890    25684    256
> 20    10    141    813    45023    225
> 50    5    156    860    57672    230
> 50    10    156    812    130177    260
>
>
> With Random Client Access                   
> 100 Millisecond sleep between each thread creation                   
> Read-Write-Read for one subset each in each user 
> (thread)                   
>
> Clients    Users    Min    Max    Total    Avg
> 5    1    594    1016    3954    790
> 5    20    1172    3187    218414    2184
> 10    5    875    2219    77668    1553
> 10    20    1125    4485    587238    2936
> 20    5    1109    3391    216036    2160
> 20    20    938    7438    1732304    4330
> 50    10    578    8438    2461820    4923
> 100    8    1671    28593    9677890    12097
>
> Hardware Configuration running JR:
> - Intel Xeon 5130 (QuadCore) 2.0 GHz processor
> - 3 GB RAM
> - Windows XP Professional (32-Bit)
> - Sun JDK 1.5.0_07
> - JackRabbit 1.2.1
> - Lucene 2.1.0
> - JCR 1.0
>
> Hardware Configuration running Oracle 10g
> - Intel Xeon 5130 (QuadCore) 2.0 GHz processor
> - 3 GB RAM
> - Windows XP Professional (32-Bit)
> - Sun JDK 1.5.0_07
> - Oracle 10g as database for repository
>
> Test Data:
> Test data has 225 clients fullupgrade information in Oracle Database.
> The Search Index for lucene is stored on the local machine.
>
> A Customer has the following hierarchy:
>
> /Product/Customer1/Configuration
> /Product/Customer1/SalesData
> /Product/Customer1/OtherData
> /Product/Customer1/Configuration
> /Product/Customer1/SalesData
> /Product/Customer1/OtherData
>
> Each of the above has some 30 to 50 nodes with properties.
>
> Test Scenario:
> Test is conducted for different clients (Customer1-Customer225) with
> different combination of simultaneous users. Each user is a concurrent
> thread. Each user is assumed to query 3 different subsets (nodes
> within SalesData/Configuration/OtherData).
> For example: In a setup when 10 customers are accessed by 5 concurrent
> users each, 10x5x3 subset calls are made to JackRabbit.
>
> Sequential / Random Customer Access:
> In sequential Customer access, we request customers specific
> information in order. If a test run needs to process 10 customers,
> then it will query Customer1 through Customer225. In Random client
> access from 225 clients, any 10 clients would be accessed.
>
> Each sheet displays statistic for a single run.
>
> Read/Write/Read test scenario tests concurrent Read/Write capabilities
> of JackRabbit. Each user will have
> 3 queries. For the 1st and 3rd query, the program reads the data and
> in the 2nd query it updates the node
> with a test property..
>
> The code:
> final String ORDER_BY_JCR_SCORE_DESCENDING = " order by @jcr:score 
> descending";
>
>
>               String clientId = "Customer" + new Random().nextInt(225);
>
>               String[] xpaths = new String[]{
>                       "Product/" + clientId +
> "/Configuration/*/*[@type='primary'and @name='connectivity']",
>                       "Reno/" + clientId +
> "/Sales/*/*[@type='local'and @name='Bill']",
>                       "Reno/" + clientId +
> "/OtherData/*/*[@type='admin'and @name='Carl']"};
>
>               long before = System.currentTimeMillis();
>               String nodeName = null;
>               for (int i = 0; i < xpaths.length; i++) {
>                   String xpath = xpaths[i];
>
>                   if (i != 1) {
>                       // Write a node
>                       Query query =
> _session.getWorkspace().getQueryManager().createQuery(xpath +
> ORDER_BY_JCR_SCORE_DESCENDING, Query.XPATH);
>                       QueryResult queryResult = query.execute();
>                       NodeIterator iterator = queryResult.getNodes();
>                       queryResult = query.execute();
>                       iterator = queryResult.getNodes();
>                       while (iterator.hasNext()) {
>                           Node node = iterator.nextNode();
>                       }
>                   } else {
>                       Query query =
> _session.getWorkspace().getQueryManager().createQuery(xpath +
> ORDER_BY_JCR_SCORE_DESCENDING, Query.XPATH);
>                       QueryResult queryResult = query.execute();
>                       NodeIterator iterator = queryResult.getNodes();
>                       int itCount = 0;
>                       while (iterator.hasNext()) {
>                           Node node = iterator.nextNode();
>                           nodeName = node.getName();
>   //                    System.out.println(nodeName);
>                       }
>                   }
>               }
>                                    
>
> Notes:
> 1. The above code runs within a thread.
> 2. Between creating threads, we have introduced a 1000 ms delay, and
> we get the above results.
> 3. If we do not introduce the thread sleep, then the java process
> quickly runs out of heap space for new threads.
> 4. Each time a thread starts, we logon.
> 5. Each time work completes in a thread, we logout.
>
> -- Sriram
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by David Nuescheler <da...@gmail.com>.
Hi Bryan,

do you think it would be possible to share the load-test (junit?) that
you conducted
initially with this list? I think we are still learning about the
performance related use
cases of content repository applications...

I think this would help to untangle the issues. while I agree that
there is room for
improvement in the area of concurrency, I think that there are other
areas that may
be more important for your real-life performance bottle-neck.

Generally, I would be interested to understand how many writes/updates to
the repository your application expects (let's say on a daily basis). Just to
get a feeling if we are talking about 1 update/s or 100 updates/s.

If you have such a real-life test that reflects the reading and writing patterns
of your application, would it be possible to share your performance goals
or expectations? For example, is the performance satisfactory if you use the
embedded derby persistence manager?

regards,
david

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/12/07, Jukka Zitting <ju...@gmail.com> wrote:
> On 3/10/07, Stefan Guggisberg <st...@gmail.com> wrote:
> > thanks for sharing your elaborate analysis. however i don't agree with your
> > analysis and proposals regarding the database persistence manager.
> > i tried to explain my point in my previous replies. it seems like you either
> > didn't read it,  don't agree or don't care. that's all perfectly fine with me,
> > but i am not gonna repeat myself.
>
> It seems to me that *you* didn't read Bryan's points. He does have
> some good points and just summarily denying them is not only wrong for
> the sake of this argument but also very destructive for the community.

Based on IM discussion with Stefan it seems that I've overreacted. His
comment on not seeing the value to continue the discussion was about
the earlier DB PM issues he already covered in his previous messages.
I interpreted the comment as referring to Bryan's entire last message
which contained a number of points. I hope others didn't get the same
impression and that we can continue the discussion on the techical
issues.

BR,

Jukka Zitting

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

[sorry to chime in this late in the thread]

On 3/10/07, Stefan Guggisberg <st...@gmail.com> wrote:
> thanks for sharing your elaborate analysis. however i don't agree with your
> analysis and proposals regarding the database persistence manager.
> i tried to explain my point in my previous replies. it seems like you either
> didn't read it,  don't agree or don't care. that's all perfectly fine with me,
> but i am not gonna repeat myself.

It seems to me that *you* didn't read Bryan's points. He does have
some good points and just summarily denying them is not only wrong for
the sake of this argument but also very destructive for the community.

> since you seem to know exactly what's wrong in the current implementation,
> feel free to create a jira issue and submit a pach with the proposed changes.

Bryan, I would very much like to encourage you to pursue this
approach! I recall proposing similar changes quite a while ago and
would very much like to see how they would turn out to work in
practice.

BR,

Jukka Zitting

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Stefan Guggisberg <st...@gmail.com>.
On 3/11/07, Michael Neale <mi...@gmail.com> wrote:
> Stefan - I think one of the main issues is the low priority of JCR-314  - it
> is much more serious then "minor" - my understanding is that any long
> running action - eg an upload of a large piece of content (file) - will
> basically block any other concurrent actions. For more fine grained CMS uses
> - this is probably not a problem (as content is read-mostly a lot of the
> time) - BUT, for people wanting to store large blobs (something that people
> would look at using JCR for) - this is a showstopper. Many transaction
> monitors in app servers have timeouts of 30 seconds - on the web or other
> tiers (of course this can be adjusted - but its not exactly a user friendly
> solution !).
>
> (my understanding may be wrong - I sincerely hope it is).

i agree with you, i changed to the priority of JCR-314 to 'major'.

cheers
stefan

>
> On 3/10/07, Stefan Guggisberg <st...@gmail.com> wrote:
> >
> > On 3/10/07, Bryan Davis <br...@bea.com> wrote:
> > > Stefan,
> > >
> > > There are a couple of issues that, collectively, we need to address in
> > order
> > > to successfully use Jackrabbit.
> > >
> > > Issue #1: Serialization of all repository updates.  See
> > > https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> > > understates the significance of the issue.  In any environment where
> > users
> > > are routinely writing anything at all to the repository (like audit or
> > log
> > > information), a large file upload (or a small file over a slow link)
> > will
> > > effectively block all other users until it completes.
> > >
> > > Having all other threads hang while a file is being uploaded is simply a
> > > show stopper for us (unfortunately this issue is marked as minor,
> > reported
> > > in 0.9, and not currently slated for a particular release).  Trying to
> > solve
> > > this issue outside of Jackrabbit is impossible, providing only stopgap
> > > solutions; plus external mitigation strategies (like uploading to a
> > local
> > > file and then streaming into Jackrabbit as fast as possible) all seem
> > fairly
> > > complex to make robust, seeing as how some data management would have to
> > be
> > > handled outside of the repository transaction.  Which leaves us with
> > trying
> > > to resolve the issue by patching Jackrabbit.
> > >
> > > I now understand that the  Jackrabbit fixes are multifaceted, and that
> > (at
> > > least) they involve changes to both the Persistence Manager and Shared
> > Item
> > > State Manager.  The Persistence Manager changes (which I will talk about
> > > separately), I think, are easy enough.  The SISM obviously needs to be
> > > upgraded to have more granular locking semantics (possibly by using
> > > item-level nested locks, or maybe a partial solution that depends on
> > > database-level locking in the Persistence Manager).
> > >
> > > There are a number of lock manager implementations floating around that
> > > could potentially be repurposed for use inside the SISM.  I am uncertain
> > of
> > > the requirements for distribution here, although on the surface it seems
> > > like a local locking implementation is all that is required since it
> > seems
> > > like clustering support is handled at a higher level.
> > >
> > > It is also tempting to try and push this functionality into the
> > database,
> > > since it is already doing all of the required locking anyway.  A custom
> > > transaction manager that delegated to the repository session transaction
> > > manager (thereby associating JDBC connections with the repository
> > session),
> > > in conjunction with a stock data source implementation (see below) might
> > do
> > > the trick.  Of course this would only work with database PM¹s, but
> > perhaps
> > > other TM¹s could still have the existing SISM locking enabled.  This
> > would
> > > be good enough for us since we only use database PM¹s, and a better,
> > more
> > > universal solution could be implemented at a later date.
> > >
> > > Has anyone looked into this issue at all or have any advice / thoughts?
> > >
> > > Issue #2: Multiple issues in database persistence managers.  I believe
> > the
> > > database persistence managers have multiple issues (please correct me if
> > I
> > > get any of this wrong).
> >
> > bryan,
> >
> > thanks for sharing your elaborate analysis. however i don't agree with
> > your
> > analysis and proposals regarding the database persistence manager.
> > i tried to explain my point in my previous replies. it seems like you
> > either
> > didn't read it,  don't agree or don't care. that's all perfectly fine with
> > me,
> > but i am not gonna repeat myself.
> >
> > since you seem to know exactly what's wrong in the current implementation,
> > feel free to create a jira issue and submit a pach with the proposed
> > changes.
> >
> > btw: i agree that synchronization is bad if you don't understand it and
> > use it
> > incorrectly ;)
> >
> > cheers
> > stefan
> >
> > >
> > > 1. JDBC connection details should not be in the repository.xml.  I
> > should be
> > > free to change the specifics of a particular database connection without
> > it
> > > constituting a repository initialization parameter change (which is
> > > effectively what changing the repository.xml is, since it gets copied
> > and
> > > subsequently accessed from inside the repository itself).  If a host
> > name or
> > > driver class or even connection URL changes, I should not have to
> > manually
> > > edit internal repository configuration files to effect the change.
> > > 2. Sharing JDBC connections (and objects obtained from them, like
> > prepared
> > > statements) between multiple threads is not a good practice.  Even
> > though
> > > many drivers support such activity, it is not specifically required by
> > the
> > > spec, and many drivers do not support it.  Even for ones that do, there
> > are
> > > always a significant list of caveats (like changing the transaction
> > > isolation of a connection impacting all threads, or rollbacks sometimes
> > > being executed against the wrong thread).  Plus, as far as I can tell,
> > there
> > > is also no particular good reason to attempt this in this
> > case.  Connection
> > > pooling is extremely well understood and there are a variety of
> > > implementations to choose from (including Apache¹s own in Jakarta
> > Commons).
> > > 3. Synchronization is  bad (generally speaking of course :).  Once the
> > > multithreaded issues of JDBC are removed (via a connection pool), there
> > are
> > > no good reasons that I can see to have any synchronization in the
> > database
> > > persistence managers.  Since any sort of requirement for synchronized
> > > operation would be coming from a higher layer, it should also be
> > provided at
> > > a higher layer.  I have always felt that a good rule of thumb in server
> > code
> > > is to avoid synchronization at all costs, particular in core server
> > > functionality (like reading and writing to a repository).  It if
> > extremely
> > > difficult to fully understand the global implications of synchronized
> > code,
> > > particular code that is synchronized at a low level.  Any serialization
> > of
> > > user requests is extremely serious in a multithreaded server and, in my
> > > experience, will lead to show-stopping performance and scalability
> > issues in
> > > nearly all cases.  In addition, serialization of requests at such a low
> > > level probably means that other synchronized code that is intended to be
> > > properly multithreaded is probably not well tested since the request
> > > serialization has eliminated (or greatly reduced) the possibility of the
> > > code being reentered like it normally would be.
> > >
> > > The solution to all of these issues, maybe, is to use the standard JDBC
> > > DataSource interface to encapsulate the details of managing the JDBC
> > > connections.  If all of the current PM and FS implementations that use
> > JDBC
> > > were refactored to have a DataSource member and to get and release
> > > connections inside of each method, then parity with the existing
> > > implementations could be achieved by providing a default
> > DataSourceLookup
> > > strategy implemention that simply encapsulated the existing connection
> > > creation code (ignoring connection release requests).  This would allow
> > us
> > > (and others) to externally extend the implementation with alternative
> > > DataSourceLookup strategy implementations, say for accessing a
> > datasource
> > > from JNDI, or getting it from a Spring application context.  This
> > solution
> > > also neatly externalizes all of the details of the actual datasource
> > > configuration from the repository.xml.
> > >
> > > Thanks!
> > > Bryan.
> > >
> > >
> > >
> > > On 3/8/07 2:48 AM, "Stefan Guggisberg" <st...@gmail.com>
> > wrote:
> > >
> > > > On 3/7/07, Bryan Davis <br...@bea.com> wrote:
> > > >> Well, serializing on the prepared statement is still fairly
> > serialized since
> > > >> we are really only talking about nodes and properties (two locks
> > instead of
> > > >> one).  If concurrency is controlled at a higher level then why is
> > > >> synchronization in the PM necessary?
> > > >
> > > > a PM's implementation should be thread-safe because it might be used
> > > > in another context or e.g. by a tool.
> > > >
> > > >>
> > > >> The code now seems to assume that the connection object is
> > thread-safe (and
> > > >> the specifics for thread-safeness of of connection objects and other
> > object
> > > >> derived from them is pretty much up to the driver).  This is one of
> > the
> > > >> reasons why connection pooling is used pretty much universally.
> > > >>
> > > >> If the built-in PM's used data sources instead of connections then
> > the
> > > >> connection settings could be more easily externalized (as these are
> > > >> typically configurable by the end user). Is there anyway to external
> > the
> > > >> JDBC connection settings from repository.xml right now (in 1.2.2) and
> > > >> configure them at runtime?
> > > >
> > > > i strongly disagree. the pm configuration is *not* supposed to be
> > > > configurable by the end user and certainly not at runtime. do you
> > think
> > > > that e.g. the tablespace settings (physical datafile paths etc) of an
> > oracle
> > > > db should be user configurable? i hope not...
> > > >
> > > >>
> > > >> You didn't really answer my question about Jackrabbit and its ability
> > to
> > > >> fetch and store information through the PM concurrently... What is
> > the
> > > >> synchronization at the higher level and how does it work?
> > > >
> > > > the current synchronization is used to guarantee data consistency
> > (such as
> > > > referential integrity).
> > > >
> > > > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > > > and you'll get the idea.
> > > >
> > > >>
> > > >> Finally, we are seeing a new issue where if a particular user uploads
> > a
> > > >> large document all other users start to get exceptions (doing a
> > normal mix
> > > >> of mostly reads/some writes).  If there is no way to do concurrent
> > writes to
> > > >> the PM I don't see any way around this problem (and it is pretty
> > serious for
> > > >> us).
> > > >
> > > > there's a related improvement issue:
> > > > https://issues.apache.org/jira/browse/JCR-314
> > > >
> > > > please feel free to comment on this issue or file a new issue if you
> > think
> > > > that it doesn't cover your use case.
> > > >
> > > > cheers
> > > > stefan
> > > >
> > > >>
> > > >> Bryan.
> > > >>
> > > >>
> > > >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com>
> > wrote:
> > > >>
> > > >>> On 3/5/07, Bryan Davis <br...@bea.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com>
> > wrote:
> > > >>>>
> > > >>>>> hi bryan
> > > >>>>>
> > > >>>>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> > > >>>>>> What persistence manager are you using?
> > > >>>>>>
> > > >>>>>> Our tests indicate that the stock persistence managers are a
> > significant
> > > >>>>>> bottleneck for both writes and also initial reads to load the
> > transient
> > > >>>>>> store (on the order of .5 seconds per node when using a remote
> > database
> > > >>>>>> like
> > > >>>>>> MSSQL or Oracle).
> > > >>>>>
> > > >>>>> what do you mean by "load the transient store"?
> > > >>>>>
> > > >>>>>>
> > > >>>>>> The stock db persistence managers have all methods marked as
> > > >>>>>> "synchronized",
> > > >>>>>> which blocks on the classdef (which means that even different
> > persistence
> > > >>>>>> managers for different workspaces will serialize all load, exists
> > and
> > > >>>>>> store
> > > >>>>>
> > > >>>>> assuming you're talking about DatabasePersistenceManager:
> > > >>>>> the store/destroy methods are 'synchronized' on the instance, not
> > on
> > > >>>>> the 'classdef'.
> > > >>>>> see e.g.
> > > >>>>>
> > > >>>>">>>>>">
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> > > th.htm>>>>>
> > > <
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> > >
> > > <
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> > >  l
> > > >>>>>
> > > >>>>> the load/exists methods are synchronized on the specific prepared
> > stmt
> > > >>>>> they're
> > > >>>>> using.
> > > >>>>>
> > > >>>>> since every workspace uses its own persistence manager instance i
> > can't
> > > >>>>> follow your conclusion that all load, exists and store operations
> > would be
> > > >>>>> be globally serialized across all workspaces.
> > > >>>>
> > > >>>> Hm, this is my bad... It does seem that sync methods are on the
> > instance.
> > > >>>> Since the db persistence manager has "synchronized" on load, store
> > and
> > > >>>> exists, though, this would still serialize all of these operations
> > for a
> > > >>>> particular workspace.
> > > >>>
> > > >>> ?? the load methods are *not* synchronized. they contain a section
> > which
> > > >>> is synchronized on the particular prepared stmt.
> > > >>>
> > > >>> <quote from my previous reply>
> > > >>> wrt synchronization:
> > > >>> concurrency is controlled outside the persistence manager on a
> > higher level.
> > > >>> eliminating the method synchronization would imo therefore have *no*
> > impact
> > > >>> on concurrency/performance.
> > > >>> </quote>
> > > >>>
> > > >>> cheers
> > > >>> stefan
> > > >>>
> > > >>>>
> > > >>>>>> operations).  Presumably this is because they allocate a JDBC
> > connection
> > > >>>>>> at
> > > >>>>>> startup and use it throughout, and the connection object is not
> > > >>>>>> multithreaded.
> > > >>>>>
> > > >>>>> what leads you to this assumption?
> > > >>>>
> > > >>>> Are there other requirements that all of these operations are
> > serialized
> > > >>>> for
> > > >>>> a particular PM instance?  This seems like a pretty serious
> > bottleneck
> > > >>>> (and,
> > > >>>> in fact, is a pretty serious bottleneck when the database is remote
> > from
> > > >>>> the
> > > >>>> repository).
> > > >>>>
> > > >>>>>>
> > > >>>>>> This problem isn't as noticeable when you are using embedded
> > Derby and
> > > >>>>>> reading/writing to the file system, but when you are doing a
> > network
> > > >>>>>> operation to a database server, the network latency in
> > combination with
> > > >>>>>> the
> > > >>>>>> serialization of all database operations results in a significant
> > > >>>>>> performance degradation.
> > > >>>>>
> > > >>>>> again: serialization of 'all' database operations?
> > > >>>>
> > > >>>> The distinction between all and all for a workspace is would really
> > only be
> > > >>>> relevant during versioning, right?
> > > >>>>
> > > >>>>>>
> > > >>>>>> The new bundle persistence manager (which isn't yet in SVN)
> > improves
> > > >>>>>> things
> > > >>>>>> dramatically since it inlines properties into the node, so
> > loading or
> > > >>>>>> persisting a node is only one operation (plus the additional
> > connection
> > > >>>>>> for
> > > >>>>>> the LOB) instead of one for the node and and one for each
> > property.  The
> > > >>>>>> bundle persistence manager also uses prepared statements and
> > keeps a
> > > >>>>>> PM-level cache of nodes (with properties) and also non-existent
> > nodes
> > > >>>>>> (which
> > > >>>>>> permits many exists() calls to return without accessing the
> > database).
> > > >>>>>>
> > > >>>>>> Changing all db persistence managers to use a datasource and get
> > and
> > > >>>>>> release
> > > >>>>>> connections inside of load, exists and store operations and
> > eliminating
> > > >>>>>> the
> > > >>>>>> method synchronization is a relatively simple change that further
> > > >>>>>> improves
> > > >>>>>> performance for connecting to database servers.
> > > >>>>>
> > > >>>>> the use of datasources, connection pools and the like have been
> > discussed
> > > >>>>> in extenso on the list. see e.g.
> > > >>>>>
> > > >>
> > > ">">
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > > > <
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > >
> > > <
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > >
> > > >> >>
> > > >> l
> > > >>>>> http://issues.apache.org/jira/browse/JCR-313
> > > >>>>>
> > > >>>>> i don't see how getting & releasing connections in every load,
> > exists and
> > > >>>>> store
> > > >>>>> call would improve preformance. could you please elaborate?
> > > >>>>>
> > > >>>>> please note that you wouldn't be able to use prepared statements
> > over
> > > >>>>> multiple
> > > >>>>> load, store etc operations because you'd have to return the
> > connection
> > > >>>>> at the end
> > > >>>>> of every call. the performance might therefore be even worse.
> > > >>>>>
> > > >>>>> further note that write operations must occur within a single jdbc
> > > >>>>> transaction, i.e.
> > > >>>>> you can't get a new connection for every store/destroy operation.
> > > >>>>>
> > > >>>>> wrt synchronization:
> > > >>>>> concurrency is controlled outside the persistence manager on a
> > higher
> > > >>>>> level.
> > > >>>>> eliminating the method synchronization would imo therefore have
> > *no*
> > > >>>>> impact
> > > >>>>> on concurrency/performance.
> > > >>>>
> > > >>>> So you are saying that it is impossible to concurrently load or
> > store data
> > > >>>> in Jackrabbit?
> > > >>>>
> > > >>>>>> There is a persistence manager with an ASL license called
> > > >>>>>> "DataSourcePersistenceManager" which seems to the PM of choice
> > for people
> > > >>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses
> > prepared
> > > >>>>>> statements and eliminates the current single-connection issues
> > associated
> > > >>>>>> with all of the stock db PMs.  It doesn't seem to have been
> > submitted
> > > >>>>>> back
> > > >>>>>> to the Jackrabbit project.  If you Google for
> > > >>>>>> "
> > com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> > > >>>>>> you
> > > >>>>>> should be able to find it.
> > > >>>>>
> > > >>>>> thanks for the hint. i am aware of this pm and i had a look at it
> > a couple
> > > >>>>> of
> > > >>>>> months ago. the major issue was that it didn't implement the
> > > >>>>> correct/required
> > > >>>>> semantics. it used a new connection for every write operation
> > which
> > > >>>>> clearly violates the contract that the write operations should
> > occur
> > > >>>>> within
> > > >>>>> a jdbc transaction bracket. further it creates a prepared stmt on
> > every
> > > >>>>> load, store etc. which is hardly efficient...
> > > >>>>
> > > >>>> Yes, this PM does have this issue.  The bundle PM implements
> > prepared
> > > >>>> statements in the correct way.
> > > >>>>
> > > >>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do
> > not need
> > > >>>>>> to
> > > >>>>>> use the Oracle-specific PMs because the 10g drivers support the
> > standard
> > > >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by
> > the
> > > >>>>>> older
> > > >>>>>> 9i drivers).  This is true even if you are connecting to an older
> > > >>>>>> database
> > > >>>>>> server as the limitation was in the driver itself.  Frankly you
> > should
> > > >>>>>> never
> > > >>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers
> > represent
> > > >>>>>> a
> > > >>>>>> complete rewrite.  Make sure you use the new driver package
> > because the
> > > >>>>>> 10g
> > > >>>>>> driver JAR also includes the older 9i drivers for
> > backward-compatibility.
> > > >>>>>> The new driver is in a new package (can't remember the exact name
> > off the
> > > >>>>>> top of my head).
> > > >>>>>
> > > >>>>> thanks for the information.
> > > >>>>>
> > > >>>>> cheers
> > > >>>>> stefan
> > > >>>>
> > > >>>> We are very interested in getting a good understanding of the
> > specifics of
> > > >>>> how PM's work, as initial reads and writes, according to our
> > profiling, are
> > > >>>> spending 80-90% of the time inside the PM.
> > > >>>>
> > > >>>> Bryan.
> > > >>>>
> > > >>>>
> > _______________________________________________________________________
> > > >>>> Notice:  This email message, together with any attachments, may
> > contain
> > > >>>> information  of  BEA Systems,  Inc.,  its
> > subsidiaries  and  affiliated
> > > >>>> entities,  that may be
> > confidential,  proprietary,  copyrighted  and/or
> > > >>>> legally privileged, and is intended solely for the use of the
> > individual
> > > >>>> or entity named in this message. If you are not the intended
> > recipient,
> > > >>>> and have received this message in error, please immediately return
> > this
> > > >>>> by email and then delete it.
> > > >>>>
> > > >>
> > > >>
> > _______________________________________________________________________
> > > >> Notice:  This email message, together with any attachments, may
> > contain
> > > >> information  of  BEA Systems,  Inc.,  its
> > subsidiaries  and  affiliated
> > > >> entities,  that may be
> > confidential,  proprietary,  copyrighted  and/or
> > > >> legally privileged, and is intended solely for the use of the
> > individual
> > > >> or entity named in this message. If you are not the intended
> > recipient,
> > > >> and have received this message in error, please immediately return
> > this
> > > >> by email and then delete it.
> > > >>
> > >
> > > _______________________________________________________________________
> > > Notice:  This email message, together with any attachments, may contain
> > > information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> > > entities,  that may be confidential,  proprietary,  copyrighted  and/or
> > > legally privileged, and is intended solely for the use of the individual
> > > or entity named in this message. If you are not the intended recipient,
> > > and have received this message in error, please immediately return this
> > > by email and then delete it.
> > >
> >
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Michael Neale <mi...@gmail.com>.
Stefan - I think one of the main issues is the low priority of JCR-314  - it
is much more serious then "minor" - my understanding is that any long
running action - eg an upload of a large piece of content (file) - will
basically block any other concurrent actions. For more fine grained CMS uses
- this is probably not a problem (as content is read-mostly a lot of the
time) - BUT, for people wanting to store large blobs (something that people
would look at using JCR for) - this is a showstopper. Many transaction
monitors in app servers have timeouts of 30 seconds - on the web or other
tiers (of course this can be adjusted - but its not exactly a user friendly
solution !).

(my understanding may be wrong - I sincerely hope it is).

On 3/10/07, Stefan Guggisberg <st...@gmail.com> wrote:
>
> On 3/10/07, Bryan Davis <br...@bea.com> wrote:
> > Stefan,
> >
> > There are a couple of issues that, collectively, we need to address in
> order
> > to successfully use Jackrabbit.
> >
> > Issue #1: Serialization of all repository updates.  See
> > https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> > understates the significance of the issue.  In any environment where
> users
> > are routinely writing anything at all to the repository (like audit or
> log
> > information), a large file upload (or a small file over a slow link)
> will
> > effectively block all other users until it completes.
> >
> > Having all other threads hang while a file is being uploaded is simply a
> > show stopper for us (unfortunately this issue is marked as minor,
> reported
> > in 0.9, and not currently slated for a particular release).  Trying to
> solve
> > this issue outside of Jackrabbit is impossible, providing only stopgap
> > solutions; plus external mitigation strategies (like uploading to a
> local
> > file and then streaming into Jackrabbit as fast as possible) all seem
> fairly
> > complex to make robust, seeing as how some data management would have to
> be
> > handled outside of the repository transaction.  Which leaves us with
> trying
> > to resolve the issue by patching Jackrabbit.
> >
> > I now understand that the  Jackrabbit fixes are multifaceted, and that
> (at
> > least) they involve changes to both the Persistence Manager and Shared
> Item
> > State Manager.  The Persistence Manager changes (which I will talk about
> > separately), I think, are easy enough.  The SISM obviously needs to be
> > upgraded to have more granular locking semantics (possibly by using
> > item-level nested locks, or maybe a partial solution that depends on
> > database-level locking in the Persistence Manager).
> >
> > There are a number of lock manager implementations floating around that
> > could potentially be repurposed for use inside the SISM.  I am uncertain
> of
> > the requirements for distribution here, although on the surface it seems
> > like a local locking implementation is all that is required since it
> seems
> > like clustering support is handled at a higher level.
> >
> > It is also tempting to try and push this functionality into the
> database,
> > since it is already doing all of the required locking anyway.  A custom
> > transaction manager that delegated to the repository session transaction
> > manager (thereby associating JDBC connections with the repository
> session),
> > in conjunction with a stock data source implementation (see below) might
> do
> > the trick.  Of course this would only work with database PM¹s, but
> perhaps
> > other TM¹s could still have the existing SISM locking enabled.  This
> would
> > be good enough for us since we only use database PM¹s, and a better,
> more
> > universal solution could be implemented at a later date.
> >
> > Has anyone looked into this issue at all or have any advice / thoughts?
> >
> > Issue #2: Multiple issues in database persistence managers.  I believe
> the
> > database persistence managers have multiple issues (please correct me if
> I
> > get any of this wrong).
>
> bryan,
>
> thanks for sharing your elaborate analysis. however i don't agree with
> your
> analysis and proposals regarding the database persistence manager.
> i tried to explain my point in my previous replies. it seems like you
> either
> didn't read it,  don't agree or don't care. that's all perfectly fine with
> me,
> but i am not gonna repeat myself.
>
> since you seem to know exactly what's wrong in the current implementation,
> feel free to create a jira issue and submit a pach with the proposed
> changes.
>
> btw: i agree that synchronization is bad if you don't understand it and
> use it
> incorrectly ;)
>
> cheers
> stefan
>
> >
> > 1. JDBC connection details should not be in the repository.xml.  I
> should be
> > free to change the specifics of a particular database connection without
> it
> > constituting a repository initialization parameter change (which is
> > effectively what changing the repository.xml is, since it gets copied
> and
> > subsequently accessed from inside the repository itself).  If a host
> name or
> > driver class or even connection URL changes, I should not have to
> manually
> > edit internal repository configuration files to effect the change.
> > 2. Sharing JDBC connections (and objects obtained from them, like
> prepared
> > statements) between multiple threads is not a good practice.  Even
> though
> > many drivers support such activity, it is not specifically required by
> the
> > spec, and many drivers do not support it.  Even for ones that do, there
> are
> > always a significant list of caveats (like changing the transaction
> > isolation of a connection impacting all threads, or rollbacks sometimes
> > being executed against the wrong thread).  Plus, as far as I can tell,
> there
> > is also no particular good reason to attempt this in this
> case.  Connection
> > pooling is extremely well understood and there are a variety of
> > implementations to choose from (including Apache¹s own in Jakarta
> Commons).
> > 3. Synchronization is  bad (generally speaking of course :).  Once the
> > multithreaded issues of JDBC are removed (via a connection pool), there
> are
> > no good reasons that I can see to have any synchronization in the
> database
> > persistence managers.  Since any sort of requirement for synchronized
> > operation would be coming from a higher layer, it should also be
> provided at
> > a higher layer.  I have always felt that a good rule of thumb in server
> code
> > is to avoid synchronization at all costs, particular in core server
> > functionality (like reading and writing to a repository).  It if
> extremely
> > difficult to fully understand the global implications of synchronized
> code,
> > particular code that is synchronized at a low level.  Any serialization
> of
> > user requests is extremely serious in a multithreaded server and, in my
> > experience, will lead to show-stopping performance and scalability
> issues in
> > nearly all cases.  In addition, serialization of requests at such a low
> > level probably means that other synchronized code that is intended to be
> > properly multithreaded is probably not well tested since the request
> > serialization has eliminated (or greatly reduced) the possibility of the
> > code being reentered like it normally would be.
> >
> > The solution to all of these issues, maybe, is to use the standard JDBC
> > DataSource interface to encapsulate the details of managing the JDBC
> > connections.  If all of the current PM and FS implementations that use
> JDBC
> > were refactored to have a DataSource member and to get and release
> > connections inside of each method, then parity with the existing
> > implementations could be achieved by providing a default
> DataSourceLookup
> > strategy implemention that simply encapsulated the existing connection
> > creation code (ignoring connection release requests).  This would allow
> us
> > (and others) to externally extend the implementation with alternative
> > DataSourceLookup strategy implementations, say for accessing a
> datasource
> > from JNDI, or getting it from a Spring application context.  This
> solution
> > also neatly externalizes all of the details of the actual datasource
> > configuration from the repository.xml.
> >
> > Thanks!
> > Bryan.
> >
> >
> >
> > On 3/8/07 2:48 AM, "Stefan Guggisberg" <st...@gmail.com>
> wrote:
> >
> > > On 3/7/07, Bryan Davis <br...@bea.com> wrote:
> > >> Well, serializing on the prepared statement is still fairly
> serialized since
> > >> we are really only talking about nodes and properties (two locks
> instead of
> > >> one).  If concurrency is controlled at a higher level then why is
> > >> synchronization in the PM necessary?
> > >
> > > a PM's implementation should be thread-safe because it might be used
> > > in another context or e.g. by a tool.
> > >
> > >>
> > >> The code now seems to assume that the connection object is
> thread-safe (and
> > >> the specifics for thread-safeness of of connection objects and other
> object
> > >> derived from them is pretty much up to the driver).  This is one of
> the
> > >> reasons why connection pooling is used pretty much universally.
> > >>
> > >> If the built-in PM's used data sources instead of connections then
> the
> > >> connection settings could be more easily externalized (as these are
> > >> typically configurable by the end user). Is there anyway to external
> the
> > >> JDBC connection settings from repository.xml right now (in 1.2.2) and
> > >> configure them at runtime?
> > >
> > > i strongly disagree. the pm configuration is *not* supposed to be
> > > configurable by the end user and certainly not at runtime. do you
> think
> > > that e.g. the tablespace settings (physical datafile paths etc) of an
> oracle
> > > db should be user configurable? i hope not...
> > >
> > >>
> > >> You didn't really answer my question about Jackrabbit and its ability
> to
> > >> fetch and store information through the PM concurrently... What is
> the
> > >> synchronization at the higher level and how does it work?
> > >
> > > the current synchronization is used to guarantee data consistency
> (such as
> > > referential integrity).
> > >
> > > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > > and you'll get the idea.
> > >
> > >>
> > >> Finally, we are seeing a new issue where if a particular user uploads
> a
> > >> large document all other users start to get exceptions (doing a
> normal mix
> > >> of mostly reads/some writes).  If there is no way to do concurrent
> writes to
> > >> the PM I don't see any way around this problem (and it is pretty
> serious for
> > >> us).
> > >
> > > there's a related improvement issue:
> > > https://issues.apache.org/jira/browse/JCR-314
> > >
> > > please feel free to comment on this issue or file a new issue if you
> think
> > > that it doesn't cover your use case.
> > >
> > > cheers
> > > stefan
> > >
> > >>
> > >> Bryan.
> > >>
> > >>
> > >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com>
> wrote:
> > >>
> > >>> On 3/5/07, Bryan Davis <br...@bea.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com>
> wrote:
> > >>>>
> > >>>>> hi bryan
> > >>>>>
> > >>>>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> > >>>>>> What persistence manager are you using?
> > >>>>>>
> > >>>>>> Our tests indicate that the stock persistence managers are a
> significant
> > >>>>>> bottleneck for both writes and also initial reads to load the
> transient
> > >>>>>> store (on the order of .5 seconds per node when using a remote
> database
> > >>>>>> like
> > >>>>>> MSSQL or Oracle).
> > >>>>>
> > >>>>> what do you mean by "load the transient store"?
> > >>>>>
> > >>>>>>
> > >>>>>> The stock db persistence managers have all methods marked as
> > >>>>>> "synchronized",
> > >>>>>> which blocks on the classdef (which means that even different
> persistence
> > >>>>>> managers for different workspaces will serialize all load, exists
> and
> > >>>>>> store
> > >>>>>
> > >>>>> assuming you're talking about DatabasePersistenceManager:
> > >>>>> the store/destroy methods are 'synchronized' on the instance, not
> on
> > >>>>> the 'classdef'.
> > >>>>> see e.g.
> > >>>>>
> > >>>>">>>>>">
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> > th.htm>>>>>
> > <
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> >
> > <
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> >  l
> > >>>>>
> > >>>>> the load/exists methods are synchronized on the specific prepared
> stmt
> > >>>>> they're
> > >>>>> using.
> > >>>>>
> > >>>>> since every workspace uses its own persistence manager instance i
> can't
> > >>>>> follow your conclusion that all load, exists and store operations
> would be
> > >>>>> be globally serialized across all workspaces.
> > >>>>
> > >>>> Hm, this is my bad... It does seem that sync methods are on the
> instance.
> > >>>> Since the db persistence manager has "synchronized" on load, store
> and
> > >>>> exists, though, this would still serialize all of these operations
> for a
> > >>>> particular workspace.
> > >>>
> > >>> ?? the load methods are *not* synchronized. they contain a section
> which
> > >>> is synchronized on the particular prepared stmt.
> > >>>
> > >>> <quote from my previous reply>
> > >>> wrt synchronization:
> > >>> concurrency is controlled outside the persistence manager on a
> higher level.
> > >>> eliminating the method synchronization would imo therefore have *no*
> impact
> > >>> on concurrency/performance.
> > >>> </quote>
> > >>>
> > >>> cheers
> > >>> stefan
> > >>>
> > >>>>
> > >>>>>> operations).  Presumably this is because they allocate a JDBC
> connection
> > >>>>>> at
> > >>>>>> startup and use it throughout, and the connection object is not
> > >>>>>> multithreaded.
> > >>>>>
> > >>>>> what leads you to this assumption?
> > >>>>
> > >>>> Are there other requirements that all of these operations are
> serialized
> > >>>> for
> > >>>> a particular PM instance?  This seems like a pretty serious
> bottleneck
> > >>>> (and,
> > >>>> in fact, is a pretty serious bottleneck when the database is remote
> from
> > >>>> the
> > >>>> repository).
> > >>>>
> > >>>>>>
> > >>>>>> This problem isn't as noticeable when you are using embedded
> Derby and
> > >>>>>> reading/writing to the file system, but when you are doing a
> network
> > >>>>>> operation to a database server, the network latency in
> combination with
> > >>>>>> the
> > >>>>>> serialization of all database operations results in a significant
> > >>>>>> performance degradation.
> > >>>>>
> > >>>>> again: serialization of 'all' database operations?
> > >>>>
> > >>>> The distinction between all and all for a workspace is would really
> only be
> > >>>> relevant during versioning, right?
> > >>>>
> > >>>>>>
> > >>>>>> The new bundle persistence manager (which isn't yet in SVN)
> improves
> > >>>>>> things
> > >>>>>> dramatically since it inlines properties into the node, so
> loading or
> > >>>>>> persisting a node is only one operation (plus the additional
> connection
> > >>>>>> for
> > >>>>>> the LOB) instead of one for the node and and one for each
> property.  The
> > >>>>>> bundle persistence manager also uses prepared statements and
> keeps a
> > >>>>>> PM-level cache of nodes (with properties) and also non-existent
> nodes
> > >>>>>> (which
> > >>>>>> permits many exists() calls to return without accessing the
> database).
> > >>>>>>
> > >>>>>> Changing all db persistence managers to use a datasource and get
> and
> > >>>>>> release
> > >>>>>> connections inside of load, exists and store operations and
> eliminating
> > >>>>>> the
> > >>>>>> method synchronization is a relatively simple change that further
> > >>>>>> improves
> > >>>>>> performance for connecting to database servers.
> > >>>>>
> > >>>>> the use of datasources, connection pools and the like have been
> discussed
> > >>>>> in extenso on the list. see e.g.
> > >>>>>
> > >>
> > ">">
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > > <
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> >
> > <
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> >
> > >> >>
> > >> l
> > >>>>> http://issues.apache.org/jira/browse/JCR-313
> > >>>>>
> > >>>>> i don't see how getting & releasing connections in every load,
> exists and
> > >>>>> store
> > >>>>> call would improve preformance. could you please elaborate?
> > >>>>>
> > >>>>> please note that you wouldn't be able to use prepared statements
> over
> > >>>>> multiple
> > >>>>> load, store etc operations because you'd have to return the
> connection
> > >>>>> at the end
> > >>>>> of every call. the performance might therefore be even worse.
> > >>>>>
> > >>>>> further note that write operations must occur within a single jdbc
> > >>>>> transaction, i.e.
> > >>>>> you can't get a new connection for every store/destroy operation.
> > >>>>>
> > >>>>> wrt synchronization:
> > >>>>> concurrency is controlled outside the persistence manager on a
> higher
> > >>>>> level.
> > >>>>> eliminating the method synchronization would imo therefore have
> *no*
> > >>>>> impact
> > >>>>> on concurrency/performance.
> > >>>>
> > >>>> So you are saying that it is impossible to concurrently load or
> store data
> > >>>> in Jackrabbit?
> > >>>>
> > >>>>>> There is a persistence manager with an ASL license called
> > >>>>>> "DataSourcePersistenceManager" which seems to the PM of choice
> for people
> > >>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses
> prepared
> > >>>>>> statements and eliminates the current single-connection issues
> associated
> > >>>>>> with all of the stock db PMs.  It doesn't seem to have been
> submitted
> > >>>>>> back
> > >>>>>> to the Jackrabbit project.  If you Google for
> > >>>>>> "
> com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> > >>>>>> you
> > >>>>>> should be able to find it.
> > >>>>>
> > >>>>> thanks for the hint. i am aware of this pm and i had a look at it
> a couple
> > >>>>> of
> > >>>>> months ago. the major issue was that it didn't implement the
> > >>>>> correct/required
> > >>>>> semantics. it used a new connection for every write operation
> which
> > >>>>> clearly violates the contract that the write operations should
> occur
> > >>>>> within
> > >>>>> a jdbc transaction bracket. further it creates a prepared stmt on
> every
> > >>>>> load, store etc. which is hardly efficient...
> > >>>>
> > >>>> Yes, this PM does have this issue.  The bundle PM implements
> prepared
> > >>>> statements in the correct way.
> > >>>>
> > >>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do
> not need
> > >>>>>> to
> > >>>>>> use the Oracle-specific PMs because the 10g drivers support the
> standard
> > >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by
> the
> > >>>>>> older
> > >>>>>> 9i drivers).  This is true even if you are connecting to an older
> > >>>>>> database
> > >>>>>> server as the limitation was in the driver itself.  Frankly you
> should
> > >>>>>> never
> > >>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers
> represent
> > >>>>>> a
> > >>>>>> complete rewrite.  Make sure you use the new driver package
> because the
> > >>>>>> 10g
> > >>>>>> driver JAR also includes the older 9i drivers for
> backward-compatibility.
> > >>>>>> The new driver is in a new package (can't remember the exact name
> off the
> > >>>>>> top of my head).
> > >>>>>
> > >>>>> thanks for the information.
> > >>>>>
> > >>>>> cheers
> > >>>>> stefan
> > >>>>
> > >>>> We are very interested in getting a good understanding of the
> specifics of
> > >>>> how PM's work, as initial reads and writes, according to our
> profiling, are
> > >>>> spending 80-90% of the time inside the PM.
> > >>>>
> > >>>> Bryan.
> > >>>>
> > >>>>
> _______________________________________________________________________
> > >>>> Notice:  This email message, together with any attachments, may
> contain
> > >>>> information  of  BEA Systems,  Inc.,  its
> subsidiaries  and  affiliated
> > >>>> entities,  that may be
> confidential,  proprietary,  copyrighted  and/or
> > >>>> legally privileged, and is intended solely for the use of the
> individual
> > >>>> or entity named in this message. If you are not the intended
> recipient,
> > >>>> and have received this message in error, please immediately return
> this
> > >>>> by email and then delete it.
> > >>>>
> > >>
> > >>
> _______________________________________________________________________
> > >> Notice:  This email message, together with any attachments, may
> contain
> > >> information  of  BEA Systems,  Inc.,  its
> subsidiaries  and  affiliated
> > >> entities,  that may be
> confidential,  proprietary,  copyrighted  and/or
> > >> legally privileged, and is intended solely for the use of the
> individual
> > >> or entity named in this message. If you are not the intended
> recipient,
> > >> and have received this message in error, please immediately return
> this
> > >> by email and then delete it.
> > >>
> >
> > _______________________________________________________________________
> > Notice:  This email message, together with any attachments, may contain
> > information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> > entities,  that may be confidential,  proprietary,  copyrighted  and/or
> > legally privileged, and is intended solely for the use of the individual
> > or entity named in this message. If you are not the intended recipient,
> > and have received this message in error, please immediately return this
> > by email and then delete it.
> >
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Stefan Guggisberg <st...@gmail.com>.
On 3/10/07, Bryan Davis <br...@bea.com> wrote:
> Stefan,
>
> There are a couple of issues that, collectively, we need to address in order
> to successfully use Jackrabbit.
>
> Issue #1: Serialization of all repository updates.  See
> https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> understates the significance of the issue.  In any environment where users
> are routinely writing anything at all to the repository (like audit or log
> information), a large file upload (or a small file over a slow link) will
> effectively block all other users until it completes.
>
> Having all other threads hang while a file is being uploaded is simply a
> show stopper for us (unfortunately this issue is marked as minor, reported
> in 0.9, and not currently slated for a particular release).  Trying to solve
> this issue outside of Jackrabbit is impossible, providing only stopgap
> solutions; plus external mitigation strategies (like uploading to a local
> file and then streaming into Jackrabbit as fast as possible) all seem fairly
> complex to make robust, seeing as how some data management would have to be
> handled outside of the repository transaction.  Which leaves us with trying
> to resolve the issue by patching Jackrabbit.
>
> I now understand that the  Jackrabbit fixes are multifaceted, and that (at
> least) they involve changes to both the Persistence Manager and Shared Item
> State Manager.  The Persistence Manager changes (which I will talk about
> separately), I think, are easy enough.  The SISM obviously needs to be
> upgraded to have more granular locking semantics (possibly by using
> item-level nested locks, or maybe a partial solution that depends on
> database-level locking in the Persistence Manager).
>
> There are a number of lock manager implementations floating around that
> could potentially be repurposed for use inside the SISM.  I am uncertain of
> the requirements for distribution here, although on the surface it seems
> like a local locking implementation is all that is required since it seems
> like clustering support is handled at a higher level.
>
> It is also tempting to try and push this functionality into the database,
> since it is already doing all of the required locking anyway.  A custom
> transaction manager that delegated to the repository session transaction
> manager (thereby associating JDBC connections with the repository session),
> in conjunction with a stock data source implementation (see below) might do
> the trick.  Of course this would only work with database PM¹s, but perhaps
> other TM¹s could still have the existing SISM locking enabled.  This would
> be good enough for us since we only use database PM¹s, and a better, more
> universal solution could be implemented at a later date.
>
> Has anyone looked into this issue at all or have any advice / thoughts?
>
> Issue #2: Multiple issues in database persistence managers.  I believe the
> database persistence managers have multiple issues (please correct me if I
> get any of this wrong).

bryan,

thanks for sharing your elaborate analysis. however i don't agree with your
analysis and proposals regarding the database persistence manager.
i tried to explain my point in my previous replies. it seems like you either
didn't read it,  don't agree or don't care. that's all perfectly fine with me,
but i am not gonna repeat myself.

since you seem to know exactly what's wrong in the current implementation,
feel free to create a jira issue and submit a pach with the proposed changes.

btw: i agree that synchronization is bad if you don't understand it and use it
incorrectly ;)

cheers
stefan

>
> 1. JDBC connection details should not be in the repository.xml.  I should be
> free to change the specifics of a particular database connection without it
> constituting a repository initialization parameter change (which is
> effectively what changing the repository.xml is, since it gets copied and
> subsequently accessed from inside the repository itself).  If a host name or
> driver class or even connection URL changes, I should not have to manually
> edit internal repository configuration files to effect the change.
> 2. Sharing JDBC connections (and objects obtained from them, like prepared
> statements) between multiple threads is not a good practice.  Even though
> many drivers support such activity, it is not specifically required by the
> spec, and many drivers do not support it.  Even for ones that do, there are
> always a significant list of caveats (like changing the transaction
> isolation of a connection impacting all threads, or rollbacks sometimes
> being executed against the wrong thread).  Plus, as far as I can tell, there
> is also no particular good reason to attempt this in this case.  Connection
> pooling is extremely well understood and there are a variety of
> implementations to choose from (including Apache¹s own in Jakarta Commons).
> 3. Synchronization is  bad (generally speaking of course :).  Once the
> multithreaded issues of JDBC are removed (via a connection pool), there are
> no good reasons that I can see to have any synchronization in the database
> persistence managers.  Since any sort of requirement for synchronized
> operation would be coming from a higher layer, it should also be provided at
> a higher layer.  I have always felt that a good rule of thumb in server code
> is to avoid synchronization at all costs, particular in core server
> functionality (like reading and writing to a repository).  It if extremely
> difficult to fully understand the global implications of synchronized code,
> particular code that is synchronized at a low level.  Any serialization of
> user requests is extremely serious in a multithreaded server and, in my
> experience, will lead to show-stopping performance and scalability issues in
> nearly all cases.  In addition, serialization of requests at such a low
> level probably means that other synchronized code that is intended to be
> properly multithreaded is probably not well tested since the request
> serialization has eliminated (or greatly reduced) the possibility of the
> code being reentered like it normally would be.
>
> The solution to all of these issues, maybe, is to use the standard JDBC
> DataSource interface to encapsulate the details of managing the JDBC
> connections.  If all of the current PM and FS implementations that use JDBC
> were refactored to have a DataSource member and to get and release
> connections inside of each method, then parity with the existing
> implementations could be achieved by providing a default DataSourceLookup
> strategy implemention that simply encapsulated the existing connection
> creation code (ignoring connection release requests).  This would allow us
> (and others) to externally extend the implementation with alternative
> DataSourceLookup strategy implementations, say for accessing a datasource
> from JNDI, or getting it from a Spring application context.  This solution
> also neatly externalizes all of the details of the actual datasource
> configuration from the repository.xml.
>
> Thanks!
> Bryan.
>
>
>
> On 3/8/07 2:48 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>
> > On 3/7/07, Bryan Davis <br...@bea.com> wrote:
> >> Well, serializing on the prepared statement is still fairly serialized since
> >> we are really only talking about nodes and properties (two locks instead of
> >> one).  If concurrency is controlled at a higher level then why is
> >> synchronization in the PM necessary?
> >
> > a PM's implementation should be thread-safe because it might be used
> > in another context or e.g. by a tool.
> >
> >>
> >> The code now seems to assume that the connection object is thread-safe (and
> >> the specifics for thread-safeness of of connection objects and other object
> >> derived from them is pretty much up to the driver).  This is one of the
> >> reasons why connection pooling is used pretty much universally.
> >>
> >> If the built-in PM's used data sources instead of connections then the
> >> connection settings could be more easily externalized (as these are
> >> typically configurable by the end user). Is there anyway to external the
> >> JDBC connection settings from repository.xml right now (in 1.2.2) and
> >> configure them at runtime?
> >
> > i strongly disagree. the pm configuration is *not* supposed to be
> > configurable by the end user and certainly not at runtime. do you think
> > that e.g. the tablespace settings (physical datafile paths etc) of an oracle
> > db should be user configurable? i hope not...
> >
> >>
> >> You didn't really answer my question about Jackrabbit and its ability to
> >> fetch and store information through the PM concurrently... What is the
> >> synchronization at the higher level and how does it work?
> >
> > the current synchronization is used to guarantee data consistency (such as
> > referential integrity).
> >
> > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > and you'll get the idea.
> >
> >>
> >> Finally, we are seeing a new issue where if a particular user uploads a
> >> large document all other users start to get exceptions (doing a normal mix
> >> of mostly reads/some writes).  If there is no way to do concurrent writes to
> >> the PM I don't see any way around this problem (and it is pretty serious for
> >> us).
> >
> > there's a related improvement issue:
> > https://issues.apache.org/jira/browse/JCR-314
> >
> > please feel free to comment on this issue or file a new issue if you think
> > that it doesn't cover your use case.
> >
> > cheers
> > stefan
> >
> >>
> >> Bryan.
> >>
> >>
> >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
> >>
> >>> On 3/5/07, Bryan Davis <br...@bea.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
> >>>>
> >>>>> hi bryan
> >>>>>
> >>>>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> >>>>>> What persistence manager are you using?
> >>>>>>
> >>>>>> Our tests indicate that the stock persistence managers are a significant
> >>>>>> bottleneck for both writes and also initial reads to load the transient
> >>>>>> store (on the order of .5 seconds per node when using a remote database
> >>>>>> like
> >>>>>> MSSQL or Oracle).
> >>>>>
> >>>>> what do you mean by "load the transient store"?
> >>>>>
> >>>>>>
> >>>>>> The stock db persistence managers have all methods marked as
> >>>>>> "synchronized",
> >>>>>> which blocks on the classdef (which means that even different persistence
> >>>>>> managers for different workspaces will serialize all load, exists and
> >>>>>> store
> >>>>>
> >>>>> assuming you're talking about DatabasePersistenceManager:
> >>>>> the store/destroy methods are 'synchronized' on the instance, not on
> >>>>> the 'classdef'.
> >>>>> see e.g.
> >>>>>
> >>>>">>>>>">http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> th.htm>>>>>
> <http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>
> <http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>  l
> >>>>>
> >>>>> the load/exists methods are synchronized on the specific prepared stmt
> >>>>> they're
> >>>>> using.
> >>>>>
> >>>>> since every workspace uses its own persistence manager instance i can't
> >>>>> follow your conclusion that all load, exists and store operations would be
> >>>>> be globally serialized across all workspaces.
> >>>>
> >>>> Hm, this is my bad... It does seem that sync methods are on the instance.
> >>>> Since the db persistence manager has "synchronized" on load, store and
> >>>> exists, though, this would still serialize all of these operations for a
> >>>> particular workspace.
> >>>
> >>> ?? the load methods are *not* synchronized. they contain a section which
> >>> is synchronized on the particular prepared stmt.
> >>>
> >>> <quote from my previous reply>
> >>> wrt synchronization:
> >>> concurrency is controlled outside the persistence manager on a higher level.
> >>> eliminating the method synchronization would imo therefore have *no* impact
> >>> on concurrency/performance.
> >>> </quote>
> >>>
> >>> cheers
> >>> stefan
> >>>
> >>>>
> >>>>>> operations).  Presumably this is because they allocate a JDBC connection
> >>>>>> at
> >>>>>> startup and use it throughout, and the connection object is not
> >>>>>> multithreaded.
> >>>>>
> >>>>> what leads you to this assumption?
> >>>>
> >>>> Are there other requirements that all of these operations are serialized
> >>>> for
> >>>> a particular PM instance?  This seems like a pretty serious bottleneck
> >>>> (and,
> >>>> in fact, is a pretty serious bottleneck when the database is remote from
> >>>> the
> >>>> repository).
> >>>>
> >>>>>>
> >>>>>> This problem isn't as noticeable when you are using embedded Derby and
> >>>>>> reading/writing to the file system, but when you are doing a network
> >>>>>> operation to a database server, the network latency in combination with
> >>>>>> the
> >>>>>> serialization of all database operations results in a significant
> >>>>>> performance degradation.
> >>>>>
> >>>>> again: serialization of 'all' database operations?
> >>>>
> >>>> The distinction between all and all for a workspace is would really only be
> >>>> relevant during versioning, right?
> >>>>
> >>>>>>
> >>>>>> The new bundle persistence manager (which isn't yet in SVN) improves
> >>>>>> things
> >>>>>> dramatically since it inlines properties into the node, so loading or
> >>>>>> persisting a node is only one operation (plus the additional connection
> >>>>>> for
> >>>>>> the LOB) instead of one for the node and and one for each property.  The
> >>>>>> bundle persistence manager also uses prepared statements and keeps a
> >>>>>> PM-level cache of nodes (with properties) and also non-existent nodes
> >>>>>> (which
> >>>>>> permits many exists() calls to return without accessing the database).
> >>>>>>
> >>>>>> Changing all db persistence managers to use a datasource and get and
> >>>>>> release
> >>>>>> connections inside of load, exists and store operations and eliminating
> >>>>>> the
> >>>>>> method synchronization is a relatively simple change that further
> >>>>>> improves
> >>>>>> performance for connecting to database servers.
> >>>>>
> >>>>> the use of datasources, connection pools and the like have been discussed
> >>>>> in extenso on the list. see e.g.
> >>>>>
> >>
> ">">http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
> <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
> >> >>
> >> l
> >>>>> http://issues.apache.org/jira/browse/JCR-313
> >>>>>
> >>>>> i don't see how getting & releasing connections in every load, exists and
> >>>>> store
> >>>>> call would improve preformance. could you please elaborate?
> >>>>>
> >>>>> please note that you wouldn't be able to use prepared statements over
> >>>>> multiple
> >>>>> load, store etc operations because you'd have to return the connection
> >>>>> at the end
> >>>>> of every call. the performance might therefore be even worse.
> >>>>>
> >>>>> further note that write operations must occur within a single jdbc
> >>>>> transaction, i.e.
> >>>>> you can't get a new connection for every store/destroy operation.
> >>>>>
> >>>>> wrt synchronization:
> >>>>> concurrency is controlled outside the persistence manager on a higher
> >>>>> level.
> >>>>> eliminating the method synchronization would imo therefore have *no*
> >>>>> impact
> >>>>> on concurrency/performance.
> >>>>
> >>>> So you are saying that it is impossible to concurrently load or store data
> >>>> in Jackrabbit?
> >>>>
> >>>>>> There is a persistence manager with an ASL license called
> >>>>>> "DataSourcePersistenceManager" which seems to the PM of choice for people
> >>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
> >>>>>> statements and eliminates the current single-connection issues associated
> >>>>>> with all of the stock db PMs.  It doesn't seem to have been submitted
> >>>>>> back
> >>>>>> to the Jackrabbit project.  If you Google for
> >>>>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> >>>>>> you
> >>>>>> should be able to find it.
> >>>>>
> >>>>> thanks for the hint. i am aware of this pm and i had a look at it a couple
> >>>>> of
> >>>>> months ago. the major issue was that it didn't implement the
> >>>>> correct/required
> >>>>> semantics. it used a new connection for every write operation which
> >>>>> clearly violates the contract that the write operations should occur
> >>>>> within
> >>>>> a jdbc transaction bracket. further it creates a prepared stmt on every
> >>>>> load, store etc. which is hardly efficient...
> >>>>
> >>>> Yes, this PM does have this issue.  The bundle PM implements prepared
> >>>> statements in the correct way.
> >>>>
> >>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need
> >>>>>> to
> >>>>>> use the Oracle-specific PMs because the 10g drivers support the standard
> >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by the
> >>>>>> older
> >>>>>> 9i drivers).  This is true even if you are connecting to an older
> >>>>>> database
> >>>>>> server as the limitation was in the driver itself.  Frankly you should
> >>>>>> never
> >>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent
> >>>>>> a
> >>>>>> complete rewrite.  Make sure you use the new driver package because the
> >>>>>> 10g
> >>>>>> driver JAR also includes the older 9i drivers for backward-compatibility.
> >>>>>> The new driver is in a new package (can't remember the exact name off the
> >>>>>> top of my head).
> >>>>>
> >>>>> thanks for the information.
> >>>>>
> >>>>> cheers
> >>>>> stefan
> >>>>
> >>>> We are very interested in getting a good understanding of the specifics of
> >>>> how PM's work, as initial reads and writes, according to our profiling, are
> >>>> spending 80-90% of the time inside the PM.
> >>>>
> >>>> Bryan.
> >>>>
> >>>> _______________________________________________________________________
> >>>> Notice:  This email message, together with any attachments, may contain
> >>>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> >>>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> >>>> legally privileged, and is intended solely for the use of the individual
> >>>> or entity named in this message. If you are not the intended recipient,
> >>>> and have received this message in error, please immediately return this
> >>>> by email and then delete it.
> >>>>
> >>
> >> _______________________________________________________________________
> >> Notice:  This email message, together with any attachments, may contain
> >> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> >> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> >> legally privileged, and is intended solely for the use of the individual
> >> or entity named in this message. If you are not the intended recipient,
> >> and have received this message in error, please immediately return this
> >> by email and then delete it.
> >>
>
> _______________________________________________________________________
> Notice:  This email message, together with any attachments, may contain
> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> legally privileged, and is intended solely for the use of the individual
> or entity named in this message. If you are not the intended recipient,
> and have received this message in error, please immediately return this
> by email and then delete it.
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Bryan,

Bryan Davis wrote:
> It is also tempting to try and push this functionality into the database,
> since it is already doing all of the required locking anyway.  A custom
> transaction manager that delegated to the repository session transaction
> manager (thereby associating JDBC connections with the repository session),
> in conjunction with a stock data source implementation (see below) might do
> the trick.  Of course this would only work with database PM¹s, but perhaps
> other TM¹s could still have the existing SISM locking enabled.  This would
> be good enough for us since we only use database PM¹s, and a better, more
> universal solution could be implemented at a later date.
> 
> Has anyone looked into this issue at all or have any advice / thoughts?

well, it's not that easy. locking in SISM is required because there is a cache 
of item states. the same requirements for a persistence manager also applies to 
this cache. specifically, after the change log got persisted the cache must be 
updated atomically.

there were some recent thoughts about implementing a multi-version cache 
(similar to MVCC in databases), which would support better concurrency. but that 
requires more effort than just removing some synchronized modifiers in the 
source code.

> Issue #2: Multiple issues in database persistence managers.  I believe the
> database persistence managers have multiple issues (please correct me if I
> get any of this wrong).
> 
> 1. JDBC connection details should not be in the repository.xml.  I should be
> free to change the specifics of a particular database connection without it
> constituting a repository initialization parameter change (which is
> effectively what changing the repository.xml is, since it gets copied and
> subsequently accessed from inside the repository itself).  If a host name or
> driver class or even connection URL changes, I should not have to manually
> edit internal repository configuration files to effect the change.

there's a 
org.apache.jackrabbit.core.persistence.db.JNDIDatabasePersistenceManager, which 
exactly does what you need.

> 2. Sharing JDBC connections (and objects obtained from them, like prepared
> statements) between multiple threads is not a good practice.  Even though
> many drivers support such activity, it is not specifically required by the
> spec, and many drivers do not support it.  Even for ones that do, there are
> always a significant list of caveats (like changing the transaction
> isolation of a connection impacting all threads, or rollbacks sometimes
> being executed against the wrong thread).  Plus, as far as I can tell, there
> is also no particular good reason to attempt this in this case.  Connection
> pooling is extremely well understood and there are a variety of
> implementations to choose from (including Apache¹s own in Jakarta Commons).

so far we didn't see any issues with our approach of sharing a connection among 
threads. we are always happy to accept contributions in that area ;)

> 3. Synchronization is  bad (generally speaking of course :).

one could also say (generally speaking) 'no synchronization' is bad. if you have 
a multi-threaded program without any synchronization and is super-fast, but just 
doesn't behave correctly, what good is it then?

> Once the
> multithreaded issues of JDBC are removed (via a connection pool), there are
> no good reasons that I can see to have any synchronization in the database
> persistence managers.

that's correct but won't solve your issue, because of JCR-314.

btw. you can easily implement your own persistence manager based on the 
JNDIDatabasePersistenceManager that provides concurrent reads. I think we didn't 
implement one so far because we didn't see much value in it.

> Since any sort of requirement for synchronized
> operation would be coming from a higher layer, it should also be provided at
> a higher layer.  I have always felt that a good rule of thumb in server code
> is to avoid synchronization at all costs, particular in core server
> functionality (like reading and writing to a repository).  It if extremely
> difficult to fully understand the global implications of synchronized code,
> particular code that is synchronized at a low level.  Any serialization of
> user requests is extremely serious in a multithreaded server and, in my
> experience, will lead to show-stopping performance and scalability issues in
> nearly all cases.  In addition, serialization of requests at such a low
> level probably means that other synchronized code that is intended to be
> properly multithreaded is probably not well tested since the request
> serialization has eliminated (or greatly reduced) the possibility of the
> code being reentered like it normally would be.

please note that this serialization of writes currently only takes place at the 
very last step when changes are stored. all other operations like constraint and 
node type validation, access right checks, check if the save is self contained, 
etc. can be done with multiple threads concurrently.

> The solution to all of these issues, maybe, is to use the standard JDBC
> DataSource interface to encapsulate the details of managing the JDBC
> connections.  If all of the current PM and FS implementations that use JDBC
> were refactored to have a DataSource member and to get and release
> connections inside of each method, then parity with the existing
> implementations could be achieved by providing a default DataSourceLookup
> strategy implemention that simply encapsulated the existing connection
> creation code (ignoring connection release requests).

I can't follow you here, how would that help?

> This would allow us
> (and others) to externally extend the implementation with alternative
> DataSourceLookup strategy implementations, say for accessing a datasource
>>from JNDI, or getting it from a Spring application context.  This solution
> also neatly externalizes all of the details of the actual datasource
> configuration from the repository.xml.

using JNDI is already possible and implementing a database persistence manager 
which obtains its connection from a spring context is fairly easy. you just have 
to implement the method DatabasePersistenceManager.getConnection() accordingly.

regards
  marcel

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.
Stefan,

There are a couple of issues that, collectively, we need to address in order
to successfully use Jackrabbit.

Issue #1: Serialization of all repository updates.  See
https://issues.apache.org/jira/browse/JCR-314, which I think seriously
understates the significance of the issue.  In any environment where users
are routinely writing anything at all to the repository (like audit or log
information), a large file upload (or a small file over a slow link) will
effectively block all other users until it completes.

Having all other threads hang while a file is being uploaded is simply a
show stopper for us (unfortunately this issue is marked as minor, reported
in 0.9, and not currently slated for a particular release).  Trying to solve
this issue outside of Jackrabbit is impossible, providing only stopgap
solutions; plus external mitigation strategies (like uploading to a local
file and then streaming into Jackrabbit as fast as possible) all seem fairly
complex to make robust, seeing as how some data management would have to be
handled outside of the repository transaction.  Which leaves us with trying
to resolve the issue by patching Jackrabbit.

I now understand that the  Jackrabbit fixes are multifaceted, and that (at
least) they involve changes to both the Persistence Manager and Shared Item
State Manager.  The Persistence Manager changes (which I will talk about
separately), I think, are easy enough.  The SISM obviously needs to be
upgraded to have more granular locking semantics (possibly by using
item-level nested locks, or maybe a partial solution that depends on
database-level locking in the Persistence Manager).

There are a number of lock manager implementations floating around that
could potentially be repurposed for use inside the SISM.  I am uncertain of
the requirements for distribution here, although on the surface it seems
like a local locking implementation is all that is required since it seems
like clustering support is handled at a higher level.

It is also tempting to try and push this functionality into the database,
since it is already doing all of the required locking anyway.  A custom
transaction manager that delegated to the repository session transaction
manager (thereby associating JDBC connections with the repository session),
in conjunction with a stock data source implementation (see below) might do
the trick.  Of course this would only work with database PM¹s, but perhaps
other TM¹s could still have the existing SISM locking enabled.  This would
be good enough for us since we only use database PM¹s, and a better, more
universal solution could be implemented at a later date.

Has anyone looked into this issue at all or have any advice / thoughts?

Issue #2: Multiple issues in database persistence managers.  I believe the
database persistence managers have multiple issues (please correct me if I
get any of this wrong).

1. JDBC connection details should not be in the repository.xml.  I should be
free to change the specifics of a particular database connection without it
constituting a repository initialization parameter change (which is
effectively what changing the repository.xml is, since it gets copied and
subsequently accessed from inside the repository itself).  If a host name or
driver class or even connection URL changes, I should not have to manually
edit internal repository configuration files to effect the change.
2. Sharing JDBC connections (and objects obtained from them, like prepared
statements) between multiple threads is not a good practice.  Even though
many drivers support such activity, it is not specifically required by the
spec, and many drivers do not support it.  Even for ones that do, there are
always a significant list of caveats (like changing the transaction
isolation of a connection impacting all threads, or rollbacks sometimes
being executed against the wrong thread).  Plus, as far as I can tell, there
is also no particular good reason to attempt this in this case.  Connection
pooling is extremely well understood and there are a variety of
implementations to choose from (including Apache¹s own in Jakarta Commons).
3. Synchronization is  bad (generally speaking of course :).  Once the
multithreaded issues of JDBC are removed (via a connection pool), there are
no good reasons that I can see to have any synchronization in the database
persistence managers.  Since any sort of requirement for synchronized
operation would be coming from a higher layer, it should also be provided at
a higher layer.  I have always felt that a good rule of thumb in server code
is to avoid synchronization at all costs, particular in core server
functionality (like reading and writing to a repository).  It if extremely
difficult to fully understand the global implications of synchronized code,
particular code that is synchronized at a low level.  Any serialization of
user requests is extremely serious in a multithreaded server and, in my
experience, will lead to show-stopping performance and scalability issues in
nearly all cases.  In addition, serialization of requests at such a low
level probably means that other synchronized code that is intended to be
properly multithreaded is probably not well tested since the request
serialization has eliminated (or greatly reduced) the possibility of the
code being reentered like it normally would be.

The solution to all of these issues, maybe, is to use the standard JDBC
DataSource interface to encapsulate the details of managing the JDBC
connections.  If all of the current PM and FS implementations that use JDBC
were refactored to have a DataSource member and to get and release
connections inside of each method, then parity with the existing
implementations could be achieved by providing a default DataSourceLookup
strategy implemention that simply encapsulated the existing connection
creation code (ignoring connection release requests).  This would allow us
(and others) to externally extend the implementation with alternative
DataSourceLookup strategy implementations, say for accessing a datasource
from JNDI, or getting it from a Spring application context.  This solution
also neatly externalizes all of the details of the actual datasource
configuration from the repository.xml.

Thanks!
Bryan.



On 3/8/07 2:48 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:

> On 3/7/07, Bryan Davis <br...@bea.com> wrote:
>> Well, serializing on the prepared statement is still fairly serialized since
>> we are really only talking about nodes and properties (two locks instead of
>> one).  If concurrency is controlled at a higher level then why is
>> synchronization in the PM necessary?
> 
> a PM's implementation should be thread-safe because it might be used
> in another context or e.g. by a tool.
> 
>> 
>> The code now seems to assume that the connection object is thread-safe (and
>> the specifics for thread-safeness of of connection objects and other object
>> derived from them is pretty much up to the driver).  This is one of the
>> reasons why connection pooling is used pretty much universally.
>> 
>> If the built-in PM's used data sources instead of connections then the
>> connection settings could be more easily externalized (as these are
>> typically configurable by the end user). Is there anyway to external the
>> JDBC connection settings from repository.xml right now (in 1.2.2) and
>> configure them at runtime?
> 
> i strongly disagree. the pm configuration is *not* supposed to be
> configurable by the end user and certainly not at runtime. do you think
> that e.g. the tablespace settings (physical datafile paths etc) of an oracle
> db should be user configurable? i hope not...
> 
>> 
>> You didn't really answer my question about Jackrabbit and its ability to
>> fetch and store information through the PM concurrently... What is the
>> synchronization at the higher level and how does it work?
> 
> the current synchronization is used to guarantee data consistency (such as
> referential integrity).
> 
> have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> and you'll get the idea.
> 
>> 
>> Finally, we are seeing a new issue where if a particular user uploads a
>> large document all other users start to get exceptions (doing a normal mix
>> of mostly reads/some writes).  If there is no way to do concurrent writes to
>> the PM I don't see any way around this problem (and it is pretty serious for
>> us).
> 
> there's a related improvement issue:
> https://issues.apache.org/jira/browse/JCR-314
> 
> please feel free to comment on this issue or file a new issue if you think
> that it doesn't cover your use case.
> 
> cheers
> stefan
> 
>> 
>> Bryan.
>> 
>> 
>> On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>> 
>>> On 3/5/07, Bryan Davis <br...@bea.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>>>> 
>>>>> hi bryan
>>>>> 
>>>>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
>>>>>> What persistence manager are you using?
>>>>>> 
>>>>>> Our tests indicate that the stock persistence managers are a significant
>>>>>> bottleneck for both writes and also initial reads to load the transient
>>>>>> store (on the order of .5 seconds per node when using a remote database
>>>>>> like
>>>>>> MSSQL or Oracle).
>>>>> 
>>>>> what do you mean by "load the transient store"?
>>>>> 
>>>>>> 
>>>>>> The stock db persistence managers have all methods marked as
>>>>>> "synchronized",
>>>>>> which blocks on the classdef (which means that even different persistence
>>>>>> managers for different workspaces will serialize all load, exists and
>>>>>> store
>>>>> 
>>>>> assuming you're talking about DatabasePersistenceManager:
>>>>> the store/destroy methods are 'synchronized' on the instance, not on
>>>>> the 'classdef'.
>>>>> see e.g.
>>>>> 
>>>>">>>>>">http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
th.htm>>>>> 
<http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>
<http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>  l
>>>>> 
>>>>> the load/exists methods are synchronized on the specific prepared stmt
>>>>> they're
>>>>> using.
>>>>> 
>>>>> since every workspace uses its own persistence manager instance i can't
>>>>> follow your conclusion that all load, exists and store operations would be
>>>>> be globally serialized across all workspaces.
>>>> 
>>>> Hm, this is my bad... It does seem that sync methods are on the instance.
>>>> Since the db persistence manager has "synchronized" on load, store and
>>>> exists, though, this would still serialize all of these operations for a
>>>> particular workspace.
>>> 
>>> ?? the load methods are *not* synchronized. they contain a section which
>>> is synchronized on the particular prepared stmt.
>>> 
>>> <quote from my previous reply>
>>> wrt synchronization:
>>> concurrency is controlled outside the persistence manager on a higher level.
>>> eliminating the method synchronization would imo therefore have *no* impact
>>> on concurrency/performance.
>>> </quote>
>>> 
>>> cheers
>>> stefan
>>> 
>>>> 
>>>>>> operations).  Presumably this is because they allocate a JDBC connection
>>>>>> at
>>>>>> startup and use it throughout, and the connection object is not
>>>>>> multithreaded.
>>>>> 
>>>>> what leads you to this assumption?
>>>> 
>>>> Are there other requirements that all of these operations are serialized
>>>> for
>>>> a particular PM instance?  This seems like a pretty serious bottleneck
>>>> (and,
>>>> in fact, is a pretty serious bottleneck when the database is remote from
>>>> the
>>>> repository).
>>>> 
>>>>>> 
>>>>>> This problem isn't as noticeable when you are using embedded Derby and
>>>>>> reading/writing to the file system, but when you are doing a network
>>>>>> operation to a database server, the network latency in combination with
>>>>>> the
>>>>>> serialization of all database operations results in a significant
>>>>>> performance degradation.
>>>>> 
>>>>> again: serialization of 'all' database operations?
>>>> 
>>>> The distinction between all and all for a workspace is would really only be
>>>> relevant during versioning, right?
>>>> 
>>>>>> 
>>>>>> The new bundle persistence manager (which isn't yet in SVN) improves
>>>>>> things
>>>>>> dramatically since it inlines properties into the node, so loading or
>>>>>> persisting a node is only one operation (plus the additional connection
>>>>>> for
>>>>>> the LOB) instead of one for the node and and one for each property.  The
>>>>>> bundle persistence manager also uses prepared statements and keeps a
>>>>>> PM-level cache of nodes (with properties) and also non-existent nodes
>>>>>> (which
>>>>>> permits many exists() calls to return without accessing the database).
>>>>>> 
>>>>>> Changing all db persistence managers to use a datasource and get and
>>>>>> release
>>>>>> connections inside of load, exists and store operations and eliminating
>>>>>> the
>>>>>> method synchronization is a relatively simple change that further
>>>>>> improves
>>>>>> performance for connecting to database servers.
>>>>> 
>>>>> the use of datasources, connection pools and the like have been discussed
>>>>> in extenso on the list. see e.g.
>>>>> 
>> 
">">http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
<http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
>> >>
>> l
>>>>> http://issues.apache.org/jira/browse/JCR-313
>>>>> 
>>>>> i don't see how getting & releasing connections in every load, exists and
>>>>> store
>>>>> call would improve preformance. could you please elaborate?
>>>>> 
>>>>> please note that you wouldn't be able to use prepared statements over
>>>>> multiple
>>>>> load, store etc operations because you'd have to return the connection
>>>>> at the end
>>>>> of every call. the performance might therefore be even worse.
>>>>> 
>>>>> further note that write operations must occur within a single jdbc
>>>>> transaction, i.e.
>>>>> you can't get a new connection for every store/destroy operation.
>>>>> 
>>>>> wrt synchronization:
>>>>> concurrency is controlled outside the persistence manager on a higher
>>>>> level.
>>>>> eliminating the method synchronization would imo therefore have *no*
>>>>> impact
>>>>> on concurrency/performance.
>>>> 
>>>> So you are saying that it is impossible to concurrently load or store data
>>>> in Jackrabbit?
>>>> 
>>>>>> There is a persistence manager with an ASL license called
>>>>>> "DataSourcePersistenceManager" which seems to the PM of choice for people
>>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
>>>>>> statements and eliminates the current single-connection issues associated
>>>>>> with all of the stock db PMs.  It doesn't seem to have been submitted
>>>>>> back
>>>>>> to the Jackrabbit project.  If you Google for
>>>>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
>>>>>> you
>>>>>> should be able to find it.
>>>>> 
>>>>> thanks for the hint. i am aware of this pm and i had a look at it a couple
>>>>> of
>>>>> months ago. the major issue was that it didn't implement the
>>>>> correct/required
>>>>> semantics. it used a new connection for every write operation which
>>>>> clearly violates the contract that the write operations should occur
>>>>> within
>>>>> a jdbc transaction bracket. further it creates a prepared stmt on every
>>>>> load, store etc. which is hardly efficient...
>>>> 
>>>> Yes, this PM does have this issue.  The bundle PM implements prepared
>>>> statements in the correct way.
>>>> 
>>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need
>>>>>> to
>>>>>> use the Oracle-specific PMs because the 10g drivers support the standard
>>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by the
>>>>>> older
>>>>>> 9i drivers).  This is true even if you are connecting to an older
>>>>>> database
>>>>>> server as the limitation was in the driver itself.  Frankly you should
>>>>>> never
>>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent
>>>>>> a
>>>>>> complete rewrite.  Make sure you use the new driver package because the
>>>>>> 10g
>>>>>> driver JAR also includes the older 9i drivers for backward-compatibility.
>>>>>> The new driver is in a new package (can't remember the exact name off the
>>>>>> top of my head).
>>>>> 
>>>>> thanks for the information.
>>>>> 
>>>>> cheers
>>>>> stefan
>>>> 
>>>> We are very interested in getting a good understanding of the specifics of
>>>> how PM's work, as initial reads and writes, according to our profiling, are
>>>> spending 80-90% of the time inside the PM.
>>>> 
>>>> Bryan.
>>>> 
>>>> _______________________________________________________________________
>>>> Notice:  This email message, together with any attachments, may contain
>>>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
>>>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
>>>> legally privileged, and is intended solely for the use of the individual
>>>> or entity named in this message. If you are not the intended recipient,
>>>> and have received this message in error, please immediately return this
>>>> by email and then delete it.
>>>> 
>> 
>> _______________________________________________________________________
>> Notice:  This email message, together with any attachments, may contain
>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
>> legally privileged, and is intended solely for the use of the individual
>> or entity named in this message. If you are not the intended recipient,
>> and have received this message in error, please immediately return this
>> by email and then delete it.
>> 

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Stefan Guggisberg <st...@gmail.com>.
On 3/7/07, Bryan Davis <br...@bea.com> wrote:
> Well, serializing on the prepared statement is still fairly serialized since
> we are really only talking about nodes and properties (two locks instead of
> one).  If concurrency is controlled at a higher level then why is
> synchronization in the PM necessary?

a PM's implementation should be thread-safe because it might be used
in another context or e.g. by a tool.

>
> The code now seems to assume that the connection object is thread-safe (and
> the specifics for thread-safeness of of connection objects and other object
> derived from them is pretty much up to the driver).  This is one of the
> reasons why connection pooling is used pretty much universally.
>
> If the built-in PM's used data sources instead of connections then the
> connection settings could be more easily externalized (as these are
> typically configurable by the end user). Is there anyway to external the
> JDBC connection settings from repository.xml right now (in 1.2.2) and
> configure them at runtime?

i strongly disagree. the pm configuration is *not* supposed to be
configurable by the end user and certainly not at runtime. do you think
that e.g. the tablespace settings (physical datafile paths etc) of an oracle
db should be user configurable? i hope not...

>
> You didn't really answer my question about Jackrabbit and its ability to
> fetch and store information through the PM concurrently... What is the
> synchronization at the higher level and how does it work?

the current synchronization is used to guarantee data consistency (such as
referential integrity).

have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
and you'll get the idea.

>
> Finally, we are seeing a new issue where if a particular user uploads a
> large document all other users start to get exceptions (doing a normal mix
> of mostly reads/some writes).  If there is no way to do concurrent writes to
> the PM I don't see any way around this problem (and it is pretty serious for
> us).

there's a related improvement issue:
https://issues.apache.org/jira/browse/JCR-314

please feel free to comment on this issue or file a new issue if you think
that it doesn't cover your use case.

cheers
stefan

>
> Bryan.
>
>
> On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>
> > On 3/5/07, Bryan Davis <br...@bea.com> wrote:
> >>
> >>
> >>
> >> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
> >>
> >>> hi bryan
> >>>
> >>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> >>>> What persistence manager are you using?
> >>>>
> >>>> Our tests indicate that the stock persistence managers are a significant
> >>>> bottleneck for both writes and also initial reads to load the transient
> >>>> store (on the order of .5 seconds per node when using a remote database
> >>>> like
> >>>> MSSQL or Oracle).
> >>>
> >>> what do you mean by "load the transient store"?
> >>>
> >>>>
> >>>> The stock db persistence managers have all methods marked as
> >>>> "synchronized",
> >>>> which blocks on the classdef (which means that even different persistence
> >>>> managers for different workspaces will serialize all load, exists and store
> >>>
> >>> assuming you're talking about DatabasePersistenceManager:
> >>> the store/destroy methods are 'synchronized' on the instance, not on
> >>> the 'classdef'.
> >>> see e.g.
> >>> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.html
> >>>
> >>> the load/exists methods are synchronized on the specific prepared stmt
> >>> they're
> >>> using.
> >>>
> >>> since every workspace uses its own persistence manager instance i can't
> >>> follow your conclusion that all load, exists and store operations would be
> >>> be globally serialized across all workspaces.
> >>
> >> Hm, this is my bad... It does seem that sync methods are on the instance.
> >> Since the db persistence manager has "synchronized" on load, store and
> >> exists, though, this would still serialize all of these operations for a
> >> particular workspace.
> >
> > ?? the load methods are *not* synchronized. they contain a section which
> > is synchronized on the particular prepared stmt.
> >
> > <quote from my previous reply>
> > wrt synchronization:
> > concurrency is controlled outside the persistence manager on a higher level.
> > eliminating the method synchronization would imo therefore have *no* impact
> > on concurrency/performance.
> > </quote>
> >
> > cheers
> > stefan
> >
> >>
> >>>> operations).  Presumably this is because they allocate a JDBC connection at
> >>>> startup and use it throughout, and the connection object is not
> >>>> multithreaded.
> >>>
> >>> what leads you to this assumption?
> >>
> >> Are there other requirements that all of these operations are serialized for
> >> a particular PM instance?  This seems like a pretty serious bottleneck (and,
> >> in fact, is a pretty serious bottleneck when the database is remote from the
> >> repository).
> >>
> >>>>
> >>>> This problem isn't as noticeable when you are using embedded Derby and
> >>>> reading/writing to the file system, but when you are doing a network
> >>>> operation to a database server, the network latency in combination with the
> >>>> serialization of all database operations results in a significant
> >>>> performance degradation.
> >>>
> >>> again: serialization of 'all' database operations?
> >>
> >> The distinction between all and all for a workspace is would really only be
> >> relevant during versioning, right?
> >>
> >>>>
> >>>> The new bundle persistence manager (which isn't yet in SVN) improves things
> >>>> dramatically since it inlines properties into the node, so loading or
> >>>> persisting a node is only one operation (plus the additional connection for
> >>>> the LOB) instead of one for the node and and one for each property.  The
> >>>> bundle persistence manager also uses prepared statements and keeps a
> >>>> PM-level cache of nodes (with properties) and also non-existent nodes
> >>>> (which
> >>>> permits many exists() calls to return without accessing the database).
> >>>>
> >>>> Changing all db persistence managers to use a datasource and get and
> >>>> release
> >>>> connections inside of load, exists and store operations and eliminating the
> >>>> method synchronization is a relatively simple change that further improves
> >>>> performance for connecting to database servers.
> >>>
> >>> the use of datasources, connection pools and the like have been discussed
> >>> in extenso on the list. see e.g.
> >>>
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>>>
> l
> >>> http://issues.apache.org/jira/browse/JCR-313
> >>>
> >>> i don't see how getting & releasing connections in every load, exists and
> >>> store
> >>> call would improve preformance. could you please elaborate?
> >>>
> >>> please note that you wouldn't be able to use prepared statements over
> >>> multiple
> >>> load, store etc operations because you'd have to return the connection
> >>> at the end
> >>> of every call. the performance might therefore be even worse.
> >>>
> >>> further note that write operations must occur within a single jdbc
> >>> transaction, i.e.
> >>> you can't get a new connection for every store/destroy operation.
> >>>
> >>> wrt synchronization:
> >>> concurrency is controlled outside the persistence manager on a higher level.
> >>> eliminating the method synchronization would imo therefore have *no* impact
> >>> on concurrency/performance.
> >>
> >> So you are saying that it is impossible to concurrently load or store data
> >> in Jackrabbit?
> >>
> >>>> There is a persistence manager with an ASL license called
> >>>> "DataSourcePersistenceManager" which seems to the PM of choice for people
> >>>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
> >>>> statements and eliminates the current single-connection issues associated
> >>>> with all of the stock db PMs.  It doesn't seem to have been submitted back
> >>>> to the Jackrabbit project.  If you Google for
> >>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
> >>>> should be able to find it.
> >>>
> >>> thanks for the hint. i am aware of this pm and i had a look at it a couple
> >>> of
> >>> months ago. the major issue was that it didn't implement the
> >>> correct/required
> >>> semantics. it used a new connection for every write operation which
> >>> clearly violates the contract that the write operations should occur within
> >>> a jdbc transaction bracket. further it creates a prepared stmt on every
> >>> load, store etc. which is hardly efficient...
> >>
> >> Yes, this PM does have this issue.  The bundle PM implements prepared
> >> statements in the correct way.
> >>
> >>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
> >>>> use the Oracle-specific PMs because the 10g drivers support the standard
> >>>> BLOB API (in addition to the Oracle-specific BLOB API required by the older
> >>>> 9i drivers).  This is true even if you are connecting to an older database
> >>>> server as the limitation was in the driver itself.  Frankly you should
> >>>> never
> >>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
> >>>> complete rewrite.  Make sure you use the new driver package because the 10g
> >>>> driver JAR also includes the older 9i drivers for backward-compatibility.
> >>>> The new driver is in a new package (can't remember the exact name off the
> >>>> top of my head).
> >>>
> >>> thanks for the information.
> >>>
> >>> cheers
> >>> stefan
> >>
> >> We are very interested in getting a good understanding of the specifics of
> >> how PM's work, as initial reads and writes, according to our profiling, are
> >> spending 80-90% of the time inside the PM.
> >>
> >> Bryan.
> >>
> >> _______________________________________________________________________
> >> Notice:  This email message, together with any attachments, may contain
> >> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> >> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> >> legally privileged, and is intended solely for the use of the individual
> >> or entity named in this message. If you are not the intended recipient,
> >> and have received this message in error, please immediately return this
> >> by email and then delete it.
> >>
>
> _______________________________________________________________________
> Notice:  This email message, together with any attachments, may contain
> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> legally privileged, and is intended solely for the use of the individual
> or entity named in this message. If you are not the intended recipient,
> and have received this message in error, please immediately return this
> by email and then delete it.
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.
Well, serializing on the prepared statement is still fairly serialized since
we are really only talking about nodes and properties (two locks instead of
one).  If concurrency is controlled at a higher level then why is
synchronization in the PM necessary?

The code now seems to assume that the connection object is thread-safe (and
the specifics for thread-safeness of of connection objects and other object
derived from them is pretty much up to the driver).  This is one of the
reasons why connection pooling is used pretty much universally.

If the built-in PM's used data sources instead of connections then the
connection settings could be more easily externalized (as these are
typically configurable by the end user). Is there anyway to external the
JDBC connection settings from repository.xml right now (in 1.2.2) and
configure them at runtime?

You didn't really answer my question about Jackrabbit and its ability to
fetch and store information through the PM concurrently... What is the
synchronization at the higher level and how does it work?

Finally, we are seeing a new issue where if a particular user uploads a
large document all other users start to get exceptions (doing a normal mix
of mostly reads/some writes).  If there is no way to do concurrent writes to
the PM I don't see any way around this problem (and it is pretty serious for
us).

Bryan.


On 3/6/07 4:12 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:

> On 3/5/07, Bryan Davis <br...@bea.com> wrote:
>> 
>> 
>> 
>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>> 
>>> hi bryan
>>> 
>>> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
>>>> What persistence manager are you using?
>>>> 
>>>> Our tests indicate that the stock persistence managers are a significant
>>>> bottleneck for both writes and also initial reads to load the transient
>>>> store (on the order of .5 seconds per node when using a remote database
>>>> like
>>>> MSSQL or Oracle).
>>> 
>>> what do you mean by "load the transient store"?
>>> 
>>>> 
>>>> The stock db persistence managers have all methods marked as
>>>> "synchronized",
>>>> which blocks on the classdef (which means that even different persistence
>>>> managers for different workspaces will serialize all load, exists and store
>>> 
>>> assuming you're talking about DatabasePersistenceManager:
>>> the store/destroy methods are 'synchronized' on the instance, not on
>>> the 'classdef'.
>>> see e.g.
>>> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.html
>>> 
>>> the load/exists methods are synchronized on the specific prepared stmt
>>> they're
>>> using.
>>> 
>>> since every workspace uses its own persistence manager instance i can't
>>> follow your conclusion that all load, exists and store operations would be
>>> be globally serialized across all workspaces.
>> 
>> Hm, this is my bad... It does seem that sync methods are on the instance.
>> Since the db persistence manager has "synchronized" on load, store and
>> exists, though, this would still serialize all of these operations for a
>> particular workspace.
> 
> ?? the load methods are *not* synchronized. they contain a section which
> is synchronized on the particular prepared stmt.
> 
> <quote from my previous reply>
> wrt synchronization:
> concurrency is controlled outside the persistence manager on a higher level.
> eliminating the method synchronization would imo therefore have *no* impact
> on concurrency/performance.
> </quote>
> 
> cheers
> stefan
> 
>> 
>>>> operations).  Presumably this is because they allocate a JDBC connection at
>>>> startup and use it throughout, and the connection object is not
>>>> multithreaded.
>>> 
>>> what leads you to this assumption?
>> 
>> Are there other requirements that all of these operations are serialized for
>> a particular PM instance?  This seems like a pretty serious bottleneck (and,
>> in fact, is a pretty serious bottleneck when the database is remote from the
>> repository).
>> 
>>>> 
>>>> This problem isn't as noticeable when you are using embedded Derby and
>>>> reading/writing to the file system, but when you are doing a network
>>>> operation to a database server, the network latency in combination with the
>>>> serialization of all database operations results in a significant
>>>> performance degradation.
>>> 
>>> again: serialization of 'all' database operations?
>> 
>> The distinction between all and all for a workspace is would really only be
>> relevant during versioning, right?
>> 
>>>> 
>>>> The new bundle persistence manager (which isn't yet in SVN) improves things
>>>> dramatically since it inlines properties into the node, so loading or
>>>> persisting a node is only one operation (plus the additional connection for
>>>> the LOB) instead of one for the node and and one for each property.  The
>>>> bundle persistence manager also uses prepared statements and keeps a
>>>> PM-level cache of nodes (with properties) and also non-existent nodes
>>>> (which
>>>> permits many exists() calls to return without accessing the database).
>>>> 
>>>> Changing all db persistence managers to use a datasource and get and
>>>> release
>>>> connections inside of load, exists and store operations and eliminating the
>>>> method synchronization is a relatively simple change that further improves
>>>> performance for connecting to database servers.
>>> 
>>> the use of datasources, connection pools and the like have been discussed
>>> in extenso on the list. see e.g.
>>> 
http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>>>
l
>>> http://issues.apache.org/jira/browse/JCR-313
>>> 
>>> i don't see how getting & releasing connections in every load, exists and
>>> store
>>> call would improve preformance. could you please elaborate?
>>> 
>>> please note that you wouldn't be able to use prepared statements over
>>> multiple
>>> load, store etc operations because you'd have to return the connection
>>> at the end
>>> of every call. the performance might therefore be even worse.
>>> 
>>> further note that write operations must occur within a single jdbc
>>> transaction, i.e.
>>> you can't get a new connection for every store/destroy operation.
>>> 
>>> wrt synchronization:
>>> concurrency is controlled outside the persistence manager on a higher level.
>>> eliminating the method synchronization would imo therefore have *no* impact
>>> on concurrency/performance.
>> 
>> So you are saying that it is impossible to concurrently load or store data
>> in Jackrabbit?
>> 
>>>> There is a persistence manager with an ASL license called
>>>> "DataSourcePersistenceManager" which seems to the PM of choice for people
>>>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
>>>> statements and eliminates the current single-connection issues associated
>>>> with all of the stock db PMs.  It doesn't seem to have been submitted back
>>>> to the Jackrabbit project.  If you Google for
>>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
>>>> should be able to find it.
>>> 
>>> thanks for the hint. i am aware of this pm and i had a look at it a couple
>>> of
>>> months ago. the major issue was that it didn't implement the
>>> correct/required
>>> semantics. it used a new connection for every write operation which
>>> clearly violates the contract that the write operations should occur within
>>> a jdbc transaction bracket. further it creates a prepared stmt on every
>>> load, store etc. which is hardly efficient...
>> 
>> Yes, this PM does have this issue.  The bundle PM implements prepared
>> statements in the correct way.
>> 
>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
>>>> use the Oracle-specific PMs because the 10g drivers support the standard
>>>> BLOB API (in addition to the Oracle-specific BLOB API required by the older
>>>> 9i drivers).  This is true even if you are connecting to an older database
>>>> server as the limitation was in the driver itself.  Frankly you should
>>>> never
>>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
>>>> complete rewrite.  Make sure you use the new driver package because the 10g
>>>> driver JAR also includes the older 9i drivers for backward-compatibility.
>>>> The new driver is in a new package (can't remember the exact name off the
>>>> top of my head).
>>> 
>>> thanks for the information.
>>> 
>>> cheers
>>> stefan
>> 
>> We are very interested in getting a good understanding of the specifics of
>> how PM's work, as initial reads and writes, according to our profiling, are
>> spending 80-90% of the time inside the PM.
>> 
>> Bryan.
>> 
>> _______________________________________________________________________
>> Notice:  This email message, together with any attachments, may contain
>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
>> legally privileged, and is intended solely for the use of the individual
>> or entity named in this message. If you are not the intended recipient,
>> and have received this message in error, please immediately return this
>> by email and then delete it.
>> 

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Stefan Guggisberg <st...@gmail.com>.
On 3/5/07, Bryan Davis <br...@bea.com> wrote:
>
>
>
> On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:
>
> > hi bryan
> >
> > On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> >> What persistence manager are you using?
> >>
> >> Our tests indicate that the stock persistence managers are a significant
> >> bottleneck for both writes and also initial reads to load the transient
> >> store (on the order of .5 seconds per node when using a remote database like
> >> MSSQL or Oracle).
> >
> > what do you mean by "load the transient store"?
> >
> >>
> >> The stock db persistence managers have all methods marked as "synchronized",
> >> which blocks on the classdef (which means that even different persistence
> >> managers for different workspaces will serialize all load, exists and store
> >
> > assuming you're talking about DatabasePersistenceManager:
> > the store/destroy methods are 'synchronized' on the instance, not on
> > the 'classdef'.
> > see e.g.
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.html
> >
> > the load/exists methods are synchronized on the specific prepared stmt they're
> > using.
> >
> > since every workspace uses its own persistence manager instance i can't
> > follow your conclusion that all load, exists and store operations would be
> > be globally serialized across all workspaces.
>
> Hm, this is my bad... It does seem that sync methods are on the instance.
> Since the db persistence manager has "synchronized" on load, store and
> exists, though, this would still serialize all of these operations for a
> particular workspace.

?? the load methods are *not* synchronized. they contain a section which
is synchronized on the particular prepared stmt.

<quote from my previous reply>
wrt synchronization:
concurrency is controlled outside the persistence manager on a higher level.
eliminating the method synchronization would imo therefore have *no* impact
on concurrency/performance.
</quote>

cheers
stefan

>
> >> operations).  Presumably this is because they allocate a JDBC connection at
> >> startup and use it throughout, and the connection object is not
> >> multithreaded.
> >
> > what leads you to this assumption?
>
> Are there other requirements that all of these operations are serialized for
> a particular PM instance?  This seems like a pretty serious bottleneck (and,
> in fact, is a pretty serious bottleneck when the database is remote from the
> repository).
>
> >>
> >> This problem isn't as noticeable when you are using embedded Derby and
> >> reading/writing to the file system, but when you are doing a network
> >> operation to a database server, the network latency in combination with the
> >> serialization of all database operations results in a significant
> >> performance degradation.
> >
> > again: serialization of 'all' database operations?
>
> The distinction between all and all for a workspace is would really only be
> relevant during versioning, right?
>
> >>
> >> The new bundle persistence manager (which isn't yet in SVN) improves things
> >> dramatically since it inlines properties into the node, so loading or
> >> persisting a node is only one operation (plus the additional connection for
> >> the LOB) instead of one for the node and and one for each property.  The
> >> bundle persistence manager also uses prepared statements and keeps a
> >> PM-level cache of nodes (with properties) and also non-existent nodes (which
> >> permits many exists() calls to return without accessing the database).
> >>
> >> Changing all db persistence managers to use a datasource and get and release
> >> connections inside of load, exists and store operations and eliminating the
> >> method synchronization is a relatively simple change that further improves
> >> performance for connecting to database servers.
> >
> > the use of datasources, connection pools and the like have been discussed
> > in extenso on the list. see e.g.
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.html
> > http://issues.apache.org/jira/browse/JCR-313
> >
> > i don't see how getting & releasing connections in every load, exists and
> > store
> > call would improve preformance. could you please elaborate?
> >
> > please note that you wouldn't be able to use prepared statements over multiple
> > load, store etc operations because you'd have to return the connection
> > at the end
> > of every call. the performance might therefore be even worse.
> >
> > further note that write operations must occur within a single jdbc
> > transaction, i.e.
> > you can't get a new connection for every store/destroy operation.
> >
> > wrt synchronization:
> > concurrency is controlled outside the persistence manager on a higher level.
> > eliminating the method synchronization would imo therefore have *no* impact
> > on concurrency/performance.
>
> So you are saying that it is impossible to concurrently load or store data
> in Jackrabbit?
>
> >> There is a persistence manager with an ASL license called
> >> "DataSourcePersistenceManager" which seems to the PM of choice for people
> >> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
> >> statements and eliminates the current single-connection issues associated
> >> with all of the stock db PMs.  It doesn't seem to have been submitted back
> >> to the Jackrabbit project.  If you Google for
> >> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
> >> should be able to find it.
> >
> > thanks for the hint. i am aware of this pm and i had a look at it a couple of
> > months ago. the major issue was that it didn't implement the correct/required
> > semantics. it used a new connection for every write operation which
> > clearly violates the contract that the write operations should occur within
> > a jdbc transaction bracket. further it creates a prepared stmt on every
> > load, store etc. which is hardly efficient...
>
> Yes, this PM does have this issue.  The bundle PM implements prepared
> statements in the correct way.
>
> >> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
> >> use the Oracle-specific PMs because the 10g drivers support the standard
> >> BLOB API (in addition to the Oracle-specific BLOB API required by the older
> >> 9i drivers).  This is true even if you are connecting to an older database
> >> server as the limitation was in the driver itself.  Frankly you should never
> >> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
> >> complete rewrite.  Make sure you use the new driver package because the 10g
> >> driver JAR also includes the older 9i drivers for backward-compatibility.
> >> The new driver is in a new package (can't remember the exact name off the
> >> top of my head).
> >
> > thanks for the information.
> >
> > cheers
> > stefan
>
> We are very interested in getting a good understanding of the specifics of
> how PM's work, as initial reads and writes, according to our profiling, are
> spending 80-90% of the time inside the PM.
>
> Bryan.
>
> _______________________________________________________________________
> Notice:  This email message, together with any attachments, may contain
> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> legally privileged, and is intended solely for the use of the individual
> or entity named in this message. If you are not the intended recipient,
> and have received this message in error, please immediately return this
> by email and then delete it.
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/13/07, Bryan Davis <br...@bea.com> wrote:
> I am working on a response to the many recent additions to this thread
> (hopefully will have something later today).

If you're interested, see below for some code I drafted together last
year when this subject was up earlier. I quickly updated the code to
match the latest changes in Jackrabbit. The class is just a quick
prototype, i.e. it compiles but is not tested and not really
documented.

PS. How about moving this discussion to the development mailing list?

BR,

Jukka Zitting

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.jackrabbit.core.persistence.db;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;

import javax.naming.InitialContext;
import javax.sql.DataSource;

import org.apache.jackrabbit.core.NodeId;
import org.apache.jackrabbit.core.PropertyId;
import org.apache.jackrabbit.core.fs.BasedFileSystem;
import org.apache.jackrabbit.core.fs.FileSystem;
import org.apache.jackrabbit.core.persistence.PMContext;
import org.apache.jackrabbit.core.persistence.PersistenceManager;
import org.apache.jackrabbit.core.persistence.util.BLOBStore;
import org.apache.jackrabbit.core.persistence.util.FileSystemBLOBStore;
import org.apache.jackrabbit.core.persistence.util.Serializer;
import org.apache.jackrabbit.core.state.ChangeLog;
import org.apache.jackrabbit.core.state.ItemState;
import org.apache.jackrabbit.core.state.ItemStateException;
import org.apache.jackrabbit.core.state.NoSuchItemStateException;
import org.apache.jackrabbit.core.state.NodeReferences;
import org.apache.jackrabbit.core.state.NodeReferencesId;
import org.apache.jackrabbit.core.state.NodeState;
import org.apache.jackrabbit.core.state.PropertyState;

public class DataSourcePersistenceManager implements PersistenceManager {

    /**
     * The underlying data source.
     */
    private DataSource database;

    /**
     * JNDI location of the data source used to acquire database connections.
     */
    private String location;

    /**
     * Schema object prefix.
     */
    private String prefix;

    /**
     * Blob store.
     */
    private BLOBStore blobs;

    private String nodeExistsSQL;
    private String propExistsSQL;
    private String refsExistsSQL;

    private String nodeSelectSQL;
    private String propSelectSQL;
    private String refsSelectSQL;

    private String nodeInsertSQL;
    private String propInsertSQL;
    private String refsInsertSQL;

    private String nodeUpdateSQL;
    private String propUpdateSQL;
    private String refsUpdateSQL;

    private String nodeDeleteSQL;
    private String propDeleteSQL;
    private String refsDeleteSQL;

    //----------------------------------------------------< setters & getters >

    /**
     * Returns the JNDI location of the data source.
     *
     * @return data source location
     */
    public String getDataSourceLocation() {
        return location;
    }

    /**
     * Sets the JNDI location of the data source.
     *
     * @param location data source location
     */
    public void setDataSourceLocation(String location) {
        this.location = location;
    }

    /**
     * Returns the schema object prefix.
     *
     * @return schema object prefix
     */
    public String getSchemaObjectPrefix() {
        return prefix;
    }

    /**
     * Sets the schema object prefix.
     *
     * @param prefix
     */
    public void setSchemaObjectPrefix(String prefix) {
        this.prefix = prefix.toUpperCase();
    }

    //--------------------------------------------------< PersistenceManager >

    /**
     * Initializes this persistence manager.
     */
    public void init(PMContext context) throws Exception {
        database = (DataSource) new InitialContext().lookup(location);

        FileSystem filesystem =
            new BasedFileSystem(context.getFileSystem(), "blobs");
        filesystem.init();
        blobs = new FileSystemBLOBStore(filesystem);

        nodeExistsSQL = "SELECT 1 FROM " + prefix + "NODE WHERE NODE_ID=?";
        propExistsSQL = "SELECT 1 FROM " + prefix + "PROP WHERE PROP_ID=?";
        refsExistsSQL = "SELECT 1 FROM " + prefix + "REFS WHERE NODE_ID=?";
        nodeSelectSQL = "SELECT NODE_DATA FROM " + prefix + "NODE
WHERE NODE_ID=?";
        propSelectSQL = "SELECT PROP_DATA FROM " + prefix + "PROP
WHERE PROP_ID=?";
        refsSelectSQL = "SELECT REFS_DATA FROM " + prefix + "REFS
WHERE NODE_ID=?";
        nodeInsertSQL = "INSERT INTO " + prefix + "NODE
(NODE_DATA,NODE_ID) VALUES (?,?)";
        propInsertSQL = "INSERT INTO " + prefix + "PROP
(PROP_DATA,PROP_ID) VALUES (?,?)";
        refsInsertSQL = "INSERT INTO " + prefix + "REFS
(REFS_DATA,NODE_ID) VALUES (?,?)";
        nodeUpdateSQL = "UPDATE " + prefix + "NODE SET NODE_DATA=?
WHERE NODE_ID=?";
        propUpdateSQL = "UPDATE " + prefix + "PROP SET PROP_DATA=?
WHERE PROP_ID=?";
        refsUpdateSQL = "UPDATE " + prefix + "REFS SET REFS_DATA=?
WHERE NODE_ID=?";
        nodeDeleteSQL = "DELETE FROM " + prefix + "NODE WHERE NODE_ID=?";
        propDeleteSQL = "DELETE FROM " + prefix + "PROP WHERE PROP_ID=?";
        refsDeleteSQL = "DELETE FROM " + prefix + "REFS WHERE NODE_ID=?";
    }

    /**
     * Closes this persistence manager.
     */
    public void close() {
        database = null;
        blobs = null;
    }

    /**
     * Creates a new node state instance.
     *
     * @param id node identifier
     * @return node state
     */
    public NodeState createNew(NodeId id) {
        return new NodeState(id, null, null, NodeState.STATUS_NEW, false);
    }

    /**
     * Creates a new property state instance.
     *
     * @param id property identifier
     * @return property state
     */
    public PropertyState createNew(PropertyId id) {
        return new PropertyState(id, PropertyState.STATUS_NEW, false);
    }

    /**
     * Checks whether the identified node state exists.
     *
     * @param id node identifier
     * @return <code>true</code> if the node state exists,
     *         <code>false</code> otherwise
     * @throws ItemStateException if a database error occurred
     */
    public boolean exists(NodeId id) throws ItemStateException {
        return exists(nodeExistsSQL, id.toString());
    }

    /**
     * Checks whether the identified property state exists.
     *
     * @param id property identifier
     * @return <code>true</code> if the property state exists,
     *         <code>false</code> otherwise
     * @throws ItemStateException if a database error occurred
     */
    public boolean exists(PropertyId id) throws ItemStateException {
        return exists(propExistsSQL, id.toString());
    }

    /**
     * Checks whether references to the identified node exists.
     *
     * @param targetId reference identifier
     * @return <code>true</code> if references to the identified node exist,
     *         <code>false</code> otherwise
     * @throws ItemStateException if a database error occurred
     */
    public boolean exists(NodeReferencesId targetId) throws ItemStateException {
        return exists(refsExistsSQL, targetId.toString());
    }

    /**
     * Loads the identified node state.
     *
     * @param id node identifier
     * @return node state
     * @throws NoSuchItemStateException if the node state does not exist
     * @throws ItemStateException if a database error occurred
     */
    public NodeState load(NodeId id)
            throws NoSuchItemStateException, ItemStateException {
        final NodeState state = createNew(id);
        load(nodeSelectSQL, id.toString(), new RecordReader() {
            public void readRecord(InputStream stream) throws Exception {
                Serializer.deserialize(state, stream);
            }
        });
        return state;
    }

    /**
     * Loads the identified property state.
     *
     * @param id property identifier
     * @return property state
     * @throws NoSuchItemStateException if the property state does not exist
     * @throws ItemStateException if a database error occurred
     */
    public PropertyState load(PropertyId id)
            throws NoSuchItemStateException, ItemStateException {
        final PropertyState state = createNew(id);
        load(propSelectSQL, id.toString(), new RecordReader() {
            public void readRecord(InputStream stream) throws Exception {
                Serializer.deserialize(state, stream, blobs);
            }
        });
        return state;
    }

    /**
     * Loads references to the identified node.
     *
     * @param id reference identifier
     * @return node references
     * @throws NoSuchItemStateException if there are no references to the node
     * @throws ItemStateException if a database error occurred
     */
    public NodeReferences load(NodeReferencesId id)
            throws NoSuchItemStateException, ItemStateException {
        final NodeReferences references = new NodeReferences(id);
        load(refsSelectSQL, id.toString(), new RecordReader() {
            public void readRecord(InputStream stream) throws Exception {
                Serializer.deserialize(references, stream);
            }
        });
        return references;
    }

    /**
     * Persists all the changes in the given change log. No changes are
     * persisted if an error occurs.
     *
     * @param changeLog change log
     * @throws ItemStateException if a database error occurred
     */
    public void store(ChangeLog changeLog) throws ItemStateException {
        try {
            Connection connection = database.getConnection();
            try {
                storeItemStates(
                        connection, changeLog.addedStates(),
                        nodeInsertSQL, propInsertSQL);
                storeItemStates(
                        connection, changeLog.modifiedStates(),
                        nodeUpdateSQL, propUpdateSQL);
                deleteItemStates(connection, changeLog.deletedStates());
                storeNodeReferences(connection, changeLog.modifiedRefs());
            } finally {
                connection.close();
            }
        } catch (SQLException e) {
            throw new ItemStateException("Database error", e);
        }
    }

    //-------------------------------------------------------------< private >

    private interface RecordReader {

        void readRecord(InputStream stream) throws Exception;

    }

    private interface RecordWriter {

        String getId(Object record);

        void writeRecord(Object record, OutputStream stream) throws Exception;

    }

    /**
     * Checks whether the identified database record exists.
     *
     * @param sql the SQL SELECT statement to use for the check
     * @param id record identifier
     * @return <code>true</code> if the identified record exists,
     *         <code>false</code> otherwise
     * @throws ItemStateException if a database error occurred
     */
    private boolean exists(String sql, String id) throws ItemStateException {
        try {
            Connection connection = database.getConnection();
            try {
                PreparedStatement select = connection.prepareStatement(sql);
                try {
                    select.setString(1, id.toString());
                    ResultSet rs = select.executeQuery();
                    try {
                        return rs.next();
                    } finally {
                        rs.close();
                    }
                } finally {
                    select.close();
                }
            } finally {
                connection.close();
            }
        } catch (SQLException e) {
            throw new ItemStateException("Database error", e);
        }
    }

    /**
     * Loads the identified database record. The record is deserialized using
     * the given deserializer instance.
     *
     * @param sql the SQL SELECT statement to use for loading the record
     * @param id record identifier
     * @param reader record reader
     * @throws NoSuchItemStateException if the record does not exist
     * @throws ItemStateException if a database error occurred
     */
    private void load(String sql, String id, RecordReader reader)
            throws NoSuchItemStateException, ItemStateException {
        try {
            Connection connection = database.getConnection();
            try {
                PreparedStatement select = connection.prepareStatement(sql);
                try {
                    select.setString(1, id);
                    ResultSet rs = select.executeQuery();
                    try {
                        if (rs.next()) {
                            InputStream stream = rs.getBinaryStream(1);
                            try {
                                reader.readRecord(stream);
                            } catch (Exception e) {
                                throw new ItemStateException(
                                        "Deserialization failed: " + id, e);
                            } finally {
                                stream.close();
                            }
                        } else {
                            throw new NoSuchItemStateException(id);
                        }
                    } catch (IOException e) {
                        throw new ItemStateException("Database error", e);
                    } finally {
                        rs.close();
                    }
                } finally {
                    select.close();
                }
            } finally {
                connection.close();
            }
        } catch (SQLException e) {
            throw new ItemStateException("Database error", e);
        }
    }

    private void classifyItemStates(
            Iterator iterator, Collection nodes, Collection props) {
        while (iterator.hasNext()) {
            ItemState state = (ItemState) iterator.next();
            if (state.isNode()) {
                nodes.add(state);
            } else {
                props.add(state);
            }
        }
    }

    private void storeItemStates(
            Connection connection, Iterator iterator,
            String nodeSQL, String propSQL) throws SQLException {
        Collection nodes = new ArrayList();
        Collection props = new ArrayList();
        classifyItemStates(iterator, nodes, props);
        if (!nodes.isEmpty()) {
            storeRecords(connection, nodeSQL, new RecordWriter() {
                public String getId(Object record) {
                    return ((NodeState) record).getId().toString();
                }
                public void writeRecord(
                        Object record, OutputStream stream) throws Exception {
                    Serializer.serialize((NodeState) record, stream);
                }
            }, nodes.iterator());
        }
        if (!props.isEmpty()) {
            storeRecords(connection, propSQL, new RecordWriter() {
                public String getId(Object record) {
                    return ((PropertyState) record).getId().toString();
                }
                public void writeRecord(
                        Object record, OutputStream stream) throws Exception {
                    Serializer.serialize((PropertyState) record, stream, blobs);
                }
            }, props.iterator());
        }
    }

    private void storeRecords(
            Connection connection, String sql,
            RecordWriter writer, Iterator iterator) throws SQLException {
        PreparedStatement statement = connection.prepareStatement(sql);
        try {
            while (iterator.hasNext()) {
                Object record = iterator.next();
                ByteArrayOutputStream buffer = new ByteArrayOutputStream();
                try {
                    writer.writeRecord(record, buffer);
                } catch (Exception e) {
                    throw new SQLException("Serialization failed: " + record);
                }
                byte[] bytes = buffer.toByteArray();
                statement.setBinaryStream(
                        1, new ByteArrayInputStream(bytes), bytes.length);
                statement.setString(2, writer.getId(record));
                statement.execute();
            }
        } finally {
            statement.close();
        }
    }

    private void deleteItemStates(Connection connection, Iterator iterator)
            throws SQLException {
        Collection nodes = new ArrayList();
        Collection props = new ArrayList();
        classifyItemStates(iterator, nodes, props);
        if (!nodes.isEmpty()) {
            deleteItemStates(connection, nodeDeleteSQL, nodes.iterator());
        }
        if (!props.isEmpty()) {
            deleteItemStates(connection, propDeleteSQL, props.iterator());
        }
    }

    private void deleteItemStates(
            Connection connection, String sql, Iterator iterator)
            throws SQLException {
        PreparedStatement statement = connection.prepareStatement(sql);
        try {
            while (iterator.hasNext()) {
                ItemState state = (ItemState) iterator.next();
                statement.setString(1, state.getId().toString());
                statement.execute();
            }
        } finally {
            statement.close();
        }
    }

    private void storeNodeReferences(
            Connection connection, Iterator iterator) throws SQLException {
        if (iterator.hasNext()) {
            Collection insert = new ArrayList();
            Collection update = new ArrayList();
            Collection delete = new ArrayList();
            classifyNodeReferences(connection, iterator, insert,
update, delete);

            if (!insert.isEmpty()) {
                storeNodeReferences(
                        connection, refsInsertSQL, insert.iterator());
            }
            if (!update.isEmpty()) {
                storeNodeReferences(
                        connection, refsUpdateSQL, update.iterator());
            }
            if (!delete.isEmpty()) {
                deleteNodeReferences(connection, delete.iterator());
            }
        }
    }

    private void storeNodeReferences(
            Connection connection, String sql, Iterator iterator)
            throws SQLException {
        storeRecords(connection, sql, new RecordWriter() {
            public String getId(Object record) {
                return ((NodeReferences) record).getId().toString();
            }
            public void writeRecord(Object record, OutputStream stream)
                    throws Exception {
                Serializer.serialize((NodeReferences) record, stream);
            }
        }, iterator);
    }

    private void deleteNodeReferences(Connection connection, Iterator iterator)
            throws SQLException {
        PreparedStatement delete = connection.prepareStatement(refsDeleteSQL);
        try {
            while (iterator.hasNext()) {
                NodeReferences references = (NodeReferences) iterator.next();
                delete.setString(1, references.getId().toString());
                delete.execute();
            }
        } finally {
            delete.close();
        }
    }

    private void classifyNodeReferences(
            Connection connection, Iterator iterator,
            Collection insert, Collection update, Collection delete)
            throws SQLException {
        PreparedStatement select = connection.prepareStatement(refsExistsSQL);
        try {
            while (iterator.hasNext()) {
                NodeReferences references = (NodeReferences) iterator.next();
                if (!references.hasReferences()) {
                    delete.add(references);
                } else {
                    select.setString(1, references.getId().toString());
                    ResultSet rs = select.executeQuery();
                    try {
                        if (rs.next()) {
                            update.add(references);
                        } else {
                            insert.add(references);
                        }
                    } finally {
                        rs.close();
                    }
                }
            }
        } finally {
            select.close();
        }
    }

}

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.
I am working on a response to the many recent additions to this thread
(hopefully will have something later today).

Thanks!
Bryan.

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Nicolas <nt...@gmail.com>.
Hi,

On 3/12/07, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Jukka Zitting wrote:
>
> > ACK, the key is the write lock on SharedItemStateManager. In fact, do
> > we even need the database persistence managers to be transactional
> > over multiple method calls?
>
> I can't follow you here. what exactly do you mean by weakening transaction
> requirements on the persistence manager? e.g. reading of uncommitted
> items?
>

I think Jukka means the underlying DB already ensures transactional
capability, therefore DatabasePersistenceManager could still be
transactional without a write lock on JR's side.

It seems achievable but I wonder if the performance gain is noticeable or
make things worse (we would use instead different PreparedStatement for
instance).

IMO it seems we don't really know where the bottlenecks are. By this, I mean
no cold hard facts.

BR,
Nico
my blog! http://www.deviant-abstraction.net !!

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Marcel Reutegger <ma...@gmx.net>.
Jukka Zitting wrote:
>> A) a change log must be persisted as a whole or not at all
>> I) while a change log is persisted a read must not see partially 
>> stored content
> 
> These could both be achieved also with connection pooling, just
> acquire a connection at the beginning of PersistenceManager.store()
> and commit the changes at the end of the method before releasing the
> connection.
> 
> Similar pattern would also work for all the load() and exists()
> methods to avoid the need to synchronize things on the prepared
> statements.

Agreed.

>> C) this is actually handled by the upper level
> 
> ACK, the key is the write lock on SharedItemStateManager. In fact, do
> we even need the database persistence managers to be transactional
> over multiple method calls?

I can't follow you here. what exactly do you mean by weakening transaction 
requirements on the persistence manager? e.g. reading of uncommitted items?

> And to follow, could we in fact already
> now remove the synchronization of read operations given that
> consistency is already achieved on a higher level?

yes, we can. the persistence manager just has to ensure that only committed data 
is returned.

regards
  marcel

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/12/07, Marcel Reutegger <ma...@gmx.net> wrote:
> Jukka Zitting wrote:
> >> further note that write operations must occur within a single jdbc
> >> transaction, i.e. you can't get a new connection for every store/destroy
> >> operation.
> >
> > I think this is a design flaw. On the other hand we require
> > persistence managers to be "dumb" components, but then we rely on them
> > for complex features like transactions.
>
> I'd say those components are 'simple' rather than 'dumb' or 'complex'. the
> requirements are therefore also relatively simple: operations must have A(C)ID
> properties.
>
> A) a change log must be persisted as a whole or not at all
> I) while a change log is persisted a read must not see partially stored content

These could both be achieved also with connection pooling, just
acquire a connection at the beginning of PersistenceManager.store()
and commit the changes at the end of the method before releasing the
connection.

Similar pattern would also work for all the load() and exists()
methods to avoid the need to synchronize things on the prepared
statements.

> D) durability, well you get the idea...

Obviously. :-)

> C) this is actually handled by the upper level

ACK, the key is the write lock on SharedItemStateManager. In fact, do
we even need the database persistence managers to be transactional
over multiple method calls? And to follow, could we in fact already
now remove the synchronization of read operations given that
consistency is already achieved on a higher level?

BR,

Jukka Zitting

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Marcel Reutegger <ma...@gmx.net>.
Jukka Zitting wrote:
>> further note that write operations must occur within a single jdbc
>> transaction, i.e. you can't get a new connection for every store/destroy
>> operation.
> 
> I think this is a design flaw. On the other hand we require
> persistence managers to be "dumb" components, but then we rely on them
> for complex features like transactions.

I'd say those components are 'simple' rather than 'dumb' or 'complex'. the 
requirements are therefore also relatively simple: operations must have A(C)ID 
properties.

A) a change log must be persisted as a whole or not at all
C) this is actually handled by the upper level
I) while a change log is persisted a read must not see partially stored content
D) durability, well you get the idea...

to implement those requirements you don't necessarily need a database (as you 
pointed out in another thread ;)).

regards
  marcel

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/3/07, Stefan Guggisberg <st...@gmail.com> wrote:
> i don't see how getting & releasing connections in every load, exists and store
> call would improve preformance. could you please elaborate?
>
> please note that you wouldn't be able to use prepared statements over multiple
> load, store etc operations because you'd have to return the connection
> at the end of every call. the performance might therefore be even worse.

With a decent connection pool those get/release operations would be
very fast since most of the time you'd just be getting and releasing
existing database connections and prepared statements. I think the
pooling overhead should be very minor and easily countered by gains in
concurrency.

> further note that write operations must occur within a single jdbc
> transaction, i.e. you can't get a new connection for every store/destroy
> operation.

I think this is a design flaw. On the other hand we require
persistence managers to be "dumb" components, but then we rely on them
for complex features like transactions.

IMHO we should be looking at ways to make Jackrabbit properly
transactional already on top of the persistence layer. As Stefan
mentioned, this would imply changing not only the database persistence
managers, but also the item state management higher up the call chain.

As mentioned in my previous message, I'd be very interested in seeing
what such changes would mean in practice. It's probably a lot of work
but there aren't be any fundamental reasons why such changes couldn't
be implemented.

BR,

Jukka Zitting

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.


On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:

> hi bryan
> 
> On 3/2/07, Bryan Davis <br...@bea.com> wrote:
>> What persistence manager are you using?
>> 
>> Our tests indicate that the stock persistence managers are a significant
>> bottleneck for both writes and also initial reads to load the transient
>> store (on the order of .5 seconds per node when using a remote database like
>> MSSQL or Oracle).
> 
> what do you mean by "load the transient store"?
> 
>> 
>> The stock db persistence managers have all methods marked as "synchronized",
>> which blocks on the classdef (which means that even different persistence
>> managers for different workspaces will serialize all load, exists and store
> 
> assuming you're talking about DatabasePersistenceManager:
> the store/destroy methods are 'synchronized' on the instance, not on
> the 'classdef'.
> see e.g. 
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.html
> 
> the load/exists methods are synchronized on the specific prepared stmt they're
> using.
> 
> since every workspace uses its own persistence manager instance i can't
> follow your conclusion that all load, exists and store operations would be
> be globally serialized across all workspaces.

Hm, this is my bad... It does seem that sync methods are on the instance.
Since the db persistence manager has "synchronized" on load, store and
exists, though, this would still serialize all of these operations for a
particular workspace.

>> operations).  Presumably this is because they allocate a JDBC connection at
>> startup and use it throughout, and the connection object is not
>> multithreaded.
> 
> what leads you to this assumption?

Are there other requirements that all of these operations are serialized for
a particular PM instance?  This seems like a pretty serious bottleneck (and,
in fact, is a pretty serious bottleneck when the database is remote from the
repository).

>> 
>> This problem isn't as noticeable when you are using embedded Derby and
>> reading/writing to the file system, but when you are doing a network
>> operation to a database server, the network latency in combination with the
>> serialization of all database operations results in a significant
>> performance degradation.
> 
> again: serialization of 'all' database operations?

The distinction between all and all for a workspace is would really only be
relevant during versioning, right?
 
>> 
>> The new bundle persistence manager (which isn't yet in SVN) improves things
>> dramatically since it inlines properties into the node, so loading or
>> persisting a node is only one operation (plus the additional connection for
>> the LOB) instead of one for the node and and one for each property.  The
>> bundle persistence manager also uses prepared statements and keeps a
>> PM-level cache of nodes (with properties) and also non-existent nodes (which
>> permits many exists() calls to return without accessing the database).
>> 
>> Changing all db persistence managers to use a datasource and get and release
>> connections inside of load, exists and store operations and eliminating the
>> method synchronization is a relatively simple change that further improves
>> performance for connecting to database servers.
> 
> the use of datasources, connection pools and the like have been discussed
> in extenso on the list. see e.g.
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.html
> http://issues.apache.org/jira/browse/JCR-313
> 
> i don't see how getting & releasing connections in every load, exists and
> store
> call would improve preformance. could you please elaborate?
> 
> please note that you wouldn't be able to use prepared statements over multiple
> load, store etc operations because you'd have to return the connection
> at the end
> of every call. the performance might therefore be even worse.
> 
> further note that write operations must occur within a single jdbc
> transaction, i.e.
> you can't get a new connection for every store/destroy operation.
> 
> wrt synchronization:
> concurrency is controlled outside the persistence manager on a higher level.
> eliminating the method synchronization would imo therefore have *no* impact
> on concurrency/performance.

So you are saying that it is impossible to concurrently load or store data
in Jackrabbit?

>> There is a persistence manager with an ASL license called
>> "DataSourcePersistenceManager" which seems to the PM of choice for people
>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
>> statements and eliminates the current single-connection issues associated
>> with all of the stock db PMs.  It doesn't seem to have been submitted back
>> to the Jackrabbit project.  If you Google for
>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
>> should be able to find it.
> 
> thanks for the hint. i am aware of this pm and i had a look at it a couple of
> months ago. the major issue was that it didn't implement the correct/required
> semantics. it used a new connection for every write operation which
> clearly violates the contract that the write operations should occur within
> a jdbc transaction bracket. further it creates a prepared stmt on every
> load, store etc. which is hardly efficient...

Yes, this PM does have this issue.  The bundle PM implements prepared
statements in the correct way.

>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
>> use the Oracle-specific PMs because the 10g drivers support the standard
>> BLOB API (in addition to the Oracle-specific BLOB API required by the older
>> 9i drivers).  This is true even if you are connecting to an older database
>> server as the limitation was in the driver itself.  Frankly you should never
>> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
>> complete rewrite.  Make sure you use the new driver package because the 10g
>> driver JAR also includes the older 9i drivers for backward-compatibility.
>> The new driver is in a new package (can't remember the exact name off the
>> top of my head).
> 
> thanks for the information.
> 
> cheers
> stefan

We are very interested in getting a good understanding of the specifics of
how PM's work, as initial reads and writes, according to our profiling, are
spending 80-90% of the time inside the PM.

Bryan.

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.
Some drivers also use the compiled statement for  a PreparedStatement even
if a new instance of PreparedStatement is created each time.  This seems to
be the case for the driver's we are using, although I can't say that
conclusively...


On 3/3/07 7:11 AM, "Stefan Guggisberg" <st...@gmail.com> wrote:

>> There is a persistence manager with an ASL license called
>> "DataSourcePersistenceManager" which seems to the PM of choice for people
>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
>> statements and eliminates the current single-connection issues associated
>> with all of the stock db PMs.  It doesn't seem to have been submitted back
>> to the Jackrabbit project.  If you Google for
>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
>> should be able to find it.
> 
> thanks for the hint. i am aware of this pm and i had a look at it a couple of
> months ago. the major issue was that it didn't implement the correct/required
> semantics. it used a new connection for every write operation which
> clearly violates the contract that the write operations should occur within
> a jdbc transaction bracket. further it creates a prepared stmt on every
> load, store etc. which is hardly efficient...

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Stefan Guggisberg <st...@gmail.com>.
hi bryan

On 3/2/07, Bryan Davis <br...@bea.com> wrote:
> What persistence manager are you using?
>
> Our tests indicate that the stock persistence managers are a significant
> bottleneck for both writes and also initial reads to load the transient
> store (on the order of .5 seconds per node when using a remote database like
> MSSQL or Oracle).

what do you mean by "load the transient store"?

>
> The stock db persistence managers have all methods marked as "synchronized",
> which blocks on the classdef (which means that even different persistence
> managers for different workspaces will serialize all load, exists and store

assuming you're talking about DatabasePersistenceManager:
the store/destroy methods are 'synchronized' on the instance, not on
the 'classdef'.
see e.g. http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.html

the load/exists methods are synchronized on the specific prepared stmt they're
using.

since every workspace uses its own persistence manager instance i can't
follow your conclusion that all load, exists and store operations would be
be globally serialized across all workspaces.

> operations).  Presumably this is because they allocate a JDBC connection at
> startup and use it throughout, and the connection object is not
> multithreaded.

what leads you to this assumption?

>
> This problem isn't as noticeable when you are using embedded Derby and
> reading/writing to the file system, but when you are doing a network
> operation to a database server, the network latency in combination with the
> serialization of all database operations results in a significant
> performance degradation.

again: serialization of 'all' database operations?

>
> The new bundle persistence manager (which isn't yet in SVN) improves things
> dramatically since it inlines properties into the node, so loading or
> persisting a node is only one operation (plus the additional connection for
> the LOB) instead of one for the node and and one for each property.  The
> bundle persistence manager also uses prepared statements and keeps a
> PM-level cache of nodes (with properties) and also non-existent nodes (which
> permits many exists() calls to return without accessing the database).
>
> Changing all db persistence managers to use a datasource and get and release
> connections inside of load, exists and store operations and eliminating the
> method synchronization is a relatively simple change that further improves
> performance for connecting to database servers.

the use of datasources, connection pools and the like have been discussed
in extenso on the list. see e.g.
http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.html
http://issues.apache.org/jira/browse/JCR-313

i don't see how getting & releasing connections in every load, exists and store
call would improve preformance. could you please elaborate?

please note that you wouldn't be able to use prepared statements over multiple
load, store etc operations because you'd have to return the connection
at the end
of every call. the performance might therefore be even worse.

further note that write operations must occur within a single jdbc
transaction, i.e.
you can't get a new connection for every store/destroy operation.

wrt synchronization:
concurrency is controlled outside the persistence manager on a higher level.
eliminating the method synchronization would imo therefore have *no* impact
on concurrency/performance.

>
> There is a persistence manager with an ASL license called
> "DataSourcePersistenceManager" which seems to the PM of choice for people
> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
> statements and eliminates the current single-connection issues associated
> with all of the stock db PMs.  It doesn't seem to have been submitted back
> to the Jackrabbit project.  If you Google for
> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
> should be able to find it.

thanks for the hint. i am aware of this pm and i had a look at it a couple of
months ago. the major issue was that it didn't implement the correct/required
semantics. it used a new connection for every write operation which
clearly violates the contract that the write operations should occur within
a jdbc transaction bracket. further it creates a prepared stmt on every
load, store etc. which is hardly efficient...

>
> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
> use the Oracle-specific PMs because the 10g drivers support the standard
> BLOB API (in addition to the Oracle-specific BLOB API required by the older
> 9i drivers).  This is true even if you are connecting to an older database
> server as the limitation was in the driver itself.  Frankly you should never
> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
> complete rewrite.  Make sure you use the new driver package because the 10g
> driver JAR also includes the older 9i drivers for backward-compatibility.
> The new driver is in a new package (can't remember the exact name off the
> top of my head).

thanks for the information.

cheers
stefan

>
> Bryan.
>
> _______________________________________________________________________
> Notice:  This email message, together with any attachments, may contain
> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> legally privileged, and is intended solely for the use of the individual
> or entity named in this message. If you are not the intended recipient,
> and have received this message in error, please immediately return this
> by email and then delete it.
>

Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Sriram Narayanan <sr...@gmail.com>.
Hi:

I've read this complete thread, and am responding to the first post on
the thread.

On 3/3/07, Bryan Davis <br...@bea.com> wrote:
> What persistence manager are you using?
>

OraclePersistenceManager

>
> This problem isn't as noticeable when you are using embedded Derby and
> reading/writing to the file system, but when you are doing a network
> operation to a database server, the network latency in combination with the
> serialization of all database operations results in a significant
> performance degradation.
>

> The new bundle persistence manager (which isn't yet in SVN) improves things
> dramatically since it inlines properties into the node, so loading or
> persisting a node is only one operation (plus the additional connection for
> the LOB) instead of one for the node and and one for each property.  The
> bundle persistence manager also uses prepared statements and keeps a
> PM-level cache of nodes (with properties) and also non-existent nodes (which
> permits many exists() calls to return without accessing the database).
>

Hmm... are you saying that it's possible to have better results as
compared to what I've reported at
http://permalink.gmane.org/gmane.comp.apache.jackrabbit.user/2436 ?

> There is a persistence manager with an ASL license called
> "DataSourcePersistenceManager" which seems to the PM of choice for people
> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
> statements and eliminates the current single-connection issues associated
> with all of the stock db PMs.  It doesn't seem to have been submitted back
> to the Jackrabbit project.  If you Google for
> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
> should be able to find it.
>

Ack. I'll go download the magnolia source. I tried doing so just now
but I got a page not found. I'll try again later, and write to the
webmaster at magnolia if required.

> Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
> use the Oracle-specific PMs because the 10g drivers support the standard
> BLOB API (in addition to the Oracle-specific BLOB API required by the older
> 9i drivers).  This is true even if you are connecting to an older database
> server as the limitation was in the driver itself.  Frankly you should never
> use the 9i drivers as they are pretty buggy and the 10g drivers represent a
> complete rewrite.  Make sure you use the new driver package because the 10g
> driver JAR also includes the older 9i drivers for backward-compatibility.
> The new driver is in a new package (can't remember the exact name off the
> top of my head).
>

Oh. Thanks for this information. I'll look up this and ensure that I
use the Oracle 10g drivers.

> Bryan.
>
-- Sriram

Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)

Posted by Bryan Davis <br...@bea.com>.
What persistence manager are you using?

Our tests indicate that the stock persistence managers are a significant
bottleneck for both writes and also initial reads to load the transient
store (on the order of .5 seconds per node when using a remote database like
MSSQL or Oracle).

The stock db persistence managers have all methods marked as "synchronized",
which blocks on the classdef (which means that even different persistence
managers for different workspaces will serialize all load, exists and store
operations).  Presumably this is because they allocate a JDBC connection at
startup and use it throughout, and the connection object is not
multithreaded.

This problem isn't as noticeable when you are using embedded Derby and
reading/writing to the file system, but when you are doing a network
operation to a database server, the network latency in combination with the
serialization of all database operations results in a significant
performance degradation.

The new bundle persistence manager (which isn't yet in SVN) improves things
dramatically since it inlines properties into the node, so loading or
persisting a node is only one operation (plus the additional connection for
the LOB) instead of one for the node and and one for each property.  The
bundle persistence manager also uses prepared statements and keeps a
PM-level cache of nodes (with properties) and also non-existent nodes (which
permits many exists() calls to return without accessing the database).

Changing all db persistence managers to use a datasource and get and release
connections inside of load, exists and store operations and eliminating the
method synchronization is a relatively simple change that further improves
performance for connecting to database servers.

There is a persistence manager with an ASL license called
"DataSourcePersistenceManager" which seems to the PM of choice for people
using Magnolia (which is backed by Jackrabbit).  It also uses prepared
statements and eliminates the current single-connection issues associated
with all of the stock db PMs.  It doesn't seem to have been submitted back
to the Jackrabbit project.  If you Google for
"com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" you
should be able to find it.

Finally, if you always use the Oracle 10g JDBC drivers, you do not need to
use the Oracle-specific PMs because the 10g drivers support the standard
BLOB API (in addition to the Oracle-specific BLOB API required by the older
9i drivers).  This is true even if you are connecting to an older database
server as the limitation was in the driver itself.  Frankly you should never
use the 9i drivers as they are pretty buggy and the 10g drivers represent a
complete rewrite.  Make sure you use the new driver package because the 10g
driver JAR also includes the older 9i drivers for backward-compatibility.
The new driver is in a new package (can't remember the exact name off the
top of my head).

Bryan.

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.