You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Six Fried Rice <te...@sixfriedrice.com> on 2008/04/15 21:24:49 UTC

Managing many databases

I'm a first-time poster so I hope I'm following protocol here. I  
searched the MarkMail archive and I don't think this is a FAQ.

We're considering using derby in an atypical situation, and I'm  
looking for some general feedback on how best to proceed. The  
application processes very large XML reports (100MB to 2GB) for our  
customers, and then presents the data in an explorable fashion through  
the browser. A typical report might produce around 500,000 records,  
with up to maybe 2 million records or so at the (rare) top end. To  
keep this under control, we are using this model:

1: The user interacts with our web site to set up an account and  
prepare to process a report.
2: When they opt to process a report, a Java WebStart application  
launches and processes the report with an embedded derby.
3: When the processing is complete, the derby database is jarred and  
uploaded to the server.
4: At that point, all the data is completely read-only.

All of this is largely working (less a few bugs) and we're very happy  
with the performance and the notion that the heavy lifting happens on  
the client side.

Now I'm trying to decide how we will handle the server side database  
interaction if we continue with this model. In the simplest case, I'd  
like to interact directly with the user's individual derby databases  
(one per report). This has several advantages:

1: We don't have any time-consuming import process to put all that  
data into a centralized database
2: We get built-in partitioning of the data on the server side which  
is good news for scalability
3: The data  model is somewhat complex and join-heavy, and I suspect  
several smaller databases will, in general, perform better than one  
very large database with hundreds of millions of records
4: Cleanup is a breeze: to remove a report we just whack a directory  
on the file system

But I'm not sure how best to actually manage all these databases. I  
suspect we will have on the order of 1000 databases in play, with  
maybe 20 of those being actively used at a single busy time. It is  
conceivable that we will have more than this, depending on the success  
of the system. So I guess I'm looking for any general insights, plus  
answers to a few concrete questions:

1: What are the performance characteristics of using zipped or jarred  
DBs? It doesn't bother me to unzip them, but I saw this option in the  
documentation and I was curious. Can these jars be in arbitrary  
locations on the file system, and be connected to ad-hoc? Can a derby  
server provide access to a jarred database at an arbitrary filesystem  
location?

2: Are there any performance concerns with having many databases in a  
single derby install? Would it be better to run one derby server, with  
1000 databases, or run multiple derby servers on the same hardware and  
partition the databases across them? I'm not looking for exact  
numbers, since they obviously depend on a lot of factors. But in  
general, can I load a ton of databases into derby server and be OK?  
(We have no problem throwing additional hardware at this system as  
needed.)

3: Can derby server discover new databases if I simply copy (or  
symlink?) a derby database directory to its DERBY_HOME? Or do the  
databases need to be *created* programmatically through JDBC?

4: Anybody have any experience with rails and derby? I see a few hints  
on line that people are doing it but I'm not too certain of the  
stability and details on what is supported. I'll have to write my own  
connection pooling and switching code in rails, which I don't think  
will be too tough. But an alternative would be to build a JEE-based  
web service to manage the derby interaction, and then have my rails  
application interact with that data server, if Rails/Derby is not a  
reliable or well-performing option.

I know this is an open-ended question. I appreciate any time and  
insight any of you may offer :)

Thanks,

Geoff

Re: Managing many databases

Posted by Six Fried Rice <te...@sixfriedrice.com>.

Thanks for the input. All excellent stuff...

One point of clarification:
>
> Give yourself lots of memory and lots of file descriptors. Use the  
> most
> modern JVM you have access to. And, if you're going to put thousands  
> of
> multi-gigabyte individual databases onto a single filesystem, you'll
> need a multi-terabyte file system, so make sure you get a solid server
> operating system that can handle that.

The original XML reports are into the multi-gigabyte size on occasion,  
but XML is inherently inefficient and these files are de-normalized.  
Once we get them into derby, 1GB of XML boils down to a 20 to 30MB  
derby database. So we won't have a major space problem with reasonable  
hardware.

Thanks so much for the feedback.

Geoff

Re: Managing many databases

Posted by Bryan Pendleton <bp...@amberpoint.com>.

> But I'm not sure how best to actually manage all these databases. I 
> suspect we will have on the order of 1000 databases in play, with maybe 
> 20 of those being actively used at a single busy time. 

When a database is not open it should be consuming no resources. There
is no global registry of databases in a particular Derby install; each
database is accessed independently, with its own set of in-memory data structures.

So, as long as you are careful to fully close and shutdown each database
when it is not in use, a carefully-written server app should be able
to alternate usage of a moderate-sized handful (20 seems well within reason)
of a much larger universe of latent databases on your filesystem (1000
seems well within reason here, too).

Give yourself lots of memory and lots of file descriptors. Use the most
modern JVM you have access to. And, if you're going to put thousands of
multi-gigabyte individual databases onto a single filesystem, you'll
need a multi-terabyte file system, so make sure you get a solid server
operating system that can handle that.

One caution: in recent history there have been a (small) number of bugs
involving Derby datastructures which were not 100% garbage collected
properly. Although those bugs have (we believe) been fixed, you might
encounter more of them, *plus* you need to be extremely careful in your
own application to ensure that you fully close all result sets, statements,
connections, and other similar JDBC objects and remove those objects
from all global hash tables and other collections, to ensure that when
you shut down one database, all of its resources can be safely cleaned
up by the JVM.

And set up a thorough test suite that explores these scalability
behaviors, and run both simulated tests and real live tests. You should
be able to run a simulated test fairly early in your development, by
simply fabricating a large number of synthetic databases and verifying
that your application works well in that environment. Modern JVMs have
superb profiling tools that you can use to verify the desired behavior.

Good luck, and keep us updated on how it goes!

thanks,

bryan

Re: Managing many databases

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Rick Hillegas <Ri...@Sun.COM> writes:

> Six Fried Rice wrote:

>> 1: What are the performance characteristics of using zipped or
>> jarred DBs? It doesn't bother me to unzip them, but I saw this
>> option in the documentation and I was curious.

The performance for zipped databases will depend on the access pattern
and the ratio between the size of the frequently accessed data in your
database and the size of the page cache. As long as all the data you
need to access is cached in the page cache, you should see no overhead
compared to having a normal database directory. If your application
frequently needs to fetch data into the cache (typically because the
page cache is smaller than the working set), you may see a small
overhead (roughly equivalent to the difference between
RandomAccessFile.read() and ZipInputStream.read() per page read into the
page cache).

>> 2: Are there any performance concerns with having many databases in
>> a single derby install? Would it be better to run one derby server,
>> with 1000 databases, or run multiple derby servers on the same
>> hardware and partition the databases across them? I'm not looking
>> for exact numbers, since they obviously depend on a lot of
>> factors. But in general, can I load a ton of databases into derby
>> server and be OK? (We have no problem throwing additional hardware
>> at this system as needed.)
> Hard to say. Most of our performance work has measured the performance
> of many clients hammering a single database. I don't know where Derby
> maxes  out in its ability to saturate multiple processors when you are
> running an application against many databases. I think that against a
> single database, there is a limit (4?) to the number of processors
> which a Derby server can keep busy. That may or may not scale up if
> your server is managing more than one database.

How many processors you can keep busy depends heavily on the type of
load. I have seen reports on derby-dev about machines with 32 processors
utilizing more than 80% of the available CPU power when running against
a single database (using the code in the 10.4 development branch). If
you have hot spots in your data (many threads accessing the same row),
you'll not be able to utilize that many processors, but since it seems
like each user of the system has his own private data set, that
shouldn't be a problem for you.

What could become a problem when you load 1000 different databases, is
that each of the databases has its own page cache, so you may have
problems finding the ideal page cache size. You'll probably end up using
a small page cache per database to make all the caches fit into the
JVM's heap, although the different users could have different needs
(some don't need all the cache they have been given, others need more
but cannot take the free space from one of the other databases). If all
users work against the same database, you can instead allocate a large,
shared page cache where those users that need much cache can utilize the
space other users don't need. That could give a more efficient use of
the available memory resources.

As Bryan mentioned, if you only have a limited number of databases
loaded at the same time and make sure the unused ones are shut down, it
shouldn't be problematic to have one database per user.

>> 3: Can derby server discover new databases if I simply copy (or
>> symlink?) a derby database directory to its DERBY_HOME? Or do the
>> databases need to be *created* programmatically through JDBC?

If I understand your question correctly, yes. If you create a symlink in
your server's derby.system.home directory pointing to a database
directory somewhere else in your file hierarchy, you should be able to
connect to the database via the network server by using a connection URL
on this form: jdbc:derby://hostname/name_of_symlink

-- 
Knut Anders

Re: Managing many databases

Posted by Six Fried Rice <te...@sixfriedrice.com>.

On Apr 15, 2008, at 1:42 PM, Rick Hillegas wrote:

> If the connection URL does not specify an absolute path to the  
> database, then the database spec is taken to be a path relative to  
> DERBY_HOME.

Ahhh... Somehow I missed this when reviewing the docs. I will look at  
the pages you linked, and thanks for spelling this out for me. That  
makes me a very happy person. I think I will be able to manage my  
databases in a directory structure of my choosing, and then connect to  
them easily. Very cool.

Thanks!

Geoff

Re: Managing many databases

Posted by Rick Hillegas <Ri...@Sun.COM>.

Six Fried Rice wrote:
> Rick:
>
> Thank you very much for your responses. I have a few followup 
> questions below:
>
> On Apr 15, 2008, at 1:03 PM, Rick Hillegas wrote:
>
>> Please take a look at the section titled "Accessing a read-only 
>> database in a zip/jar file" in the Derby Developer's Guide: 
>> http://db.apache.org/derby/docs/10.3/devguide/ The jars can live 
>> anywhere in the file system or on the classpath.
>
> Ok, super. I had read that information, but I wasn't sure if that 
> applied to embedded derby, or also to the derby network server. Can I 
> connect to a derby server and then target a jar file at an arbitrary 
> location? Also, does this perform reasonably well? My jars will be 
> between 5 and 50 MB. Of course I will do testing as well. Just 
> gathering info before I start writing the server-side code.
Hi Geoff,

Other than an extra prefix on the connection URL, the network server 
case shouldn't be any different from the embedded case. Just for 
reference, the network server itself is simply an application that 
embeds the Derby engine. Network or not, you should be able to connect 
to any database on the machine which is running the Derby engine. You 
may have difficulty connecting to databases that are on machines other 
than the engine's host. Please see the section titled "Conventions for 
specifying the database path" in the Derby Developer's Guide. I don't 
know what the performance comparison is between a read-only database 
inside a jar file and a read-only database unpacked on the local file 
system.
>
>> Derby has no heuristic for knowing where to look for databases.  I 
>> think that database discovery has to be done by your application.  
>> Basically, you need to locate the database via a JDBC connection URL.
>
> I'm thinking about derby network server here again. My understanding 
> is that I launch the server with a DERBY_HOME and it knows about 
> databases as named directories inside this home. Is that correct? 
Right. If the connection URL does not specify an absolute path to the 
database, then the database spec is taken to be a path relative to 
DERBY_HOME. Please see the following section in the Derby Developer's 
Guide: "Database connection examples".
> Can you connect to a derby server and then connect to a database at 
> any filesystem path? 
Provided that the filespec is on a disk on the machine where the Derby 
engine runs.
> And if things do need to be in DERBY_HOME, can I just copy or symlink 
> a directory to this location at the file system level, then use JDBC 
> to connect to the database by name?
Yes, Derby will find your database even if you indirect through a symlink.

As Bryan said, please let us know what your experiments show. This 
information may be helpful to other Derby users.

Hope this helps,
-Rick
>
> Thanks!
>
> Geoff
>

Re: Managing many databases

Posted by Six Fried Rice <te...@sixfriedrice.com>.

Rick:

Thank you very much for your responses. I have a few followup  
questions below:

On Apr 15, 2008, at 1:03 PM, Rick Hillegas wrote:

> Please take a look at the section titled "Accessing a read-only  
> database in a zip/jar file" in the Derby Developer's Guide: http://db.apache.org/derby/docs/10.3/devguide/ 
>  The jars can live anywhere in the file system or on the classpath.

Ok, super. I had read that information, but I wasn't sure if that  
applied to embedded derby, or also to the derby network server. Can I  
connect to a derby server and then target a jar file at an arbitrary  
location? Also, does this perform reasonably well? My jars will be  
between 5 and 50 MB. Of course I will do testing as well. Just  
gathering info before I start writing the server-side code.

> Derby has no heuristic for knowing where to look for databases.  I  
> think that database discovery has to be done by your application.   
> Basically, you need to locate the database via a JDBC connection URL.

I'm thinking about derby network server here again. My understanding  
is that I launch the server with a DERBY_HOME and it knows about  
databases as named directories inside this home. Is that correct? Can  
you connect to a derby server and then connect to a database at any  
filesystem path? And if things do need to be in DERBY_HOME, can I just  
copy or symlink a directory to this location at the file system level,  
then use JDBC to connect to the database by name?

Thanks!

Geoff

Re: Managing many databases

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Geoff,

You have asked a lot of interesting questions. I will try to give you 
some feedback on some of your questions. Hopefully others can provide 
more information. Please see my responses inline...


Six Fried Rice wrote:
> I'm a first-time poster so I hope I'm following protocol here. I 
> searched the MarkMail archive and I don't think this is a FAQ.
>
> We're considering using derby in an atypical situation, and I'm 
> looking for some general feedback on how best to proceed. The 
> application processes very large XML reports (100MB to 2GB) for our 
> customers, and then presents the data in an explorable fashion through 
> the browser. A typical report might produce around 500,000 records, 
> with up to maybe 2 million records or so at the (rare) top end. To 
> keep this under control, we are using this model:
>
> 1: The user interacts with our web site to set up an account and 
> prepare to process a report.
> 2: When they opt to process a report, a Java WebStart application 
> launches and processes the report with an embedded derby.
> 3: When the processing is complete, the derby database is jarred and 
> uploaded to the server.
> 4: At that point, all the data is completely read-only.
>
> All of this is largely working (less a few bugs) and we're very happy 
> with the performance and the notion that the heavy lifting happens on 
> the client side.
>
> Now I'm trying to decide how we will handle the server side database 
> interaction if we continue with this model. In the simplest case, I'd 
> like to interact directly with the user's individual derby databases 
> (one per report). This has several advantages:
>
> 1: We don't have any time-consuming import process to put all that 
> data into a centralized database
> 2: We get built-in partitioning of the data on the server side which 
> is good news for scalability
> 3: The data  model is somewhat complex and join-heavy, and I suspect 
> several smaller databases will, in general, perform better than one 
> very large database with hundreds of millions of records
> 4: Cleanup is a breeze: to remove a report we just whack a directory 
> on the file system
>
> But I'm not sure how best to actually manage all these databases. I 
> suspect we will have on the order of 1000 databases in play, with 
> maybe 20 of those being actively used at a single busy time. It is 
> conceivable that we will have more than this, depending on the success 
> of the system. So I guess I'm looking for any general insights, plus 
> answers to a few concrete questions:
>
> 1: What are the performance characteristics of using zipped or jarred 
> DBs? It doesn't bother me to unzip them, but I saw this option in the 
> documentation and I was curious. Can these jars be in arbitrary 
> locations on the file system, and be connected to ad-hoc? Can a derby 
> server provide access to a jarred database at an arbitrary filesystem 
> location?
Please take a look at the section titled "Accessing a read-only database 
in a zip/jar file" in the Derby Developer's Guide: 
http://db.apache.org/derby/docs/10.3/devguide/ The jars can live 
anywhere in the file system or on the classpath.
>
> 2: Are there any performance concerns with having many databases in a 
> single derby install? Would it be better to run one derby server, with 
> 1000 databases, or run multiple derby servers on the same hardware and 
> partition the databases across them? I'm not looking for exact 
> numbers, since they obviously depend on a lot of factors. But in 
> general, can I load a ton of databases into derby server and be OK? 
> (We have no problem throwing additional hardware at this system as 
> needed.)
Hard to say. Most of our performance work has measured the performance 
of many clients hammering a single database. I don't know where Derby 
maxes  out in its ability to saturate multiple processors when you are 
running an application against many databases. I think that against a 
single database, there is a limit (4?) to the number of processors which 
a Derby server can keep busy. That may or may not scale up if your 
server is managing more than one database.
>
> 3: Can derby server discover new databases if I simply copy (or 
> symlink?) a derby database directory to its DERBY_HOME? Or do the 
> databases need to be *created* programmatically through JDBC?
Derby has no heuristic for knowing where to look for databases.  I think 
that database discovery has to be done by your application.  Basically, 
you need to locate the database via a JDBC connection URL.
>
> 4: Anybody have any experience with rails and derby? I see a few hints 
> on line that people are doing it but I'm not too certain of the 
> stability and details on what is supported. I'll have to write my own 
> connection pooling and switching code in rails, which I don't think 
> will be too tough. But an alternative would be to build a JEE-based 
> web service to manage the derby interaction, and then have my rails 
> application interact with that data server, if Rails/Derby is not a 
> reliable or well-performing option.
Sorry, I'm out of my league here.

Hope this is a little helpful,
-Rick
>
> I know this is an open-ended question. I appreciate any time and 
> insight any of you may offer :)
>
> Thanks,
>
> Geoff