You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-user@db.apache.org by Ture Munter <tu...@fysik.dtu.dk> on 2007/10/08 11:48:03 UTC

Inserting data into a database on a Derby network server

I'm using embedded derby in a Java SE application, however I use Derby
as a network server while building the database. In that connection I
ran into problems.
I have a database with 65000 rows that I wanted to add to a Derby
database (total size before insert around 10000 records, 30MB). Each row
consist of some strings and some floats (6 fields in total) and the
operation requires two inserts and one select. During insert I run Derby
as a Derby network server, the problem is that no matter what I do I get
OutOfMemory exceptions (Derby ran out of heap space) all the time.
Setting the max memory use of the JVM to 1GB allowed me to add around
17000 rows, but not anymore than that.

The problem was "solved" by running Derby embedded in the program that
inserted data into the database and only commit every 50th row. When
running Derby embedded I didn't run out of memory. After adding all
65000 records I can still run Derby as network server and do queries,
but there are problems when trying to insert data. There are no problems
when running Derby embedded. Have anybody experienced the same
problem/features?

I also experience a decrease in performance after having inserted many
rows, but the slow-down is not as bad as described in the thread
"exponential increase in insert time".

Apart of the problems when inserting data in bulk, embedded Derby works
great in my application.

Thanks
- Ture


Re: Inserting data into a database on a Derby network server

Posted by Bryan Pendleton <bp...@amberpoint.com>.
> The problem was "solved" by running Derby embedded in the program that
> inserted data into the database and only commit every 50th row.

When you were running in the Network Server configuration, were you
also committing your transaction after every 50th row?

thanks,

bryan


Re: Foreign key constraints

Posted by Kristian Waagan <Kr...@Sun.COM>.
Williamson, Nick wrote:
> Hi all,
> 
> I see that in Derby, foreign key constraints don't support the deferred
> / deferrable syntax that I'm used to using in Oracle, and - creating
> data in IJ by running a script - validation seems to take place straight
> away by default (in other words, Derby doesn't wait until the COMMIT).
> Now what if I have a transaction inserting into several tables and one
> of the tables that I'm inserting into has an FK constraint pointing at
> one of the others? It would seem to be the case that unless I issue all
> the INSERTs in the "right" order, they will fail. Similarly, if I have a
> circular dependency where two tables each have an FK constraint pointing
> at the other one, I would imagine that I will never be able to insert
> data into those tables. Is there a way of changing Derby's behaviour
> with regard to the time when FK constraints are enforced?

Hello Nick,

I think you are out of luck on this one...
There is a Jira for it, but it has not seen any activity: 
https://issues.apache.org/jira/browse/DERBY-532

If you only need the deferred constraint checking in an initial 
population phase, you can of course add the constraint after the data is 
inserted. You should make sure the data is consistent then, I'm not sure 
Derby tells you which row(s) violate(s) the constraint when it fails the 
constraint addition.
Repeatedly dropping and recreating the indexes on a database seems 
unfeasible to me, but for some situations it might be acceptable (?).


I don't know of any plans for implementing this functionality, but you 
are free to comment or vote on the issue, or do something else to help 
the cause :)


regards,
-- 
Kristian

NB! There used to be some restrictions on ALTER TABLE operations. I hope 
there are none left that makes the proposed approach unusable.

> 
> TIA
> Nick


Foreign key constraints

Posted by "Williamson, Nick" <ni...@mentor.com>.
Hi all,

I see that in Derby, foreign key constraints don't support the deferred
/ deferrable syntax that I'm used to using in Oracle, and - creating
data in IJ by running a script - validation seems to take place straight
away by default (in other words, Derby doesn't wait until the COMMIT).
Now what if I have a transaction inserting into several tables and one
of the tables that I'm inserting into has an FK constraint pointing at
one of the others? It would seem to be the case that unless I issue all
the INSERTs in the "right" order, they will fail. Similarly, if I have a
circular dependency where two tables each have an FK constraint pointing
at the other one, I would imagine that I will never be able to insert
data into those tables. Is there a way of changing Derby's behaviour
with regard to the time when FK constraints are enforced?

TIA
Nick

Re: Inserting data into a database on a Derby network server

Posted by Kristian Waagan <Kr...@Sun.COM>.
John Embretsen wrote:
> Kristian Waagan wrote:
>> John Embretsen wrote:
>>> Hi Ture, 
>>
>> [ snip ]
>>
>>> I didn't think it was this easy to fill up the heap (by not 
>>> explicitly closing Statement objects) anymore (see e.g. DERBY-210), 
>>> but there are obviously some vulnerabilities left. So thank you for 
>>> sharing your code and for reporting this!
>>
>> Hi John,
>>
>> I think the reason why the OOME happened so fast in this application 
>> was because of the sorting (ORDER BY). I don't know enough about how 
>> sorting is implemented in Derby to have an opinion, but I think we 
>> should have a look at it.
>>
>> I will do a little pre-investigation and create a Jira if appropriate.
> 
> OK, thanks! From quickly reading the thread I was under the impression 
> that adding a call to Statement.close() solved the problem, but I guess 
> that was only part of it.
> 
> One of the DOTS tests I've been running also stumbled across a memory 
> leak related to sorting, see 
> http://issues.apache.org/jira/browse/DERBY-2176 . Interesting 
> coincidence. That was fixed with 10.2.2.0, though, so this must be 
> something else.

Hi again,

After a quick look at the repro, I have concluded that Derby does not 
have a bug in this area.
That said, it might be possible to code very defensively and try to make 
up for the lack of explicit closing of resources (statements and 
resultsets). I believe this works for the embedded case, due to garbage 
collection. For this to work in c/s, the server must somehow be notified 
that the objects on the client side has been gc'ed.

Regarding the repro, only one change is needed to make it complete the 
loading; close the resultset 'rs' inside the loop. It might or might not 
work when you remove the reconnection-logic, depending on how much 
resources are tied to the statement-object on the server side and how 
big the heap is.
The sort-resources are tied to the resultset.
Even better, close the statement or use a prepared statement.

Regarding the reduced insertion performance over time, this was simply 
caused by a missing index. By adding an index for the column 'name' in 
Quaternary, the times look like this on my computer:
Connecting to the database
5000 rows inserted (8s)
10000 rows inserted (5s)
15000 rows inserted (5s)
20000 rows inserted (5s)
25000 rows inserted (5s)
30000 rows inserted (5s)
35000 rows inserted (5s)
40000 rows inserted (5s)
45000 rows inserted (5s)
50000 rows inserted (5s)
55000 rows inserted (5s)
60000 rows inserted (5s)

real    1m11.429s
user    0m13.909s
sys     0m2.606s

Now that's a nice performance increase :) (from ~30 to ~1 minute).
And you pay for it by increased storage cost (for the index). With the 
index the database doesn't have to scan the whole table on each insert 
to answer the 'SELECT id FROM Quaternary WHERE name=?'-query.



-- 
Kristian

> 
> 


Re: Inserting data into a database on a Derby network server

Posted by John Embretsen <Jo...@Sun.COM>.
Kristian Waagan wrote:
> John Embretsen wrote:
>> Hi Ture, 
> 
> [ snip ]
> 
>> I didn't think it was this easy to fill up the heap (by not explicitly 
>> closing Statement objects) anymore (see e.g. DERBY-210), but there are 
>> obviously some vulnerabilities left. So thank you for sharing your 
>> code and for reporting this!
> 
> Hi John,
> 
> I think the reason why the OOME happened so fast in this application was 
> because of the sorting (ORDER BY). I don't know enough about how sorting 
> is implemented in Derby to have an opinion, but I think we should have a 
> look at it.
> 
> I will do a little pre-investigation and create a Jira if appropriate.

OK, thanks! From quickly reading the thread I was under the impression that 
adding a call to Statement.close() solved the problem, but I guess that was only 
part of it.

One of the DOTS tests I've been running also stumbled across a memory leak 
related to sorting, see http://issues.apache.org/jira/browse/DERBY-2176 . 
Interesting coincidence. That was fixed with 10.2.2.0, though, so this must be 
something else.


-- 
John


Re: Inserting data into a database on a Derby network server

Posted by Kristian Waagan <Kr...@Sun.COM>.
John Embretsen wrote:
> Hi Ture, 

[ snip ]

> I didn't think it was this easy to fill up the heap (by not explicitly 
> closing Statement objects) anymore (see e.g. DERBY-210), but there are 
> obviously some vulnerabilities left. So thank you for sharing your code 
> and for reporting this!

Hi John,

I think the reason why the OOME happened so fast in this application was 
because of the sorting (ORDER BY). I don't know enough about how sorting 
is implemented in Derby to have an opinion, but I think we should have a 
look at it.

I will do a little pre-investigation and create a Jira if appropriate.


-- 
Kristian

> 
> 


Re: Inserting data into a database on a Derby network server

Posted by John Embretsen <Jo...@Sun.COM>.
Hi Ture,

Ture Munter wrote:
> For me this "minor" bug in my own code is not so bad as the uploaded program 
> is only used once (by myself) to insert all data in the database. But that 
> Derby can run out of memory if somebody forgets to close created statements is
> potentially a more serious bug. Of course it requires an attacker to first be 
> able to create a connection to the database server, and in that case he can do
> more interesting things than just making the server crash.

I would say making the server crash is quite interesting, especially when it is 
this easy...  Anyway, I was running a test a while ago that was containing code 
which resulted in similar errors. You may be interested in reading the related 
discussion that we had on this list almost two years ago, it's available at:

http://www.nabble.com/OutOfMemoryErrors-when-testing-Derby-with-DOTS-t1010027.html

Let me quote myself:

"However, my main concern right now is that Derby is not robust enough to
handle code of this type without running out of memory within a
relatively short period of time. I guess that since (even) the DOTS
creators wrote such code, other Derby users may be inclined to do so in
the future."

Roughly speaking, two camps emerged during that discussion: Those who think that 
Derby should be able to withstand such code (e.g. not explicitly closing 
statement objects) if possible, and those who won't cut you any slack because 
this is not the recommended way to do it.

I didn't think it was this easy to fill up the heap (by not explicitly closing 
Statement objects) anymore (see e.g. DERBY-210), but there are obviously some 
vulnerabilities left. So thank you for sharing your code and for reporting this!


-- 
John


Re: Inserting data into a database on a Derby network server

Posted by Ture Munter <tu...@fysik.dtu.dk>.
Bryan Pendleton <bp...@...> writes:

> 
> > database. As the data are published scientific data I can share my 
> > source code
> > and the data. Everything is available on 
> > 
> > www.fysik.dtu.dk/~munter/derby/
> 
> Thanks for sharing your sample program. It looks like you are creating
> statements, but never closing them.
> 
> Each call to cDerby.createStatement() should be matched with a call to 
> close() on the statement.
> 
> thanks,
> 
> bryan
> 
> 

The program is the same as I was using both when running an embedded 
server and in network server mode. Of course the commit after 
50 rows and the reconnect parts are things I first added when the
original version of the program failed to add all records. But exactly 
this program makes the Network server run out of memory after 6956 and 
23154 rows, respectively (as described in an earlier post).

But thanks for pointing out the missing call to close(). I think you 
spotted the memory-leak there. And it explains why running "derbyrun.jar 
server runtimeinfo" showed so many Statements in the cache. Of course this 
would have went undetected if I only inserted a few thousand rows. I have 
done some runs today where I remember to call close() and don't reconnect all
the time and it seems to run without problems. I would think that still active
statements would be closed implicit when closing the Connection, but that 
seems not to be the case (in Network Server mode).

For me this "minor" bug in my own code is not so bad as the uploaded program 
is only used once (by myself) to insert all data in the database. But that 
Derby can run out of memory if somebody forgets to close created statements is
potentially a more serious bug. Of course it requires an attacker to first be 
able to create a connection to the database server, and in that case he can do
more interesting things than just making the server crash.


Thanks for the replies
- Ture Munter





Re: Inserting data into a database on a Derby network server

Posted by Ture Munter <tu...@fysik.dtu.dk>.
Kristian Waagan <Kr...@...> writes:

> As Bryan says, you should close your statements. Even better would be to 
> use a PreparedStatement for all you queries (including those without any 
>   variables).
> 
> Unless it is an application requirement, you could also do with just one 
> table and use the (Prepared)Statement.getGeneratedKeys() if you need to 
> obtain the unique identifier after insertion. Something like this:
>
> [SOURCE CODE]
> 
> Don't know how important it is anymore (with newer Java versions), but 
> you might see a little improvement by using 'Float.valueOf(strings[3])' 
> instead of 'new Float(strings[3].floatValue())'.

Thank you for your suggestions, I'll keep them in mind when I work on other
programs. The two tables are an application requirement, there are potentially
many tables containing data (now only around 15), so we need to know where the
data with a specific unique id ended up. It is certainly a better idea to use
the built-in methods to obtain the last auto-generated key. Maybe the slowdown
observed is caused by more and more rows needing to be sorted(?). I will try
tomorrow to rewrite parts of it to use the Java-methods to obtain the
auto-generated keys.


> Further, you should not have to reconnect after a few hundred 
> insertions. If you have to, it probably means one out of two things; the 
> application code is not optimal, or there is a bug in Derby.
> In this case, I *guess* that not closing the statements caused the heap 
> to fill up. 

See my other post to Bryan. I agree with you.


> Also, the reduced insertion rate can easily be observed with the repro.
> The number of rows is accumulated, the duration is not (i.e. the 
> durations printed are all for inserting 5000 rows). The numbers below 
> are from a run where a commit is done every 5000 rows, which turned out 
> to be lightly worse than every 50 rows (clocked in at 28m36s):
> Connecting to the database
> [TIMING RESULTS CUT]
> 
> As can be seen, the time it takes to insert 5000 rows rises from 23 
> seconds at startup to over nearly 300 seconds (5 minutes). This has to 
> be investigated as well. I'll see if I can have a look soon, but anyone 
> else is free to check it out. If I get around to it, I'll modify the 
> script slightly and create a Jira.
> 
> Thanks Ture for reporting this.
> 
> regards,

Thanks Kristian and Bryan for so quickly helping me out with the problem.

regards,

Ture



Re: Inserting data into a database on a Derby network server

Posted by Kristian Waagan <Kr...@Sun.COM>.
Bryan Pendleton wrote:
>> database. As the data are published scientific data I can share my 
>> source code
>> and the data. Everything is available on
>> www.fysik.dtu.dk/~munter/derby/
> 
> Thanks for sharing your sample program. It looks like you are creating
> statements, but never closing them.
> 
> Each call to cDerby.createStatement() should be matched with a call to 
> close()
> on the statement.

Ture,

As Bryan says, you should close your statements. Even better would be to 
use a PreparedStatement for all you queries (including those without any 
  variables).

Unless it is an application requirement, you could also do with just one 
table and use the (Prepared)Statement.getGeneratedKeys() if you need to 
obtain the unique identifier after insertion. Something like this:
   // Note that the first marker (?) is replaced with DEFAULT.
   dataInsert = cDerby.prepareStatement(
      "INSERT INTO Quaternary VALUES (
           DEFAULT, ?, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?)",
       Statement.RETURN_GENERATED_KEYS);
   ...
   affectedRows = dataInsert.executeUpdate();
   ResultSet rs = dataInsert.getGeneratedKeys();
   rs.next();
   int sysid = rs.getInt(1);
   rs.close();

Don't know how important it is anymore (with newer Java versions), but 
you might see a little improvement by using 'Float.valueOf(strings[3])' 
instead of 'new Float(strings[3].floatValue())'.

Further, you should not have to reconnect after a few hundred 
insertions. If you have to, it probably means one out of two things; the 
application code is not optimal, or there is a bug in Derby.
In this case, I *guess* that not closing the statements caused the heap 
to fill up. Particularly, the query is sorting the data with an ORDER 
BY. The excerpt of the following heap histogram tells the story:

num   #instances    #bytes  class name
--------------------------------------
   1:   1360547    54421880  org.apache.derby.impl.store.access.sort.Node
   2:       293     6282608  [Lorg.apache.derby.impl.store.access.sort.Node;
   3:     33641     3463176  <constMethodKlass>
   4:     33641     2696744  <methodKlass>
   5:      4919     2097344  [B
   6:     47123     1986192  <symbolKlass>
   7:      2547     1468640  <constantPoolKlass>

But, where are all the statement objects?
I think this case deserves a little more investigation to make sure we 
don't have a leak related to the Node-objects in Derby.
Can anyone shed some light on this based on previous experience?


Also, the reduced insertion rate can easily be observed with the repro.
The number of rows is accumulated, the duration is not (i.e. the 
durations printed are all for inserting 5000 rows). The numbers below 
are from a run where a commit is done every 5000 rows, which turned out 
to be lightly worse than every 50 rows (clocked in at 28m36s):
Connecting to the database
5000 rows inserted (23s)
10000 rows inserted (38s)
15000 rows inserted (54s)
20000 rows inserted (77s)
25000 rows inserted (93s)
30000 rows inserted (104s)
35000 rows inserted (127s)
40000 rows inserted (162s)
45000 rows inserted (186s)
50000 rows inserted (208s)
55000 rows inserted (234s)
60000 rows inserted (279s)

real    30m38.104s
user    0m20.898s
sys     0m4.452s

As can be seen, the time it takes to insert 5000 rows rises from 23 
seconds at startup to over nearly 300 seconds (5 minutes). This has to 
be investigated as well. I'll see if I can have a look soon, but anyone 
else is free to check it out. If I get around to it, I'll modify the 
script slightly and create a Jira.


Thanks Ture for reporting this.


regards,
-- 
Kristian


> 
> thanks,
> 
> bryan
> 


Re: Inserting data into a database on a Derby network server

Posted by Bryan Pendleton <bp...@amberpoint.com>.
> database. As the data are published scientific data I can share my source code
> and the data. Everything is available on 
> 
> www.fysik.dtu.dk/~munter/derby/

Thanks for sharing your sample program. It looks like you are creating
statements, but never closing them.

Each call to cDerby.createStatement() should be matched with a call to close()
on the statement.

thanks,

bryan


Re: Inserting data into a database on a Derby network server

Posted by Ture Munter <tu...@fysik.dtu.dk>.
Kristian Waagan <Kr...@...> writes:

> 
> Ture Munter wrote:
> > I'm using embedded derby in a Java SE application, however I use Derby
> > as a network server while building the database. In that connection I
> > ran into problems.
> > I have a database with 65000 rows that I wanted to add to a Derby
> > database (total size before insert around 10000 records, 30MB). Each row
> > consist of some strings and some floats (6 fields in total) and the
> > operation requires two inserts and one select. During insert I run Derby
> > as a Derby network server, the problem is that no matter what I do I get
> > OutOfMemory exceptions (Derby ran out of heap space) all the time.
> > Setting the max memory use of the JVM to 1GB allowed me to add around
> > 17000 rows, but not anymore than that.
> 
> Hello Ture,

Hello Kristian,

Thanks for the quick response, I'm very impressed :-)


> 
> Is this a out-of-the-box configuration, or have you configured Derby by 
> specifying properties?
> If you have tuned it, have you set the Derby page cache size? 
> (derby.system.pageCacheSize).
> The page size?
> Is there only one connection inserting data?

I use Derby in the out-of-the-box configuration, it worked fine and performance
was good until this problem showed up. I'm going to use the database in
read-only mode and in the embedded configuration. I forgot to tell that I'm
using Derby 10.3.1.4 and JavaSE JDK 1.5.0_12.


> Sounds to me as if you have a program generating the data. Is this 
> something you could share with the community?
> That would be great, then people could run it and see if they see the 
> same behavior. Also, if there's a bug it will be easier to find it if we 
> have a running repro.

All the data are read in from a text-file, parsed and inserted into the
database. As the data are published scientific data I can share my source code
and the data. Everything is available on 

www.fysik.dtu.dk/~munter/derby/

There are a few comments in the source-code that should explain what happens
including the SQL to create the needed tables. Everything should be pretty
self-explanatory as it only reads the text-file line-by-line and inserts them
into the database. 

In order to get it to work, I am constantly closing the connection and
reconnecting to the network server, this is done every 300 rows. This happens
around line 215. 
If this with closing/opening the connection is disabled and Derby runs as
network server (and the database is empty), it runs out of memory after 6956
rows (on my machine with 2GB of memory). If closing/opening the connection is
enabled it first runs out of memory after 16198 rows (when restarting with the
existing database w 6956 rows).

Maybe it would work better/without problems if I used the methods to get the
last auto-generated key - but it shouldn't be necessary to be so careful in
order not to "break" anythings :-) 


> If that is not possible, can you run with Java SE 6 and tell it to
> dump the heap on OOME?
> Another possibility is to use jmap to get a histogram of the 
> heap-content (add the option to only get live objects as well), and 
> this would be easy to post.


I haven't tried any of those things yet, maybe I can find time for that
tomorrow, but for now have fun with the program I uploaded :-)


> > The problem was "solved" by running Derby embedded in the program that
> > inserted data into the database and only commit every 50th row. When
> > running Derby embedded I didn't run out of memory. After adding all
> > 65000 records I can still run Derby as network server and do queries,
> > but there are problems when trying to insert data. There are no problems
> > when running Derby embedded. Have anybody experienced the same
> > problem/features?
> > 
> > I also experience a decrease in performance after having inserted many
> > rows, but the slow-down is not as bad as described in the thread
> > "exponential increase in insert time".
> 
> Just out of curiosity, are you using Statement.executeUpdate, prepared 
> statements or batching when inserting the data?


A little executeUpdate and some PreparedStatements and first calling commit
after 50 rows have been inserted.

Thanks,
Ture Munter





Re: Inserting data into a database on a Derby network server

Posted by Kristian Waagan <Kr...@Sun.COM>.
Ture Munter wrote:
> I'm using embedded derby in a Java SE application, however I use Derby
> as a network server while building the database. In that connection I
> ran into problems.
> I have a database with 65000 rows that I wanted to add to a Derby
> database (total size before insert around 10000 records, 30MB). Each row
> consist of some strings and some floats (6 fields in total) and the
> operation requires two inserts and one select. During insert I run Derby
> as a Derby network server, the problem is that no matter what I do I get
> OutOfMemory exceptions (Derby ran out of heap space) all the time.
> Setting the max memory use of the JVM to 1GB allowed me to add around
> 17000 rows, but not anymore than that.

Hello Ture,

Is this a out-of-the-box configuration, or have you configured Derby by 
specifying properties?
If you have tuned it, have you set the Derby page cache size? 
(derby.system.pageCacheSize).
The page size?
Is there only one connection inserting data?

Sounds to me as if you have a program generating the data. Is this 
something you could share with the community?
That would be great, then people could run it and see if they see the 
same behavior. Also, if there's a bug it will be easier to find it if we 
have a running repro.

If that is not possible, can you run with Java SE 6 and tell it to dump 
the heap on OOME?
Another possibility is to use jmap to get a histogram of the 
heap-content (add the option to only get live objects as well), and this 
would be easy to post.

> 
> The problem was "solved" by running Derby embedded in the program that
> inserted data into the database and only commit every 50th row. When
> running Derby embedded I didn't run out of memory. After adding all
> 65000 records I can still run Derby as network server and do queries,
> but there are problems when trying to insert data. There are no problems
> when running Derby embedded. Have anybody experienced the same
> problem/features?
> 
> I also experience a decrease in performance after having inserted many
> rows, but the slow-down is not as bad as described in the thread
> "exponential increase in insert time".

Just out of curiosity, are you using Statement.executeUpdate, prepared 
statements or batching when inserting the data?

> 
> Apart of the problems when inserting data in bulk, embedded Derby works
> great in my application.

That is nice to hear :)


regards,
-- 
Kristian


> 
> Thanks
> - Ture
>