You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Lars Clausen <lc...@statsbiblioteket.dk> on 2005/11/25 09:14:54 UTC

Workarounds for too many open files?

Trying to import a 10GB text file (about 50x10^6 entries) into a single
Derby table, I got the following error:

ij> connect 'jdbc:derby:cdxdb'; ij> elapsedtime on; ij> CALL
SYSCS_UTIL.SYSCS_IMPORT_DATA ( null, 'CDX',
'URL,IP,MIMETYPE,LENGTH,ARCFILE,OFFSET', '1,2,4,5,6,7',
'/home/lc/index-backping.cdx', '`', null, null, 1);
ERROR 38000: The exception 'SQL Exception: Exception during creation of
file
/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
for container' was thrown while evaluating an expression.
ERROR XSDF1: Exception during creation of
file
/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
for container
ERROR XJ001: Java exception:
'/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp (Too many open files): java.io.FileNotFoundException'.

Has anybody seen this and found a workaround?  The table has no external
dependencies (in fact is the only one in the database) and has two
indices on it.  The schema is 

url varchar(3000), ip char(16), ingestdate date, mimetype varchar(256),
length bigint, arcfile varchar(256), offset bigint, fullmd5 char(20),
bodymd5 char(20), etag varchar(256), lastmodified date

I guess I'll have to go file a bug on it, too, even if there is a
workaround.

-Lars

RE: Workarounds for too many open files?

Posted by Michael Segel <ms...@segel.com>.

-----Original Message-----
From: Lars Clausen [mailto:lc@statsbiblioteket.dk] 
Sent: Tuesday, November 29, 2005 9:29 AM
To: Derby Discussion
Subject: Re: Workarounds for too many open files?

On Mon, 2005-11-28 at 15:17, Michael Segel wrote:
> On Monday 28 November 2005 04:10, Lars Clausen wrote:
> 
> I guess anyone who's going to try and create a 100GB database will also
run in 
> to this problem. (Remember the discussions on scalability?)
> 
> Since I haven't seen the code on how Derby stores data to the disk, why so

> many files per index?
> 
> A possible, but ugly work around is to create the index as you load the
table.
> (Ugly in that you'll take a performance hit. However, this should work.)

That's what I did first, later I split it into two to see why it
crashed.

-Lars

Ah. Ok.

So then the issue with the temporary files is going to be an issue
regardless when you're building your index.

Geez, trying to migrate to a tablespace arrangement (either cooked or raw)
seems awfully appealing.

I guess this goes back to an earlier point I had been trying to make.
You have to make some basic design decisions that would impact not only the
small portion of the code you are trying to fix or enhance, but the overall
/ underlying structure of Derby.

I mean on one hand, its very feasible for someone to say that for your
application, Derby is the wrong database because it wasn't designed to
function that way. On the other hand, its quite possible to say "Well, we
want to see Derby to succeed as an all around 100% Java RDBMS" so then you
have to keep that perspective when you make design decisions.

I guess your experience will damper any large database tests. :-(

Re: Workarounds for too many open files?

Posted by Lars Clausen <lc...@statsbiblioteket.dk>.

On Mon, 2005-11-28 at 15:17, Michael Segel wrote:
> On Monday 28 November 2005 04:10, Lars Clausen wrote:
> 
> I guess anyone who's going to try and create a 100GB database will also run in 
> to this problem. (Remember the discussions on scalability?)
> 
> Since I haven't seen the code on how Derby stores data to the disk, why so 
> many files per index?
> 
> A possible, but ugly work around is to create the index as you load the table.
> (Ugly in that you'll take a performance hit. However, this should work.)

That's what I did first, later I split it into two to see why it
crashed.

-Lars

Re: Workarounds for too many open files?

Posted by Michael Segel <ms...@segel.com>.

On Monday 28 November 2005 04:10, Lars Clausen wrote:

I guess anyone who's going to try and create a 100GB database will also run in 
to this problem. (Remember the discussions on scalability?)

Since I haven't seen the code on how Derby stores data to the disk, why so 
many files per index?

A possible, but ugly work around is to create the index as you load the table.
(Ugly in that you'll take a performance hit. However, this should work.)

-HTH...

>
> It turns out that this happens during index creation.  I was able to
> import the text file and run selects on it, but when I try to create an
> index:
>
> Derby creates files in the tmp directory at a rate of about 8 per
> second.  If it doesn't close all of these, it would run out of FDs
> (ulimit 1024) before long.
>
> I would file a bug report, but db.apache.org isn't responding.
>
> -Lars

-- 
Michael Segel
Principal 
MSCC
312 952- 8175 [M]

Re: Workarounds for too many open files?

Posted by Mike Matrigali <mi...@sbcglobal.net>.

The easiest workaround is to up the open file count - these should
only be needed during the sort.  How to do this is very OS system
specific.  To my knowledge java gives no visibility to this resource.

I believe the
problem is the sort algorithm used to create the index, not the index
itself.  It uses a multi-level merge strategy where each merge group
is a separate file (once Derby has determined that it is going to do
a disk based sort rather than an in-memory sort).

I have not debugged this, other than verifying that upping the open
file count allows the index to be created and the temp files are
cleaned up.  Unlike the the normal open files which have a cache
to limit how many were open at one time, my guess is that sort
just keeps all of the files open as they tend to be first filled and
then drained almost immediately.  Caching the opens are going to
slow down the sort while conserving the open file resource.  Alternate
sort/merge strategies could be used that did not need one file per
merge group.  Also it may be that we should up the size of each merge
pass when dealing with such a big sort.

Not very much work has been done on the performance of a disk based
sort, and especially not on 100gb sorts.  Anyone interested in doing
some development work on derby may want to look at sorts.  It is a
module where one could implement a completely separate implementation
and easily compare/test ones changes without worrying about any other
part of the code.

This is definitely an area that probably can be approved as it has
not changed much since it's orgininal implementation where 100gb
db's were just a dream (I can verify it was developed on machines
where 1 gb of disk space was a luxury).

Lars Clausen wrote:
> On Fri, 2005-11-25 at 09:14, Lars Clausen wrote:
> 
>>Trying to import a 10GB text file (about 50x10^6 entries) into a single
>>Derby table, I got the following error:
>>
>>ij> connect 'jdbc:derby:cdxdb'; ij> elapsedtime on; ij> CALL
>>SYSCS_UTIL.SYSCS_IMPORT_DATA ( null, 'CDX',
>>'URL,IP,MIMETYPE,LENGTH,ARCFILE,OFFSET', '1,2,4,5,6,7',
>>'/home/lc/index-backping.cdx', '`', null, null, 1);
>>ERROR 38000: The exception 'SQL Exception: Exception during creation of
>>file
>>/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
>>for container' was thrown while evaluating an expression.
>>ERROR XSDF1: Exception during creation of
>>file
>>/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
>>for container
>>ERROR XJ001: Java exception:
>>'/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp (Too many open files): java.io.FileNotFoundException'.
> 
> 
> It turns out that this happens during index creation.  I was able to
> import the text file and run selects on it, but when I try to create an
> index:
> 
> ij> select count(*) from cdx;
> 1
> -----------
> 50000000
>  
> 1 row selected
> ELAPSED TIME = 320818 milliseconds
> ij> create index cdxurl on cdx(url);
> ERROR XSDF1: Exception during creation of file
> /home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132927896412.tmp
> for container
> ERROR XJ001: Java exception:
> '/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132927896412.tmp (Too many open files): java.io.FileNotFoundException'.
> ij>
> 
> Derby creates files in the tmp directory at a rate of about 8 per
> second.  If it doesn't close all of these, it would run out of FDs
> (ulimit 1024) before long.
> 
> I would file a bug report, but db.apache.org isn't responding.
> 
> -Lars
> 
> 
>

Re: Workarounds for too many open files?

Posted by Lars Clausen <lc...@statsbiblioteket.dk>.

On Fri, 2005-11-25 at 09:14, Lars Clausen wrote:
> Trying to import a 10GB text file (about 50x10^6 entries) into a single
> Derby table, I got the following error:
> 
> ij> connect 'jdbc:derby:cdxdb'; ij> elapsedtime on; ij> CALL
> SYSCS_UTIL.SYSCS_IMPORT_DATA ( null, 'CDX',
> 'URL,IP,MIMETYPE,LENGTH,ARCFILE,OFFSET', '1,2,4,5,6,7',
> '/home/lc/index-backping.cdx', '`', null, null, 1);
> ERROR 38000: The exception 'SQL Exception: Exception during creation of
> file
> /home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
> for container' was thrown while evaluating an expression.
> ERROR XSDF1: Exception during creation of
> file
> /home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp
> for container
> ERROR XJ001: Java exception:
> '/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132842374093.tmp (Too many open files): java.io.FileNotFoundException'.

It turns out that this happens during index creation.  I was able to
import the text file and run selects on it, but when I try to create an
index:

ij> select count(*) from cdx;
1
-----------
50000000

1 row selected
ELAPSED TIME = 320818 milliseconds
ij> create index cdxurl on cdx(url);
ERROR XSDF1: Exception during creation of file
/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132927896412.tmp
for container
ERROR XJ001: Java exception:
'/home/lc/projects/webarkivering/scripts/sql/cdxdb/tmp/T1132927896412.tmp (Too many open files): java.io.FileNotFoundException'.
ij>

Derby creates files in the tmp directory at a rate of about 8 per
second.  If it doesn't close all of these, it would run out of FDs
(ulimit 1024) before long.

I would file a bug report, but db.apache.org isn't responding.

-Lars