You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Jeffrey Lichtman <sw...@rcn.com> on 2006/03/01 05:25:34 UTC

Re: order by

>I have made a test and dropped the composite index, and now the
>sorting is ok, regenerated the index and sorting is as described
>in my first problem report.
>
>tom

It seems likely that the optimizer is using the index as a 
sort-avoidance strategy. That is, it is recognizing that the index is 
in the same order as the "order by" clause, so it's doing an index 
scan instead of using the sorter. Now, I don't know why this in 
itself would cause the problem you're seeing, since as you've 
described it the index should be in the right order. Maybe there's 
something wrong with the index.


                        -        Jeff Lichtman
                                 swazoo@rcn.com
                                 Check out Swazoo Koolak's Web Jukebox at
                                 http://swazoo.com/

Re: order by

Posted by Michael Segel <ms...@segel.com>.

On Friday 03 March 2006 5:04 am, Thomas Vatter wrote:
> I have shutdown derby, dropped the database by removing its folder,
> started derby (no large memory consuming, stays low), recreated
> the database, imported 1300 data and showed them in my application.
> Memory stays low, but the ordering is not ok. This is a mystery for me.

Well...

Since you mentioned Linux/Unix... try looking at the data files using od.
od (octal dump???) is a great tool for looking at the low level data.
You can look up the man page on it, but I'd suggest using -a option so that 
the output is in ascii.

There's definitely a problem with your insert/load code.
You should not be consuming that much memory.

At the start of your load program, when you establish your connection,
you should be instantiating a single prepared statement to perform the insert.
Then you should open the file as an input stream so you can use get line.

Then you'll have to write your own parsing program or rather parsing routine 
since the split() method of String doesn't take in to consideration if a 
comma occurs within a quoted value. "Smith, John" will split in to "Smith
and then John".

The other thing that you need to do is to remove any leading /trailing white 
space. (This may remove any non-printable garbage too but I'm not sure. Of 
course, when in doubt, write your own....)

I would also suggest that you create your own container class to store each 
data element  on to a vector as you parse the row. (note: you could just pop 
the string on to the vector, but depending on complexity of the load program, 
you may want to capture additional metadata like mapping position, data 
types, etc ...)

The bottom line, a load program should be extremely memory efficient program.

HTH,

-Mike

Re: order by

Posted by Kristian Waagan <Kr...@Sun.COM>.

Thomas Vatter wrote:

>The result of distributing application and database over two machines is:
>Import of the critical data now works, no hanging at 1200 records
>(I need to start derby with -Xmx1024m else an outofmemory exception
>is thrown). My import algorithm has to be reviewed, it allocates 
>memory in a loop, 100mb when looping 1300 times. The original ordering 
>problem does not appear at this point.
>Now the less good news. After work, I turned off the computer where 
>derby is installed. I know derby expects a shutdown, but I just 
>pressed CTL-C in its terminal window and made a shutdown for the os. 
>This morning I booted the computer and started derby. I was surprized
>about the memory consumption in the task list. The derby process was
>consuming the complete memory, 1024mb, resident part 30mb (os is
>suse9.3). I tried an import of another spreadsheet with 1220 records 
>in this situation. The import worked. I displayed the newly imported
>data in my application, the data are listed slowly - and the 
>original ordering problem appears. To check if the composite index on
>the sorting columns was corrupted I dropped and recreated the index.
>Same ordering problem. Perhaps the cancelling of the derby server in
>the evening caused the problem, I will redo the whole thing without 
>stopping the server.
>
>  
>

Just to experiment a little, I created a program that imported data into 
Derby.
Only one table with three CHAR(100) columns. I inserted approximately 5 
GB (of text) with the default Derby 10.1 and JVM settings. A count(*) 
reported this:

ij> select count(*) from biginsert;
1
-----------
17476267

1 row selected

Memory usage was low at all times, and as you can see there are over 17 
million rows in the database. So data volume per se should not be a 
problem here, it must be something else causing your trouble (schema 
layout? column types?).

For us to be able to help you, is there any possibility you can give us 
more details about the database schema (tables, columns, types, 
indexes/indices), the data itself and/or the import program?

The fact that you need to start the JVM with -Xmx1024m indicates 
something is wrong. It can be Derby's fault, but it very hard to tell 
without more information.

thanks,
--
Kristian

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.



I have shutdown derby, dropped the database by removing its folder,
started derby (no large memory consuming, stays low), recreated 
the database, imported 1300 data and showed them in my application. 
Memory stays low, but the ordering is not ok. This is a mystery for me.

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.

The result of distributing application and database over two machines is:
Import of the critical data now works, no hanging at 1200 records
(I need to start derby with -Xmx1024m else an outofmemory exception
is thrown). My import algorithm has to be reviewed, it allocates 
memory in a loop, 100mb when looping 1300 times. The original ordering 
problem does not appear at this point.
Now the less good news. After work, I turned off the computer where 
derby is installed. I know derby expects a shutdown, but I just 
pressed CTL-C in its terminal window and made a shutdown for the os. 
This morning I booted the computer and started derby. I was surprized
about the memory consumption in the task list. The derby process was
consuming the complete memory, 1024mb, resident part 30mb (os is
suse9.3). I tried an import of another spreadsheet with 1220 records 
in this situation. The import worked. I displayed the newly imported
data in my application, the data are listed slowly - and the 
original ordering problem appears. To check if the composite index on
the sorting columns was corrupted I dropped and recreated the index.
Same ordering problem. Perhaps the cancelling of the derby server in
the evening caused the problem, I will redo the whole thing without 
stopping the server.

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.

Michael Segel <de...@...> writes:

> 
> On Thursday 02 March 2006 5:00 am, Thomas Vatter wrote:
> 
> This doesn't make any sense.
> You keep saying "spreadsheet".
> 
> Are you reading directly from an excel spreadsheet? .XLS or did you convert 
> the spreadsheet to a .CSV format?
> 
> The other issue of concern is that you say that you continue to die at 
> approximately 1200 rows. 
> 
> The first thought is that its a data set problem because you indicate that it 
> occurs with a single data set.
> 
> The second thought is that its a programming issue on your part because you 
> now say that you're using different portions of the data set and its still 
> hanging at 1200 rows.
> 
> A simple read/parse/insert program should not eat memory.
> 
> Are you using a prepared statement?
> 
> As you parse each line of data, how are you storing the fields and when you do 
> your insert, are you properly "flushing" the variables? (Destroying the 
> objects...)
> 
> The only negative thing is that you can't use the split() method. You have to 
> manually parse the string to pull out the data fields, strip the quotes and 
> white space.... 
> (Which is why ESQL/C  or C is a better choice for this type of work...)
> 
> HTH,
> 

spreadsheet
I am not reading the .XLS file, just reading ASCII format

should not eat memory
I absolutely agree, I have to review the code, this will take some time, I will
do the separation of database and application over two machines first, for
easier analysis

prepared statement
usually I do, I will check this in the review

tom

Re: order by

Posted by Michael Segel <de...@segel.com>.

On Thursday 02 March 2006 5:00 am, Thomas Vatter wrote:

This doesn't make any sense.
You keep saying "spreadsheet".

Are you reading directly from an excel spreadsheet? .XLS or did you convert 
the spreadsheet to a .CSV format?

The other issue of concern is that you say that you continue to die at 
approximately 1200 rows. 

The first thought is that its a data set problem because you indicate that it 
occurs with a single data set.

The second thought is that its a programming issue on your part because you 
now say that you're using different portions of the data set and its still 
hanging at 1200 rows.

A simple read/parse/insert program should not eat memory.

Are you using a prepared statement?

As you parse each line of data, how are you storing the fields and when you do 
your insert, are you properly "flushing" the variables? (Destroying the 
objects...)

The only negative thing is that you can't use the split() method. You have to 
manually parse the string to pull out the data fields, strip the quotes and 
white space.... 
(Which is why ESQL/C  or C is a better choice for this type of work...)

HTH,

-Mike

> Kristian Waagan <Kr...@...> writes:
>
> ...
>
>
> Hello Kristian,
>
> I'm doing this on a Windows XP system, I start all java processes with
> -Xmx1024m, I had a try with 2048 but the process did not start so I stayed
> with 1024. "Between 912 and 960" is the total usage for the machine,
> read from the task-manager. I have thought about it, derbys pageSize
> is 4k, I have increased pageCacheSize from 1.000 to 10.000, resulting
> in 40mb steps, this would explain that memory "ends" before 1024 is
> reached. Yes, I was unprecise for shortness with the word "record".
> I should have said "record in the main database table" or "dataset".
> But you are right, a dataset (or line) in my spreadsheet has around
> 30 fields only. I had the network server running on the described machine,
> because I am testing how derby performes in the network situation,
> because in the single user situation I have typically less data.
>
> tom

-- 
--
Michael Segel
Principal 
Michael Segel Consulting Corp.
derby@segel.com
(312) 952-8175 [mobile]

Re: order by

Posted by Mike Matrigali <mi...@sbcglobal.net>.

please note that your memory estimates for cache size could be way off,
I haven't seen the ddl for the table so not sure.  In some cases derby
will default to 32k pages rather than 4k (basically if there are blob or 
clob columns).  The page size is also user setable, but I assume that is 
not a factor here.  Also there is more overhead
per page than just the page, most of the overhead is per row per page
so if rows are very small compared to page size it is significant (on
the order of one object per row).

With memory leaks I usually recommend setting the cache size even 
smaller than 1000, say 100.  That way it is even quicker to tell the
difference between expected cache memory growth and some real problem.

Kristian Waagan wrote:
> Thomas Vatter wrote:
> 
>> Kristian Waagan <Kr...@...> writes:
>>
>> ...
>>
>>
>> Hello Kristian,
>>
>> I'm doing this on a Windows XP system, I start all java processes with 
>> -Xmx1024m, I had a try with 2048 but the process did not start so I 
>> stayed with 1024. "Between 912 and 960" is the total usage for the 
>> machine, read from the task-manager. I have thought about it, derbys 
>> pageSize is 4k, I have increased pageCacheSize from 1.000 to 10.000, 
>> resulting in 40mb steps, this would explain that memory "ends" before 
>> 1024 is reached. Yes, I was unprecise for shortness with the word 
>> "record". I should have said "record in the main database table" or 
>> "dataset".
>> But you are right, a dataset (or line) in my spreadsheet has around 30 
>> fields only. I had the network server running on the described 
>> machine, because I am testing how derby performes in the network 
>> situation, because in the single user situation I have typically less 
>> data.
>>
>> tom
>>
>>
> 
> Okay, thank you for the answers.
> 
> Just to be clear, I would have tried running the Derby network server on 
> a separate machine, and see where the the memory usage goes sky-high; 
> the database machine or the application code machine.
> 
> You could also try running Derby with the default page cache size (1000 
> pages ~ 4MB) and *not* alter the JVM memory options. Performance could 
> go down, but if you are having problems with a memory leak it would show 
> pretty fast by getting an OutOfMemoryException.
> 
> 
> 
> 
> -- 
> Kristian
> 
>

Re: order by

Posted by Kristian Waagan <Kr...@Sun.COM>.

Thomas Vatter wrote:
> Kristian Waagan <Kr...@...> writes:
> 
> ...
> 
> 
> Hello Kristian,
> 
> I'm doing this on a Windows XP system, I start all java processes with 
> -Xmx1024m, I had a try with 2048 but the process did not start so I stayed 
> with 1024. "Between 912 and 960" is the total usage for the machine, 
> read from the task-manager. I have thought about it, derbys pageSize 
> is 4k, I have increased pageCacheSize from 1.000 to 10.000, resulting 
> in 40mb steps, this would explain that memory "ends" before 1024 is 
> reached. Yes, I was unprecise for shortness with the word "record". 
> I should have said "record in the main database table" or "dataset".
> But you are right, a dataset (or line) in my spreadsheet has around 
> 30 fields only. I had the network server running on the described machine, 
> because I am testing how derby performes in the network situation, 
> because in the single user situation I have typically less data.
> 
> tom
> 
> 

Okay, thank you for the answers.

Just to be clear, I would have tried running the Derby network server on 
a separate machine, and see where the the memory usage goes sky-high; 
the database machine or the application code machine.

You could also try running Derby with the default page cache size (1000 
pages ~ 4MB) and *not* alter the JVM memory options. Performance could 
go down, but if you are having problems with a memory leak it would show 
pretty fast by getting an OutOfMemoryException.

--
Kristian

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.

Kristian Waagan <Kr...@...> writes:

...


Hello Kristian,

I'm doing this on a Windows XP system, I start all java processes with 
-Xmx1024m, I had a try with 2048 but the process did not start so I stayed 
with 1024. "Between 912 and 960" is the total usage for the machine, 
read from the task-manager. I have thought about it, derbys pageSize 
is 4k, I have increased pageCacheSize from 1.000 to 10.000, resulting 
in 40mb steps, this would explain that memory "ends" before 1024 is 
reached. Yes, I was unprecise for shortness with the word "record". 
I should have said "record in the main database table" or "dataset".
But you are right, a dataset (or line) in my spreadsheet has around 
30 fields only. I had the network server running on the described machine, 
because I am testing how derby performes in the network situation, 
because in the single user situation I have typically less data.

tom

Re: order by

Posted by Kristian Waagan <Kr...@Sun.COM>.

Thomas Vatter wrote:
> 
> I have analysed the situation by importing multiplications of the small 
> and working spreadsheet. The import again slowed down and then stops at 
> 1200 records (ui goes on with progress indicator, it does not catch the 
> problem). The same slowdown and stop at 1200 records happens when I show 
> the data in the Application. When this happens memory usage is always 
> between 912 and 960mb. The machine has 1gb. This prevented me from 
> understandig that it is a memory problem. Now I am 'sure' it is one. 
> I have several processes running at once, the derby networkserver, 
> my applications rmi server, my applications client and the ide. I 
> have to do the data testing with 40.000 records in the database not 
> only 1.200, so perhaps I should install the database server on a 
> remote machine to have a chance to succeed.
> 
> 

Hello Thomas,

I agree that it would be a good test to run the Derby network server on 
a different machine than your application code/processes. That way, you 
should be able to monitor the memory usage on the machines better.

I do have a few questions.
* What operating system are you running?

* What parameters, if any, do you pass to the Java VM? I'm thinking of 
those that determine how much memory the VM is allowed to use.

* You say "memory usage is always between 912 and 960mb". It is your 
Java process that uses this memory, or is this the total usage for the 
machine?

* How big is a record? Since you are talking about a spreadsheet here, I 
imagine a record is a line in it, and thus not very big.

* It would also be helpful if you are able to import the data somehow, 
and then try to do some queries against the database from ij.

If you run into problems getting the network server running, don't 
hesitate asking for help. You also have a search box on the Derby 
website and the various manuals that you can use.
Don't forget to specify the host argument when starting the network 
server (or else it will only accept connections from localhost):
'java -classpath derbynet.jar org.apache.derby.drda.NetworkServerControl 
-h my.hostname.or.ip start'

(or use the scripts as described in the documentation)

--
Kristian

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.


I have analysed the situation by importing multiplications of the small 
and working spreadsheet. The import again slowed down and then stops at 
1200 records (ui goes on with progress indicator, it does not catch the 
problem). The same slowdown and stop at 1200 records happens when I show 
the data in the Application. When this happens memory usage is always 
between 912 and 960mb. The machine has 1gb. This prevented me from 
understandig that it is a memory problem. Now I am 'sure' it is one. 
I have several processes running at once, the derby networkserver, 
my applications rmi server, my applications client and the ide. I 
have to do the data testing with 40.000 records in the database not 
only 1.200, so perhaps I should install the database server on a 
remote machine to have a chance to succeed.

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.

Michael Segel <ms...@...> writes:

> 
> Ok...
> 
> Maybe I'm a tad slow... (Lack of sleep and not enough coffee...)
> 
> You said that the problem only occurs with your one result set.
> 
> When you say that you're getting the data from a spreadsheet, how are you
> reading it in? Are you trying to access the raw XLS dataset? Or did you
> export the data as a .csv file?
> 
> I don't believe it's a bug because of the fact that it only occurs in one
> data set, and that you could be introducing an error in a couple of places.
> (The data set itself for example.)(load program)
> 
> So what am I missing?
> 

you are right, data importing could be the problem, I am just coding a table
level locking to make the import faster, and to find out why import sometimes
produces errors, it has a debug output that shows me what it does so I can see
if it gets the right data.

Re: order by

Posted by Thomas Vatter <th...@network-inventory.de>.


yes, I will analyse the data, make further tests (I have some more example
spreadsheets) and I can create a bigger one from the smaller (that works) by
multiplicating its rows.

Re: order by

Posted by Myrna van Lunteren <m....@gmail.com>.

On 3/1/06, Michael Segel <ms...@segel.com> wrote:
>
> Ok...
>
> You said that the problem only occurs with your one result set.
>
> When you say that you're getting the data from a spreadsheet, how are you
> reading it in? Are you trying to access the raw XLS dataset? Or did you
> export the data as a .csv file?
>
> I don't believe it's a bug because of the fact that it only occurs in one
> data set, and that you could be introducing an error in a couple of
> places.
> (The data set itself for example.)(load program)
>
> > -----Original Message-----
> > From: news [mailto:news@sea.gmane.org] On Behalf Of tom
> > Sent: Wednesday, March 01, 2006 5:01 AM
> > To: derby-user@db.apache.org
> > Subject: Re: order by
> >
> >
> >
> > There still something is wrong. I recreated the database without
> > composite index, imported 1200 records without error, viewed
> > them and the programm was hanging at record 1100 without messages,
> > I killed the process after waiting 5 minutes and restarted
> > derby network server. I created a composite index and viewed the
> > data, now the sorting problem that I described at the beginning of this
> > thread appears again. pooh
> >

I agree with Michael, it still looks like there is some odd stuff in your
spreadsheet or your import program.

Maybe spurious characters (i.e. characters that are not interpreted by your
encoding and don't actually show up when you look at your spreadsheet), a
tab in a data field that gets interpreted as a new column/field by the
import program so you have an extra field the import program can't
handle...an occurrence of your field delimiter (if you have such a thing) in
a text string (same effect)...

In a past life in tech support I once had to take the 'binary' approach to
find exactly such a situation - there were no errors that helped... It was
labor intensive, bothersome, but got through the hurdle. I think there were
a couple of thousand rows.

I had to cut the dataset in sections, ignore the bits that worked, cut up
the bits that didn't work further, until I found the troublesome
record/records/entries. Then fixed those up in the original, and then  the
whole process worked.
Then I informed the customer to prevent whatever it was that caused the
problem.

Unless someone knows another way to do this?

Myrna

RE: order by

Posted by Michael Segel <ms...@segel.com>.

Ok...

Maybe I'm a tad slow... (Lack of sleep and not enough coffee...)

You said that the problem only occurs with your one result set.

When you say that you're getting the data from a spreadsheet, how are you
reading it in? Are you trying to access the raw XLS dataset? Or did you
export the data as a .csv file?

I don't believe it's a bug because of the fact that it only occurs in one
data set, and that you could be introducing an error in a couple of places.
(The data set itself for example.)(load program)

So what am I missing?

> -----Original Message-----
> From: news [mailto:news@sea.gmane.org] On Behalf Of tom
> Sent: Wednesday, March 01, 2006 5:01 AM
> To: derby-user@db.apache.org
> Subject: Re: order by
> 
> 
> 
> There still something is wrong. I recreated the database without
> composite index, imported 1200 records without error, viewed
> them and the programm was hanging at record 1100 without messages,
> I killed the process after waiting 5 minutes and restarted
> derby network server. I created a composite index and viewed the
> data, now the sorting problem that I described at the beginning of this
> thread appears again. pooh
>

Re: order by

Posted by tom <th...@network-inventory.de>.


There still something is wrong. I recreated the database without 
composite index, imported 1200 records without error, viewed 
them and the programm was hanging at record 1100 without messages, 
I killed the process after waiting 5 minutes and restarted 
derby network server. I created a composite index and viewed the 
data, now the sorting problem that I described at the beginning of this 
thread appears again. pooh

Re: order by

Posted by tom <th...@network-inventory.de>.

Jeffrey Lichtman <sw...@...> writes:

> 
> 
> >I have made a test and dropped the composite index, and now the
> >sorting is ok, regenerated the index and sorting is as described
> >in my first problem report.
> >
> >tom
> 
> It seems likely that the optimizer is using the index as a 
> sort-avoidance strategy. That is, it is recognizing that the index is 
> in the same order as the "order by" clause, so it's doing an index 
> scan instead of using the sorter. Now, I don't know why this in 
> itself would cause the problem you're seeing, since as you've 
> described it the index should be in the right order. Maybe there's 
> something wrong with the index.
> 
>                         -        Jeff Lichtman
>                                  swazoo@...
>                                  Check out Swazoo Koolak's Web Jukebox at
>                                  http://swazoo.com/ 
> 
> 

Yes, you are right, the index is corrupted, I just found that the problem is not
reproducible with other datasets. I create the datasets by importing spreadsheet
data by am import program. The problematic dataset has 1300 records. I now found
that near the end of this import error messages are listed in the background. I
tried with a smaller spreadsheet with 108 records and here everything is ok. Now
I'll find out why import of larger spreadsheet has problems.