You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by pu...@verizon.net on 2008/11/25 22:33:27 UTC

faster inserts in big tables

I have 10's of thousands of rows to add to a table, possibly 100's of 
thousands, and I'm wondering if there's anything else I can do to speed 
this up. The table could end up having a couple of million rows.

This is what I've done so far:
* Using a PreparedStatement that gets reused with each insert.
* Set locking level to TABLE for that table.
* Turned off autocommit.
* Set the connection to READ_COMMIT.

In addition to that, I'm also setting these system parameters, though 
not
necessarily to improve insert performance:
* derby.system.durability=test
* derby.storage.pageSize=32768

The table has one odd feature: The last column is a VARCHAR(32672) FOR 
BIT DATA. I've tried setting the length to something smaller, but it 
didn't really seem to matter.

The primary key is an auto generated int with another 2-column index on 
two BIGINT columns. Something I found interesting is that the inserts 
seem to go 2x faster if I have the 2-column index in place than if I 
have just the primary-key index.

I'm running
    Derby 10.2.2
    JRE 1.6.0_07
    Windows XP SP2

Is there anything else I can do to speed up row inserts?

Thanks,
Brian

Re: faster inserts in big tables

Posted by Rick Hillegas <Ri...@Sun.COM>.

Knut Anders Hatlen wrote:
> Rick Hillegas <Ri...@Sun.COM> writes:
>
>   
>> Knut Anders Hatlen wrote:
>>     
>>> Rick Hillegas <Ri...@Sun.COM> writes:
>>>
>>>   
>>>       
>>>> Hi Brian,
>>>>
>>>> In a previous response Peter Ondruška noted that you could use bulk
>>>> import to speed up your inserts if your source data lived in a
>>>> properly formatted file.
>>>>
>>>> Even if your source data does not live in a file, you can still get
>>>> bulk import speed (and the benefits of your generated key) by using
>>>> table functions--provided that you upgrade to 10.4. If you wrap your
>>>> source data in a table function, then you can bulk insert your data as
>>>> follows:
>>>>
>>>> insert into MY_TABLE( NON_GEN_COL1, ... NON_GEN_COL_N )
>>>> select * from table( MY_TABLE_FUNCTION() ) s
>>>>     
>>>>         
>>> The internal SQL syntax allows you to add an optimizer override here
>>> (--DERBY-PROPERTIES insertMode=bulkInsert) which I think speeds it up
>>> even more, though I'm not sure exactly what it buys you. This override
>>> is used in the import code and in the SYSCS_UTIL.SYSCS_BULK_INSERT
>>> system procedure, but you're not allowed to use it in your own SQL
>>> queries, it seems. The (undocumented) SYSCS_BULK_INSERT procedure does
>>> exactly what Rick's example does, except that it also adds the optimizer
>>> override.
>>>
>>> The procedure only works on old-style table functions, though. Does
>>> anyone know what performance gains one can expect by using
>>> SYSCS_BULK_INSERT instead of INSERT INTO ... SELECT FROM? Would it be
>>> worth the effort to create a similar mechanism for the new-style table
>>> functions?
>>>
>>>   
>>>       
>> Hi Knut,
>>
>> It appears to me that this setting (insertMode=bulkInsert) is
>> inspected when binding an INSERT statement. If the bind() logic finds
>> this setting, then it escalates the lock mode for the target table to
>> be a table level lock. Other than that, I cannot find any other use of
>> this setting. Comments around this code, however, indicate that there
>> might once have been other implications and that this setting, which
>> is now unconditionally accepted, used to be silently rejected in
>> certain situations, including insertion into a synchronized client
>> database, insertion into a table which has triggers, and deferred mode
>> insertion (see the comments on InsertNode.verifyBulkInsert).
>>
>> It appears to me that the benefit now conferred by this setting can be
>> achieved by setting the lock mode on the table to be table level. So
>> you would preface the INSERT statement above with the following
>> statement:
>>
>> lock MY_TABLE in exclusive mode
>>
>> That, at least, is what I have gleaned from a code inspection.
>>     
>
> Thanks Rick,
>
> Did you also look at the code in InsertResultSet.open()? It looks like
> it has some special handling of triggers, forreign keys and checking of
> constraints when running in bulk insert mode.
>
>   
Hi Knut,

Thanks for finding that. It looks like InsertResultSet optimizes the 
performance of bulk inserts as follows:

1) Claims to turn off row logging. That is, performs the insert via 
TransactionController.createAndLoadConglomerate() and 
recreateAndLoadConglomerate(). Perhaps you could take a look at those 
methods and verify what they really do.

2) Bulk-enforces foreign keys defined on the table. This is an optimized 
zippering together of the indexes which support a foreign/primary key 
pair. As I look at this code, however, it seems to me that you may 
actually get worse performance if you are bulk inserting (not replacing) 
a small number of rows into a table which already has a lot of rows in it.

There is some additional trickiness dealing with the interaction of 
BEFORE triggers and CHECK constrains, but (famous last words) it appears 
to me that the processing is still row-by-row in the language layer so I 
don't see any optimization involved here.

I guess the bottom line is that with bulk insert, you get better 
performance because logging is disabled. The bulk enforcement of foreign 
keys may improve or degrade your performance.

So those are a few more clues.

Cheers,
-Rick

Re: faster inserts in big tables

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Rick Hillegas <Ri...@Sun.COM> writes:

> Knut Anders Hatlen wrote:
>> Rick Hillegas <Ri...@Sun.COM> writes:
>>
>>   
>>> Hi Brian,
>>>
>>> In a previous response Peter Ondruška noted that you could use bulk
>>> import to speed up your inserts if your source data lived in a
>>> properly formatted file.
>>>
>>> Even if your source data does not live in a file, you can still get
>>> bulk import speed (and the benefits of your generated key) by using
>>> table functions--provided that you upgrade to 10.4. If you wrap your
>>> source data in a table function, then you can bulk insert your data as
>>> follows:
>>>
>>> insert into MY_TABLE( NON_GEN_COL1, ... NON_GEN_COL_N )
>>> select * from table( MY_TABLE_FUNCTION() ) s
>>>     
>>
>> The internal SQL syntax allows you to add an optimizer override here
>> (--DERBY-PROPERTIES insertMode=bulkInsert) which I think speeds it up
>> even more, though I'm not sure exactly what it buys you. This override
>> is used in the import code and in the SYSCS_UTIL.SYSCS_BULK_INSERT
>> system procedure, but you're not allowed to use it in your own SQL
>> queries, it seems. The (undocumented) SYSCS_BULK_INSERT procedure does
>> exactly what Rick's example does, except that it also adds the optimizer
>> override.
>>
>> The procedure only works on old-style table functions, though. Does
>> anyone know what performance gains one can expect by using
>> SYSCS_BULK_INSERT instead of INSERT INTO ... SELECT FROM? Would it be
>> worth the effort to create a similar mechanism for the new-style table
>> functions?
>>
>>   
> Hi Knut,
>
> It appears to me that this setting (insertMode=bulkInsert) is
> inspected when binding an INSERT statement. If the bind() logic finds
> this setting, then it escalates the lock mode for the target table to
> be a table level lock. Other than that, I cannot find any other use of
> this setting. Comments around this code, however, indicate that there
> might once have been other implications and that this setting, which
> is now unconditionally accepted, used to be silently rejected in
> certain situations, including insertion into a synchronized client
> database, insertion into a table which has triggers, and deferred mode
> insertion (see the comments on InsertNode.verifyBulkInsert).
>
> It appears to me that the benefit now conferred by this setting can be
> achieved by setting the lock mode on the table to be table level. So
> you would preface the INSERT statement above with the following
> statement:
>
> lock MY_TABLE in exclusive mode
>
> That, at least, is what I have gleaned from a code inspection.

Thanks Rick,

Did you also look at the code in InsertResultSet.open()? It looks like
it has some special handling of triggers, forreign keys and checking of
constraints when running in bulk insert mode.

-- 
Knut Anders

Re: faster inserts in big tables

Posted by Rick Hillegas <Ri...@Sun.COM>.

Knut Anders Hatlen wrote:
> Rick Hillegas <Ri...@Sun.COM> writes:
>
>   
>> Hi Brian,
>>
>> In a previous response Peter Ondruška noted that you could use bulk
>> import to speed up your inserts if your source data lived in a
>> properly formatted file.
>>
>> Even if your source data does not live in a file, you can still get
>> bulk import speed (and the benefits of your generated key) by using
>> table functions--provided that you upgrade to 10.4. If you wrap your
>> source data in a table function, then you can bulk insert your data as
>> follows:
>>
>> insert into MY_TABLE( NON_GEN_COL1, ... NON_GEN_COL_N )
>> select * from table( MY_TABLE_FUNCTION() ) s
>>     
>
> The internal SQL syntax allows you to add an optimizer override here
> (--DERBY-PROPERTIES insertMode=bulkInsert) which I think speeds it up
> even more, though I'm not sure exactly what it buys you. This override
> is used in the import code and in the SYSCS_UTIL.SYSCS_BULK_INSERT
> system procedure, but you're not allowed to use it in your own SQL
> queries, it seems. The (undocumented) SYSCS_BULK_INSERT procedure does
> exactly what Rick's example does, except that it also adds the optimizer
> override.
>
> The procedure only works on old-style table functions, though. Does
> anyone know what performance gains one can expect by using
> SYSCS_BULK_INSERT instead of INSERT INTO ... SELECT FROM? Would it be
> worth the effort to create a similar mechanism for the new-style table
> functions?
>
>   
Hi Knut,

It appears to me that this setting (insertMode=bulkInsert) is inspected 
when binding an INSERT statement. If the bind() logic finds this 
setting, then it escalates the lock mode for the target table to be a 
table level lock. Other than that, I cannot find any other use of this 
setting. Comments around this code, however, indicate that there might 
once have been other implications and that this setting, which is now 
unconditionally accepted, used to be silently rejected in certain 
situations, including insertion into a synchronized client database, 
insertion into a table which has triggers, and deferred mode insertion 
(see the comments on InsertNode.verifyBulkInsert).

It appears to me that the benefit now conferred by this setting can be 
achieved by setting the lock mode on the table to be table level. So you 
would preface the INSERT statement above with the following statement:

lock MY_TABLE in exclusive mode

That, at least, is what I have gleaned from a code inspection.

Regards,
-Rick

Re: faster inserts in big tables

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Rick Hillegas <Ri...@Sun.COM> writes:

> Hi Brian,
>
> In a previous response Peter Ondruška noted that you could use bulk
> import to speed up your inserts if your source data lived in a
> properly formatted file.
>
> Even if your source data does not live in a file, you can still get
> bulk import speed (and the benefits of your generated key) by using
> table functions--provided that you upgrade to 10.4. If you wrap your
> source data in a table function, then you can bulk insert your data as
> follows:
>
> insert into MY_TABLE( NON_GEN_COL1, ... NON_GEN_COL_N )
> select * from table( MY_TABLE_FUNCTION() ) s

The internal SQL syntax allows you to add an optimizer override here
(--DERBY-PROPERTIES insertMode=bulkInsert) which I think speeds it up
even more, though I'm not sure exactly what it buys you. This override
is used in the import code and in the SYSCS_UTIL.SYSCS_BULK_INSERT
system procedure, but you're not allowed to use it in your own SQL
queries, it seems. The (undocumented) SYSCS_BULK_INSERT procedure does
exactly what Rick's example does, except that it also adds the optimizer
override.

The procedure only works on old-style table functions, though. Does
anyone know what performance gains one can expect by using
SYSCS_BULK_INSERT instead of INSERT INTO ... SELECT FROM? Would it be
worth the effort to create a similar mechanism for the new-style table
functions?

-- 
Knut Anders

Re: faster inserts in big tables

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Brian,

In a previous response Peter Ondruška noted that you could use bulk 
import to speed up your inserts if your source data lived in a properly 
formatted file.

Even if your source data does not live in a file, you can still get bulk 
import speed (and the benefits of your generated key) by using table 
functions--provided that you upgrade to 10.4. If you wrap your source 
data in a table function, then you can bulk insert your data as follows:

insert into MY_TABLE( NON_GEN_COL1, ... NON_GEN_COL_N )
select * from table( MY_TABLE_FUNCTION() ) s

For more information on table functions, see the white paper at 
http://developers.sun.com/javadb/reference/whitepapers/index.jsp

Hope this helps,
-Rick


>
> publicayers wrote:
>   
>> I have 10's of thousands of rows to add to a table, possibly 100's of 
>> thousands, and I'm wondering if there's anything else I can do to speed 
>> this up. The table could end up having a couple of million rows.
>>
>> This is what I've done so far:
>> * Using a PreparedStatement that gets reused with each insert.
>> * Set locking level to TABLE for that table.
>> * Turned off autocommit.
>> * Set the connection to READ_COMMIT.
>>
>> In addition to that, I'm also setting these system parameters, though 
>> not
>> necessarily to improve insert performance:
>> * derby.system.durability=test
>> * derby.storage.pageSize=32768
>>
>> The table has one odd feature: The last column is a VARCHAR(32672) FOR 
>> BIT DATA. I've tried setting the length to something smaller, but it 
>> didn't really seem to matter.
>>
>> The primary key is an auto generated int with another 2-column index on 
>> two BIGINT columns. Something I found interesting is that the inserts 
>> seem to go 2x faster if I have the 2-column index in place than if I 
>> have just the primary-key index.
>>
>> I'm running
>>     Derby 10.2.2
>>     JRE 1.6.0_07
>>     Windows XP SP2
>>
>> Is there anything else I can do to speed up row inserts?
>>
>> Thanks,
>> Brian
>>
>>
>>     
>
>

Re: faster inserts in big tables

Posted by Bocete <ko...@gmail.com>.

You could make a helper class that will bundle several inserts into a single
query. 

pseudo-code:
class InsertQueryBundle {
     insertRow(String values) {
          if (valuesQueue.isFull())
               flushValuesQueue();
          addRowToQueue(values)
     }

     flushValuesQueue() {
          StringBuilder builder = new StringBuilder("INSERT INTO " +
tableName + " (" + columns + ") VALUES ";
          while ((String values = queue.poll()) != null) {
               builder.append("(" + values + "), ");
          }
          executeQuery(builder.toString());
     }
}


publicayers wrote:
> 
> I have 10's of thousands of rows to add to a table, possibly 100's of 
> thousands, and I'm wondering if there's anything else I can do to speed 
> this up. The table could end up having a couple of million rows.
> 
> This is what I've done so far:
> * Using a PreparedStatement that gets reused with each insert.
> * Set locking level to TABLE for that table.
> * Turned off autocommit.
> * Set the connection to READ_COMMIT.
> 
> In addition to that, I'm also setting these system parameters, though 
> not
> necessarily to improve insert performance:
> * derby.system.durability=test
> * derby.storage.pageSize=32768
> 
> The table has one odd feature: The last column is a VARCHAR(32672) FOR 
> BIT DATA. I've tried setting the length to something smaller, but it 
> didn't really seem to matter.
> 
> The primary key is an auto generated int with another 2-column index on 
> two BIGINT columns. Something I found interesting is that the inserts 
> seem to go 2x faster if I have the 2-column index in place than if I 
> have just the primary-key index.
> 
> I'm running
>     Derby 10.2.2
>     JRE 1.6.0_07
>     Windows XP SP2
> 
> Is there anything else I can do to speed up row inserts?
> 
> Thanks,
> Brian
> 
> 

-- 
View this message in context: http://www.nabble.com/faster-inserts-in-big-tables-tp20690051p20957504.html
Sent from the Apache Derby Users mailing list archive at Nabble.com.

Re: faster inserts in big tables

Posted by Peter Ondruška <pe...@gmail.com>.

Go with latest Derby, if you load data from externat file use bulk
import. Using large log file may help.

2008/11/25, publicayers@verizon.net <pu...@verizon.net>:
> I have 10's of thousands of rows to add to a table, possibly 100's of
> thousands, and I'm wondering if there's anything else I can do to speed
> this up. The table could end up having a couple of million rows.
>
> This is what I've done so far:
> * Using a PreparedStatement that gets reused with each insert.
> * Set locking level to TABLE for that table.
> * Turned off autocommit.
> * Set the connection to READ_COMMIT.
>
> In addition to that, I'm also setting these system parameters, though
> not
> necessarily to improve insert performance:
> * derby.system.durability=test
> * derby.storage.pageSize=32768
>
> The table has one odd feature: The last column is a VARCHAR(32672) FOR
> BIT DATA. I've tried setting the length to something smaller, but it
> didn't really seem to matter.
>
> The primary key is an auto generated int with another 2-column index on
> two BIGINT columns. Something I found interesting is that the inserts
> seem to go 2x faster if I have the 2-column index in place than if I
> have just the primary-key index.
>
> I'm running
>     Derby 10.2.2
>     JRE 1.6.0_07
>     Windows XP SP2
>
> Is there anything else I can do to speed up row inserts?
>
> Thanks,
> Brian
>