You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Mike Matrigali <mi...@sbcglobal.net> on 2010/08/31 20:28:47 UTC

is there some way to support bulk insert without new syntax, in a zero admin way?

bulk insert used by import is actually just a hint to the
underlying system to use a different query execution technique for the
inserts.
Rather than add syntax and force users to direct the system it
would be nice as a zero admin db if the system could figure
out itself whether or not to do the work, when doing an insert
that results in more than one row.

The cost of the bulk insert execution is:
o it has to do a scan of the table to determine if the table
   is empty.  If it is not empty then none of the bulk insert
   optimization is possible, so none of the following work is
   done.
o It has to create a new table and new indexes for each of the
   existing indexes.
o it has to drop the old table and indexes on completion.
o it has to do system catalog updates to reflect the swapping of
   new table and indexes for old.
o it needs to force all datapages of the new table and indexes
   to disk and wait for all the I/O's to finish before returning.
o it benefits from not logging the actual data of all the inserts
   done.


If handling this in the optimizer, one could easily estimate if
there are rows in the table or not.  And only build a bulk insert
plan if there are none.

Also there should be some threshold on the number of rows being
inserted before choosing a bulk insert plan.  One really only
wants to do this if there is a lot of data being inserted into
an empty table.  In the case of insert ... as select from ...
the optimizer definitely has an estimate for the number of rows
being inserted.  I am not sure what state this estimate is in
for table functions.

If the system is not doing so already any time you do a create table as 
select we could be using the "bulk" optimization.   I don't know if
this is the case.  In this case one can avoid the extra system catalog 
overhead, and would argue that it is reasonable to assume a large amount
of data is being inserted so that "bulk" optimization is reasonable.

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by Rick Hillegas <ri...@oracle.com>.

Thanks to everyone for the additional discussion. I have logged two 
issues based on our consensus:

DERBY-4788 - Master issue for automatically applying the bulk-insert 
optimization when inserting large data sets into empty tables.
DERBY-4789 - Always apply the bulk-insert optimization when the source 
stream involves a table function.

Thanks,
-Rick

Bryan Pendleton wrote:
>>> I think that I would be satisfied with something less ambitious: If
>>> the source stream for the INSERT involves a table function, then
>>> always attempt a bulk insert--naturally, the bulk insert optimization
>>> would be applied only if the target table was empty. This wouldn't
>
> +1 to improved performance with no non-standard language changes!
>
> +1 also to the suggestion of Knut Anders to enable a standards-compliant
> TRUNCATE TABLE and then gain import speed that way.
>
> The entire discussion has been interesting and instructive, thank you
> for helping us understand the issues, even though the rest of us aren't
> commenting :)
>
> thanks,
>
> bryan
>

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by "Dag H. Wanvik" <da...@Oracle.com>.

Bryan Pendleton <bp...@gmail.com> writes:

> The entire discussion has been interesting and instructive, thank you
> for helping us understand the issues, even though the rest of us aren't
> commenting :)

+1

Dag

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by Bryan Pendleton <bp...@gmail.com>.

>> I think that I would be satisfied with something less ambitious: If
>> the source stream for the INSERT involves a table function, then
>> always attempt a bulk insert--naturally, the bulk insert optimization
>> would be applied only if the target table was empty. This wouldn't

+1 to improved performance with no non-standard language changes!

+1 also to the suggestion of Knut Anders to enable a standards-compliant
TRUNCATE TABLE and then gain import speed that way.

The entire discussion has been interesting and instructive, thank you
for helping us understand the issues, even though the rest of us aren't
commenting :)

thanks,

bryan

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Rick Hillegas wrote:
> Thanks for starting this new thread, Mike. For people who may have 
> missed the earlier conversation about bulk insert, this is a 
> continuation of a discussion started here: 
> http://old.nabble.com/bulk-insert-to29554227.html#a29554227
> 
> I think that I would be satisfied with something less ambitious: If the 
> source stream for the INSERT involves a table function, then always 
> attempt a bulk insert--naturally, the bulk insert optimization would be 
> applied only if the target table was empty. This wouldn't give us 
> simple, standard syntax for the replace functionality enjoyed by the 
> import procedures, but the incremental improvement would be significant.
sounds like a good first step, and maybe with implementation the 
following increments could be costed and laid out for further
enhancement by others.
1) make it a compile time decision (not sure if this is being proposed 
above or not).
2) once a compile time decision, invalidate the plan (or disable the 
optimization), after it is executed once.
3) enhance it to also include when appropriate other source streams. 
Something like when the estimate number of rows * bytes per row is
bigger than X (where X could be determined by some one time test 
comparing performance with bulk on and not for various number of rows).
I think the rows * size of row metric is probably reasonable as what is
being saved is overhead of logging that amount of data.   The estimates
probably want to pushed down and pulled back to/from store which already 
owns all the other estimates.

It is nice that this approach gets all the users
the improvement automatically - and we just need to keep an eye out
if this slows down anyone noticably.  My take is that if we make it
a compile time decision doing the empty table check at compile time
to make the plan consider doing a bulk insert or not
then probably no one will see any slower performance.  I think the
optimizer check can just use the existing very quick row count
checks, but should check that we do sometimes return 0 (In past
I know there have been optimizer divide by 0 problems on row counts and 
don't remember how they were fixed).  We will still
have to do a real serializable empty check again at execute time, but 
only if it was
also empty at compile time.  I am thinking this helps if there is
some application out there that in the worst case does insert as select 
from a table function once each for each row in the table - or any
application that uses this paradigm to incrementally load rows into
the table rather than once as an initial load.

> 
> More comments inline...
> 
> Mike Matrigali wrote:
>> bulk insert used by import is actually just a hint to the
>> underlying system to use a different query execution technique for the
>> inserts.
>> Rather than add syntax and force users to direct the system it
>> would be nice as a zero admin db if the system could figure
>> out itself whether or not to do the work, when doing an insert
>> that results in more than one row.
>>
>> The cost of the bulk insert execution is:
>> o it has to do a scan of the table to determine if the table
>>   is empty.  If it is not empty then none of the bulk insert
>>   optimization is possible, so none of the following work is
>>   done.
> Just to be clear here, we don't do a full table scan. We just read one row.
Right, and to be even clearer we scan the table until we see one row. 
At language layer I think this just means we do a next and it either
returns no rows or one row.  At the store layer in the worst case this
could mean traversing and doing many I/O's (this would be a wierd case
where the table has been populated and then all the rows have been
deleted but the deleted rows and pages have not been reclaimed yet).
Note that the check for empty table can also cause a thread to block
where doing the insert may not without the optimization may not.
  The usual case would be just one page read and a scan of that page.
>> o It has to create a new table and new indexes for each of the
>>   existing indexes.
>> o it has to drop the old table and indexes on completion.
>> o it has to do system catalog updates to reflect the swapping of
>>   new table and indexes for old.
>> o it needs to force all datapages of the new table and indexes
>>   to disk and wait for all the I/O's to finish before returning.
>> o it benefits from not logging the actual data of all the inserts
>>   done.
>>
>>
>> If handling this in the optimizer, one could easily estimate if
>> there are rows in the table or not.  And only build a bulk insert
>> plan if there are none.
>>
>> Also there should be some threshold on the number of rows being
>> inserted before choosing a bulk insert plan.  One really only
>> wants to do this if there is a lot of data being inserted into
>> an empty table.  In the case of insert ... as select from ...
>> the optimizer definitely has an estimate for the number of rows
>> being inserted.  I am not sure what state this estimate is in
>> for table functions.
> The VTICosting interface provides a mechanism for declaring how many 
> rows a table function returns. It's an awkard mechanism and in order to 
> get the proposed scheme to work, we would have to expose the magic 
> threshhold which triggers the optimizer to select a bulk insert 
> strategy. I'm not keen on exposing optimizer magic in Derby's public api.
I agree, we have never really done a good job with the costing stuff. 
At this point about the only thing I was hoping we could rely on is
a row estimate.
> 
> I think it's important to note a significant characteristic shared by 
> external data sources, whether they are files we import or table 
> functions which drive inserts: the Derby optimizer has limited 
> visibility into external data sources and heuristics in this area are 
> likely to be crude.
>>
>> If the system is not doing so already any time you do a create table 
>> as select we could be using the "bulk" optimization.   I don't know if
>> this is the case.  In this case one can avoid the extra system catalog 
>> overhead, and would argue that it is reasonable to assume a large amount
>> of data is being inserted so that "bulk" optimization is reasonable.
> Note that right now the CREATE TABLE AS SELECT statement requires that 
> you specify WITH NO DATA. That means that all you can do with this 
> statement is create the shape of an empty table. You cannot use this 
> statement to populate a table. This part of your proposal will be a 
> welcome improvement, however, when we allow this statement to populate 
> tables.
ok, I just had glanced at the doc as I sort of remembered someone had 
added the feature and didn't catch that we don't support
the with data option.  Too bad, as this seems like exactly the case that 
we are looking to solve here.
> 
> Thanks,
> -Rick
>>
>>
> 
>

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by Knut Anders Hatlen <kn...@oracle.com>.

Rick Hillegas <ri...@oracle.com> writes:

> I think that I would be satisfied with something less ambitious: If
> the source stream for the INSERT involves a table function, then
> always attempt a bulk insert--naturally, the bulk insert optimization
> would be applied only if the target table was empty. This wouldn't
> give us simple, standard syntax for the replace functionality enjoyed
> by the import procedures, but the incremental improvement would be
> significant.

If we can live with two statements instead of one, we could use this
standard syntax:

  truncate table t;    -- SQL:2008 / DERBY-268
  insert into t select * from table ( my_vti () ) as tt;

Since the table is empty after TRUNCATE TABLE, the suggested
optimization for inserts from table functions should kick in here.

This would however require someone to verify that our TRUNCATE TABLE
implementation is standards compliant and enable it in production
builds.

-- 
Knut Anders

Re: is there some way to support bulk insert without new syntax, in a zero admin way?

Posted by Rick Hillegas <ri...@oracle.com>.

Thanks for starting this new thread, Mike. For people who may have 
missed the earlier conversation about bulk insert, this is a 
continuation of a discussion started here: 
http://old.nabble.com/bulk-insert-to29554227.html#a29554227

I think that I would be satisfied with something less ambitious: If the 
source stream for the INSERT involves a table function, then always 
attempt a bulk insert--naturally, the bulk insert optimization would be 
applied only if the target table was empty. This wouldn't give us 
simple, standard syntax for the replace functionality enjoyed by the 
import procedures, but the incremental improvement would be significant.

More comments inline...

Mike Matrigali wrote:
> bulk insert used by import is actually just a hint to the
> underlying system to use a different query execution technique for the
> inserts.
> Rather than add syntax and force users to direct the system it
> would be nice as a zero admin db if the system could figure
> out itself whether or not to do the work, when doing an insert
> that results in more than one row.
>
> The cost of the bulk insert execution is:
> o it has to do a scan of the table to determine if the table
>   is empty.  If it is not empty then none of the bulk insert
>   optimization is possible, so none of the following work is
>   done.
Just to be clear here, we don't do a full table scan. We just read one row.
> o It has to create a new table and new indexes for each of the
>   existing indexes.
> o it has to drop the old table and indexes on completion.
> o it has to do system catalog updates to reflect the swapping of
>   new table and indexes for old.
> o it needs to force all datapages of the new table and indexes
>   to disk and wait for all the I/O's to finish before returning.
> o it benefits from not logging the actual data of all the inserts
>   done.
>
>
> If handling this in the optimizer, one could easily estimate if
> there are rows in the table or not.  And only build a bulk insert
> plan if there are none.
>
> Also there should be some threshold on the number of rows being
> inserted before choosing a bulk insert plan.  One really only
> wants to do this if there is a lot of data being inserted into
> an empty table.  In the case of insert ... as select from ...
> the optimizer definitely has an estimate for the number of rows
> being inserted.  I am not sure what state this estimate is in
> for table functions.
The VTICosting interface provides a mechanism for declaring how many 
rows a table function returns. It's an awkard mechanism and in order to 
get the proposed scheme to work, we would have to expose the magic 
threshhold which triggers the optimizer to select a bulk insert 
strategy. I'm not keen on exposing optimizer magic in Derby's public api.

I think it's important to note a significant characteristic shared by 
external data sources, whether they are files we import or table 
functions which drive inserts: the Derby optimizer has limited 
visibility into external data sources and heuristics in this area are 
likely to be crude.
>
> If the system is not doing so already any time you do a create table 
> as select we could be using the "bulk" optimization.   I don't know if
> this is the case.  In this case one can avoid the extra system catalog 
> overhead, and would argue that it is reasonable to assume a large amount
> of data is being inserted so that "bulk" optimization is reasonable.
Note that right now the CREATE TABLE AS SELECT statement requires that 
you specify WITH NO DATA. That means that all you can do with this 
statement is create the shape of an empty table. You cannot use this 
statement to populate a table. This part of your proposal will be a 
welcome improvement, however, when we allow this statement to populate 
tables.

Thanks,
-Rick
>
>