You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/11/06 15:43:00 UTC

Batched update

JENA-528

== History

The BulkUpdateHandler interface is supposed to help system optimize 
large(r) scale addition and deletion of triple (it's graph-centric so 
datasets don't feature here; it predates datasets).

The API contract is that a collection of triples can be added or be 
deleted in a single operation.  There is no possibilities of looking at 
the data while a bulk operation is in progress (its a single API call). 
  The default implementation of SimpleBulkUpdateHandler turned these 
into a loop.

This contract has problems: how big is a batch? does it have to reside 
in memory? (the nearest to avoiding that is a pull-style passing in an 
Iterator but that is hard to use).

ARP (the RDF/XML parser) did use batching.  It builds an array of 1000 
triples and sends them into a model in such batches.  Other parsers did 
not; they do can signal start/end of parse run though via a different 
mechanism.

= RDB

The original work was triggered by RDB.

The gain was because a single DB transaction could be used for multiple 
additions.  Without any other information, the current state of the JDBC 
connection would be used, and that could be autocommit which has severe 
limitations.

= SDB

SDB does have a bulk loader - it manipulates temporary tables in the 
database to accumulate and manipulate the data to insert (e.g. 
de-duplication).  It also gets away from the JDBC/autocommit issue.

The temporary tables are not always transaction safe (depends on the
DB).  SDB even copes with bulk deletion (but I'm unclear about what 
happens if a a triple is inserted and deleted in the same batch as 
inserts and deletes are treated as two separate groups so interleaving 
is lost).

A batch is ideally in the 10k-20K range.  These batches accumulate - not 
a single API operation.

Batching is done internally (even if the client app did).  It is 
triggered by explicit SDB-specific calls or graph events 
GraphEvents.startRead and GraphEvents.finishRead.  These can be called 
manually by the application; they cause the appropriate GraphSDB 
operations to be called and these can be called explicitly as well.

Batched additions can't be seen by Graph.find during the batch update.

http://jena.apache.org/documentation/sdb/loading_data.html

The bulk triple API operations were overlaid on this mechanism.

= TDB

TDB has a bulk loader - it loads empty databases.  It's idea of a batch 
is the whole data stream to be loaded.  It does not need to know the end 
of the batch when it starts a batch.  It's idea of a batch is millions 
of triples, beyond what can be buffered in RAM.

Bulk upload is not transactional.  Typically, it's a separate step using 
one of the bulk loader utilities. It handles triples and quads.  It does 
not use the BulkUpdateHandler.

For TDB, batches into a non-empty database are not special - they could 
be and it might be advantageous for some situations but currently it 
does not do anything special in this case.  In the future, Lizard (which 
is derived from TDB) would like to see batches for insertion; it could 
make use of bulk insertion even on non-empty databases.

TDB does nothing special about deletes (it does have, separately, have 
an optimized path for Graph.remove and Graph.clear).

When active, the bulk loader assumes total control over the database - 
any other operation (e.g. looking at the data) is likely to go wrong 
(very, very wrong!) - and it manipulate database details at a very low 
level.

For 100's of millions of triple, bulkloading is the only way to go.

== Towards Requirements

So we have related-but-different mechanisms in different places.

* Is bulk deletion an issue worth addressing?

Do any other systems have any bulk optimizations for deletion (but not 
"delete all").

* What about a mixture of adds/deletes?

* What is the contract e.g. parallel uses of Graph.find?

* What's the unit? Graphs, datasets other ?

Separately, there is a graph operation Graph.remove(S,P,O) where S,P,O 
can be Node.ANY so it's a pattern.

	Andy