You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/11/06 15:43:00 UTC
Batched update
JENA-528
== History
The BulkUpdateHandler interface is supposed to help system optimize
large(r) scale addition and deletion of triple (it's graph-centric so
datasets don't feature here; it predates datasets).
The API contract is that a collection of triples can be added or be
deleted in a single operation. There is no possibilities of looking at
the data while a bulk operation is in progress (its a single API call).
The default implementation of SimpleBulkUpdateHandler turned these
into a loop.
This contract has problems: how big is a batch? does it have to reside
in memory? (the nearest to avoiding that is a pull-style passing in an
Iterator but that is hard to use).
ARP (the RDF/XML parser) did use batching. It builds an array of 1000
triples and sends them into a model in such batches. Other parsers did
not; they do can signal start/end of parse run though via a different
mechanism.
= RDB
The original work was triggered by RDB.
The gain was because a single DB transaction could be used for multiple
additions. Without any other information, the current state of the JDBC
connection would be used, and that could be autocommit which has severe
limitations.
= SDB
SDB does have a bulk loader - it manipulates temporary tables in the
database to accumulate and manipulate the data to insert (e.g.
de-duplication). It also gets away from the JDBC/autocommit issue.
The temporary tables are not always transaction safe (depends on the
DB). SDB even copes with bulk deletion (but I'm unclear about what
happens if a a triple is inserted and deleted in the same batch as
inserts and deletes are treated as two separate groups so interleaving
is lost).
A batch is ideally in the 10k-20K range. These batches accumulate - not
a single API operation.
Batching is done internally (even if the client app did). It is
triggered by explicit SDB-specific calls or graph events
GraphEvents.startRead and GraphEvents.finishRead. These can be called
manually by the application; they cause the appropriate GraphSDB
operations to be called and these can be called explicitly as well.
Batched additions can't be seen by Graph.find during the batch update.
http://jena.apache.org/documentation/sdb/loading_data.html
The bulk triple API operations were overlaid on this mechanism.
= TDB
TDB has a bulk loader - it loads empty databases. It's idea of a batch
is the whole data stream to be loaded. It does not need to know the end
of the batch when it starts a batch. It's idea of a batch is millions
of triples, beyond what can be buffered in RAM.
Bulk upload is not transactional. Typically, it's a separate step using
one of the bulk loader utilities. It handles triples and quads. It does
not use the BulkUpdateHandler.
For TDB, batches into a non-empty database are not special - they could
be and it might be advantageous for some situations but currently it
does not do anything special in this case. In the future, Lizard (which
is derived from TDB) would like to see batches for insertion; it could
make use of bulk insertion even on non-empty databases.
TDB does nothing special about deletes (it does have, separately, have
an optimized path for Graph.remove and Graph.clear).
When active, the bulk loader assumes total control over the database -
any other operation (e.g. looking at the data) is likely to go wrong
(very, very wrong!) - and it manipulate database details at a very low
level.
For 100's of millions of triple, bulkloading is the only way to go.
== Towards Requirements
So we have related-but-different mechanisms in different places.
* Is bulk deletion an issue worth addressing?
Do any other systems have any bulk optimizations for deletion (but not
"delete all").
* What about a mixture of adds/deletes?
* What is the contract e.g. parallel uses of Graph.find?
* What's the unit? Graphs, datasets other ?
Separately, there is a graph operation Graph.remove(S,P,O) where S,P,O
can be Node.ANY so it's a pattern.
Andy