You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2018/05/18 19:37:00 UTC

[jira] [Commented] (JENA-1550) Bulk loader for TDB2.

    [ https://issues.apache.org/jira/browse/JENA-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481097#comment-16481097 ] 

Andy Seaborne commented on JENA-1550:
-------------------------------------

This ticket comment describes work-in-progress (2018-05).

A system of 3 loaders:
 * basic - same algorithm as the current (Jena 3.7.0) {{tdb2.tdbloader}}, a transactional parse and insert into the database on a single thread.
 * sequential - the same algorithm as TDB tdbloader1, parse/load and create primary indexes in one pass then create secondary indexes one at a time. Uses one thread.
 * parallel - a new loader that is multi-threaded. Parse, node table building, indexing are all on separate threads.

The basic loader works quite well with TDB2 because transaction changes go to the storage tables and benefit from the OS scheduling of writes, asynchronously, during the transaction lifetime so that less data remains to be sync'ed to disk at the end of transaction.

The sequential loader is not much better than the basic loader, at least for an SSD. It remains for comparison.

The parallel loader currently attempts to execute all work at once using multiple threads. Doing all the work at the same time will reduce the OS file system cache efficiency of the node table and each index.

A phased approach may benefit large data in constrained environments and when writing to disk, not SSD (the cost of a cache miss is much higher for a disk than an SSD). Experimentation needed.

There are several things that may improve performance of loading such as NVMe SSDs and multiple physical devices for storage (especially when loading to disk - I/O bandwidth is a limiting factor).

Some results: Figures obtained 2018-05-18:
{quote}machine: Ubuntu 18.04, SATA SSD, 32G RAM: Quad core i7
 Data loaded: BSBM 200 million (200,031,975), read from disk in as "nt.gz".
{quote}
TPS = Triples per second

tdb2.tdbloader (writing to disk)
 {{Time = 6,045.02 seconds : Rate = 33,090 TPS}}

Parallel:
 {{Time = 1,180.57 seconds : Rate = 169,437 TPS}}

Sequential:
 {{Time = 3,227.84 seconds : Rate = 61,971 TPS}}

Basic:
 {{Time = 3,507.56 seconds : Rate = 57,029 TPS}}

  
For comparison: TDB1 loaders, same data, same hardware:
 TDB1: loader1/SSD
 {{Time = 3,333.00 seconds : Rate = 60,015.54 TPS}}

TDB1: loader2/SSD
 {{Time = 3,078.00 seconds : Rate = 64,987 TPS}}

TDB1 loader2 starts to be significantly faster that loader1 around 500 million on this 32G machine.

Other learnings:

A single thread to parse, load the node table and produce index tuples runs at around 100K TPS.
 Splitting parsing and node table loading into two threads runs at 195K TPS.

Single thread:
{noformat}
DataToTuplesInline: Triples = 200,031,975 : time = 2,017.56 : rate = 99,145.294
{noformat}
It peaks at 115KTPS around5 million then trails off around 130 million to around 80kTPS.

Multi-thread:
{noformat}
Triples = 200,031,975 : time = 1,028.29 : rate = 194,528.378
{noformat}
It starts at 170kTPS, jumps at 2.5 million to 190 kTPS then jumps again at 22.5 million to 240+ kTPS. The overall average rate was still going up at the end of the run.

> Bulk loader for TDB2.
> ---------------------
>
>                 Key: JENA-1550
>                 URL: https://issues.apache.org/jira/browse/JENA-1550
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB2
>    Affects Versions: Jena 3.7.0
>            Reporter: Andy Seaborne
>            Assignee: Andy Seaborne
>            Priority: Major
>
> Provide a bulk loader for TDB2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)