You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Isroel Kogan <is...@gmail.com> on 2020/06/20 15:10:15 UTC

TDB2 parallel load on cloud SSD and other observations/questions

Hi,

I am also a newcomer to the RDF world - and particularly Jena, which I started using this week.

A couple of observations I have made over the last few days exploring different options.

Local Machine (specs):

Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
16GB RAM
512 SSD (NVMe).

the following compares loading a file in compressed vs decompressed format -both w parallel loader.

file:
docstrings_triples.nq
size: 28GB

cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq > tdb2.log1 2>&1

:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s

real 22m46.346s
user 120m46.591s
sys 3m22.698s

file:
docstrings_triples.nq.bz2
size: 542M

cmd:

time tdb2.tdbloader --loader=parallel --loc=test2graphdb docstrings_triples.nq.bz2 > tdb2.log2 2>&1

:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s

real 37m8.182s
user 109m42.970s
sys 6m27.426s

resulting DB size
30GB

confirmed equal via diff.

pbzip2 ran in 84s

Less rigorously I noticed a similar gain in speed for other files.
Is this expected behaviour? What factors influence this?

SSD - local vs cloud.

on my local machine, when running parallel loader, cores were working at over 70% capacity and there was little IO induced down time.

GCP instance specs:

20 CPU
32GB RAM
6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency - it has physical proximity to instance - as per GCP.

a few cores were working at near capacity, while the vast majority idle (near 0%) w occasional spikes. average load translates to 20% utilization. As I've seen others write here, this is a difference others have noted.
How can this be addressed? buffer size? (I don't have a deep enough understanding).

Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).

file size: 93GB
triples: 472m

batch size decreased from 160k range to under 1k, while processing time per batch increased from a few seconds to over 10 min. All this time average CPU usage has remained steady, as has RAM usage.

I don't understand how all of this works with indexing. Is this expected behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware at it.

thanks

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Andy Seaborne <an...@apache.org>.

Hi Isroel,

It is worth trying each of the loaders to see how they perform on your 
machine.  Theer si a lot of parallel work when unlimited parallel 
loading of quads happens so may it is a mismatch to the machine setup. 
Databases created by any loader are compatible.

On 22/06/2020 16:57, Isroel Kogan wrote:
> thank you Rob - I have confused the terminology. Indeed each run processes 0.5m quads.
> what is UOM of the batch loading? M/s?
> 
> Looking at the output of iotop - the 3 main threads - which comprise the lionshare of the activity - have pretty steady reads - of about 900-950 M/s w little variation beyond that. write varies a little more - but 2-6 M/s. What is puzzling to me - is that in a configuration with twice the CPU and twice the RAM - the load average remained the same in absolute terms.
> RAM usage was steady at about 8GB in the 32GB configuration - now its been steady at 4.8 (16GB configuration) -
> in both cases it is severely underutilized.

That might be due to how much mmap memory the process can have.  There 
is a per-process restriction.

----
I'm not clear if you are loading an empty database or an existing one. 
It makes a difference

The indexes are sorted (they are B+Trees) so as the database grows, the 
depth of the tree (size of sorted data) grows and insert rates drop.

----

Another factor is the nature of the data : if there are many "large 
literals" such as long strings, there is more byte-shifting needed which 
slows the first phase, not the indexing steps.

     Andy

> 
> I'm not conversant in Java - I've been trying jstat and jmap - the latter hasnt been working. how best to verify the heap size? what else can affect things?
> 
> On 2020/06/22 09:01:36, Rob Vesse <rv...@dotnetrdf.org> wrote:
>> Isabel
>>
>> I think there might be a fundamental misunderstanding happening about batch sizes here.  The batch sizes are fixed for a run and never changes, the "batch size" you refer to is a speed calculation e.g
>>
>>      19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
>>      19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
>>      19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
>>      19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
>>      19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
>>      19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
>>      19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
>>
>> Note that each batch is 500,000 quads as the number after "Add:" increases by 500,000 each time.  The "Batch" in brackets is referring to the calculated loading speed for the current batch, where "Avg" is the calculated loading speed over the entire load.
>>
>> So yes the speed is decreasing over time, this is a commonly reported issue on these lists but there is no simple fix for this because it depends both on hardware and data.  The output you have provided suggests that you are stuck on IO, Andy is the primary developer on TDB2 so he may be able to shed more light on what might be going on.
>>
>> Rob
>>
>> On 22/06/2020, 06:32, "Isroel Kogan" <is...@gmail.com> wrote:
>>
>>      Hi Andy - thanks for your comments.
>>      
>>      Instead of responding point by point, its best if I present a clearer picture -as  I also have a better understanding of the factors so far.
>>      
>>      
>>      GCP instance stats:
>>      $ lscpu
>>      
>>      Architecture:        x86_64
>>      CPU op-mode(s):      32-bit, 64-bit
>>      Byte Order:          Little Endian
>>      Address sizes:       46 bits physical, 48 bits virtual
>>      CPU(s):              10
>>      On-line CPU(s) list: 0-9
>>      Thread(s) per core:  2
>>      Core(s) per socket:  5
>>      Socket(s):           1
>>      NUMA node(s):        1
>>      Vendor ID:           GenuineIntel
>>      CPU family:          6
>>      Model:               63
>>      Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
>>      Stepping:            0
>>      CPU MHz:             2300.000
>>      BogoMIPS:            4600.00
>>      Hypervisor vendor:   KVM
>>      Virtualization type: full
>>      L1d cache:           32K
>>      L1i cache:           32K
>>      L2 cache:            256K
>>      L3 cache:            46080K
>>      NUMA node0 CPU(s):   0-9
>>      
>>      16GB RAM
>>      (I configured less RAM because on a prior iteration - out of 32GB - only 8 was being used)
>>      
>>      3TB local SSD
>>      
>>      according to google cloud - max performance for this size is as follows:
>>      Storage space Partitions     IOPS                          Throughput (MB/s)
>>                                                  Read      Write            Read         Write
>>      
>>      3 TB                 8                 680,000  360,000       2,650         1,400
>>      
>>      
>>      https://cloud.google.com/compute/docs/disks
>>      
>>      I'm not getting that - but performance is an order of magnitude or more better than on my machine - as youll see below -
>>      
>>      
>>      
>>      My data is as follows -15 files of varying size (based on the original bz2 files provided by the authors). files range from 2GB (a few million triples) to 129GB. 4 files have close to a half a billion triples each.
>>      The files are .nq - as you noted - quads. (unzipped as already mentioned).
>>      
>>      Currently I have a process running initiated by tdb2.tdbloader command in parallel mode. DB empty initially.
>>      
>>      I will include some information here that may help you get a sense of the performance issues I alluded to
>>      
>>      
>>      beginning of log output:
>>      
>>      
>>      17:14:42 INFO  loader     :: Loader = LoaderParallel
>>      17:14:42 INFO  loader     :: Start: 15 files
>>      17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq (Batch: 77,339 / Avg: 77,339)
>>      17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq (Batch: 89,285 / Avg: 82,884)
>>      17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq (Batch: 90,090 / Avg: 85,154)
>>      17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq (Batch: 87,382 / Avg: 85,700)
>>      17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq (Batch: 82,304 / Avg: 84,999)
>>      17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq (Batch: 87,032 / Avg: 85,331)
>>      17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq (triples/quads = 3,402,753)
>>      17:15:23 INFO  loader     :: Add: 3,500,000 datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
>>      17:15:29 INFO  loader     :: Add: 4,000,000 datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
>>      17:15:35 INFO  loader     :: Add: 4,500,000 datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
>>      17:15:40 INFO  loader     :: Add: 5,000,000 datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)
>>      
>>      
>>      the batch size increases for the next file - but the third file is one of those w nearly a half a billion triples.
>>      
>>      the batch size keeps steadily decreasing
>>      
>>      here is the output at the 250m triple mark
>>      
>>      19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
>>      19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
>>      19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
>>      19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
>>      19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
>>      19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
>>      19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
>>      19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 3,976 / Avg: 34,473)
>>      19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 3,171 / Avg: 33,810)
>>      19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 3,512 / Avg: 33,241)
>>      
>>      The batch size has decreased (I don't understand exactly what the batch size refers to) and conversely the time for a half million triples to be processed has gone from 5s to 2.5m.
>>      
>>      This is the tail of the log currently:
>>      
>>      03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 1,468 / Avg: 8,680)
>>      03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 1,512 / Avg: 8,613)
>>      03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 1,897 / Avg: 8,564)
>>      03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 1,723 / Avg: 8,509)
>>      03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 1,489 / Avg: 8,444)
>>      03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 1,842 / Avg: 8,395)
>>      03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 1,491 / Avg: 8,333)
>>      03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 1,882 / Avg: 8,287)
>>      03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 2,391 / Avg: 8,254)
>>      03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 1,887 / Avg: 8,209)
>>      03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 03:44:02 UTC]
>>      03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 2,711 / Avg: 8,183)
>>      03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 1,717 / Avg: 8,133)
>>      
>>      batch size down below 2k and processing time per 0.5m triples up to around 5 min.
>>      
>>      What has puzzled me is that in the beginning, CPU utilization was high. At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
>>      Whereas now Load average sits near 3.5 not crossing 4 for several hours. This is based on htop - according to the gcp console dashboard - CPU utilization did indeed peak near 80% - but has been steady around 13-14% for the last hours - I don't know how to match these two metrics - but its heavily underutilized.
>>      
>>      Here's a flavor of the output (as I write now -when the batch size has decreased) of
>>      
>>      $ iotop -ob
>>      
>>      
>>      Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
>>        TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>      12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      
>>      
>>      The relevant output of running (for 124s) - this is the last accumulation:
>>      
>>      $ iotop -ab
>>      
>>      Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
>>      Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
>>        TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>      12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena
>>      12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>>      
>>      
>>      
>>      and now the output of for pid 12155 - the subprocess running all the threads
>>      $ pidstat -p 12155 -ut
>>      
>>      
>>      Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 CPU)
>>      
>>      04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
>>      04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   91.54     4  java
>>      04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    0.00     4  |__java
>>      04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    4.99     7  |__java
>>      04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    0.52     7  |__GC Thread#0
>>      04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    0.00     1  |__G1 Main Marker
>>      04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    0.27     2  |__G1 Conc#0
>>      04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    0.34     8  |__G1 Refine#0
>>      04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    0.28     6  |__G1 Young RemSet
>>      04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    0.03     1  |__VM Thread
>>      04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    0.00     4  |__Reference Handl
>>      04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    0.00     7  |__Finalizer
>>      04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    0.00     6  |__Signal Dispatch
>>      04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    0.04     1  |__C2 CompilerThre
>>      04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    0.00     1  |__C1 CompilerThre
>>      04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    0.00     1  |__Sweeper thread
>>      04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    0.00     1  |__Service Thread
>>      04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    0.05     1  |__VM Periodic Tas
>>      04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    0.00     6  |__Common-Cleaner
>>      04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    0.51     8  |__GC Thread#1
>>      04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    0.51     3  |__GC Thread#2
>>      04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    0.50     1  |__GC Thread#3
>>      04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    0.52     5  |__GC Thread#4
>>      04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    0.51     2  |__GC Thread#5
>>      04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    0.00     5  |__Thread-0
>>      04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    0.00     6  |__Thread-1
>>      04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    0.00     7  |__Thread-2
>>      04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    4.42     2  |__Thread-3
>>      04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    5.54     3  |__Thread-4
>>      04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    5.29     4  |__Thread-5
>>      04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   19.34     3  |__Thread-6
>>      04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   20.83     2  |__Thread-7
>>      04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   20.07     9  |__Thread-8
>>      04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    5.08     0  |__Thread-9
>>      04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    0.52     6  |__GC Thread#6
>>      04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    0.51     0  |__GC Thread#7
>>      04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    0.51     4  |__GC Thread#8
>>      04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    0.02     5  |__G1 Refine#1
>>      04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    0.01     9  |__G1 Refine#2
>>      04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    0.01     0  |__G1 Refine#3
>>      04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    0.01     7  |__G1 Refine#4
>>      04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#5
>>      04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#6
>>      04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#7
>>      04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#8
>>      04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    0.27     9  |__G1 Conc#1
>>      
>>      
>>      you can see threads, 12184, 12185, 12186 have the highest CPU utilization, albeit still at 20% only. they are the same threads in the output of iotop - doing a lot of reading mostly. As you can see in iotop output, the read is at over 900 M/s - which is excellent. All the other threads are idle the overwhelming majority of the time
>>      
>>      Finally htop gives a similar picture, the 3 threads have the highest CPU utilization (around 20%) now, and so 7/10 cores are practically idle with short spikes when the other threads pick up. which isnt much as the above shows. Also those threads are either in R or D status (rarely in S) - while the rest spend most of the time in S.
>>      
>>      I didnt understand what you wrote about the heap size, RAM and read threads - I don't use java - so I'm flying by the seat of my pants.
>>      Here is the output of
>>      
>>      $ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
>>       0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 1042.531
>>      
>>      which may be relevant.
>>      
>>      My very crudely formulated question is - how come it seems that with smaller batch sizes, the processing time increased so much - while at the same time there is actually much less processing going on (much lower CPU utilization) - it seems like everything is waiting for whatever those 3 threads are doing.
>>      And is three anyway to address it?
>>      
>>      thanks for your help!
>>      
>>      best,
>>      
>>      Isroel
>>      
>>      On 2020/06/21 21:38:50, Andy Seaborne <an...@apache.org> wrote:
>>      >  > And when it says "nq" is really quads or all data for the default
>>      >  > graph? (there is more indexing work for named graphs).
>>      >
>>      >  >> : Quads = 127,206,280
>>      >
>>      >
>>      > OK - it's quads. There are 6 quad indexes and in full parallel mode it
>>      > will use 2 more threads to parse and to build the node table.
>>      >
>>      > Full parallel loading is going to use up all the cores and HT threads
>>      > aren't full threads for this purpose.
>>      >
>>      > The phased loader (default) uses less threads.
>>      >
>>      > Phase 1:
>>      > one thread to decompress and parse
>>      > one thread to build the node table.
>>      > one thread to for the GSPO
>>      > (and one for SPO but you seem to have no triples)
>>      > =3
>>      >
>>      > Phase 2:
>>      > two threads
>>      > =2
>>      >
>>      > Phase 3:
>>      > three threads
>>      > =3
>>      >
>>      >      Andy
>>      >
>>      >
>>      > On 21/06/2020 22:11, Andy Seaborne wrote:
>>      > > Hi there,
>>      > >
>>      > > Thanks for reporting the findings.
>>      > >
>>      > > On 20/06/2020 16:10, Isroel Kogan wrote:
>>      > >> Hi,
>>      > >>
>>      > >> I am also a newcomer to the RDF world - and particularly Jena, which I
>>      > >> started using this week.
>>      > >>
>>      > >> A couple of observations I have made over the last few days exploring
>>      > >> different options.
>>      > >>
>>      > >> Local Machine (specs):
>>      > >>
>>      > >> Ubuntu 18.04
>>      > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
>>      > >
>>      > > which is 4 core and hyper threading. for this workload that is more like
>>      > > 4 threads. HT is not a completely x2 for this sort of continuous
>>      > > processing threading.
>>      > >
>>      > > And pre-emtptive timeslicing is not nice!
>>      > >
>>      > >> 16GB RAM
>>      > >> 512 SSD (NVMe).
>>      > >>
>>      > >>
>>      > >> the following compares loading a file in compressed vs decompressed
>>      > >> format -both w parallel loader.
>>      > >>
>>      > >> file:
>>      > >> docstrings_triples.nq
>>      > >> size: 28GB
>>      > >>
>>      > >> cmd:
>>      > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb
>>      > >> docstrings_triples.nq > tdb2.log1 2>&1
>>      > >>
>>      > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
>>      > >>
>>      > >> real    22m46.346s
>>      > >> user    120m46.591s
>>      > >> sys    3m22.698s
>>      > >>
>>      > >>
>>      > >> file:
>>      > >> docstrings_triples.nq.bz2
>>      > >> size: 542M
>>      > >>
>>      > >> cmd:
>>      > >>
>>      > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb
>>      > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
>>      > >>
>>      > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
>>      > >>
>>      > >>
>>      > >> real    37m8.182s
>>      > >> user    109m42.970s
>>      > >> sys    6m27.426s
>>      > >>
>>      > >> resulting DB size
>>      > >> 30GB
>>      > >>
>>      > >> confirmed equal via diff.
>>      > >>
>>      > >> pbzip2 ran in 84s
>>      > >>
>>      > >> Less rigorously I noticed a similar gain in speed for other files.
>>      > >
>>      > > For gz files, the speed of loading of compressed vs uncompressed is
>>      > > usually not very much.  It does look like bz2
>>      > >
>>      > > Using a separate process and faster decompressor may help:
>>      > >
>>      > > bzip2 -d < docstrings_triples.nq.bz2 | \
>>      > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
>>      > >      -- - > tdb2.log2 2>&1
>>      > >
>>      > > When Jena decompresses a bz2 file, it uses a Apache Common Compress so
>>      > > it is a java decompressor which will take time to get optimized by the
>>      > > JIT and is likely slower than a specialized tool like bzip2.
>>      > >
>>      > > But with 4 core, it may have the opposite effect - using more processes
>>      > > causes preemption timeslicing.
>>      > >
>>      > > It maybe one of the other loaders is faster because it is a better match
>>      > > to the hardware.
>>      > >
>>      > >> Is this expected behaviour? What factors influence this?
>>      > >>
>>      > >> SSD - local vs cloud.
>>      > >>
>>      > >> on my local machine, when running parallel loader, cores were working
>>      > >> at over 70% capacity and there was little IO induced down time.
>>      > >
>>      > > How many core were active?
>>      > > And when it says "nq" is really quads or all data for the default graph?
>>      > > (there is more indexing work for named graphs).
>>      > >
>>      > > Some of that will be the bz2 decompression but it looks to me "like it's
>>      > >   "more threads than cores" causing timeslicing.
>>      > >
>>      > >>
>>      > >> GCP instance specs:
>>      > >>
>>      > >> 20 CPU
>>      > >> 32GB RAM
>>      > >
>>      > > And same heap size?
>>      > >
>>      > > While the parallel loader is using multiple threads it is a fixed number
>>      > > so more CPU will help only if
>>      > >
>>      > > More RAM is going to help because the OS will use it for file system
>>      > > cache, delaying writes.
>>      > >
>>      > > But with more read threads, it could be there is less preemptive
>>      > > scheduling and that could be a big gain.
>>      > >
>>      > >> 6TB "local SSD" storage
>>      > >> the local SSD storage offers the best performance to reduce IO latency
>>      > >> - it has physical proximity to instance - as per GCP.
>>      > >>
>>      > >> a few cores were working at near capacity, while the vast majority
>>      > >> idle (near 0%) w occasional spikes. average load translates to 20%
>>      > >> utilization. As I've seen others write here, this is a difference
>>      > >> others have noted.
>>      > >> How can this be addressed? buffer size? (I don't have a deep enough
>>      > >> understanding).
>>      > >
>>      > > My guess is that on the GCP instance it is one thread-one core.
>>      > >
>>      > >>
>>      > >>
>>      > >> Another recurring pattern is the reduction in batch size.
>>      > >> I've been running a load job on my gcp instance for almost a day (23+h).
>>      > >>
>>      > >> file size: 93GB
>>      > >> triples: 472m
>>      > >>
>>      > >> batch size decreased from 160k range to under 1k, while processing
>>      > >> time per batch increased from a few seconds to over 10 min. All this
>>      > >> time average CPU usage has remained steady, as has RAM usage.
>>      > >
>>      > > Not sure I quite understand - this is adding more data to an existing
>>      > > database? And 10mins for 1k? While it will be slower, that does sound
>>      > > rather extreme.
>>      > >
>>      > >>
>>      > >> I don't understand how all of this works with indexing. Is this
>>      > >> expected behaviour? besides a locally proximate SSD, I've thrown an
>>      > >> overkill of hardware at it.
>>      > >>
>>      > >> thanks
>>      > >>
>>      > >
>>      > >      Andy
>>      > >
>>      >
>>      
>>
>>
>>
>>
>>

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Isroel Kogan <is...@gmail.com>.

thank you Rob - I have confused the terminology. Indeed each run processes 0.5m quads. 
what is UOM of the batch loading? M/s?

Looking at the output of iotop - the 3 main threads - which comprise the lionshare of the activity - have pretty steady reads - of about 900-950 M/s w little variation beyond that. write varies a little more - but 2-6 M/s. What is puzzling to me - is that in a configuration with twice the CPU and twice the RAM - the load average remained the same in absolute terms. 
RAM usage was steady at about 8GB in the 32GB configuration - now its been steady at 4.8 (16GB configuration) -
in both cases it is severely underutilized. 

I'm not conversant in Java - I've been trying jstat and jmap - the latter hasnt been working. how best to verify the heap size? what else can affect things?

On 2020/06/22 09:01:36, Rob Vesse <rv...@dotnetrdf.org> wrote: 
> Isabel
> 
> I think there might be a fundamental misunderstanding happening about batch sizes here.  The batch sizes are fixed for a run and never changes, the "batch size" you refer to is a speed calculation e.g
> 
>     19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
>     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
>     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
>     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
>     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
>     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
>     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
> 
> Note that each batch is 500,000 quads as the number after "Add:" increases by 500,000 each time.  The "Batch" in brackets is referring to the calculated loading speed for the current batch, where "Avg" is the calculated loading speed over the entire load.
> 
> So yes the speed is decreasing over time, this is a commonly reported issue on these lists but there is no simple fix for this because it depends both on hardware and data.  The output you have provided suggests that you are stuck on IO, Andy is the primary developer on TDB2 so he may be able to shed more light on what might be going on.
> 
> Rob
> 
> On 22/06/2020, 06:32, "Isroel Kogan" <is...@gmail.com> wrote:
> 
>     Hi Andy - thanks for your comments.
>     
>     Instead of responding point by point, its best if I present a clearer picture -as  I also have a better understanding of the factors so far.
>     
>     
>     GCP instance stats:
>     $ lscpu
>     
>     Architecture:        x86_64
>     CPU op-mode(s):      32-bit, 64-bit
>     Byte Order:          Little Endian
>     Address sizes:       46 bits physical, 48 bits virtual
>     CPU(s):              10
>     On-line CPU(s) list: 0-9
>     Thread(s) per core:  2
>     Core(s) per socket:  5
>     Socket(s):           1
>     NUMA node(s):        1
>     Vendor ID:           GenuineIntel
>     CPU family:          6
>     Model:               63
>     Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
>     Stepping:            0
>     CPU MHz:             2300.000
>     BogoMIPS:            4600.00
>     Hypervisor vendor:   KVM
>     Virtualization type: full
>     L1d cache:           32K
>     L1i cache:           32K
>     L2 cache:            256K
>     L3 cache:            46080K
>     NUMA node0 CPU(s):   0-9
>     
>     16GB RAM
>     (I configured less RAM because on a prior iteration - out of 32GB - only 8 was being used)
>     
>     3TB local SSD
>     
>     according to google cloud - max performance for this size is as follows:
>     Storage space Partitions     IOPS                          Throughput (MB/s)
>                                                 Read      Write            Read         Write
>     
>     3 TB                 8                 680,000  360,000       2,650         1,400
>     
>     
>     https://cloud.google.com/compute/docs/disks
>     
>     I'm not getting that - but performance is an order of magnitude or more better than on my machine - as youll see below -
>     
>     
>     
>     My data is as follows -15 files of varying size (based on the original bz2 files provided by the authors). files range from 2GB (a few million triples) to 129GB. 4 files have close to a half a billion triples each.
>     The files are .nq - as you noted - quads. (unzipped as already mentioned).
>     
>     Currently I have a process running initiated by tdb2.tdbloader command in parallel mode. DB empty initially.
>     
>     I will include some information here that may help you get a sense of the performance issues I alluded to
>     
>     
>     beginning of log output:
>     
>     
>     17:14:42 INFO  loader     :: Loader = LoaderParallel
>     17:14:42 INFO  loader     :: Start: 15 files
>     17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq (Batch: 77,339 / Avg: 77,339)
>     17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq (Batch: 89,285 / Avg: 82,884)
>     17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq (Batch: 90,090 / Avg: 85,154)
>     17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq (Batch: 87,382 / Avg: 85,700)
>     17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq (Batch: 82,304 / Avg: 84,999)
>     17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq (Batch: 87,032 / Avg: 85,331)
>     17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq (triples/quads = 3,402,753)
>     17:15:23 INFO  loader     :: Add: 3,500,000 datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
>     17:15:29 INFO  loader     :: Add: 4,000,000 datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
>     17:15:35 INFO  loader     :: Add: 4,500,000 datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
>     17:15:40 INFO  loader     :: Add: 5,000,000 datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)
>     
>     
>     the batch size increases for the next file - but the third file is one of those w nearly a half a billion triples.
>     
>     the batch size keeps steadily decreasing
>     
>     here is the output at the 250m triple mark
>     
>     19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
>     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
>     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
>     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
>     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
>     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
>     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
>     19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 3,976 / Avg: 34,473)
>     19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 3,171 / Avg: 33,810)
>     19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 3,512 / Avg: 33,241)
>     
>     The batch size has decreased (I don't understand exactly what the batch size refers to) and conversely the time for a half million triples to be processed has gone from 5s to 2.5m.
>     
>     This is the tail of the log currently:
>     
>     03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 1,468 / Avg: 8,680)
>     03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 1,512 / Avg: 8,613)
>     03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 1,897 / Avg: 8,564)
>     03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 1,723 / Avg: 8,509)
>     03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 1,489 / Avg: 8,444)
>     03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 1,842 / Avg: 8,395)
>     03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 1,491 / Avg: 8,333)
>     03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 1,882 / Avg: 8,287)
>     03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 2,391 / Avg: 8,254)
>     03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 1,887 / Avg: 8,209)
>     03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 03:44:02 UTC]
>     03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 2,711 / Avg: 8,183)
>     03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 1,717 / Avg: 8,133)
>     
>     batch size down below 2k and processing time per 0.5m triples up to around 5 min.
>     
>     What has puzzled me is that in the beginning, CPU utilization was high. At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
>     Whereas now Load average sits near 3.5 not crossing 4 for several hours. This is based on htop - according to the gcp console dashboard - CPU utilization did indeed peak near 80% - but has been steady around 13-14% for the last hours - I don't know how to match these two metrics - but its heavily underutilized.
>     
>     Here's a flavor of the output (as I write now -when the batch size has decreased) of
>     
>     $ iotop -ob
>     
>     
>     Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
>       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>     12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     
>     
>     The relevant output of running (for 124s) - this is the last accumulation:
>     
>     $ iotop -ab
>     
>     Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
>     Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
>       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>     12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena
>     12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
>     
>     
>     
>     and now the output of for pid 12155 - the subprocess running all the threads
>     $ pidstat -p 12155 -ut
>     
>     
>     Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 CPU)
>     
>     04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
>     04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   91.54     4  java
>     04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    0.00     4  |__java
>     04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    4.99     7  |__java
>     04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    0.52     7  |__GC Thread#0
>     04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    0.00     1  |__G1 Main Marker
>     04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    0.27     2  |__G1 Conc#0
>     04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    0.34     8  |__G1 Refine#0
>     04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    0.28     6  |__G1 Young RemSet
>     04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    0.03     1  |__VM Thread
>     04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    0.00     4  |__Reference Handl
>     04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    0.00     7  |__Finalizer
>     04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    0.00     6  |__Signal Dispatch
>     04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    0.04     1  |__C2 CompilerThre
>     04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    0.00     1  |__C1 CompilerThre
>     04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    0.00     1  |__Sweeper thread
>     04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    0.00     1  |__Service Thread
>     04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    0.05     1  |__VM Periodic Tas
>     04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    0.00     6  |__Common-Cleaner
>     04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    0.51     8  |__GC Thread#1
>     04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    0.51     3  |__GC Thread#2
>     04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    0.50     1  |__GC Thread#3
>     04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    0.52     5  |__GC Thread#4
>     04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    0.51     2  |__GC Thread#5
>     04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    0.00     5  |__Thread-0
>     04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    0.00     6  |__Thread-1
>     04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    0.00     7  |__Thread-2
>     04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    4.42     2  |__Thread-3
>     04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    5.54     3  |__Thread-4
>     04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    5.29     4  |__Thread-5
>     04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   19.34     3  |__Thread-6
>     04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   20.83     2  |__Thread-7
>     04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   20.07     9  |__Thread-8
>     04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    5.08     0  |__Thread-9
>     04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    0.52     6  |__GC Thread#6
>     04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    0.51     0  |__GC Thread#7
>     04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    0.51     4  |__GC Thread#8
>     04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    0.02     5  |__G1 Refine#1
>     04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    0.01     9  |__G1 Refine#2
>     04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    0.01     0  |__G1 Refine#3
>     04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    0.01     7  |__G1 Refine#4
>     04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#5
>     04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#6
>     04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#7
>     04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#8
>     04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    0.27     9  |__G1 Conc#1
>     
>     
>     you can see threads, 12184, 12185, 12186 have the highest CPU utilization, albeit still at 20% only. they are the same threads in the output of iotop - doing a lot of reading mostly. As you can see in iotop output, the read is at over 900 M/s - which is excellent. All the other threads are idle the overwhelming majority of the time
>     
>     Finally htop gives a similar picture, the 3 threads have the highest CPU utilization (around 20%) now, and so 7/10 cores are practically idle with short spikes when the other threads pick up. which isnt much as the above shows. Also those threads are either in R or D status (rarely in S) - while the rest spend most of the time in S.
>     
>     I didnt understand what you wrote about the heap size, RAM and read threads - I don't use java - so I'm flying by the seat of my pants. 
>     Here is the output of 
>     
>     $ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT   
>      0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 1042.531
>     
>     which may be relevant. 
>     
>     My very crudely formulated question is - how come it seems that with smaller batch sizes, the processing time increased so much - while at the same time there is actually much less processing going on (much lower CPU utilization) - it seems like everything is waiting for whatever those 3 threads are doing. 
>     And is three anyway to address it? 
>     
>     thanks for your help!
>     
>     best,
>     
>     Isroel
>     
>     On 2020/06/21 21:38:50, Andy Seaborne <an...@apache.org> wrote: 
>     >  > And when it says "nq" is really quads or all data for the default
>     >  > graph? (there is more indexing work for named graphs).
>     > 
>     >  >> : Quads = 127,206,280
>     > 
>     > 
>     > OK - it's quads. There are 6 quad indexes and in full parallel mode it 
>     > will use 2 more threads to parse and to build the node table.
>     > 
>     > Full parallel loading is going to use up all the cores and HT threads 
>     > aren't full threads for this purpose.
>     > 
>     > The phased loader (default) uses less threads.
>     > 
>     > Phase 1:
>     > one thread to decompress and parse
>     > one thread to build the node table.
>     > one thread to for the GSPO
>     > (and one for SPO but you seem to have no triples)
>     > =3
>     > 
>     > Phase 2:
>     > two threads
>     > =2
>     > 
>     > Phase 3:
>     > three threads
>     > =3
>     > 
>     >      Andy
>     > 
>     > 
>     > On 21/06/2020 22:11, Andy Seaborne wrote:
>     > > Hi there,
>     > > 
>     > > Thanks for reporting the findings.
>     > > 
>     > > On 20/06/2020 16:10, Isroel Kogan wrote:
>     > >> Hi,
>     > >>
>     > >> I am also a newcomer to the RDF world - and particularly Jena, which I 
>     > >> started using this week.
>     > >>
>     > >> A couple of observations I have made over the last few days exploring 
>     > >> different options.
>     > >>
>     > >> Local Machine (specs):
>     > >>
>     > >> Ubuntu 18.04
>     > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
>     > > 
>     > > which is 4 core and hyper threading. for this workload that is more like 
>     > > 4 threads. HT is not a completely x2 for this sort of continuous 
>     > > processing threading.
>     > > 
>     > > And pre-emtptive timeslicing is not nice!
>     > > 
>     > >> 16GB RAM
>     > >> 512 SSD (NVMe).
>     > >>
>     > >>
>     > >> the following compares loading a file in compressed vs decompressed 
>     > >> format -both w parallel loader.
>     > >>
>     > >> file:
>     > >> docstrings_triples.nq
>     > >> size: 28GB
>     > >>
>     > >> cmd:
>     > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
>     > >> docstrings_triples.nq > tdb2.log1 2>&1
>     > >>
>     > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
>     > >>
>     > >> real    22m46.346s
>     > >> user    120m46.591s
>     > >> sys    3m22.698s
>     > >>
>     > >>
>     > >> file:
>     > >> docstrings_triples.nq.bz2
>     > >> size: 542M
>     > >>
>     > >> cmd:
>     > >>
>     > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
>     > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
>     > >>
>     > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
>     > >>
>     > >>
>     > >> real    37m8.182s
>     > >> user    109m42.970s
>     > >> sys    6m27.426s
>     > >>
>     > >> resulting DB size
>     > >> 30GB
>     > >>
>     > >> confirmed equal via diff.
>     > >>
>     > >> pbzip2 ran in 84s
>     > >>
>     > >> Less rigorously I noticed a similar gain in speed for other files.
>     > > 
>     > > For gz files, the speed of loading of compressed vs uncompressed is 
>     > > usually not very much.  It does look like bz2
>     > > 
>     > > Using a separate process and faster decompressor may help:
>     > > 
>     > > bzip2 -d < docstrings_triples.nq.bz2 | \
>     > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
>     > >      -- - > tdb2.log2 2>&1
>     > > 
>     > > When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
>     > > it is a java decompressor which will take time to get optimized by the 
>     > > JIT and is likely slower than a specialized tool like bzip2.
>     > > 
>     > > But with 4 core, it may have the opposite effect - using more processes 
>     > > causes preemption timeslicing.
>     > > 
>     > > It maybe one of the other loaders is faster because it is a better match 
>     > > to the hardware.
>     > > 
>     > >> Is this expected behaviour? What factors influence this?
>     > >>
>     > >> SSD - local vs cloud.
>     > >>
>     > >> on my local machine, when running parallel loader, cores were working 
>     > >> at over 70% capacity and there was little IO induced down time.
>     > > 
>     > > How many core were active?
>     > > And when it says "nq" is really quads or all data for the default graph? 
>     > > (there is more indexing work for named graphs).
>     > > 
>     > > Some of that will be the bz2 decompression but it looks to me "like it's 
>     > >   "more threads than cores" causing timeslicing.
>     > > 
>     > >>
>     > >> GCP instance specs:
>     > >>
>     > >> 20 CPU
>     > >> 32GB RAM
>     > > 
>     > > And same heap size?
>     > > 
>     > > While the parallel loader is using multiple threads it is a fixed number 
>     > > so more CPU will help only if
>     > > 
>     > > More RAM is going to help because the OS will use it for file system 
>     > > cache, delaying writes.
>     > > 
>     > > But with more read threads, it could be there is less preemptive 
>     > > scheduling and that could be a big gain.
>     > > 
>     > >> 6TB "local SSD" storage
>     > >> the local SSD storage offers the best performance to reduce IO latency 
>     > >> - it has physical proximity to instance - as per GCP.
>     > >>
>     > >> a few cores were working at near capacity, while the vast majority 
>     > >> idle (near 0%) w occasional spikes. average load translates to 20% 
>     > >> utilization. As I've seen others write here, this is a difference 
>     > >> others have noted.
>     > >> How can this be addressed? buffer size? (I don't have a deep enough 
>     > >> understanding).
>     > > 
>     > > My guess is that on the GCP instance it is one thread-one core.
>     > > 
>     > >>
>     > >>
>     > >> Another recurring pattern is the reduction in batch size.
>     > >> I've been running a load job on my gcp instance for almost a day (23+h).
>     > >>
>     > >> file size: 93GB
>     > >> triples: 472m
>     > >>
>     > >> batch size decreased from 160k range to under 1k, while processing 
>     > >> time per batch increased from a few seconds to over 10 min. All this 
>     > >> time average CPU usage has remained steady, as has RAM usage.
>     > > 
>     > > Not sure I quite understand - this is adding more data to an existing 
>     > > database? And 10mins for 1k? While it will be slower, that does sound 
>     > > rather extreme.
>     > > 
>     > >>
>     > >> I don't understand how all of this works with indexing. Is this 
>     > >> expected behaviour? besides a locally proximate SSD, I've thrown an 
>     > >> overkill of hardware at it.
>     > >>
>     > >> thanks
>     > >>
>     > > 
>     > >      Andy
>     > > 
>     > 
>     
> 
> 
> 
> 
>

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Isabel

I think there might be a fundamental misunderstanding happening about batch sizes here.  The batch sizes are fixed for a run and never changes, the "batch size" you refer to is a speed calculation e.g

    19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
    19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
    19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
    19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
    19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
    19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
    19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)

Note that each batch is 500,000 quads as the number after "Add:" increases by 500,000 each time.  The "Batch" in brackets is referring to the calculated loading speed for the current batch, where "Avg" is the calculated loading speed over the entire load.

So yes the speed is decreasing over time, this is a commonly reported issue on these lists but there is no simple fix for this because it depends both on hardware and data.  The output you have provided suggests that you are stuck on IO, Andy is the primary developer on TDB2 so he may be able to shed more light on what might be going on.

Rob

On 22/06/2020, 06:32, "Isroel Kogan" <is...@gmail.com> wrote:

    Hi Andy - thanks for your comments.
    
    Instead of responding point by point, its best if I present a clearer picture -as  I also have a better understanding of the factors so far.
    
    
    GCP instance stats:
    $ lscpu
    
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    Address sizes:       46 bits physical, 48 bits virtual
    CPU(s):              10
    On-line CPU(s) list: 0-9
    Thread(s) per core:  2
    Core(s) per socket:  5
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               63
    Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
    Stepping:            0
    CPU MHz:             2300.000
    BogoMIPS:            4600.00
    Hypervisor vendor:   KVM
    Virtualization type: full
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            256K
    L3 cache:            46080K
    NUMA node0 CPU(s):   0-9
    
    16GB RAM
    (I configured less RAM because on a prior iteration - out of 32GB - only 8 was being used)
    
    3TB local SSD
    
    according to google cloud - max performance for this size is as follows:
    Storage space Partitions     IOPS                          Throughput (MB/s)
                                                Read      Write            Read         Write
    
    3 TB                 8                 680,000  360,000       2,650         1,400
    
    
    https://cloud.google.com/compute/docs/disks
    
    I'm not getting that - but performance is an order of magnitude or more better than on my machine - as youll see below -
    
    
    
    My data is as follows -15 files of varying size (based on the original bz2 files provided by the authors). files range from 2GB (a few million triples) to 129GB. 4 files have close to a half a billion triples each.
    The files are .nq - as you noted - quads. (unzipped as already mentioned).
    
    Currently I have a process running initiated by tdb2.tdbloader command in parallel mode. DB empty initially.
    
    I will include some information here that may help you get a sense of the performance issues I alluded to
    
    
    beginning of log output:
    
    
    17:14:42 INFO  loader     :: Loader = LoaderParallel
    17:14:42 INFO  loader     :: Start: 15 files
    17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq (Batch: 77,339 / Avg: 77,339)
    17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq (Batch: 89,285 / Avg: 82,884)
    17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq (Batch: 90,090 / Avg: 85,154)
    17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq (Batch: 87,382 / Avg: 85,700)
    17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq (Batch: 82,304 / Avg: 84,999)
    17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq (Batch: 87,032 / Avg: 85,331)
    17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq (triples/quads = 3,402,753)
    17:15:23 INFO  loader     :: Add: 3,500,000 datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
    17:15:29 INFO  loader     :: Add: 4,000,000 datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
    17:15:35 INFO  loader     :: Add: 4,500,000 datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
    17:15:40 INFO  loader     :: Add: 5,000,000 datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)
    
    
    the batch size increases for the next file - but the third file is one of those w nearly a half a billion triples.
    
    the batch size keeps steadily decreasing
    
    here is the output at the 250m triple mark
    
    19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
    19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
    19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
    19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
    19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
    19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
    19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
    19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 3,976 / Avg: 34,473)
    19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 3,171 / Avg: 33,810)
    19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 3,512 / Avg: 33,241)
    
    The batch size has decreased (I don't understand exactly what the batch size refers to) and conversely the time for a half million triples to be processed has gone from 5s to 2.5m.
    
    This is the tail of the log currently:
    
    03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 1,468 / Avg: 8,680)
    03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 1,512 / Avg: 8,613)
    03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 1,897 / Avg: 8,564)
    03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 1,723 / Avg: 8,509)
    03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 1,489 / Avg: 8,444)
    03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 1,842 / Avg: 8,395)
    03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 1,491 / Avg: 8,333)
    03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 1,882 / Avg: 8,287)
    03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 2,391 / Avg: 8,254)
    03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 1,887 / Avg: 8,209)
    03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 03:44:02 UTC]
    03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 2,711 / Avg: 8,183)
    03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 1,717 / Avg: 8,133)
    
    batch size down below 2k and processing time per 0.5m triples up to around 5 min.
    
    What has puzzled me is that in the beginning, CPU utilization was high. At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
    Whereas now Load average sits near 3.5 not crossing 4 for several hours. This is based on htop - according to the gcp console dashboard - CPU utilization did indeed peak near 80% - but has been steady around 13-14% for the last hours - I don't know how to match these two metrics - but its heavily underutilized.
    
    Here's a flavor of the output (as I write now -when the batch size has decreased) of
    
    $ iotop -ob
    
    
    Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    
    
    The relevant output of running (for 124s) - this is the last accumulation:
    
    $ iotop -ab
    
    Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
    Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena
    12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
    
    
    
    and now the output of for pid 12155 - the subprocess running all the threads
    $ pidstat -p 12155 -ut
    
    
    Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 CPU)
    
    04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
    04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   91.54     4  java
    04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    0.00     4  |__java
    04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    4.99     7  |__java
    04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    0.52     7  |__GC Thread#0
    04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    0.00     1  |__G1 Main Marker
    04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    0.27     2  |__G1 Conc#0
    04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    0.34     8  |__G1 Refine#0
    04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    0.28     6  |__G1 Young RemSet
    04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    0.03     1  |__VM Thread
    04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    0.00     4  |__Reference Handl
    04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    0.00     7  |__Finalizer
    04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    0.00     6  |__Signal Dispatch
    04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    0.04     1  |__C2 CompilerThre
    04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    0.00     1  |__C1 CompilerThre
    04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    0.00     1  |__Sweeper thread
    04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    0.00     1  |__Service Thread
    04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    0.05     1  |__VM Periodic Tas
    04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    0.00     6  |__Common-Cleaner
    04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    0.51     8  |__GC Thread#1
    04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    0.51     3  |__GC Thread#2
    04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    0.50     1  |__GC Thread#3
    04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    0.52     5  |__GC Thread#4
    04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    0.51     2  |__GC Thread#5
    04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    0.00     5  |__Thread-0
    04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    0.00     6  |__Thread-1
    04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    0.00     7  |__Thread-2
    04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    4.42     2  |__Thread-3
    04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    5.54     3  |__Thread-4
    04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    5.29     4  |__Thread-5
    04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   19.34     3  |__Thread-6
    04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   20.83     2  |__Thread-7
    04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   20.07     9  |__Thread-8
    04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    5.08     0  |__Thread-9
    04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    0.52     6  |__GC Thread#6
    04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    0.51     0  |__GC Thread#7
    04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    0.51     4  |__GC Thread#8
    04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    0.02     5  |__G1 Refine#1
    04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    0.01     9  |__G1 Refine#2
    04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    0.01     0  |__G1 Refine#3
    04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    0.01     7  |__G1 Refine#4
    04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#5
    04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#6
    04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#7
    04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#8
    04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    0.27     9  |__G1 Conc#1
    
    
    you can see threads, 12184, 12185, 12186 have the highest CPU utilization, albeit still at 20% only. they are the same threads in the output of iotop - doing a lot of reading mostly. As you can see in iotop output, the read is at over 900 M/s - which is excellent. All the other threads are idle the overwhelming majority of the time
    
    Finally htop gives a similar picture, the 3 threads have the highest CPU utilization (around 20%) now, and so 7/10 cores are practically idle with short spikes when the other threads pick up. which isnt much as the above shows. Also those threads are either in R or D status (rarely in S) - while the rest spend most of the time in S.
    
    I didnt understand what you wrote about the heap size, RAM and read threads - I don't use java - so I'm flying by the seat of my pants. 
    Here is the output of 
    
    $ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT   
     0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 1042.531
    
    which may be relevant. 
    
    My very crudely formulated question is - how come it seems that with smaller batch sizes, the processing time increased so much - while at the same time there is actually much less processing going on (much lower CPU utilization) - it seems like everything is waiting for whatever those 3 threads are doing. 
    And is three anyway to address it? 
    
    thanks for your help!
    
    best,
    
    Isroel
    
    On 2020/06/21 21:38:50, Andy Seaborne <an...@apache.org> wrote: 
    >  > And when it says "nq" is really quads or all data for the default
    >  > graph? (there is more indexing work for named graphs).
    > 
    >  >> : Quads = 127,206,280
    > 
    > 
    > OK - it's quads. There are 6 quad indexes and in full parallel mode it 
    > will use 2 more threads to parse and to build the node table.
    > 
    > Full parallel loading is going to use up all the cores and HT threads 
    > aren't full threads for this purpose.
    > 
    > The phased loader (default) uses less threads.
    > 
    > Phase 1:
    > one thread to decompress and parse
    > one thread to build the node table.
    > one thread to for the GSPO
    > (and one for SPO but you seem to have no triples)
    > =3
    > 
    > Phase 2:
    > two threads
    > =2
    > 
    > Phase 3:
    > three threads
    > =3
    > 
    >      Andy
    > 
    > 
    > On 21/06/2020 22:11, Andy Seaborne wrote:
    > > Hi there,
    > > 
    > > Thanks for reporting the findings.
    > > 
    > > On 20/06/2020 16:10, Isroel Kogan wrote:
    > >> Hi,
    > >>
    > >> I am also a newcomer to the RDF world - and particularly Jena, which I 
    > >> started using this week.
    > >>
    > >> A couple of observations I have made over the last few days exploring 
    > >> different options.
    > >>
    > >> Local Machine (specs):
    > >>
    > >> Ubuntu 18.04
    > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
    > > 
    > > which is 4 core and hyper threading. for this workload that is more like 
    > > 4 threads. HT is not a completely x2 for this sort of continuous 
    > > processing threading.
    > > 
    > > And pre-emtptive timeslicing is not nice!
    > > 
    > >> 16GB RAM
    > >> 512 SSD (NVMe).
    > >>
    > >>
    > >> the following compares loading a file in compressed vs decompressed 
    > >> format -both w parallel loader.
    > >>
    > >> file:
    > >> docstrings_triples.nq
    > >> size: 28GB
    > >>
    > >> cmd:
    > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
    > >> docstrings_triples.nq > tdb2.log1 2>&1
    > >>
    > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
    > >>
    > >> real    22m46.346s
    > >> user    120m46.591s
    > >> sys    3m22.698s
    > >>
    > >>
    > >> file:
    > >> docstrings_triples.nq.bz2
    > >> size: 542M
    > >>
    > >> cmd:
    > >>
    > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
    > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
    > >>
    > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
    > >>
    > >>
    > >> real    37m8.182s
    > >> user    109m42.970s
    > >> sys    6m27.426s
    > >>
    > >> resulting DB size
    > >> 30GB
    > >>
    > >> confirmed equal via diff.
    > >>
    > >> pbzip2 ran in 84s
    > >>
    > >> Less rigorously I noticed a similar gain in speed for other files.
    > > 
    > > For gz files, the speed of loading of compressed vs uncompressed is 
    > > usually not very much.  It does look like bz2
    > > 
    > > Using a separate process and faster decompressor may help:
    > > 
    > > bzip2 -d < docstrings_triples.nq.bz2 | \
    > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
    > >      -- - > tdb2.log2 2>&1
    > > 
    > > When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
    > > it is a java decompressor which will take time to get optimized by the 
    > > JIT and is likely slower than a specialized tool like bzip2.
    > > 
    > > But with 4 core, it may have the opposite effect - using more processes 
    > > causes preemption timeslicing.
    > > 
    > > It maybe one of the other loaders is faster because it is a better match 
    > > to the hardware.
    > > 
    > >> Is this expected behaviour? What factors influence this?
    > >>
    > >> SSD - local vs cloud.
    > >>
    > >> on my local machine, when running parallel loader, cores were working 
    > >> at over 70% capacity and there was little IO induced down time.
    > > 
    > > How many core were active?
    > > And when it says "nq" is really quads or all data for the default graph? 
    > > (there is more indexing work for named graphs).
    > > 
    > > Some of that will be the bz2 decompression but it looks to me "like it's 
    > >   "more threads than cores" causing timeslicing.
    > > 
    > >>
    > >> GCP instance specs:
    > >>
    > >> 20 CPU
    > >> 32GB RAM
    > > 
    > > And same heap size?
    > > 
    > > While the parallel loader is using multiple threads it is a fixed number 
    > > so more CPU will help only if
    > > 
    > > More RAM is going to help because the OS will use it for file system 
    > > cache, delaying writes.
    > > 
    > > But with more read threads, it could be there is less preemptive 
    > > scheduling and that could be a big gain.
    > > 
    > >> 6TB "local SSD" storage
    > >> the local SSD storage offers the best performance to reduce IO latency 
    > >> - it has physical proximity to instance - as per GCP.
    > >>
    > >> a few cores were working at near capacity, while the vast majority 
    > >> idle (near 0%) w occasional spikes. average load translates to 20% 
    > >> utilization. As I've seen others write here, this is a difference 
    > >> others have noted.
    > >> How can this be addressed? buffer size? (I don't have a deep enough 
    > >> understanding).
    > > 
    > > My guess is that on the GCP instance it is one thread-one core.
    > > 
    > >>
    > >>
    > >> Another recurring pattern is the reduction in batch size.
    > >> I've been running a load job on my gcp instance for almost a day (23+h).
    > >>
    > >> file size: 93GB
    > >> triples: 472m
    > >>
    > >> batch size decreased from 160k range to under 1k, while processing 
    > >> time per batch increased from a few seconds to over 10 min. All this 
    > >> time average CPU usage has remained steady, as has RAM usage.
    > > 
    > > Not sure I quite understand - this is adding more data to an existing 
    > > database? And 10mins for 1k? While it will be slower, that does sound 
    > > rather extreme.
    > > 
    > >>
    > >> I don't understand how all of this works with indexing. Is this 
    > >> expected behaviour? besides a locally proximate SSD, I've thrown an 
    > >> overkill of hardware at it.
    > >>
    > >> thanks
    > >>
    > > 
    > >      Andy
    > > 
    >

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Isroel Kogan <is...@gmail.com>.

Hi Andy - thanks for your comments.

Instead of responding point by point, its best if I present a clearer picture -as  I also have a better understanding of the factors so far.


GCP instance stats:
$ lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              10
On-line CPU(s) list: 0-9
Thread(s) per core:  2
Core(s) per socket:  5
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2300.000
BogoMIPS:            4600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-9

16GB RAM
(I configured less RAM because on a prior iteration - out of 32GB - only 8 was being used)

3TB local SSD

according to google cloud - max performance for this size is as follows:
Storage space Partitions     IOPS                          Throughput (MB/s)
                                            Read      Write            Read         Write

3 TB                 8                 680,000  360,000       2,650         1,400


https://cloud.google.com/compute/docs/disks

I'm not getting that - but performance is an order of magnitude or more better than on my machine - as youll see below -



My data is as follows -15 files of varying size (based on the original bz2 files provided by the authors). files range from 2GB (a few million triples) to 129GB. 4 files have close to a half a billion triples each.
The files are .nq - as you noted - quads. (unzipped as already mentioned).

Currently I have a process running initiated by tdb2.tdbloader command in parallel mode. DB empty initially.

I will include some information here that may help you get a sense of the performance issues I alluded to


beginning of log output:


17:14:42 INFO  loader     :: Loader = LoaderParallel
17:14:42 INFO  loader     :: Start: 15 files
17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq (Batch: 77,339 / Avg: 77,339)
17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq (Batch: 89,285 / Avg: 82,884)
17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq (Batch: 90,090 / Avg: 85,154)
17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq (Batch: 87,382 / Avg: 85,700)
17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq (Batch: 82,304 / Avg: 84,999)
17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq (Batch: 87,032 / Avg: 85,331)
17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq (triples/quads = 3,402,753)
17:15:23 INFO  loader     :: Add: 3,500,000 datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
17:15:29 INFO  loader     :: Add: 4,000,000 datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
17:15:35 INFO  loader     :: Add: 4,500,000 datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
17:15:40 INFO  loader     :: Add: 5,000,000 datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)


the batch size increases for the next file - but the third file is one of those w nearly a half a billion triples.

the batch size keeps steadily decreasing

here is the output at the 250m triple mark

19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)
19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 4,322 / Avg: 37,443)
19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 3,649 / Avg: 36,759)
19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 4,868 / Avg: 36,283)
19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 3,532 / Avg: 35,622)
19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 19:11:41 UTC]
19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 3,643 / Avg: 35,009)
19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 3,976 / Avg: 34,473)
19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 3,171 / Avg: 33,810)
19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 3,512 / Avg: 33,241)

The batch size has decreased (I don't understand exactly what the batch size refers to) and conversely the time for a half million triples to be processed has gone from 5s to 2.5m.

This is the tail of the log currently:

03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 1,468 / Avg: 8,680)
03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 1,512 / Avg: 8,613)
03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 1,897 / Avg: 8,564)
03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 1,723 / Avg: 8,509)
03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 1,489 / Avg: 8,444)
03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 1,842 / Avg: 8,395)
03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 1,491 / Avg: 8,333)
03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 1,882 / Avg: 8,287)
03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 2,391 / Avg: 8,254)
03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 1,887 / Avg: 8,209)
03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 03:44:02 UTC]
03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 2,711 / Avg: 8,183)
03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 1,717 / Avg: 8,133)

batch size down below 2k and processing time per 0.5m triples up to around 5 min.

What has puzzled me is that in the beginning, CPU utilization was high. At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
Whereas now Load average sits near 3.5 not crossing 4 for several hours. This is based on htop - according to the gcp console dashboard - CPU utilization did indeed peak near 80% - but has been steady around 13-14% for the last hours - I don't know how to match these two metrics - but its heavily underutilized.

Here's a flavor of the output (as I write now -when the batch size has decreased) of

$ iotop -ob


Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java -Dlog4j.configuration=file:/usr/local/apache-jena-


The relevant output of running (for 124s) - this is the last accumulation:

$ iotop -ab

Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena
12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-
12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java -Dlog4j.configuration=file:/usr/local/apache-jena-



and now the output of for pid 12155 - the subprocess running all the threads
$ pidstat -p 12155 -ut


Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 CPU)

04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   91.54     4  java
04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    0.00     4  |__java
04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    4.99     7  |__java
04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    0.52     7  |__GC Thread#0
04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    0.00     1  |__G1 Main Marker
04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    0.27     2  |__G1 Conc#0
04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    0.34     8  |__G1 Refine#0
04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    0.28     6  |__G1 Young RemSet
04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    0.03     1  |__VM Thread
04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    0.00     4  |__Reference Handl
04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    0.00     7  |__Finalizer
04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    0.00     6  |__Signal Dispatch
04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    0.04     1  |__C2 CompilerThre
04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    0.00     1  |__C1 CompilerThre
04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    0.00     1  |__Sweeper thread
04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    0.00     1  |__Service Thread
04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    0.05     1  |__VM Periodic Tas
04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    0.00     6  |__Common-Cleaner
04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    0.51     8  |__GC Thread#1
04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    0.51     3  |__GC Thread#2
04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    0.50     1  |__GC Thread#3
04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    0.52     5  |__GC Thread#4
04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    0.51     2  |__GC Thread#5
04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    0.00     5  |__Thread-0
04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    0.00     6  |__Thread-1
04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    0.00     7  |__Thread-2
04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    4.42     2  |__Thread-3
04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    5.54     3  |__Thread-4
04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    5.29     4  |__Thread-5
04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   19.34     3  |__Thread-6
04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   20.83     2  |__Thread-7
04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   20.07     9  |__Thread-8
04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    5.08     0  |__Thread-9
04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    0.52     6  |__GC Thread#6
04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    0.51     0  |__GC Thread#7
04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    0.51     4  |__GC Thread#8
04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    0.02     5  |__G1 Refine#1
04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    0.01     9  |__G1 Refine#2
04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    0.01     0  |__G1 Refine#3
04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    0.01     7  |__G1 Refine#4
04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#5
04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    0.00     1  |__G1 Refine#6
04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#7
04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    0.00     6  |__G1 Refine#8
04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    0.27     9  |__G1 Conc#1


you can see threads, 12184, 12185, 12186 have the highest CPU utilization, albeit still at 20% only. they are the same threads in the output of iotop - doing a lot of reading mostly. As you can see in iotop output, the read is at over 900 M/s - which is excellent. All the other threads are idle the overwhelming majority of the time

Finally htop gives a similar picture, the 3 threads have the highest CPU utilization (around 20%) now, and so 7/10 cores are practically idle with short spikes when the other threads pick up. which isnt much as the above shows. Also those threads are either in R or D status (rarely in S) - while the rest spend most of the time in S.

I didnt understand what you wrote about the heap size, RAM and read threads - I don't use java - so I'm flying by the seat of my pants. 
Here is the output of 

$ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT   
 0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 1042.531

which may be relevant. 

My very crudely formulated question is - how come it seems that with smaller batch sizes, the processing time increased so much - while at the same time there is actually much less processing going on (much lower CPU utilization) - it seems like everything is waiting for whatever those 3 threads are doing. 
And is three anyway to address it? 

thanks for your help!

best,

Isroel

On 2020/06/21 21:38:50, Andy Seaborne <an...@apache.org> wrote: 
>  > And when it says "nq" is really quads or all data for the default
>  > graph? (there is more indexing work for named graphs).
> 
>  >> : Quads = 127,206,280
> 
> 
> OK - it's quads. There are 6 quad indexes and in full parallel mode it 
> will use 2 more threads to parse and to build the node table.
> 
> Full parallel loading is going to use up all the cores and HT threads 
> aren't full threads for this purpose.
> 
> The phased loader (default) uses less threads.
> 
> Phase 1:
> one thread to decompress and parse
> one thread to build the node table.
> one thread to for the GSPO
> (and one for SPO but you seem to have no triples)
> =3
> 
> Phase 2:
> two threads
> =2
> 
> Phase 3:
> three threads
> =3
> 
>      Andy
> 
> 
> On 21/06/2020 22:11, Andy Seaborne wrote:
> > Hi there,
> > 
> > Thanks for reporting the findings.
> > 
> > On 20/06/2020 16:10, Isroel Kogan wrote:
> >> Hi,
> >>
> >> I am also a newcomer to the RDF world - and particularly Jena, which I 
> >> started using this week.
> >>
> >> A couple of observations I have made over the last few days exploring 
> >> different options.
> >>
> >> Local Machine (specs):
> >>
> >> Ubuntu 18.04
> >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
> > 
> > which is 4 core and hyper threading. for this workload that is more like 
> > 4 threads. HT is not a completely x2 for this sort of continuous 
> > processing threading.
> > 
> > And pre-emtptive timeslicing is not nice!
> > 
> >> 16GB RAM
> >> 512 SSD (NVMe).
> >>
> >>
> >> the following compares loading a file in compressed vs decompressed 
> >> format -both w parallel loader.
> >>
> >> file:
> >> docstrings_triples.nq
> >> size: 28GB
> >>
> >> cmd:
> >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
> >> docstrings_triples.nq > tdb2.log1 2>&1
> >>
> >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
> >>
> >> real    22m46.346s
> >> user    120m46.591s
> >> sys    3m22.698s
> >>
> >>
> >> file:
> >> docstrings_triples.nq.bz2
> >> size: 542M
> >>
> >> cmd:
> >>
> >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
> >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
> >>
> >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
> >>
> >>
> >> real    37m8.182s
> >> user    109m42.970s
> >> sys    6m27.426s
> >>
> >> resulting DB size
> >> 30GB
> >>
> >> confirmed equal via diff.
> >>
> >> pbzip2 ran in 84s
> >>
> >> Less rigorously I noticed a similar gain in speed for other files.
> > 
> > For gz files, the speed of loading of compressed vs uncompressed is 
> > usually not very much.  It does look like bz2
> > 
> > Using a separate process and faster decompressor may help:
> > 
> > bzip2 -d < docstrings_triples.nq.bz2 | \
> > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
> >      -- - > tdb2.log2 2>&1
> > 
> > When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
> > it is a java decompressor which will take time to get optimized by the 
> > JIT and is likely slower than a specialized tool like bzip2.
> > 
> > But with 4 core, it may have the opposite effect - using more processes 
> > causes preemption timeslicing.
> > 
> > It maybe one of the other loaders is faster because it is a better match 
> > to the hardware.
> > 
> >> Is this expected behaviour? What factors influence this?
> >>
> >> SSD - local vs cloud.
> >>
> >> on my local machine, when running parallel loader, cores were working 
> >> at over 70% capacity and there was little IO induced down time.
> > 
> > How many core were active?
> > And when it says "nq" is really quads or all data for the default graph? 
> > (there is more indexing work for named graphs).
> > 
> > Some of that will be the bz2 decompression but it looks to me "like it's 
> >   "more threads than cores" causing timeslicing.
> > 
> >>
> >> GCP instance specs:
> >>
> >> 20 CPU
> >> 32GB RAM
> > 
> > And same heap size?
> > 
> > While the parallel loader is using multiple threads it is a fixed number 
> > so more CPU will help only if
> > 
> > More RAM is going to help because the OS will use it for file system 
> > cache, delaying writes.
> > 
> > But with more read threads, it could be there is less preemptive 
> > scheduling and that could be a big gain.
> > 
> >> 6TB "local SSD" storage
> >> the local SSD storage offers the best performance to reduce IO latency 
> >> - it has physical proximity to instance - as per GCP.
> >>
> >> a few cores were working at near capacity, while the vast majority 
> >> idle (near 0%) w occasional spikes. average load translates to 20% 
> >> utilization. As I've seen others write here, this is a difference 
> >> others have noted.
> >> How can this be addressed? buffer size? (I don't have a deep enough 
> >> understanding).
> > 
> > My guess is that on the GCP instance it is one thread-one core.
> > 
> >>
> >>
> >> Another recurring pattern is the reduction in batch size.
> >> I've been running a load job on my gcp instance for almost a day (23+h).
> >>
> >> file size: 93GB
> >> triples: 472m
> >>
> >> batch size decreased from 160k range to under 1k, while processing 
> >> time per batch increased from a few seconds to over 10 min. All this 
> >> time average CPU usage has remained steady, as has RAM usage.
> > 
> > Not sure I quite understand - this is adding more data to an existing 
> > database? And 10mins for 1k? While it will be slower, that does sound 
> > rather extreme.
> > 
> >>
> >> I don't understand how all of this works with indexing. Is this 
> >> expected behaviour? besides a locally proximate SSD, I've thrown an 
> >> overkill of hardware at it.
> >>
> >> thanks
> >>
> > 
> >      Andy
> > 
>

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Andy Seaborne <an...@apache.org>.

 > And when it says "nq" is really quads or all data for the default
 > graph? (there is more indexing work for named graphs).

 >> : Quads = 127,206,280


OK - it's quads. There are 6 quad indexes and in full parallel mode it 
will use 2 more threads to parse and to build the node table.

Full parallel loading is going to use up all the cores and HT threads 
aren't full threads for this purpose.

The phased loader (default) uses less threads.

Phase 1:
one thread to decompress and parse
one thread to build the node table.
one thread to for the GSPO
(and one for SPO but you seem to have no triples)
=3

Phase 2:
two threads
=2

Phase 3:
three threads
=3

     Andy


On 21/06/2020 22:11, Andy Seaborne wrote:
> Hi there,
> 
> Thanks for reporting the findings.
> 
> On 20/06/2020 16:10, Isroel Kogan wrote:
>> Hi,
>>
>> I am also a newcomer to the RDF world - and particularly Jena, which I 
>> started using this week.
>>
>> A couple of observations I have made over the last few days exploring 
>> different options.
>>
>> Local Machine (specs):
>>
>> Ubuntu 18.04
>> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
> 
> which is 4 core and hyper threading. for this workload that is more like 
> 4 threads. HT is not a completely x2 for this sort of continuous 
> processing threading.
> 
> And pre-emtptive timeslicing is not nice!
> 
>> 16GB RAM
>> 512 SSD (NVMe).
>>
>>
>> the following compares loading a file in compressed vs decompressed 
>> format -both w parallel loader.
>>
>> file:
>> docstrings_triples.nq
>> size: 28GB
>>
>> cmd:
>> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
>> docstrings_triples.nq > tdb2.log1 2>&1
>>
>> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
>>
>> real    22m46.346s
>> user    120m46.591s
>> sys    3m22.698s
>>
>>
>> file:
>> docstrings_triples.nq.bz2
>> size: 542M
>>
>> cmd:
>>
>> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
>> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
>>
>> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
>>
>>
>> real    37m8.182s
>> user    109m42.970s
>> sys    6m27.426s
>>
>> resulting DB size
>> 30GB
>>
>> confirmed equal via diff.
>>
>> pbzip2 ran in 84s
>>
>> Less rigorously I noticed a similar gain in speed for other files.
> 
> For gz files, the speed of loading of compressed vs uncompressed is 
> usually not very much.  It does look like bz2
> 
> Using a separate process and faster decompressor may help:
> 
> bzip2 -d < docstrings_triples.nq.bz2 | \
> time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
>      -- - > tdb2.log2 2>&1
> 
> When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
> it is a java decompressor which will take time to get optimized by the 
> JIT and is likely slower than a specialized tool like bzip2.
> 
> But with 4 core, it may have the opposite effect - using more processes 
> causes preemption timeslicing.
> 
> It maybe one of the other loaders is faster because it is a better match 
> to the hardware.
> 
>> Is this expected behaviour? What factors influence this?
>>
>> SSD - local vs cloud.
>>
>> on my local machine, when running parallel loader, cores were working 
>> at over 70% capacity and there was little IO induced down time.
> 
> How many core were active?
> And when it says "nq" is really quads or all data for the default graph? 
> (there is more indexing work for named graphs).
> 
> Some of that will be the bz2 decompression but it looks to me "like it's 
>   "more threads than cores" causing timeslicing.
> 
>>
>> GCP instance specs:
>>
>> 20 CPU
>> 32GB RAM
> 
> And same heap size?
> 
> While the parallel loader is using multiple threads it is a fixed number 
> so more CPU will help only if
> 
> More RAM is going to help because the OS will use it for file system 
> cache, delaying writes.
> 
> But with more read threads, it could be there is less preemptive 
> scheduling and that could be a big gain.
> 
>> 6TB "local SSD" storage
>> the local SSD storage offers the best performance to reduce IO latency 
>> - it has physical proximity to instance - as per GCP.
>>
>> a few cores were working at near capacity, while the vast majority 
>> idle (near 0%) w occasional spikes. average load translates to 20% 
>> utilization. As I've seen others write here, this is a difference 
>> others have noted.
>> How can this be addressed? buffer size? (I don't have a deep enough 
>> understanding).
> 
> My guess is that on the GCP instance it is one thread-one core.
> 
>>
>>
>> Another recurring pattern is the reduction in batch size.
>> I've been running a load job on my gcp instance for almost a day (23+h).
>>
>> file size: 93GB
>> triples: 472m
>>
>> batch size decreased from 160k range to under 1k, while processing 
>> time per batch increased from a few seconds to over 10 min. All this 
>> time average CPU usage has remained steady, as has RAM usage.
> 
> Not sure I quite understand - this is adding more data to an existing 
> database? And 10mins for 1k? While it will be slower, that does sound 
> rather extreme.
> 
>>
>> I don't understand how all of this works with indexing. Is this 
>> expected behaviour? besides a locally proximate SSD, I've thrown an 
>> overkill of hardware at it.
>>
>> thanks
>>
> 
>      Andy
>

Re: TDB2 parallel load on cloud SSD and other observations/questions

Posted by Andy Seaborne <an...@apache.org>.

Hi there,

Thanks for reporting the findings.

On 20/06/2020 16:10, Isroel Kogan wrote:
> Hi,
> 
> I am also a newcomer to the RDF world - and particularly Jena, which I started using this week.
> 
> A couple of observations I have made over the last few days exploring different options.
> 
> Local Machine (specs):
> 
> Ubuntu 18.04
> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)

which is 4 core and hyper threading. for this workload that is more like 
4 threads. HT is not a completely x2 for this sort of continuous 
processing threading.

And pre-emtptive timeslicing is not nice!

> 16GB RAM
> 512 SSD (NVMe).
> 
> 
> the following compares loading a file in compressed vs decompressed format -both w parallel loader.
> 
> file:
> docstrings_triples.nq
> size: 28GB
> 
> cmd:
> time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq > tdb2.log1 2>&1
> 
> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
> 
> real    22m46.346s
> user    120m46.591s
> sys    3m22.698s
> 
> 
> file:
> docstrings_triples.nq.bz2
> size: 542M
> 
> cmd:
> 
> time tdb2.tdbloader --loader=parallel --loc=test2graphdb docstrings_triples.nq.bz2 > tdb2.log2 2>&1
> 
> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
> 
> 
> real    37m8.182s
> user    109m42.970s
> sys    6m27.426s
> 
> resulting DB size
> 30GB
> 
> confirmed equal via diff.
> 
> pbzip2 ran in 84s
> 
> Less rigorously I noticed a similar gain in speed for other files.

For gz files, the speed of loading of compressed vs uncompressed is 
usually not very much.  It does look like bz2

Using a separate process and faster decompressor may help:

bzip2 -d < docstrings_triples.nq.bz2 | \
time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
     -- - > tdb2.log2 2>&1

When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
it is a java decompressor which will take time to get optimized by the 
JIT and is likely slower than a specialized tool like bzip2.

But with 4 core, it may have the opposite effect - using more processes 
causes preemption timeslicing.

It maybe one of the other loaders is faster because it is a better match 
to the hardware.

> Is this expected behaviour? What factors influence this?
> 
> SSD - local vs cloud.
> 
> on my local machine, when running parallel loader, cores were working at over 70% capacity and there was little IO induced down time.

How many core were active?
And when it says "nq" is really quads or all data for the default graph? 
(there is more indexing work for named graphs).

Some of that will be the bz2 decompression but it looks to me "like it's 
  "more threads than cores" causing timeslicing.

> 
> GCP instance specs:
> 
> 20 CPU
> 32GB RAM

And same heap size?

While the parallel loader is using multiple threads it is a fixed number 
so more CPU will help only if

More RAM is going to help because the OS will use it for file system 
cache, delaying writes.

But with more read threads, it could be there is less preemptive 
scheduling and that could be a big gain.

> 6TB "local SSD" storage
> the local SSD storage offers the best performance to reduce IO latency - it has physical proximity to instance - as per GCP.
> 
> a few cores were working at near capacity, while the vast majority idle (near 0%) w occasional spikes. average load translates to 20% utilization. As I've seen others write here, this is a difference others have noted.
> How can this be addressed? buffer size? (I don't have a deep enough understanding).

My guess is that on the GCP instance it is one thread-one core.

> 
> 
> Another recurring pattern is the reduction in batch size.
> I've been running a load job on my gcp instance for almost a day (23+h).
> 
> file size: 93GB
> triples: 472m
> 
> batch size decreased from 160k range to under 1k, while processing time per batch increased from a few seconds to over 10 min. All this time average CPU usage has remained steady, as has RAM usage.

Not sure I quite understand - this is adding more data to an existing 
database? And 10mins for 1k? While it will be slower, that does sound 
rather extreme.

> 
> I don't understand how all of this works with indexing. Is this expected behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware at it.
> 
> thanks
> 

     Andy