You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Neubert Joachim <J....@zbw.eu> on 2013/10/28 13:01:08 UTC

Declining TDB load performance with larger files

I'm loading a 111 million triples file (GND German Authority files). For the first roughly 70 million triples, it's really fast (more than 60,000 avg), but then throughput declines continuously to a thousand or just some hundred triples (which brings down the avg to less than 7000). During the last part of triples data phase, java is down to 1-2% CPU usage, while disk usage goes up to 100%.

As TDB writes to disk, I'd expect rather linear loading times. The Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere cluster, with SAN hardware under-laying. As I observed the same behavior at different times a day, with for sure different load situations, there is no indication that it depended on parallel actions on the cluster.

Perhaps there is something wrong in my config, but I could not figure out what it may be. I add an extract of the log below - it would be great if somebody could help me with hints.

Cheers, Joachim

---------------

2013-10-25 13:33:33 start run

Configuration:
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
JAVA_OPTS: -d64 -Xms6g -Xmx10g
Jena:       VERSION: 2.11.0
Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
ARQ:        VERSION: 2.11.0
ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
RIOT:       VERSION: 2.11.0
RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
TDB:        VERSION: 1.0.0
TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100

Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
INFO  -- Start triples data phase
INFO  ** Load empty triples table
INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25 13:33:35 MESZ
INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
INFO  -- Finish triples data phase
INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]

Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.

INFO  -- Start triples index phase
INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]
...
INFO  -- Finish triples index phase
INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 5.596,64 per second]
INFO  -- Finish triples load
INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds [Rate: 3.083,49 per second]


AW: Declining TDB load performance with larger files

Posted by Neubert Joachim <J....@zbw.eu>.
Hi Andy,

Thank you so much. I fixed the heap issue, and will check and report my findings.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy@apache.org] 
Gesendet: Montag, 28. Oktober 2013 16:58
An: users@jena.apache.org
Betreff: Re: Declining TDB load performance with larger files

Hi Joachim,

What is happing is that the system is running out of working space and the disk is being used for real.

 > JAVA_OPTS: -d64 -Xms6g -Xmx10g

Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.

More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.


There is another effect.  The OS is managing memory but sometimes it 
gets its policy wrong.   Oddly, the faster the initial part of the load, 
the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.

There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.

As it is so per-system specific, try each and see what happens, after fixing the heap issue.

	Andy


On 28/10/13 12:01, Neubert Joachim wrote:
> I'm loading a 111 million triples file (GND German Authority files).
> For the first roughly 70 million triples, it's really fast (more than
> 60,000 avg), but then throughput declines continuously to a thousand 
> or just some hundred triples (which brings down the avg to less than 
> 7000). During the last part of triples data phase, java is down to 
> 1-2% CPU usage, while disk usage goes up to 100%.
>
> As TDB writes to disk, I'd expect rather linear loading times. The 
> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
> cluster, with SAN hardware under-laying. As I observed the same 
> behavior at different times a day, with for sure different load 
> situations, there is no indication that it depended on parallel 
> actions on the cluster.
>
> Perhaps there is something wrong in my config, but I could not figure 
> out what it may be. I add an extract of the log below - it would be 
> great if somebody could help me with hints.
>
> Cheers, Joachim
>

 > ---------------
 >
 > 2013-10-25 13:33:33 start run
 >
 > Configuration:
 > java version "1.6.0_24"
 > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
 > Jena:       VERSION: 2.11.0
 > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > ARQ:        VERSION: 2.11.0
 > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
 > RIOT:       VERSION: 2.11.0
 > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > TDB:        VERSION: 1.0.0
 > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
 >
 > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
 > INFO  -- Start triples data phase
 > INFO  ** Load empty triples table
 > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
13:33:35 MESZ
 > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
 > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
 > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
 > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
 > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
 > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
 > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
 > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
 > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
 > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
 > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
 > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
 > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
 > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
 > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
 > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
 > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
 > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
 > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
 > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
 > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
 > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
 > INFO  -- Finish triples data phase
 > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
 >
 > INFO  -- Start triples index phase
 > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
 > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
 > INFO  -- Finish triples index phase
 > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 
5.596,64 per second]
 > INFO  -- Finish triples load
 > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
[Rate: 3.083,49 per second]
 >
 >

AW: AW: Declining TDB load performance with larger files

Posted by Neubert Joachim <J....@zbw.eu>.
Hi Andy,

With kind support by a IT infrastructures colleague, I'll temporarily have 32 GB mem for the next run. 

Are you aware of any possibility to check how much of the OS managed memory is used for TDB? I just wonder how much would be enough ...

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy@apache.org] 
Gesendet: Samstag, 2. November 2013 13:40
An: users@jena.apache.org
Betreff: Re: AW: Declining TDB load performance with larger files

On 01/11/13 12:42, Neubert Joachim wrote:
> I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:
>

Thanks - I'm interested - and if anyone has examples that would be useful information as well.

Which OS is this? And specifically, does sort(1) have the --parallel options? (Ubuntu does; centos does not at the moment).

> tdbloader w/ 2G heap
> 4:15 Data phase
> 4:30 Index phase
>
> tdbloader2 w/ 2G heap
> 1:30 Data phase
> 6:30 Index phase
>
> So in sum tdbloader2 shows a slight advantage in my current configuration.

This is what I've been finding when the laod is a bit more than the machine has RAM for it's current working set.

The sortign in the index phase of tdblaoder2 sometimes, on some systems, takes much longer than I think it should.  I even had an example last week when making the sort(1) space less speeded it up which I find bizarre.

> The reduction of heap space had indeed brought an improvement:
>
> tdbloader w/ 10G heap
> 4:30 Data phase
> 5:45 Index phase
>
> Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?

Most likely.  The loading rate is susceptible to system issues as we've all been finding, and at the moment, I don't have a good model for wwhat teh factors are.  But for a given setup, more RAM is good especially for the all java tdbloader.

	Andy

>
> Cheers, Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne [mailto:andy@apache.org]
> Gesendet: Montag, 28. Oktober 2013 16:58
> An: users@jena.apache.org
> Betreff: Re: Declining TDB load performance with larger files
>
> Hi Joachim,
>
> What is happing is that the system is running out of working space and the disk is being used for real.
>
>   > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>
> Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.
>
> More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.
>
>
> There is another effect.  The OS is managing memory but sometimes it
> gets its policy wrong.   Oddly, the faster the initial part of the load,
> the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.
>
> There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.
>
> As it is so per-system specific, try each and see what happens, after fixing the heap issue.
>
> 	Andy
>
>
> On 28/10/13 12:01, Neubert Joachim wrote:
>> I'm loading a 111 million triples file (GND German Authority files).
>> For the first roughly 70 million triples, it's really fast (more than
>> 60,000 avg), but then throughput declines continuously to a thousand 
>> or just some hundred triples (which brings down the avg to less than 
>> 7000). During the last part of triples data phase, java is down to 
>> 1-2% CPU usage, while disk usage goes up to 100%.
>>
>> As TDB writes to disk, I'd expect rather linear loading times. The 
>> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
>> cluster, with SAN hardware under-laying. As I observed the same 
>> behavior at different times a day, with for sure different load 
>> situations, there is no indication that it depended on parallel 
>> actions on the cluster.
>>
>> Perhaps there is something wrong in my config, but I could not figure 
>> out what it may be. I add an extract of the log below - it would be 
>> great if somebody could help me with hints.
>>
>> Cheers, Joachim
>>
>
>   > ---------------
>   >
>   > 2013-10-25 13:33:33 start run
>   >
>   > Configuration:
>   > java version "1.6.0_24"
>   > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>   > Jena:       VERSION: 2.11.0
>   > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > ARQ:        VERSION: 2.11.0
>   > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   > RIOT:       VERSION: 2.11.0
>   > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > TDB:        VERSION: 1.0.0
>   > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   >
>   > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
>   > INFO  -- Start triples data phase
>   > INFO  ** Load empty triples table
>   > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
> 13:33:35 MESZ
>   > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
>   > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
>   > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
>   > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
>   > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
>   > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
>   > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
>   > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
>   > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
>   > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
>   > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
>   > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
>   > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
>   > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
>   > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
>   > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
>   > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
>   > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
>   > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
>   > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
>   > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
>   > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
>   > INFO  -- Finish triples data phase
>   > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
>   >
>   > INFO  -- Start triples index phase
>   > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
>   > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
>   > INFO  -- Finish triples index phase
>   > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
> 5.596,64 per second]
>   > INFO  -- Finish triples load
>   > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 
> seconds
> [Rate: 3.083,49 per second]
>   >
>   >
>


AW: AW: Declining TDB load performance with larger files

Posted by Neubert Joachim <J....@zbw.eu>.
Hi Andy,

on this machine, it's Redhat EL 5.9, and as you suspected, sort has no "--parallel" option.

Thank you very much for the explanations. I just filed an improvement suggestion for including tdbloader2 into the Fuseki distribution, to alleviate playing with the different options on large datasets.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy@apache.org] 
Gesendet: Samstag, 2. November 2013 13:40
An: users@jena.apache.org
Betreff: Re: AW: Declining TDB load performance with larger files

On 01/11/13 12:42, Neubert Joachim wrote:
> I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:
>

Thanks - I'm interested - and if anyone has examples that would be useful information as well.

Which OS is this? And specifically, does sort(1) have the --parallel options? (Ubuntu does; centos does not at the moment).

> tdbloader w/ 2G heap
> 4:15 Data phase
> 4:30 Index phase
>
> tdbloader2 w/ 2G heap
> 1:30 Data phase
> 6:30 Index phase
>
> So in sum tdbloader2 shows a slight advantage in my current configuration.

This is what I've been finding when the laod is a bit more than the machine has RAM for it's current working set.

The sortign in the index phase of tdblaoder2 sometimes, on some systems, takes much longer than I think it should.  I even had an example last week when making the sort(1) space less speeded it up which I find bizarre.

> The reduction of heap space had indeed brought an improvement:
>
> tdbloader w/ 10G heap
> 4:30 Data phase
> 5:45 Index phase
>
> Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?

Most likely.  The loading rate is susceptible to system issues as we've all been finding, and at the moment, I don't have a good model for wwhat teh factors are.  But for a given setup, more RAM is good especially for the all java tdbloader.

	Andy

>
> Cheers, Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne [mailto:andy@apache.org]
> Gesendet: Montag, 28. Oktober 2013 16:58
> An: users@jena.apache.org
> Betreff: Re: Declining TDB load performance with larger files
>
> Hi Joachim,
>
> What is happing is that the system is running out of working space and the disk is being used for real.
>
>   > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>
> Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.
>
> More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.
>
>
> There is another effect.  The OS is managing memory but sometimes it
> gets its policy wrong.   Oddly, the faster the initial part of the load,
> the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.
>
> There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.
>
> As it is so per-system specific, try each and see what happens, after fixing the heap issue.
>
> 	Andy
>
>
> On 28/10/13 12:01, Neubert Joachim wrote:
>> I'm loading a 111 million triples file (GND German Authority files).
>> For the first roughly 70 million triples, it's really fast (more than
>> 60,000 avg), but then throughput declines continuously to a thousand 
>> or just some hundred triples (which brings down the avg to less than 
>> 7000). During the last part of triples data phase, java is down to 
>> 1-2% CPU usage, while disk usage goes up to 100%.
>>
>> As TDB writes to disk, I'd expect rather linear loading times. The 
>> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
>> cluster, with SAN hardware under-laying. As I observed the same 
>> behavior at different times a day, with for sure different load 
>> situations, there is no indication that it depended on parallel 
>> actions on the cluster.
>>
>> Perhaps there is something wrong in my config, but I could not figure 
>> out what it may be. I add an extract of the log below - it would be 
>> great if somebody could help me with hints.
>>
>> Cheers, Joachim
>>
>
>   > ---------------
>   >
>   > 2013-10-25 13:33:33 start run
>   >
>   > Configuration:
>   > java version "1.6.0_24"
>   > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>   > Jena:       VERSION: 2.11.0
>   > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > ARQ:        VERSION: 2.11.0
>   > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   > RIOT:       VERSION: 2.11.0
>   > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > TDB:        VERSION: 1.0.0
>   > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   >
>   > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
>   > INFO  -- Start triples data phase
>   > INFO  ** Load empty triples table
>   > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
> 13:33:35 MESZ
>   > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
>   > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
>   > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
>   > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
>   > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
>   > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
>   > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
>   > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
>   > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
>   > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
>   > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
>   > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
>   > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
>   > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
>   > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
>   > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
>   > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
>   > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
>   > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
>   > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
>   > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
>   > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
>   > INFO  -- Finish triples data phase
>   > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
>   >
>   > INFO  -- Start triples index phase
>   > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
>   > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
>   > INFO  -- Finish triples index phase
>   > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
> 5.596,64 per second]
>   > INFO  -- Finish triples load
>   > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 
> seconds
> [Rate: 3.083,49 per second]
>   >
>   >
>


Re: AW: Declining TDB load performance with larger files

Posted by Andy Seaborne <an...@apache.org>.
On 01/11/13 12:42, Neubert Joachim wrote:
> I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:
>

Thanks - I'm interested - and if anyone has examples that would be 
useful information as well.

Which OS is this? And specifically, does sort(1) have the --parallel 
options? (Ubuntu does; centos does not at the moment).

> tdbloader w/ 2G heap
> 4:15 Data phase
> 4:30 Index phase
>
> tdbloader2 w/ 2G heap
> 1:30 Data phase
> 6:30 Index phase
>
> So in sum tdbloader2 shows a slight advantage in my current configuration.

This is what I've been finding when the laod is a bit more than the 
machine has RAM for it's current working set.

The sortign in the index phase of tdblaoder2 sometimes, on some systems, 
takes much longer than I think it should.  I even had an example last 
week when making the sort(1) space less speeded it up which I find bizarre.

> The reduction of heap space had indeed brought an improvement:
>
> tdbloader w/ 10G heap
> 4:30 Data phase
> 5:45 Index phase
>
> Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?

Most likely.  The loading rate is susceptible to system issues as we've 
all been finding, and at the moment, I don't have a good model for wwhat 
teh factors are.  But for a given setup, more RAM is good especially for 
the all java tdbloader.

	Andy

>
> Cheers, Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne [mailto:andy@apache.org]
> Gesendet: Montag, 28. Oktober 2013 16:58
> An: users@jena.apache.org
> Betreff: Re: Declining TDB load performance with larger files
>
> Hi Joachim,
>
> What is happing is that the system is running out of working space and the disk is being used for real.
>
>   > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>
> Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.
>
> More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.
>
>
> There is another effect.  The OS is managing memory but sometimes it
> gets its policy wrong.   Oddly, the faster the initial part of the load,
> the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.
>
> There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.
>
> As it is so per-system specific, try each and see what happens, after fixing the heap issue.
>
> 	Andy
>
>
> On 28/10/13 12:01, Neubert Joachim wrote:
>> I'm loading a 111 million triples file (GND German Authority files).
>> For the first roughly 70 million triples, it's really fast (more than
>> 60,000 avg), but then throughput declines continuously to a thousand
>> or just some hundred triples (which brings down the avg to less than
>> 7000). During the last part of triples data phase, java is down to
>> 1-2% CPU usage, while disk usage goes up to 100%.
>>
>> As TDB writes to disk, I'd expect rather linear loading times. The
>> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere
>> cluster, with SAN hardware under-laying. As I observed the same
>> behavior at different times a day, with for sure different load
>> situations, there is no indication that it depended on parallel
>> actions on the cluster.
>>
>> Perhaps there is something wrong in my config, but I could not figure
>> out what it may be. I add an extract of the log below - it would be
>> great if somebody could help me with hints.
>>
>> Cheers, Joachim
>>
>
>   > ---------------
>   >
>   > 2013-10-25 13:33:33 start run
>   >
>   > Configuration:
>   > java version "1.6.0_24"
>   > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>   > Jena:       VERSION: 2.11.0
>   > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > ARQ:        VERSION: 2.11.0
>   > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   > RIOT:       VERSION: 2.11.0
>   > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > TDB:        VERSION: 1.0.0
>   > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   >
>   > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
>   > INFO  -- Start triples data phase
>   > INFO  ** Load empty triples table
>   > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
> 13:33:35 MESZ
>   > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
>   > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
>   > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
>   > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
>   > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
>   > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
>   > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
>   > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
>   > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
>   > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
>   > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
>   > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
>   > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
>   > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
>   > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
>   > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
>   > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
>   > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
>   > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
>   > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
>   > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
>   > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
>   > INFO  -- Finish triples data phase
>   > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
>   >
>   > INFO  -- Start triples index phase
>   > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
>   > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
>   > INFO  -- Finish triples index phase
>   > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
> 5.596,64 per second]
>   > INFO  -- Finish triples load
>   > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
> [Rate: 3.083,49 per second]
>   >
>   >
>


Re: WG: Declining TDB load performance with larger files

Posted by Andy Seaborne <an...@apache.org>.
Thank you very much for the figures.

I can understand why tdbloader is affected by the change in RAM size.

But your (and mime, recently) experience with tdbloader2 is mysterious 
to me currently.

	Andy

On 05/11/13 06:06, Neubert Joachim wrote:
> I repeated the data loading with 32 G mem (instead of 11 G, see below). tdbloader2 showed a slight improvement:
>
> tdbloader2
> 0:30 Data phase
> 5:45 Index phase
>
> Almost all of index phase (which seems to have taken longer with more memory - could have been caused by parallel activities on the vm cluster, however) was consumed by OS sort.
>
> Yet, tdbloader took off:
>
> tdbloader
> 0:30 Data phase
> 0:30 Index phase
>
> Excactly 59 min for the whole run! The rough figures of atop showed, near the end of the index phase, the tdbloader java process with 20.3 G rsize, 24.5 G vsize and 8 G vdata. Disk was under full load at this time, and about 9 G of temporary disk space (adding to the 18 G of the tdb dir) were used.
>
> Thank you very much for your help, Andy, and for this great piece of software.
>
> Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Neubert Joachim [mailto:J.Neubert@zbw.eu]
> Gesendet: Freitag, 1. November 2013 13:43
> An: users@jena.apache.org
> Betreff: AW: Declining TDB load performance with larger files
>
> I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:
>
> tdbloader w/ 2G heap
> 4:15 Data phase
> 4:30 Index phase
>
> tdbloader2 w/ 2G heap
> 1:30 Data phase
> 6:30 Index phase
>
> So in sum tdbloader2 shows a slight advantage in my current configuration.
>
> The reduction of heap space had indeed brought an improvement:
>
> tdbloader w/ 10G heap
> 4:30 Data phase
> 5:45 Index phase
>
> Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?
>
> Cheers, Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne [mailto:andy@apache.org]
> Gesendet: Montag, 28. Oktober 2013 16:58
> An: users@jena.apache.org
> Betreff: Re: Declining TDB load performance with larger files
>
> Hi Joachim,
>
> What is happing is that the system is running out of working space and the disk is being used for real.
>
>   > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>
> Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.
>
> More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.
>
>
> There is another effect.  The OS is managing memory but sometimes it
> gets its policy wrong.   Oddly, the faster the initial part of the load,
> the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.
>
> There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.
>
> As it is so per-system specific, try each and see what happens, after fixing the heap issue.
>
> 	Andy
>
>
> On 28/10/13 12:01, Neubert Joachim wrote:
>> I'm loading a 111 million triples file (GND German Authority files).
>> For the first roughly 70 million triples, it's really fast (more than
>> 60,000 avg), but then throughput declines continuously to a thousand
>> or just some hundred triples (which brings down the avg to less than
>> 7000). During the last part of triples data phase, java is down to
>> 1-2% CPU usage, while disk usage goes up to 100%.
>>
>> As TDB writes to disk, I'd expect rather linear loading times. The
>> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere
>> cluster, with SAN hardware under-laying. As I observed the same
>> behavior at different times a day, with for sure different load
>> situations, there is no indication that it depended on parallel
>> actions on the cluster.
>>
>> Perhaps there is something wrong in my config, but I could not figure
>> out what it may be. I add an extract of the log below - it would be
>> great if somebody could help me with hints.
>>
>> Cheers, Joachim
>>
>
>   > ---------------
>   >
>   > 2013-10-25 13:33:33 start run
>   >
>   > Configuration:
>   > java version "1.6.0_24"
>   > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>   > Jena:       VERSION: 2.11.0
>   > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > ARQ:        VERSION: 2.11.0
>   > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   > RIOT:       VERSION: 2.11.0
>   > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > TDB:        VERSION: 1.0.0
>   > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   >
>   > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
>   > INFO  -- Start triples data phase
>   > INFO  ** Load empty triples table
>   > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
> 13:33:35 MESZ
>   > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
>   > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
>   > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
>   > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
>   > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
>   > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
>   > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
>   > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
>   > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
>   > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
>   > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
>   > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
>   > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
>   > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
>   > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
>   > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
>   > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
>   > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
>   > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
>   > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
>   > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
>   > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
>   > INFO  -- Finish triples data phase
>   > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
>   >
>   > INFO  -- Start triples index phase
>   > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
>   > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
>   > INFO  -- Finish triples index phase
>   > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
> 5.596,64 per second]
>   > INFO  -- Finish triples load
>   > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
> [Rate: 3.083,49 per second]
>   >
>   >
>


WG: Declining TDB load performance with larger files

Posted by Neubert Joachim <J....@zbw.eu>.
I repeated the data loading with 32 G mem (instead of 11 G, see below). tdbloader2 showed a slight improvement:

tdbloader2
0:30 Data phase
5:45 Index phase

Almost all of index phase (which seems to have taken longer with more memory - could have been caused by parallel activities on the vm cluster, however) was consumed by OS sort.

Yet, tdbloader took off:

tdbloader
0:30 Data phase
0:30 Index phase

Excactly 59 min for the whole run! The rough figures of atop showed, near the end of the index phase, the tdbloader java process with 20.3 G rsize, 24.5 G vsize and 8 G vdata. Disk was under full load at this time, and about 9 G of temporary disk space (adding to the 18 G of the tdb dir) were used.

Thank you very much for your help, Andy, and for this great piece of software.

Joachim

-----Ursprüngliche Nachricht-----
Von: Neubert Joachim [mailto:J.Neubert@zbw.eu] 
Gesendet: Freitag, 1. November 2013 13:43
An: users@jena.apache.org
Betreff: AW: Declining TDB load performance with larger files

I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:

tdbloader w/ 2G heap
4:15 Data phase
4:30 Index phase

tdbloader2 w/ 2G heap
1:30 Data phase
6:30 Index phase

So in sum tdbloader2 shows a slight advantage in my current configuration. 

The reduction of heap space had indeed brought an improvement:

tdbloader w/ 10G heap
4:30 Data phase
5:45 Index phase

Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy@apache.org]
Gesendet: Montag, 28. Oktober 2013 16:58
An: users@jena.apache.org
Betreff: Re: Declining TDB load performance with larger files

Hi Joachim,

What is happing is that the system is running out of working space and the disk is being used for real.

 > JAVA_OPTS: -d64 -Xms6g -Xmx10g

Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.

More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.


There is another effect.  The OS is managing memory but sometimes it 
gets its policy wrong.   Oddly, the faster the initial part of the load, 
the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.

There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.

As it is so per-system specific, try each and see what happens, after fixing the heap issue.

	Andy


On 28/10/13 12:01, Neubert Joachim wrote:
> I'm loading a 111 million triples file (GND German Authority files).
> For the first roughly 70 million triples, it's really fast (more than
> 60,000 avg), but then throughput declines continuously to a thousand 
> or just some hundred triples (which brings down the avg to less than 
> 7000). During the last part of triples data phase, java is down to 
> 1-2% CPU usage, while disk usage goes up to 100%.
>
> As TDB writes to disk, I'd expect rather linear loading times. The 
> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
> cluster, with SAN hardware under-laying. As I observed the same 
> behavior at different times a day, with for sure different load 
> situations, there is no indication that it depended on parallel 
> actions on the cluster.
>
> Perhaps there is something wrong in my config, but I could not figure 
> out what it may be. I add an extract of the log below - it would be 
> great if somebody could help me with hints.
>
> Cheers, Joachim
>

 > ---------------
 >
 > 2013-10-25 13:33:33 start run
 >
 > Configuration:
 > java version "1.6.0_24"
 > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
 > Jena:       VERSION: 2.11.0
 > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > ARQ:        VERSION: 2.11.0
 > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
 > RIOT:       VERSION: 2.11.0
 > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > TDB:        VERSION: 1.0.0
 > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
 >
 > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
 > INFO  -- Start triples data phase
 > INFO  ** Load empty triples table
 > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
13:33:35 MESZ
 > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
 > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
 > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
 > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
 > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
 > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
 > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
 > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
 > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
 > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
 > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
 > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
 > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
 > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
 > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
 > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
 > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
 > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
 > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
 > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
 > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
 > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
 > INFO  -- Finish triples data phase
 > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
 >
 > INFO  -- Start triples index phase
 > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
 > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
 > INFO  -- Finish triples index phase
 > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 
5.596,64 per second]
 > INFO  -- Finish triples load
 > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
[Rate: 3.083,49 per second]
 >
 >

AW: Declining TDB load performance with larger files

Posted by Neubert Joachim <J....@zbw.eu>.
I did a comparison of tdbloader vs. tdbloader2. The results are not relieable (machine-dependent, and perhaps even influenced by different background load on the vm cluster), but perhaps even then they may be interesting to others:

tdbloader w/ 2G heap
4:15 Data phase
4:30 Index phase

tdbloader2 w/ 2G heap
1:30 Data phase
6:30 Index phase

So in sum tdbloader2 shows a slight advantage in my current configuration. 

The reduction of heap space had indeed brought an improvement:

tdbloader w/ 10G heap
4:30 Data phase
5:45 Index phase

Could I expect a larger improvement by adding more memory (for example upgrading from 11 to 32 GB)? Are there any experiences for estimating an optimal memory size for tdb loading?

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy@apache.org] 
Gesendet: Montag, 28. Oktober 2013 16:58
An: users@jena.apache.org
Betreff: Re: Declining TDB load performance with larger files

Hi Joachim,

What is happing is that the system is running out of working space and the disk is being used for real.

 > JAVA_OPTS: -d64 -Xms6g -Xmx10g

Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.

More heap does not help - in fact, it can make it worse.  TDB uses memory mapped files - these are not in Java heap space.  The operating system manages how much real RAM is devoted to the virtual address space for the file.  As your JVM grows, it is reducing the space for file caching.


There is another effect.  The OS is managing memory but sometimes it 
gets its policy wrong.   Oddly, the faster the initial part of the load, 
the slower the speed drops off to when workspace RAM runs out.  My guess is that the OS guesses some acecss style and then code then breaks that assumption.  It can even different from run to run on the same machine.

There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to OS in different ways.

As it is so per-system specific, try each and see what happens, after fixing the heap issue.

	Andy


On 28/10/13 12:01, Neubert Joachim wrote:
> I'm loading a 111 million triples file (GND German Authority files).
> For the first roughly 70 million triples, it's really fast (more than
> 60,000 avg), but then throughput declines continuously to a thousand 
> or just some hundred triples (which brings down the avg to less than 
> 7000). During the last part of triples data phase, java is down to 
> 1-2% CPU usage, while disk usage goes up to 100%.
>
> As TDB writes to disk, I'd expect rather linear loading times. The 
> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
> cluster, with SAN hardware under-laying. As I observed the same 
> behavior at different times a day, with for sure different load 
> situations, there is no indication that it depended on parallel 
> actions on the cluster.
>
> Perhaps there is something wrong in my config, but I could not figure 
> out what it may be. I add an extract of the log below - it would be 
> great if somebody could help me with hints.
>
> Cheers, Joachim
>

 > ---------------
 >
 > 2013-10-25 13:33:33 start run
 >
 > Configuration:
 > java version "1.6.0_24"
 > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g -Xmx10g
 > Jena:       VERSION: 2.11.0
 > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > ARQ:        VERSION: 2.11.0
 > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
 > RIOT:       VERSION: 2.11.0
 > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > TDB:        VERSION: 1.0.0
 > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
 >
 > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
 > INFO  -- Start triples data phase
 > INFO  ** Load empty triples table
 > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
13:33:35 MESZ
 > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
 > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
 > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
 > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
 > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
 > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
 > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
 > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
 > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
 > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
 > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
 > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
 > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
 > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
 > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
 > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
 > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
 > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
 > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
 > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
 > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
 > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
 > INFO  -- Finish triples data phase
 > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]  >  > Indexing phase also takes its time, and has some decrease in performance too, but does not show a sharp drop.
 >
 > INFO  -- Start triples index phase
 > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
 > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate: 25.570,57 per second]  > ...
 > INFO  -- Finish triples index phase
 > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 
5.596,64 per second]
 > INFO  -- Finish triples load
 > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
[Rate: 3.083,49 per second]
 >
 >

Re: Declining TDB load performance with larger files

Posted by Andy Seaborne <an...@apache.org>.
Hi Joachim,

What is happing is that the system is running out of working space and 
the disk is being used for real.

 > JAVA_OPTS: -d64 -Xms6g -Xmx10g

Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.

More heap does not help - in fact, it can make it worse.  TDB uses 
memory mapped files - these are not in Java heap space.  The operating 
system manages how much real RAM is devoted to the virtual address space 
for the file.  As your JVM grows, it is reducing the space for file 
caching.


There is another effect.  The OS is managing memory but sometimes it 
gets its policy wrong.   Oddly, the faster the initial part of the load, 
the slower the speed drops off to when workspace RAM runs out.  My guess 
is that the OS guesses some acecss style and then code then breaks that 
assumption.  It can even different from run to run on the same machine.

There is also tdbloader2 - it may be faster, it may not.  It is 
vulnerable to OS in different ways.

As it is so per-system specific, try each and see what happens, after 
fixing the heap issue.

	Andy


On 28/10/13 12:01, Neubert Joachim wrote:
> I'm loading a 111 million triples file (GND German Authority files).
> For the first roughly 70 million triples, it's really fast (more than
> 60,000 avg), but then throughput declines continuously to a thousand
> or just some hundred triples (which brings down the avg to less than
> 7000). During the last part of triples data phase, java is down to
> 1-2% CPU usage, while disk usage goes up to 100%.
>
> As TDB writes to disk, I'd expect rather linear loading times. The
> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere
> cluster, with SAN hardware under-laying. As I observed the same
> behavior at different times a day, with for sure different load
> situations, there is no indication that it depended on parallel
> actions on the cluster.
>
> Perhaps there is something wrong in my config, but I could not figure
> out what it may be. I add an extract of the log below - it would be
> great if somebody could help me with hints.
>
> Cheers, Joachim
>

 > ---------------
 >
 > 2013-10-25 13:33:33 start run
 >
 > Configuration:
 > java version "1.6.0_24"
 > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
 > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
 > JAVA_OPTS: -d64 -Xms6g -Xmx10g
 > Jena:       VERSION: 2.11.0
 > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > ARQ:        VERSION: 2.11.0
 > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
 > RIOT:       VERSION: 2.11.0
 > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > TDB:        VERSION: 1.0.0
 > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
 >
 > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
 > INFO  -- Start triples data phase
 > INFO  ** Load empty triples table
 > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25 
13:33:35 MESZ
 > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
 > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
 > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
 > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
 > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
 > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
 > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
 > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
 > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
 > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
 > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
 > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
 > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
 > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
 > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
 > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
 > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
 > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
 > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
 > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
 > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
 > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
 > INFO  -- Finish triples data phase
 > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 
per second]
 >
 > Indexing phase also takes its time, and has some decrease in 
performance too, but does not show a sharp drop.
 >
 > INFO  -- Start triples index phase
 > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
 > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 
seconds [Rate: 25.570,57 per second]
 > ...
 > INFO  -- Finish triples index phase
 > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 
5.596,64 per second]
 > INFO  -- Finish triples load
 > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds 
[Rate: 3.083,49 per second]
 >
 >