You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Steven Blanchard <st...@microbiome.studio> on 2023/05/22 09:49:25 UTC

xloader on large dataset : Data Task with poor load average

Hello,

I am currently trying to load a very large dataset ( 54 billion 
triples) with the tdb2.xloader command.

The first two steps (Nodes and Terms) are completed with an average 
load speed of ~ 120,000.
The third stage (Data) has an average load speed of only 800. This 
average load speed is incompatible with the amount of data to be loaded.

Looking at the status of the job, it is possible that there is an 
excessive demand on memory which slows down the process extremely.

We saw with a top that java required many memories :
```
top
# PID              USER   PR NI      VIRT       RES     SHR S %CPU %MEM 
      TIME+ COMMAND
# 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3      72,1    
    1102:32               java
```

But with a free -g, we see that it actually uses very little memory.
```
free -g
#             total used free shared buff/cache available
# Mem: 125          3     0             0                 121           
  120
```

Are there any possibilities to speed up this step?  (Give a -xms to 
java?)
Can this significant drop in loading speed for this step be due to 
memory usage? Do you know of any other limiting causes in this loading 
stage?

For previous insertions on smaller datasets, this Data step was not 
limiting and the average speed was even slightly higher than the Nodes 
and Terms steps.

For information, the machine used has 32 CPUs and 128 Giga of Ram.

Thanks for your help,
Regards,

Steven



Re: xloader on large dataset : Data Task with poor load average

Posted by Steven Blanchard <st...@microbiome.studio>.

Le jeu., mai 25 2023 at 08:46:31 +0100, Andy Seaborne <an...@apache.org> 
a écrit :
> 
> 
> On 24/05/2023 10:22, Steven Blanchard wrote:
>> Hi Andy,
>> 
>> I tried it on a local disk and it had no impact on the average speed 
>> for the Data stage.
> 
> SSD or rotating disk? (It shouldn't make an extreme difference or 
> xloader, because that's part of the point of the xloader.)
On a SSD. The average speed are the same on the SSD or on the  Block 
Storage.
> 
>> I checked with iostat, there was indeed an increase in the speed of 
>> reading the input files. This step writes very little data so there 
>> was no difference in the writing speed.
>> 
>> I also did a test with only 1 of the uniprot files (291 million 
>> tuples) and the average speed was about 160,000 tuples/s. This 
>> value corresponds to speeds obtained on other insertions.
> 
> On the exact same hardware?

Yes same hardware, same folder, same time. Only the quantity of data is 
differents.

> 
>> Could this decrease of average speed be related to the amount of 
>> total data?
>> Is it possible to run this Data step only file by file and all the 
>> other steps with all files?
> 
> Not sure - there is a shared node table being built. The slowness is 
> presumably a consequence of the previous stages. The use of the some 
> URI needs to have the same internal NodeId everywhere - i.e. seeing 
> all the data.

During our tests, we resumed the insertion at the Data stage and we 
noticed that the decrease in average speed is related to the previous 
steps. If we give an argument the existing directory with the step 
Nodes and Term having already instead the speed of insertion of the 
data is 800 tuples/s. If we give an argument an empty directory, the 
insertion speed of the Data step is 190,000 tuples/s.

The decrease in speed therefore seems to be related to the amount of 
data and the results of the previous steps. When this step ingest data, 
there is an optional step that uses previously created files that could 
be very long because of the total amount of data?
What is the link between these 3 steps? What does each of these three 
steps do for the data insertion?

> I'm still not seeing why the data stage starts at a slow rate - I 
> will need to find time to explore the code.
> 
> (This is an argument for having NodeIds be hashes because that can be 
> computed without reference to the table unique ids and representation 
> storage. Downside - the NodeIds would be longer, 96 or 128 bits and 
> hashes have bad locality (i.e. none whatsoever)).
> 
>     Andy
> 
>> 
>> Thank you,
>> 
>> Steven
>> 
>> Le mar., mai 23 2023 at 11:30:36 +0100, Andy Seaborne 
>> <andy@apache.org <ma...@apache.org>> a écrit :
>>> 
>>> 
>>> On 22/05/2023 16:38, Steven Blanchard wrote:
>>>> 
>>>> 
>>>> Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne 
>>>> <andy@apache.org <ma...@apache.org> 
>>>> <<m...@apache.org>>> a écrit :
>>>> Hello Andy,
>>>> 
>>>>> Hi Steven,
>>>>> 
>>>>> How are you runnign xloader? Default settings?
>>>>> 
>>>> Yes, we use your default settings.
>>>> The command line used is the following line :
>>>> tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir 
>>>> /nfs/uniprot_tmp/ --threads 30 
>>>> /nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf
>>> 
>>> Just looking at that, the use of NFS may be related.
>>> 
>>> NFS is shared, remote filing system so it has comparative high 
>>> overheads on every operation to give the semantics of sharing 
>>> (visibility on write).
>>> 
>>> Could you try using local disk to see if that makes a difference?
>>> 
>>>     Andy
>>> 
>>>> 
>>>>> What's the storage being used?
>>>> We use a Block Storage from a cloud providers with ssd on a mouted 
>>>> nsf volume.
>>>> 
>>>>> On 22/05/2023 10:49, Steven Blanchard wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I am currently trying to load a very large dataset ( 54 billion 
>>>>>> triples) with the tdb2.xloader command.
>>>>>> 
>>>>>> The first two steps (Nodes and Terms) are completed with an 
>>>>>> average load speed of ~ 120,000.
>>>>>> The third stage (Data) has an average load speed of only 800.
>>>>> 
>>>>> is thet "Avg" is 800 from teh start of the phase or "the average 
>>>>> drops to 800" during the phase?
>>>> The Avg is 800 from the start of the phase and he stay at 800.
>>>> 
>>>>>> This average load speed is incompatible with the amount of data 
>>>>>> to be loaded.
>>>>>> 
>>>>>> Looking at the status of the job, it is possible that there is 
>>>>>> an excessive demand on memory which slows down the 
>>>>>> process extremely.
>>>>>> 
>>>>>> We saw with a top that java required many memories :
>>>>>> ```
>>>>>> top
>>>>>> # PID              USER   PR NI      VIRT       RES     SHR S 
>>>>>> %CPU %MEM       TIME+ COMMAND
>>>>>> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3 72,1 
>>>>>>    1102:32               java
>>>>>> ```
>>>>> 
>>>>> xloader does not have much requirement for java heap memory.
>>>> Ok, since our email we have try to increase the -xmx and we have 
>>>> not an increase of the performance.
>>>>> That space may be mapped files.
>>>>> 
>>>>>> But with a free -g, we see that it actually uses very little 
>>>>>> memory.
>>>>>> ```
>>>>>> free -g
>>>>>> #             total used free shared buff/cache available
>>>>>> # Mem: 125          3     0             0                 121  
>>>>>> 120
>>>>>> ```
>>>>>> 
>>>>>> Are there any possibilities to speed up this step?  (Give a -xms 
>>>>>> to java?)
>>>>>> Can this significant drop in loading speed for this step be due 
>>>>>> to memory usage? Do you know of any other limiting 
>>>>>> causes in this loading stage?
>>>>>> 
>>>>>> For previous insertions on smaller datasets, this Data step was 
>>>>>> not limiting and the average speed was even slightly 
>>>>>> higher than the Nodes and Terms steps.
>>>>> 
>>>>> How small is "smaller"?
>>>> For example, we have upload UniRef RDF Database (Same providers 
>>>> like UniProt)  with 12 Milliards of triples with an average 
>>>> for Data task of 230 000 tuples/s
>>>> 
>>>>> That sounds like what I see when loading.
>>>>> 
>>>>>> 
>>>>>> For information, the machine used has 32 CPUs and 128 Giga of 
>>>>>> Ram.
>>>>>> 
>>>>>> Thanks for your help,
>>>>>> Regards,
>>>>>> 
>>>>>> Steven
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: xloader on large dataset : Data Task with poor load average

Posted by Andy Seaborne <an...@apache.org>.

On 24/05/2023 10:22, Steven Blanchard wrote:
> Hi Andy,
> 
> I tried it on a local disk and it had no impact on the average speed for 
> the Data stage.

SSD or rotating disk? (It shouldn't make an extreme difference or 
xloader, because that's part of the point of the xloader.)

> I checked with iostat, there was indeed an increase in the speed of 
> reading the input files. This step writes very little data so there was 
> no difference in the writing speed.
> 
> I also did a test with only 1 of the uniprot files (291 million tuples) 
> and the average speed was about 160,000 tuples/s. This value corresponds 
> to speeds obtained on other insertions.

On the exact same hardware?

> Could this decrease of average speed be related to the amount of total 
> data?
> Is it possible to run this Data step only file by file and all the other 
> steps with all files?

Not sure - there is a shared node table being built. The slowness is 
presumably a consequence of the previous stages. The use of the some URI 
needs to have the same internal NodeId everywhere - i.e. seeing all the 
data.

I'm still not seeing why the data stage starts at a slow rate - I will 
need to find time to explore the code.

(This is an argument for having NodeIds be hashes because that can be 
computed without reference to the table unique ids and representation 
storage. Downside - the NodeIds would be longer, 96 or 128 bits and 
hashes have bad locality (i.e. none whatsoever)).

     Andy

> 
> Thank you,
> 
> Steven
> 
> Le mar., mai 23 2023 at 11:30:36 +0100, Andy Seaborne <an...@apache.org> 
> a écrit :
>>
>>
>> On 22/05/2023 16:38, Steven Blanchard wrote:
>>>
>>>
>>> Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne 
>>> <andy@apache.org <ma...@apache.org>> a écrit :
>>> Hello Andy,
>>>
>>>> Hi Steven,
>>>>
>>>> How are you runnign xloader? Default settings?
>>>>
>>> Yes, we use your default settings.
>>> The command line used is the following line :
>>> tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir 
>>> /nfs/uniprot_tmp/ --threads 30 
>>> /nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf
>>
>> Just looking at that, the use of NFS may be related.
>>
>> NFS is shared, remote filing system so it has comparative high 
>> overheads on every operation to give the semantics of sharing 
>> (visibility on write).
>>
>> Could you try using local disk to see if that makes a difference?
>>
>>     Andy
>>
>>>
>>>> What's the storage being used?
>>> We use a Block Storage from a cloud providers with ssd on a mouted 
>>> nsf volume.
>>>
>>>> On 22/05/2023 10:49, Steven Blanchard wrote:
>>>>> Hello,
>>>>>
>>>>> I am currently trying to load a very large dataset ( 54 billion 
>>>>> triples) with the tdb2.xloader command.
>>>>>
>>>>> The first two steps (Nodes and Terms) are completed with an average 
>>>>> load speed of ~ 120,000.
>>>>> The third stage (Data) has an average load speed of only 800.
>>>>
>>>> is thet "Avg" is 800 from teh start of the phase or "the average 
>>>> drops to 800" during the phase?
>>> The Avg is 800 from the start of the phase and he stay at 800.
>>>
>>>>> This average load speed is incompatible with the amount of data to 
>>>>> be loaded.
>>>>>
>>>>> Looking at the status of the job, it is possible that there is an 
>>>>> excessive demand on memory which slows down the process extremely.
>>>>>
>>>>> We saw with a top that java required many memories :
>>>>> ```
>>>>> top
>>>>> # PID              USER   PR NI      VIRT       RES     SHR S %CPU 
>>>>> %MEM       TIME+ COMMAND
>>>>> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3 72,1 
>>>>>    1102:32               java
>>>>> ```
>>>>
>>>> xloader does not have much requirement for java heap memory.
>>> Ok, since our email we have try to increase the -xmx and we have not 
>>> an increase of the performance.
>>>> That space may be mapped files.
>>>>
>>>>> But with a free -g, we see that it actually uses very little memory.
>>>>> ```
>>>>> free -g
>>>>> #             total used free shared buff/cache available
>>>>> # Mem: 125          3     0             0                 121  120
>>>>> ```
>>>>>
>>>>> Are there any possibilities to speed up this step?  (Give a -xms to 
>>>>> java?)
>>>>> Can this significant drop in loading speed for this step be due to 
>>>>> memory usage? Do you know of any other limiting causes in this 
>>>>> loading stage?
>>>>>
>>>>> For previous insertions on smaller datasets, this Data step was not 
>>>>> limiting and the average speed was even slightly higher than 
>>>>> the Nodes and Terms steps.
>>>>
>>>> How small is "smaller"?
>>> For example, we have upload UniRef RDF Database (Same providers like 
>>> UniProt)  with 12 Milliards of triples with an average for Data task 
>>> of 230 000 tuples/s
>>>
>>>> That sounds like what I see when loading.
>>>>
>>>>>
>>>>> For information, the machine used has 32 CPUs and 128 Giga of Ram.
>>>>>
>>>>> Thanks for your help,
>>>>> Regards,
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>>
>>>
>>>
> 
> 

Re: xloader on large dataset : Data Task with poor load average

Posted by Steven Blanchard <st...@microbiome.studio>.
Hi Andy,

I tried it on a local disk and it had no impact on the average speed 
for the Data stage.
I checked with iostat, there was indeed an increase in the speed of 
reading the input files. This step writes very little data so there was 
no difference in the writing speed.

I also did a test with only 1 of the uniprot files (291 million tuples) 
and the average speed was about 160,000 tuples/s. This value 
corresponds to speeds obtained on other insertions.

Could this decrease of average speed be related to the amount of total 
data?
Is it possible to run this Data step only file by file and all the 
other steps with all files?

Thank you,

Steven

Le mar., mai 23 2023 at 11:30:36 +0100, Andy Seaborne <an...@apache.org> 
a écrit :
> 
> 
> On 22/05/2023 16:38, Steven Blanchard wrote:
>> 
>> 
>> Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne 
>> <andy@apache.org <ma...@apache.org>> a écrit :
>> Hello Andy,
>> 
>>> Hi Steven,
>>> 
>>> How are you runnign xloader? Default settings?
>>> 
>> Yes, we use your default settings.
>> The command line used is the following line :
>> tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir 
>> /nfs/uniprot_tmp/ --threads 30 
>> /nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf
> 
> Just looking at that, the use of NFS may be related.
> 
> NFS is shared, remote filing system so it has comparative high 
> overheads on every operation to give the semantics of sharing 
> (visibility on write).
> 
> Could you try using local disk to see if that makes a difference?
> 
>     Andy
> 
>> 
>>> What's the storage being used?
>> We use a Block Storage from a cloud providers with ssd on a mouted 
>> nsf volume.
>> 
>>> On 22/05/2023 10:49, Steven Blanchard wrote:
>>>> Hello,
>>>> 
>>>> I am currently trying to load a very large dataset ( 54 billion 
>>>> triples) with the tdb2.xloader command.
>>>> 
>>>> The first two steps (Nodes and Terms) are completed with an 
>>>> average load speed of ~ 120,000.
>>>> The third stage (Data) has an average load speed of only 800.
>>> 
>>> is thet "Avg" is 800 from teh start of the phase or "the average 
>>> drops to 800" during the phase?
>> The Avg is 800 from the start of the phase and he stay at 800.
>> 
>>>> This average load speed is incompatible with the amount of data 
>>>> to be loaded.
>>>> 
>>>> Looking at the status of the job, it is possible that there is an 
>>>> excessive demand on memory which slows down the process 
>>>> extremely.
>>>> 
>>>> We saw with a top that java required many memories :
>>>> ```
>>>> top
>>>> # PID              USER   PR NI      VIRT       RES     SHR S %CPU 
>>>> %MEM       TIME+ COMMAND
>>>> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3      
>>>> 72,1    1102:32               java
>>>> ```
>>> 
>>> xloader does not have much requirement for java heap memory.
>> Ok, since our email we have try to increase the -xmx and we have not 
>> an increase of the performance.
>>> That space may be mapped files.
>>> 
>>>> But with a free -g, we see that it actually uses very little 
>>>> memory.
>>>> ```
>>>> free -g
>>>> #             total used free shared buff/cache available
>>>> # Mem: 125          3     0             0                 121  120
>>>> ```
>>>> 
>>>> Are there any possibilities to speed up this step?  (Give a -xms 
>>>> to java?)
>>>> Can this significant drop in loading speed for this step be due to 
>>>> memory usage? Do you know of any other limiting causes in this 
>>>> loading stage?
>>>> 
>>>> For previous insertions on smaller datasets, this Data step was 
>>>> not limiting and the average speed was even slightly higher 
>>>> than the Nodes and Terms steps.
>>> 
>>> How small is "smaller"?
>> For example, we have upload UniRef RDF Database (Same providers like 
>> UniProt)  with 12 Milliards of triples with an average for Data 
>> task of 230 000 tuples/s
>> 
>>> That sounds like what I see when loading.
>>> 
>>>> 
>>>> For information, the machine used has 32 CPUs and 128 Giga of Ram.
>>>> 
>>>> Thanks for your help,
>>>> Regards,
>>>> 
>>>> Steven
>>>> 
>>>> 
>>>> 
>> 
>> 


Re: xloader on large dataset : Data Task with poor load average

Posted by Andy Seaborne <an...@apache.org>.

On 22/05/2023 16:38, Steven Blanchard wrote:
> 
> 
> Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne <an...@apache.org> 
> a écrit :
> Hello Andy,
> 
>> Hi Steven,
>>
>> How are you runnign xloader? Default settings?
>>
> Yes, we use your default settings.
> The command line used is the following line :
> tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir 
> /nfs/uniprot_tmp/ --threads 30 
> /nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf

Just looking at that, the use of NFS may be related.

NFS is shared, remote filing system so it has comparative high overheads 
on every operation to give the semantics of sharing (visibility on write).

Could you try using local disk to see if that makes a difference?

     Andy

> 
>> What's the storage being used?
> We use a Block Storage from a cloud providers with ssd on a mouted nsf 
> volume.
> 
>> On 22/05/2023 10:49, Steven Blanchard wrote:
>>> Hello,
>>>
>>> I am currently trying to load a very large dataset ( 54 billion 
>>> triples) with the tdb2.xloader command.
>>>
>>> The first two steps (Nodes and Terms) are completed with an average 
>>> load speed of ~ 120,000.
>>> The third stage (Data) has an average load speed of only 800.
>>
>> is thet "Avg" is 800 from teh start of the phase or "the average drops 
>> to 800" during the phase?
> The Avg is 800 from the start of the phase and he stay at 800.
> 
>>> This average load speed is incompatible with the amount of data to 
>>> be loaded.
>>>
>>> Looking at the status of the job, it is possible that there is an 
>>> excessive demand on memory which slows down the process extremely.
>>>
>>> We saw with a top that java required many memories :
>>> ```
>>> top
>>> # PID              USER   PR NI      VIRT       RES     SHR S %CPU 
>>> %MEM       TIME+ COMMAND
>>> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3      72,1 
>>>    1102:32               java
>>> ```
>>
>> xloader does not have much requirement for java heap memory.
> Ok, since our email we have try to increase the -xmx and we have not an 
> increase of the performance.
>> That space may be mapped files.
>>
>>> But with a free -g, we see that it actually uses very little memory.
>>> ```
>>> free -g
>>> #             total used free shared buff/cache available
>>> # Mem: 125          3     0             0                 121  120
>>> ```
>>>
>>> Are there any possibilities to speed up this step?  (Give a -xms to 
>>> java?)
>>> Can this significant drop in loading speed for this step be due to 
>>> memory usage? Do you know of any other limiting causes in this 
>>> loading stage?
>>>
>>> For previous insertions on smaller datasets, this Data step was not 
>>> limiting and the average speed was even slightly higher than the 
>>> Nodes and Terms steps.
>>
>> How small is "smaller"?
> For example, we have upload UniRef RDF Database (Same providers like 
> UniProt)  with 12 Milliards of triples with an average for Data task of 
> 230 000 tuples/s
> 
>> That sounds like what I see when loading.
>>
>>>
>>> For information, the machine used has 32 CPUs and 128 Giga of Ram.
>>>
>>> Thanks for your help,
>>> Regards,
>>>
>>> Steven
>>>
>>>
>>>
> 
> 

Re: xloader on large dataset : Data Task with poor load average

Posted by Steven Blanchard <st...@microbiome.studio>.

Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne <an...@apache.org> 
a écrit :
Hello Andy,

> Hi Steven,
> 
> How are you runnign xloader? Default settings?
> 
Yes, we use your default settings.
The command line used is the following line :
tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir 
/nfs/uniprot_tmp/ --threads 30 
/nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf

> What's the storage being used?
We use a Block Storage from a cloud providers with ssd on a mouted nsf 
volume.

> On 22/05/2023 10:49, Steven Blanchard wrote:
>> Hello,
>> 
>> I am currently trying to load a very large dataset ( 54 billion 
>> triples) with the tdb2.xloader command.
>> 
>> The first two steps (Nodes and Terms) are completed with an average 
>> load speed of ~ 120,000.
>> The third stage (Data) has an average load speed of only 800.
> 
> is thet "Avg" is 800 from teh start of the phase or "the average 
> drops to 800" during the phase?
The Avg is 800 from the start of the phase and he stay at 800.

>> This average load speed is incompatible with the amount of data to 
>> be loaded.
>> 
>> Looking at the status of the job, it is possible that there is an 
>> excessive demand on memory which slows down the process extremely.
>> 
>> We saw with a top that java required many memories :
>> ```
>> top
>> # PID              USER   PR NI      VIRT       RES     SHR S %CPU 
>> %MEM       TIME+ COMMAND
>> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3      72,1 
>>    1102:32               java
>> ```
> 
> xloader does not have much requirement for java heap memory.
Ok, since our email we have try to increase the -xmx and we have not an 
increase of the performance.
> That space may be mapped files.
> 
>> But with a free -g, we see that it actually uses very little memory.
>> ```
>> free -g
>> #             total used free shared buff/cache available
>> # Mem: 125          3     0             0                 121  120
>> ```
>> 
>> Are there any possibilities to speed up this step?  (Give a -xms to 
>> java?)
>> Can this significant drop in loading speed for this step be due to 
>> memory usage? Do you know of any other limiting causes in this 
>> loading stage?
>> 
>> For previous insertions on smaller datasets, this Data step was not 
>> limiting and the average speed was even slightly higher than the 
>> Nodes and Terms steps.
> 
> How small is "smaller"?
For example, we have upload UniRef RDF Database (Same providers like 
UniProt)  with 12 Milliards of triples with an average for Data task of 
230 000 tuples/s

> That sounds like what I see when loading.
> 
>> 
>> For information, the machine used has 32 CPUs and 128 Giga of Ram.
>> 
>> Thanks for your help,
>> Regards,
>> 
>> Steven
>> 
>> 
>> 


Re: xloader on large dataset : Data Task with poor load average

Posted by Andy Seaborne <an...@apache.org>.
Hi Steven,

How are you runnign xloader? Default settings?

What's the storage being used?

On 22/05/2023 10:49, Steven Blanchard wrote:
> Hello,
> 
> I am currently trying to load a very large dataset ( 54 billion triples) 
> with the tdb2.xloader command.
> 
> The first two steps (Nodes and Terms) are completed with an average load 
> speed of ~ 120,000.
> The third stage (Data) has an average load speed of only 800.

is thet "Avg" is 800 from teh start of the phase or "the average drops 
to 800" during the phase?

> This 
> average load speed is incompatible with the amount of data to be loaded.
> 
> Looking at the status of the job, it is possible that there is an 
> excessive demand on memory which slows down the process extremely.
> 
> We saw with a top that java required many memories :
> ```
> top
> # PID              USER   PR NI      VIRT       RES     SHR S %CPU %MEM 
>       TIME+ COMMAND
> # 867362 sblanch+ 20  0 289,0g    90,2g  88,4g S       3,3      72,1    
> 1102:32               java
> ```

xloader does not have much requirement for java heap memory.

That space may be mapped files.

> But with a free -g, we see that it actually uses very little memory.
> ```
> free -g
> #             total used free shared buff/cache available
> # Mem: 125          3     0             0                 121  120
> ```
> 
> Are there any possibilities to speed up this step?  (Give a -xms to java?)
> Can this significant drop in loading speed for this step be due to 
> memory usage? Do you know of any other limiting causes in this loading 
> stage?
> 
> For previous insertions on smaller datasets, this Data step was not 
> limiting and the average speed was even slightly higher than the Nodes 
> and Terms steps.

How small is "smaller"?

That sounds like what I see when loading.

> 
> For information, the machine used has 32 CPUs and 128 Giga of Ram.
> 
> Thanks for your help,
> Regards,
> 
> Steven
> 
> 
>