You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Steven Blanchard <st...@microbiome.studio> on 2023/03/13 15:08:58 UTC

Significant decrease in indexing speed.

Hello,

I am currently loading data in ttl format with the command line 
tdb2.tdbloader (option --loader=phased ). This data consists of 1.25 
billion tuples. The first step of loading is finish and the indexing 
begin. Since then, the indexing speed has been steadily decreasing 
until reaching an average speed of 651 while only 474 million triples 
have been indexed. With this speed, the indexing take several month.

Do you have tips and solution to speed up the indexing speed? Could 
this be due to the input files?

Thanks,

Steven



Re: Significant decrease in indexing speed.

Posted by Andy Seaborne <an...@apache.org>.

On 15/03/2023 18:02, Steven Blanchard wrote:
> Hello,
> 
> I tried xloader and the upload was completed correctly in about 14 
> hours. It's perfect ! thanks for your help.
> 
> How can we know when xloader is more efficient than tdbloader?

Try it. It isn't easy to give a general rule.

Factors include:

* disk vs SSD

If SSD, the data needs to be quite large ("billion minimum"). For disk 
it is lower.

Disk throughput matters.
Local NVMe SSD is better than SATA and remote SSD.

* Free RAM available - not heap space - this is the OS file system cache 
used for tdb2.tdbloader (including the parallel loader).

A large RAM, local SSD, can easily be faster with loader=parallel up to 
a billion or more.

You can load on server then move the database to another machine. Ensure 
no process is using the database and do a disk-level copy.

     Andy

> Is there a ratio of Ram/ntriples to push to be done?
> 
> regards,
> 
> Steven
> 
> Le mar., mars 14 2023 at 10:08:02 +0000, Andy Seaborne <an...@apache.org> 
> a écrit :
>>
>>
>> On 14/03/2023 08:56, Steven Blanchard wrote:
>>> Hello,
>>>
>>> Yes, I do a one time operation on a non-existant folder. It's faster 
>>> to do this than split input file and do multiple upload ?
>>>
>>> Yes, the loading is to a disk.
>>>
>>> Why does the performance of indexing decrease over time?
>>
>> As the indexes grow in size, the efficiency of the OS file system 
>> cache drops.
>>
>> In loader=phased, one index is created as the loaded, then the phases 
>> create the other indexes by copying the first index to the others ... 
>> which is in effect a sort, and when the OS file system cache becomes 
>> less effective, there is more disk I/O.
>>
>> The xloader are designed to avoid (reduce) this. It uses an external 
>> sort program (Linux sort(1) - batching sorting) to get the information 
>> to load sequentially.
>>
>>     Andy
>>
>>>
>>> Thanks,
>>>
>>> Steven
>>>
>>> Le lun., mars 13 2023 at 21:26:55 +0000, Andy Seaborne 
>>> <andy@apache.org <ma...@apache.org>> a écrit :
>>>>
>>>>
>>>> On 13/03/2023 18:35, Simon Bin wrote:
>>>>> if you do  a 1 time operation of non-existant => tdb, try xloader.
>>>>
>>>> xloader is good when loading to disk, rather than on to an SSD.
>>>>
>>>> Steven - is that what you are using?
>>>>
>>>>     Andy
>>>>
>>>>>
>>>>> On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I am currently loading data in ttl format with the command line
>>>>>> tdb2.tdbloader (option --loader=phased ). This data consists of 1.25
>>>>>> billion tuples. The first step of loading is finish and the indexing
>>>>>> begin. Since then, the indexing speed has been steadily decreasing
>>>>>> until reaching an average speed of 651 while only 474 million triples
>>>>>> have been indexed. With this speed, the indexing take several month.
>>>>>>
>>>>>> Do you have tips and solution to speed up the indexing speed? Could
>>>>>> this be due to the input files?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>
>>>
>>>
> 
> 

Re: Significant decrease in indexing speed.

Posted by Steven Blanchard <st...@microbiome.studio>.
Hello,

I tried xloader and the upload was completed correctly in about 14 
hours. It's perfect ! thanks for your help.

How can we know when xloader is more efficient than tdbloader?
Is there a ratio of Ram/ntriples to push to be done?

regards,

Steven

Le mar., mars 14 2023 at 10:08:02 +0000, Andy Seaborne 
<an...@apache.org> a écrit :
> 
> 
> On 14/03/2023 08:56, Steven Blanchard wrote:
>> Hello,
>> 
>> Yes, I do a one time operation on a non-existant folder. It's faster 
>> to do this than split input file and do multiple upload ?
>> 
>> Yes, the loading is to a disk.
>> 
>> Why does the performance of indexing decrease over time?
> 
> As the indexes grow in size, the efficiency of the OS file system 
> cache drops.
> 
> In loader=phased, one index is created as the loaded, then the phases 
> create the other indexes by copying the first index to the others ... 
> which is in effect a sort, and when the OS file system cache becomes 
> less effective, there is more disk I/O.
> 
> The xloader are designed to avoid (reduce) this. It uses an external 
> sort program (Linux sort(1) - batching sorting) to get the 
> information to load sequentially.
> 
>     Andy
> 
>> 
>> Thanks,
>> 
>> Steven
>> 
>> Le lun., mars 13 2023 at 21:26:55 +0000, Andy Seaborne 
>> <andy@apache.org <ma...@apache.org>> a écrit :
>>> 
>>> 
>>> On 13/03/2023 18:35, Simon Bin wrote:
>>>> if you do  a 1 time operation of non-existant => tdb, try xloader.
>>> 
>>> xloader is good when loading to disk, rather than on to an SSD.
>>> 
>>> Steven - is that what you are using?
>>> 
>>>     Andy
>>> 
>>>> 
>>>> On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
>>>>> Hello,
>>>>> 
>>>>> I am currently loading data in ttl format with the command line
>>>>> tdb2.tdbloader (option --loader=phased ). This data consists of 
>>>>> 1.25
>>>>> billion tuples. The first step of loading is finish and the 
>>>>> indexing
>>>>> begin. Since then, the indexing speed has been steadily decreasing
>>>>> until reaching an average speed of 651 while only 474 million 
>>>>> triples
>>>>> have been indexed. With this speed, the indexing take several 
>>>>> month.
>>>>> 
>>>>> Do you have tips and solution to speed up the indexing speed? 
>>>>> Could
>>>>> this be due to the input files?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Steven
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: Significant decrease in indexing speed.

Posted by Andy Seaborne <an...@apache.org>.

On 14/03/2023 08:56, Steven Blanchard wrote:
> Hello,
> 
> Yes, I do a one time operation on a non-existant folder. It's faster to 
> do this than split input file and do multiple upload ?
> 
> Yes, the loading is to a disk.
> 
> Why does the performance of indexing decrease over time?

As the indexes grow in size, the efficiency of the OS file system cache 
drops.

In loader=phased, one index is created as the loaded, then the phases 
create the other indexes by copying the first index to the others ... 
which is in effect a sort, and when the OS file system cache becomes 
less effective, there is more disk I/O.

The xloader are designed to avoid (reduce) this. It uses an external 
sort program (Linux sort(1) - batching sorting) to get the information 
to load sequentially.

     Andy

> 
> Thanks,
> 
> Steven
> 
> Le lun., mars 13 2023 at 21:26:55 +0000, Andy Seaborne <an...@apache.org> 
> a écrit :
>>
>>
>> On 13/03/2023 18:35, Simon Bin wrote:
>>> if you do  a 1 time operation of non-existant => tdb, try xloader.
>>
>> xloader is good when loading to disk, rather than on to an SSD.
>>
>> Steven - is that what you are using?
>>
>>     Andy
>>
>>>
>>> On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
>>>> Hello,
>>>>
>>>> I am currently loading data in ttl format with the command line
>>>> tdb2.tdbloader (option --loader=phased ). This data consists of 1.25
>>>> billion tuples. The first step of loading is finish and the indexing
>>>> begin. Since then, the indexing speed has been steadily decreasing
>>>> until reaching an average speed of 651 while only 474 million triples
>>>> have been indexed. With this speed, the indexing take several month.
>>>>
>>>> Do you have tips and solution to speed up the indexing speed? Could
>>>> this be due to the input files?
>>>>
>>>> Thanks,
>>>>
>>>> Steven
>>>>
>>>>
>>>
> 
> 

Re: Significant decrease in indexing speed.

Posted by Steven Blanchard <st...@microbiome.studio>.
Hello,

Yes, I do a one time operation on a non-existant folder. It's faster to 
do this than split input file and do multiple upload ?

Yes, the loading is to a disk.

Why does the performance of indexing decrease over time?

Thanks,

Steven

Le lun., mars 13 2023 at 21:26:55 +0000, Andy Seaborne 
<an...@apache.org> a écrit :
> 
> 
> On 13/03/2023 18:35, Simon Bin wrote:
>> if you do  a 1 time operation of non-existant => tdb, try xloader.
> 
> xloader is good when loading to disk, rather than on to an SSD.
> 
> Steven - is that what you are using?
> 
>     Andy
> 
>> 
>> On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
>>> Hello,
>>> 
>>> I am currently loading data in ttl format with the command line
>>> tdb2.tdbloader (option --loader=phased ). This data consists of 1.25
>>> billion tuples. The first step of loading is finish and the indexing
>>> begin. Since then, the indexing speed has been steadily decreasing
>>> until reaching an average speed of 651 while only 474 million 
>>> triples
>>> have been indexed. With this speed, the indexing take several month.
>>> 
>>> Do you have tips and solution to speed up the indexing speed? Could
>>> this be due to the input files?
>>> 
>>> Thanks,
>>> 
>>> Steven
>>> 
>>> 
>> 


Re: Significant decrease in indexing speed.

Posted by Andy Seaborne <an...@apache.org>.

On 13/03/2023 18:35, Simon Bin wrote:
> if you do  a 1 time operation of non-existant => tdb, try xloader.

xloader is good when loading to disk, rather than on to an SSD.

Steven - is that what you are using?

     Andy

> 
> On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
>> Hello,
>>
>> I am currently loading data in ttl format with the command line
>> tdb2.tdbloader (option --loader=phased ). This data consists of 1.25
>> billion tuples. The first step of loading is finish and the indexing
>> begin. Since then, the indexing speed has been steadily decreasing
>> until reaching an average speed of 651 while only 474 million triples
>> have been indexed. With this speed, the indexing take several month.
>>
>> Do you have tips and solution to speed up the indexing speed? Could
>> this be due to the input files?
>>
>> Thanks,
>>
>> Steven
>>
>>
> 

Re: Significant decrease in indexing speed.

Posted by Simon Bin <sb...@informatik.uni-leipzig.de>.
if you do  a 1 time operation of non-existant => tdb, try xloader. 

On Mon, 2023-03-13 at 16:08 +0100, Steven Blanchard wrote:
> Hello,
> 
> I am currently loading data in ttl format with the command line 
> tdb2.tdbloader (option --loader=phased ). This data consists of 1.25 
> billion tuples. The first step of loading is finish and the indexing 
> begin. Since then, the indexing speed has been steadily decreasing 
> until reaching an average speed of 651 while only 474 million triples
> have been indexed. With this speed, the indexing take several month.
> 
> Do you have tips and solution to speed up the indexing speed? Could 
> this be due to the input files?
> 
> Thanks,
> 
> Steven
> 
>