You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2022/11/13 18:26:37 UTC

Loader performance test

Trying out a specific machine:

1 billion triples : BSBM-1000 (1,000,253,325 triples)

tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
Time: 3,218.82 seconds (53mins 39secs)
Rate: 310,751 triples/s

The machine:

Dell 8950, Intel® Core™ i7-12700K Processor
   8 performance cores with hyper threading
   4 Efficient-cores
   Total : 16+4 threads

64G RAM DDR5, 2 memory channels
m2 SSD (1TB)

The database is 191GBytes

4 threads were running at 100% and they were spread across cores (other 
threads were doing I/O and general housekeeping).

The OS didn't apply any thermal controls - the active threads weren't 
being moved across cores, the CPU temperatures were only around 44C, and 
the processor fans wasn't elevated.

The machine was usable during the load.

----

On the same hardware tdb2.xloader achieved 87kTPS and a database of 
132Gbytes

Re: Loader performance test

Posted by LB <co...@googlemail.com.INVALID>.
Hi Andy,

in the meantime I run the parallel loader on a different server with no 
ZFS raid and NVMe disk instead of SATA and got way better results:

Server:

AMD 5950X, 128GB RAM, 2 x 3,84TB NVMe in RAID1

Time = 5,454.383 seconds : Quads = 958,530,116 : Rate = 175,736 /s

So it's ~90 minutes for 1 billion triples which I would think is quite nice


Regards,

Lorenz

On 18.11.22 22:24, Andy Seaborne wrote:
> How does this compare with your groups previous loader performance 
> investigations? Did any use PCIe/m2?
>
> On 18/11/2022 19:52, Simon Bin wrote:
>> Hi,
>>
>> we're trying to load our project internal data set
>>
>>
>> with currently 959,170,877 quads (still growing)
>>
>> on a
>>
>> 24-core AMD EPYC 7443P with 2.85-4.00GHz
>> 256GB RAM
>> and Samsung SSD 870 QVO 8TB SATA SSDs in a RAIDZ1
>>
>> tdb2.tdbloader --loader=parallel
>> 21,450.519 seconds
>>
>> especially noticeable towards the end, it stalls massively (Batch:
>> 1,169). Avg: 44,752
>
> 1/ Does the process have limits on the amount of memory mapped file
> area? If its limited, the resident address space is small and mmap 
> files don't cache.
>
> 2/ I'm not familiar with RAIDZ1 but it seems to requires 2 writes per 
> block to maintain the parity bit.
>
> 3/ Try the other loaders 'phased' and 'sequential' to see if their 
> less I/O intensive requirements and less overlapping use of file 
> system cache do better than "parallel".
>
>> The produced tdb2 files are 297G
>>
>>
>> tdb2.xloader --threads 11
>> 25,295 seconds
>> Overall Rate     37,919 tuples per second
>>
>> the xloader is a bit slower (~+1 hour) but seems to put much less
>> strain on the system. Also the tdb2 is much more compact -- 173G
>
> It does more sequential I/O which is SATA friendly.
>
>     Andy
>
>>
>>
>> Curious if you have any advice to improve performance?
>
> Experiment!
>
>>
>> Cheers,
>>
>> On 2022/11/16 12:37:19 Andy Seaborne wrote:
>>>
>>>
>>> On 16/11/2022 07:54, LB wrote:
>>>> Andy got a new computer? Nice.
>>>>
>>>> I'm wondering if higher bandwidth of DDR5 already has an impact.
>>>>
>>>> Performance with xloader was ~ 4x lower than tdbloader? Any ideas
>> why?
>>>
>>> xloader does more work (sorting is a separate step) with less
>> resources.
>>>
>>> tdb2.tdbloader --loader=parallel is slower if the I/O bandwidth isn't
>>> there and also performs parallel random I/O operations hence it is
>> bad
>>> on HDD (and to some extend on SATA SSDs).
>>>
>>> xloader is disk friendly and uses (roughly speaking) only a single
>> write
>>> channel.
>>>
>>>        Andy
>>>
>>>> Can you try a real world dataset like Wikidata truthy as well?
>>>>
>>>> I could also give it another try if we agree on timestamp of the
>> dump as
>>>> well as the Jena version for better comparison. Collecting those
>> runs on
>>>> the Jena site would be good material for interested people.
>>>>
>>>> On 13.11.22 19:26, Andy Seaborne wrote:
>>>>> Trying out a specific machine:
>>>>>
>>>>> 1 billion triples : BSBM-1000 (1,000,253,325 triples)
>>>>>
>>>>> tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
>>>>> Time: 3,218.82 seconds (53mins 39secs)
>>>>> Rate: 310,751 triples/s
>>>>>
>>>>> The machine:
>>>>>
>>>>> Dell 8950, Intel® Core™ i7-12700K Processor
>>>>>    8 performance cores with hyper threading
>>>>>    4 Efficient-cores
>>>>>    Total : 16+4 threads
>>>>>
>>>>> 64G RAM DDR5, 2 memory channels
>>>>> m2 SSD (1TB)
>>>>>
>>>>> The database is 191GBytes
>>>>>
>>>>> 4 threads were running at 100% and they were spread across cores
>>>>> (other threads were doing I/O and general housekeeping).
>>>>>
>>>>> The OS didn't apply any thermal controls - the active threads
>> weren't
>>>>> being moved across cores, the CPU temperatures were only around
>> 44C,
>>>>> and the processor fans wasn't elevated.
>>>>>
>>>>> The machine was usable during the load.
>>>>>
>>>>> ----
>>>>>
>>>>> On the same hardware tdb2.xloader achieved 87kTPS and a database
>> of
>>>>> 132Gbytes
>>>
>>

Re: Re: Loader performance test

Posted by Simon Bin <sb...@informatik.uni-leipzig.de>.
On Mon, 2022-11-21 at 15:57 +0000, Andy Seaborne wrote:
> ulimit -something

I guess it's

ulimit -m: resident set size (kbytes)      unlimited

thanks (it was unlimited already)

Re: Loader performance test

Posted by Andy Seaborne <an...@apache.org>.
ulimit -something

/proc/sys/vm/max_map_count is the number of areas that can be used, not 
the controlled resident size.

     Andy

On 21/11/2022 10:44, Simon Bin wrote:
> On Fri, 2022-11-18 at 21:24 +0000, Andy Seaborne wrote:
>> 1/ Does the process have limits on the amount of memory mapped file
>> area? If its limited, the resident address space is small and mmap
>> files
>> don't cache.
> 
> Hi, thanks for this pointer I have to admit I don't know how to check
> it, do you have any details for me? /proc/sys/vm/max_map_count is set
> to 65530 which I think should be high enough.

Re: Re: Loader performance test

Posted by Simon Bin <sb...@informatik.uni-leipzig.de>.
On Fri, 2022-11-18 at 21:24 +0000, Andy Seaborne wrote:
> 1/ Does the process have limits on the amount of memory mapped file
> area? If its limited, the resident address space is small and mmap
> files 
> don't cache.

Hi, thanks for this pointer I have to admit I don't know how to check
it, do you have any details for me? /proc/sys/vm/max_map_count is set
to 65530 which I think should be high enough.

Re: Loader performance test

Posted by Andy Seaborne <an...@apache.org>.
How does this compare with your groups previous loader performance 
investigations? Did any use PCIe/m2?

On 18/11/2022 19:52, Simon Bin wrote:
> Hi,
> 
> we're trying to load our project internal data set
> 
> 
> with currently 959,170,877 quads (still growing)
> 
> on a
> 
> 24-core AMD EPYC 7443P with 2.85-4.00GHz
> 256GB RAM
> and Samsung SSD 870 QVO 8TB SATA SSDs in a RAIDZ1
> 
> tdb2.tdbloader --loader=parallel
> 21,450.519 seconds
> 
> especially noticeable towards the end, it stalls massively (Batch:
> 1,169). Avg: 44,752

1/ Does the process have limits on the amount of memory mapped file
area? If its limited, the resident address space is small and mmap files 
don't cache.

2/ I'm not familiar with RAIDZ1 but it seems to requires 2 writes per 
block to maintain the parity bit.

3/ Try the other loaders 'phased' and 'sequential' to see if their less 
I/O intensive requirements and less overlapping use of file system cache 
do better than "parallel".

> The produced tdb2 files are 297G
> 
> 
> tdb2.xloader --threads 11
> 25,295 seconds
> Overall Rate     37,919 tuples per second
> 
> the xloader is a bit slower (~+1 hour) but seems to put much less
> strain on the system. Also the tdb2 is much more compact -- 173G

It does more sequential I/O which is SATA friendly.

     Andy

> 
> 
> Curious if you have any advice to improve performance?

Experiment!

> 
> Cheers,
> 
> On 2022/11/16 12:37:19 Andy Seaborne wrote:
>>
>>
>> On 16/11/2022 07:54, LB wrote:
>>> Andy got a new computer? Nice.
>>>
>>> I'm wondering if higher bandwidth of DDR5 already has an impact.
>>>
>>> Performance with xloader was ~ 4x lower than tdbloader? Any ideas
> why?
>>
>> xloader does more work (sorting is a separate step) with less
> resources.
>>
>> tdb2.tdbloader --loader=parallel is slower if the I/O bandwidth isn't
>> there and also performs parallel random I/O operations hence it is
> bad
>> on HDD (and to some extend on SATA SSDs).
>>
>> xloader is disk friendly and uses (roughly speaking) only a single
> write
>> channel.
>>
>>        Andy
>>
>>> Can you try a real world dataset like Wikidata truthy as well?
>>>
>>> I could also give it another try if we agree on timestamp of the
> dump as
>>> well as the Jena version for better comparison. Collecting those
> runs on
>>> the Jena site would be good material for interested people.
>>>
>>> On 13.11.22 19:26, Andy Seaborne wrote:
>>>> Trying out a specific machine:
>>>>
>>>> 1 billion triples : BSBM-1000 (1,000,253,325 triples)
>>>>
>>>> tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
>>>> Time: 3,218.82 seconds (53mins 39secs)
>>>> Rate: 310,751 triples/s
>>>>
>>>> The machine:
>>>>
>>>> Dell 8950, Intel® Core™ i7-12700K Processor
>>>>    8 performance cores with hyper threading
>>>>    4 Efficient-cores
>>>>    Total : 16+4 threads
>>>>
>>>> 64G RAM DDR5, 2 memory channels
>>>> m2 SSD (1TB)
>>>>
>>>> The database is 191GBytes
>>>>
>>>> 4 threads were running at 100% and they were spread across cores
>>>> (other threads were doing I/O and general housekeeping).
>>>>
>>>> The OS didn't apply any thermal controls - the active threads
> weren't
>>>> being moved across cores, the CPU temperatures were only around
> 44C,
>>>> and the processor fans wasn't elevated.
>>>>
>>>> The machine was usable during the load.
>>>>
>>>> ----
>>>>
>>>> On the same hardware tdb2.xloader achieved 87kTPS and a database
> of
>>>> 132Gbytes
>>
> 

RE: Re: Loader performance test

Posted by Simon Bin <sb...@informatik.uni-leipzig.de>.
Hi,

we're trying to load our project internal data set


with currently 959,170,877 quads (still growing)

on a 

24-core AMD EPYC 7443P with 2.85-4.00GHz
256GB RAM
and Samsung SSD 870 QVO 8TB SATA SSDs in a RAIDZ1

tdb2.tdbloader --loader=parallel 
21,450.519 seconds

especially noticeable towards the end, it stalls massively (Batch:
1,169). Avg: 44,752

The produced tdb2 files are 297G


tdb2.xloader --threads 11
25,295 seconds
Overall Rate     37,919 tuples per second

the xloader is a bit slower (~+1 hour) but seems to put much less
strain on the system. Also the tdb2 is much more compact -- 173G


Curious if you have any advice to improve performance?

Cheers,

On 2022/11/16 12:37:19 Andy Seaborne wrote:
> 
> 
> On 16/11/2022 07:54, LB wrote:
> > Andy got a new computer? Nice.
> > 
> > I'm wondering if higher bandwidth of DDR5 already has an impact.
> > 
> > Performance with xloader was ~ 4x lower than tdbloader? Any ideas
why?
> 
> xloader does more work (sorting is a separate step) with less
resources.
> 
> tdb2.tdbloader --loader=parallel is slower if the I/O bandwidth isn't
> there and also performs parallel random I/O operations hence it is
bad 
> on HDD (and to some extend on SATA SSDs).
> 
> xloader is disk friendly and uses (roughly speaking) only a single
write 
> channel.
> 
>      Andy
> 
> > Can you try a real world dataset like Wikidata truthy as well?
> > 
> > I could also give it another try if we agree on timestamp of the
dump as 
> > well as the Jena version for better comparison. Collecting those
runs on 
> > the Jena site would be good material for interested people.
> > 
> > On 13.11.22 19:26, Andy Seaborne wrote:
> >> Trying out a specific machine:
> >>
> >> 1 billion triples : BSBM-1000 (1,000,253,325 triples)
> >>
> >> tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
> >> Time: 3,218.82 seconds (53mins 39secs)
> >> Rate: 310,751 triples/s
> >>
> >> The machine:
> >>
> >> Dell 8950, Intel® Core™ i7-12700K Processor
> >>   8 performance cores with hyper threading
> >>   4 Efficient-cores
> >>   Total : 16+4 threads
> >>
> >> 64G RAM DDR5, 2 memory channels
> >> m2 SSD (1TB)
> >>
> >> The database is 191GBytes
> >>
> >> 4 threads were running at 100% and they were spread across cores 
> >> (other threads were doing I/O and general housekeeping).
> >>
> >> The OS didn't apply any thermal controls - the active threads
weren't 
> >> being moved across cores, the CPU temperatures were only around
44C, 
> >> and the processor fans wasn't elevated.
> >>
> >> The machine was usable during the load.
> >>
> >> ----
> >>
> >> On the same hardware tdb2.xloader achieved 87kTPS and a database
of 
> >> 132Gbytes
> 


Re: Loader performance test

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/2022 07:54, LB wrote:
> Andy got a new computer? Nice.
> 
> I'm wondering if higher bandwidth of DDR5 already has an impact.
> 
> Performance with xloader was ~ 4x lower than tdbloader? Any ideas why?

xloader does more work (sorting is a separate step) with less resources.

tdb2.tdbloader --loader=parallel is slower if the I/O bandwidth isn't 
there and also performs parallel random I/O operations hence it is bad 
on HDD (and to some extend on SATA SSDs).

xloader is disk friendly and uses (roughly speaking) only a single write 
channel.

     Andy

> Can you try a real world dataset like Wikidata truthy as well?
> 
> I could also give it another try if we agree on timestamp of the dump as 
> well as the Jena version for better comparison. Collecting those runs on 
> the Jena site would be good material for interested people.
> 
> On 13.11.22 19:26, Andy Seaborne wrote:
>> Trying out a specific machine:
>>
>> 1 billion triples : BSBM-1000 (1,000,253,325 triples)
>>
>> tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
>> Time: 3,218.82 seconds (53mins 39secs)
>> Rate: 310,751 triples/s
>>
>> The machine:
>>
>> Dell 8950, Intel® Core™ i7-12700K Processor
>>   8 performance cores with hyper threading
>>   4 Efficient-cores
>>   Total : 16+4 threads
>>
>> 64G RAM DDR5, 2 memory channels
>> m2 SSD (1TB)
>>
>> The database is 191GBytes
>>
>> 4 threads were running at 100% and they were spread across cores 
>> (other threads were doing I/O and general housekeeping).
>>
>> The OS didn't apply any thermal controls - the active threads weren't 
>> being moved across cores, the CPU temperatures were only around 44C, 
>> and the processor fans wasn't elevated.
>>
>> The machine was usable during the load.
>>
>> ----
>>
>> On the same hardware tdb2.xloader achieved 87kTPS and a database of 
>> 132Gbytes

Re: Loader performance test

Posted by LB <co...@googlemail.com.INVALID>.
Andy got a new computer? Nice.

I'm wondering if higher bandwidth of DDR5 already has an impact.

Performance with xloader was ~ 4x lower than tdbloader? Any ideas why?

Can you try a real world dataset like Wikidata truthy as well?

I could also give it another try if we agree on timestamp of the dump as 
well as the Jena version for better comparison. Collecting those runs on 
the Jena site would be good material for interested people.

On 13.11.22 19:26, Andy Seaborne wrote:
> Trying out a specific machine:
>
> 1 billion triples : BSBM-1000 (1,000,253,325 triples)
>
> tdb2.tdbloader --loc DB2 bsbm-1000m.nt.gz
> Time: 3,218.82 seconds (53mins 39secs)
> Rate: 310,751 triples/s
>
> The machine:
>
> Dell 8950, Intel® Core™ i7-12700K Processor
>   8 performance cores with hyper threading
>   4 Efficient-cores
>   Total : 16+4 threads
>
> 64G RAM DDR5, 2 memory channels
> m2 SSD (1TB)
>
> The database is 191GBytes
>
> 4 threads were running at 100% and they were spread across cores 
> (other threads were doing I/O and general housekeeping).
>
> The OS didn't apply any thermal controls - the active threads weren't 
> being moved across cores, the CPU temperatures were only around 44C, 
> and the processor fans wasn't elevated.
>
> The machine was usable during the load.
>
> ----
>
> On the same hardware tdb2.xloader achieved 87kTPS and a database of 
> 132Gbytes