You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Marco Neumann <ma...@gmail.com> on 2020/06/12 09:22:36 UTC

Re: JENA Loader Benchmarks

just reporting some performance related regression with ubuntu 20.04 LTS
and jdk 13.02 of 20%+ compared to ubuntu 19.04 with jdk12.0.1. The
performance difference between jena 3.13 and 3.15 on the other hand is
marginal.

so far everything indicates that this regression seems to be ubuntu 20.04
distro and jdk13.02 related on my end.



On Mon, Jun 24, 2019 at 12:05 AM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 23/06/2019 10:29, Marco Neumann wrote:
> > yes I'd say the local NVMe SSDs make the difference here. In my case for
> > zone US East and US East 2 the VMs only showing a premium ssd option. So
> > called ultra ssd's seem to be high in demand and currently not available
> in
> > my profile. And they also come at a very high estimated price point.
> >
> > which dataset do you use to run the load test above?
>
> It was a synthetic one to mimic up some work-related data.
>
>      Andy
>
> >
> >
> >
> > On Sat, Jun 22, 2019 at 11:47 PM Andy Seaborne <an...@apache.org> wrote:
> >
> >>
> >>
> >> On 20/06/2019 16:01, Marco Neumann wrote:
> >>> quick update here on loader performance. Did a modest (in terms of
> cost)
> >>> hardware upgrade of one of the dedicated data processors with a faster
> >> CPU
> >>> and faster NVme SSD drive and was able to almost half our load times.
> >> Very
> >>> satisfied with the HW upgrade and TDB2 loader performance. VM's don't
> >> seem
> >>> to work well for us in combination with TDB.
> >>
> >> My experience has been significant variation across different VM types.
> >> My assumption is the form of virtualization matters.
> >>
> >> I had access to an AWS i3.8xlarge for a short while which had local NVMe
> >> SSDs and got very good performance:
> >>
> >> 500m            TDB2    2,362s  39m 22s         218,460 TPS
> >> 1 billion       TDB2    5,164s  1h 26m 04s      200,100 TPS
> >>
> >> (this is a single graph dataset)
> >>
> >> i3 are "Storage optimized"
> >>
> >> The TDB2 loader is multithreaded and each thread is working on a
> >> different indexes so the access patterns are jumping around all over the
> >> place both because the non-primary index is, in effect at scale,
> >> randomly accessed, and because multiple indexes are updating at the same
> >> time.
> >>
> >>       Andy
> >>
> >>>
> >>> On Fri, Jun 14, 2019 at 11:56 PM Marco Neumann <
> marco.neumann@gmail.com>
> >>> wrote:
> >>>
> >>>> absolutely it does, preferably NVMe SSD. tdbloaders are almost a
> >> showcase
> >>>> themselves for good up-to-date hardware..
> >>>>
> >>>> if possible I'd like to load the wikidata dataset* at at some point to
> >> see
> >>>> where 57GB fits in terms of tdb. The wikidata team is currently
> looking
> >> at
> >>>> new solutions that can go beyond blazegraph. And I get the impression
> >> that
> >>>> they have not yet actively considered to give jena tdb try.
> >>>>
> >>>> https://dumps.wikimedia.org/wikidatawiki/entities/
> >>>>
> >>>>
> >>>> On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius <
> >>>> martynas@atomgraph.com> wrote:
> >>>>
> >>>>> What about SSD disks, don't they make a difference?
> >>>>>
> >>>>> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann <
> >> marco.neumann@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> that did the trick Andy, very good might be a good idea to add this
> to
> >>>>> the
> >>>>>> distribution in jena-log4j.properties
> >>>>>>
> >>>>>> I am getting these numbers for a midsize dedicated server, very nice
> >>>>>> numbers indeed Andy. well done!
> >>>>>>
> >>>>>> 00:24:53 INFO  loader               :: Loader = LoaderPhased
> >>>>>> 00:24:53 INFO  loader               :: Start:
> >>>>>> ../../public_html/lotico.ttl.gz
> >>>>>> 00:24:55 INFO  loader               :: Add: 500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 237,755 / Avg: 237,755)
> >>>>>> 00:24:56 INFO  loader               :: Add: 1,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 305,250 / Avg: 267,308)
> >>>>>> 00:24:58 INFO  loader               :: Add: 1,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 313,087 / Avg: 281,004)
> >>>>>> 00:25:00 INFO  loader               :: Add: 2,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 328,299 / Avg: 291,502)
> >>>>>> 00:25:01 INFO  loader               :: Add: 2,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 341,763 / Avg: 300,336)
> >>>>>> 00:25:03 INFO  loader               :: Add: 3,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 337,381 / Avg: 305,935)
> >>>>>> 00:25:04 INFO  loader               :: Add: 3,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 318,877 / Avg: 307,719)
> >>>>>> 00:25:06 INFO  loader               :: Add: 4,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 295,857 / Avg: 306,184)
> >>>>>> 00:25:07 INFO  loader               :: Add: 4,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 327,225 / Avg: 308,388)
> >>>>>> 00:25:09 INFO  loader               :: Add: 5,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 349,406 / Avg: 312,051)
> >>>>>> 00:25:09 INFO  loader               ::   Elapsed: 16.02 seconds
> >>>>> [2019/06/15
> >>>>>> 00:25:09 CEST]
> >>>>>> 00:25:11 INFO  loader               :: Add: 5,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 285,062 / Avg: 309,388)
> >>>>>> 00:25:13 INFO  loader               :: Add: 6,000,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 203,665 / Avg: 296,559)
> >>>>>> 00:25:16 INFO  loader               :: Add: 6,500,000 lotico.ttl.gz
> >>>>> (Batch:
> >>>>>> 189,393 / Avg: 284,190)
> >>>>>>
> >>>>>> on another machine that sits in the Azure infrastructure somewhere
> it
> >>>>>> tdbloader doesn't look as good, even with decent hardware it seems
> to
> >>>>> die a
> >>>>>> slow death of memory exhaustion at 16GB. started off with 70kT/s and
> >> is
> >>>>> now
> >>>>>> down to 17kT/s and still going.
> >>>>>>
> >>>>>> lesson learned big iron and big memory is the way to go with Jena
> >>>>>> tdbloaders.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne <an...@apache.org>
> >> wrote:
> >>>>>>
> >>>>>>> These messages are logged (to logger
> "org.apache.jena.tdb2.loader") -
> >>>>> do
> >>>>>>> you have log4j.proprties in the current working directory?
> >>>>>>>
> >>>>>>> Do you get any output?
> >>>>>>>
> >>>>>>> INFO  Loader = LoaderParallel
> >>>>>>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
> >>>>>>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770)
> >>>>>>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604)
> >>>>>>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920)
> >>>>>>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189)
> >>>>>>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508)
> >>>>>>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173)
> >>>>>>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804)
> >>>>>>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676)
> >>>>>>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081)
> >>>>>>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076)
> >>>>>>> INFO    Elapsed: 28.56 seconds [2019/06/14 22:51:37 BST]
> >>>>>>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599
> >>>>> tuples
> >>>>>>> in 28.63s (Avg: 174,644)
> >>>>>>> INFO  Finish - index SPO
> >>>>>>> INFO  Finish - index POS
> >>>>>>> INFO  Finish - index OSP
> >>>>>>> INFO  Time = 35.572 seconds : Triples = 5,000,599 : Rate = 140,577
> /s
> >>>>>>>
> >>>>>>>
> >>>>>>> There is pause after the first "Finished:" - this is finished data
> >> in,
> >>>>>>> the index threads are still running and the pause comes from flush
> to
> >>>>> disk.
> >>>>>>>
> >>>>>>>        Andy
> >>>>>>>
> >>>>>>> On 14/06/2019 20:16, Marco Neumann wrote:
> >>>>>>>> let me fire up one of the big machines to see what I will get
> there.
> >>>>>>>> currently I have no info display during load with tdb2.tdbloader .
> >>>>> if -v
> >>>>>>> is
> >>>>>>>> specified I get some extra info but no load info.
> >>>>>>>>
> >>>>>>>> On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <an...@apache.org>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 14/06/2019 18:13, Marco Neumann wrote:
> >>>>>>>>>> I am collecting jena loader benchmarks. if you have results
> >>>>> please post
> >>>>>>>>>> them directly.
> >>>>>>>>>>
> >>>>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >>>>>>>>>
> >>>>>>>>> tdb2.tdbloader has variations controlled by --loader.
> >>>>>>>>>
> >>>>>>>>> --loader=
> >>>>>>>>> Loader to use: 'basic', 'phased' (default), 'sequential',
> >>>>> 'parallel' or
> >>>>>>>>> 'light'
> >>>>>>>>>
> >>>>>>>>> "basic" is a super naive parser-add triple loop - it used if a
> >>>>> loader
> >>>>>>>>> can't cope with an already loaded database.
> >>>>>>>>>
> >>>>>>>>> "phased" is a balanced, does not saturate the machine loader.
> Some
> >>>>>>>>> parallelism.
> >>>>>>>>>
> >>>>>>>>> "sequential" is the tdbloader algorithm for TDB2, more for
> >>>>> reference.
> >>>>>>>>>
> >>>>>>>>> "parallel" is as much parallelism as it wants. (5 for triples,
> >>>>> more for
> >>>>>>>>> quads)
> >>>>>>>>>
> >>>>>>>>> "light" is two threaded. Slightly ligther than "phased".
> >>>>>>>>>
> >>>>>>>>> See LoaderPlans.
> >>>>>>>>>
> >>>>>>>>>> On a linux machine I am using "time" to collect data.
> >>>>>>>>>>
> >>>>>>>>>> Is there a flag on tdb2.tdbloader to report time and triples per
> >>>>>>> second?
> >>>>>>>>>>
> >>>>>>>>>> I have noticed that storage space use for tdbloader2 is
> >>>>> significantly
> >>>>>>>>>> smaller on disk compared to tdbloader and tdb2.tdbloader. Is
> >>>>> there a
> >>>>>>>>>> straight forward explanation here?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>>
> >>>>>> ---
> >>>>>> Marco Neumann
> >>>>>> KONA
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>> ---
> >>>> Marco Neumann
> >>>> KONA
> >>>>
> >>>>
> >>>
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA