You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2021/11/20 14:21:22 UTC

Wikidata evolution

Wikidata are looking for a replace for BlazeGraph

About WDQS, current scale and current challenges
   https://youtu.be/wn2BrQomvFU?t=9148

And in the process of appointing a graph consultant: (5 month contract):
https://boards.greenhouse.io/wikimedia/jobs/3546920

and Apache Jena came up:
https://phabricator.wikimedia.org/T206560#7517212

Realistically?

Full wikidata is 16B triples. Very hard to load - xloader may help 
though the goal for that was to make loading the truthy subset (5B) 
easier. 5B -> 16B is not a trivial step.

Even if wikidata loads, it would be impractically slow as TDB is today.
(yes, that's fixable; not practical in their timescales.)

The current discussions feel more like they are looking for a "product" 
- a triplestore that they are use - rather than a collaboration.

     Andy

Re: Wikidata evolution

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

I have followed this for a bit but hadn’t suggested Jena for the reasons
you mentioned.

AFAIK they are looking for a triplestore that is
- open-source
- actively maintained
- scales to their data volume

To my knowledge, there isn’t any that satisfies all 3 requirements :/

On Sat, 20 Nov 2021 at 15.21, Andy Seaborne <an...@apache.org> wrote:

> Wikidata are looking for a replace for BlazeGraph
>
> About WDQS, current scale and current challenges
>    https://youtu.be/wn2BrQomvFU?t=9148
>
> And in the process of appointing a graph consultant: (5 month contract):
> https://boards.greenhouse.io/wikimedia/jobs/3546920
>
> and Apache Jena came up:
> https://phabricator.wikimedia.org/T206560#7517212
>
> Realistically?
>
> Full wikidata is 16B triples. Very hard to load - xloader may help
> though the goal for that was to make loading the truthy subset (5B)
> easier. 5B -> 16B is not a trivial step.
>
> Even if wikidata loads, it would be impractically slow as TDB is today.
> (yes, that's fixable; not practical in their timescales.)
>
> The current discussions feel more like they are looking for a "product"
> - a triplestore that they are use - rather than a collaboration.
>
>      Andy
>

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

I see this makes all sense Andy but I think I would like an option for
turning the replication / versioning off. or at least direct that data into
another location.

I believe this only is the case for TDB/TDB2 backed datasets?

On Tue, Nov 23, 2021 at 10:45 AM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 23/11/2021 09:40, Rob Vesse wrote:
> > Marco
> >
> > So there's a couple of things going on.
> >
> > Firstly the Node Table, the mapping of RDF Terms to the internal Node
> IDs used in the indexes can only ever grow.  TDB2 doesn't do reference
> counting so it doesn't ever remove entries from the table as it doesn't
> know when a Node ID is no longer needed.  Also for RDF Terms that aren't
> directly interned (e.g. some numerics, booleans, dates etc), so primarily
> URIs, Blank Nodes and larger/arbitrarily typed literals, the Node ID
> actually encodes the offset into the Node Table to make Node ID to RDF Term
> decoding fast so you can’t just arbitrarily rewrite the Node Table.  And
> even if rewriting the Node Table were supported it would require rewriting
> all the indexes since those use the Node IDs.
> >
> > TL;DR the Node Table only grows because the cost of compacting it
> outweighs the benefits.  This is also why you may have seen advice in the
> past that if your database has a lot of DELETE operations made against it
> then in periodically dumping all the data and reloading it into a new
> database is recommended since that generates a fresh Node Table with only
> the RDF Terms currently in use.
> >
> > Secondly the indexes are themselves versioned storage, so when you
> modify the database a new state is created (potentially pointing to
> some/all of the existing data) but the old data is still there as well.
> This is done for two reasons:
> >
> > 1) It allows writes to overlap with ongoing reads to improve
> concurrency.  Essentially each read/write transaction operates on a
> snapshot of the data, a write creates a new snapshot but an ongoing read
> can continue to read the old snapshot it was working against
> > 2) It provides for strong fault tolerance since a crash/exit during a
> write doesn't affect old data
>
> 3) Arbitrarily large transactions.
>
> > Note that you can perform a compact operation on a TDB2 database which
> essentially discards all but the latest snapshot and should reclaim the
> index data that is no longer needed.  This is a blocking exclusive write
> operation so doesn't allow for concurrent reads as a normal write would.
>
> Nowadays, reads continue during compaction; it's only writes that get
> held up (I'd like to add delta-technology to fix that).
>
> There is a short period of pointer swapping with some disk sync at the
> end to switch the database in-use; it is milliseconds.
>
>         Andy
>
> >
> > Cheers,
> >
> > Rob
> >
> > PS. I'm sure Andy will chime in if I've misrepresented/misstated
> anything above
> >
> > On 22/11/2021, 21:15, "Marco Neumann" <ma...@gmail.com> wrote:
> >
> >      Yes I just had a look at one of my own datasets with 180mt and a
> footprint
> >      of 28G. The overhead is not too bad at 10-20%. vs raw nt files
> >
> >      I was surprised that the CLEAR ALL directive doesn't remove/release
> disk
> >      memory. Does TDB2 require a commit to release disk space?
> >
> >      impressed to see that load times went up to 250k/s with 4.2. more
> than
> >      twice the speed I have seen with 3.15. Not sure if this is OS
> (Ubuntu
> >      20.04.3 LTS) related.
> >
> >      Maybe we should make a recommendation to the wikidata team to
> provide us
> >      with a production environment type machine to run some load and
> query tests.
> >
> >
> >
> >
> >
> >
> >      On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org>
> wrote:
> >
> >      >
> >      >
> >      > On 21/11/2021 21:03, Marco Neumann wrote:
> >      > > What's the disk footprint these days for 1b on tdb2?
> >      >
> >      > Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on
> significant
> >      > sized literals - the node themselves are 50G). Obvious for
> current WD
> >      > scale usage a sprinkling of compression would be good!
> >      >
> >      > One thing xloader gives us is that it makes it possible to load
> on a
> >      > spinning disk. (it also has lower peak intermediate file space and
> >      > faster because it does not fall into a slow loading mode for the
> node
> >      > table that tdbloader2 did sometimes.)
> >      >
> >      >      Andy
> >      >
> >      > >
> >      > > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org>
> wrote:
> >      > >
> >      > >>
> >      > >>
> >      > >> On 20/11/2021 14:21, Andy Seaborne wrote:
> >      > >>> Wikidata are looking for a replace for BlazeGraph
> >      > >>>
> >      > >>> About WDQS, current scale and current challenges
> >      > >>>     https://youtu.be/wn2BrQomvFU?t=9148
> >      > >>>
> >      > >>> And in the process of appointing a graph consultant: (5 month
> >      > contract):
> >      > >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
> >      > >>>
> >      > >>> and Apache Jena came up:
> >      > >>> https://phabricator.wikimedia.org/T206560#7517212
> >      > >>>
> >      > >>> Realistically?
> >      > >>>
> >      > >>> Full wikidata is 16B triples. Very hard to load - xloader may
> help
> >      > >>> though the goal for that was to make loading the truthy
> subset (5B)
> >      > >>> easier. 5B -> 16B is not a trivial step.
> >      > >>
> >      > >> And it's growing at about 1B per quarter.
> >      > >>
> >      > >>
> >      >
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
> >      > >>
> >      > >>>
> >      > >>> Even if wikidata loads, it would be impractically slow as TDB
> is today.
> >      > >>> (yes, that's fixable; not practical in their timescales.)
> >      > >>>
> >      > >>> The current discussions feel more like they are looking for a
> "product"
> >      > >>> - a triplestore that they are use - rather than a
> collaboration.
> >      > >>>
> >      > >>>       Andy
> >      > >>
> >      > >
> >      > >
> >      >
> >
> >
> >      --
> >
> >
> >      ---
> >      Marco Neumann
> >      KONA
> >
> >
> >
> >
>


-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


On 23/11/2021 09:40, Rob Vesse wrote:
> Marco
> 
> So there's a couple of things going on.
> 
> Firstly the Node Table, the mapping of RDF Terms to the internal Node IDs used in the indexes can only ever grow.  TDB2 doesn't do reference counting so it doesn't ever remove entries from the table as it doesn't know when a Node ID is no longer needed.  Also for RDF Terms that aren't directly interned (e.g. some numerics, booleans, dates etc), so primarily URIs, Blank Nodes and larger/arbitrarily typed literals, the Node ID actually encodes the offset into the Node Table to make Node ID to RDF Term decoding fast so you can’t just arbitrarily rewrite the Node Table.  And even if rewriting the Node Table were supported it would require rewriting all the indexes since those use the Node IDs.
> 
> TL;DR the Node Table only grows because the cost of compacting it outweighs the benefits.  This is also why you may have seen advice in the past that if your database has a lot of DELETE operations made against it then in periodically dumping all the data and reloading it into a new database is recommended since that generates a fresh Node Table with only the RDF Terms currently in use.
> 
> Secondly the indexes are themselves versioned storage, so when you modify the database a new state is created (potentially pointing to some/all of the existing data) but the old data is still there as well.  This is done for two reasons:
> 
> 1) It allows writes to overlap with ongoing reads to improve concurrency.  Essentially each read/write transaction operates on a snapshot of the data, a write creates a new snapshot but an ongoing read can continue to read the old snapshot it was working against
> 2) It provides for strong fault tolerance since a crash/exit during a write doesn't affect old data

3) Arbitrarily large transactions.

> Note that you can perform a compact operation on a TDB2 database which essentially discards all but the latest snapshot and should reclaim the index data that is no longer needed.  This is a blocking exclusive write operation so doesn't allow for concurrent reads as a normal write would.

Nowadays, reads continue during compaction; it's only writes that get 
held up (I'd like to add delta-technology to fix that).

There is a short period of pointer swapping with some disk sync at the 
end to switch the database in-use; it is milliseconds.

	Andy

> 
> Cheers,
> 
> Rob
> 
> PS. I'm sure Andy will chime in if I've misrepresented/misstated anything above
> 
> On 22/11/2021, 21:15, "Marco Neumann" <ma...@gmail.com> wrote:
> 
>      Yes I just had a look at one of my own datasets with 180mt and a footprint
>      of 28G. The overhead is not too bad at 10-20%. vs raw nt files
> 
>      I was surprised that the CLEAR ALL directive doesn't remove/release disk
>      memory. Does TDB2 require a commit to release disk space?
> 
>      impressed to see that load times went up to 250k/s with 4.2. more than
>      twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>      20.04.3 LTS) related.
> 
>      Maybe we should make a recommendation to the wikidata team to provide us
>      with a production environment type machine to run some load and query tests.
> 
> 
> 
> 
> 
> 
>      On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
> 
>      >
>      >
>      > On 21/11/2021 21:03, Marco Neumann wrote:
>      > > What's the disk footprint these days for 1b on tdb2?
>      >
>      > Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>      > sized literals - the node themselves are 50G). Obvious for current WD
>      > scale usage a sprinkling of compression would be good!
>      >
>      > One thing xloader gives us is that it makes it possible to load on a
>      > spinning disk. (it also has lower peak intermediate file space and
>      > faster because it does not fall into a slow loading mode for the node
>      > table that tdbloader2 did sometimes.)
>      >
>      >      Andy
>      >
>      > >
>      > > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
>      > >
>      > >>
>      > >>
>      > >> On 20/11/2021 14:21, Andy Seaborne wrote:
>      > >>> Wikidata are looking for a replace for BlazeGraph
>      > >>>
>      > >>> About WDQS, current scale and current challenges
>      > >>>     https://youtu.be/wn2BrQomvFU?t=9148
>      > >>>
>      > >>> And in the process of appointing a graph consultant: (5 month
>      > contract):
>      > >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>      > >>>
>      > >>> and Apache Jena came up:
>      > >>> https://phabricator.wikimedia.org/T206560#7517212
>      > >>>
>      > >>> Realistically?
>      > >>>
>      > >>> Full wikidata is 16B triples. Very hard to load - xloader may help
>      > >>> though the goal for that was to make loading the truthy subset (5B)
>      > >>> easier. 5B -> 16B is not a trivial step.
>      > >>
>      > >> And it's growing at about 1B per quarter.
>      > >>
>      > >>
>      > https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>      > >>
>      > >>>
>      > >>> Even if wikidata loads, it would be impractically slow as TDB is today.
>      > >>> (yes, that's fixable; not practical in their timescales.)
>      > >>>
>      > >>> The current discussions feel more like they are looking for a "product"
>      > >>> - a triplestore that they are use - rather than a collaboration.
>      > >>>
>      > >>>       Andy
>      > >>
>      > >
>      > >
>      >
> 
> 
>      --
> 
> 
>      ---
>      Marco Neumann
>      KONA
> 
> 
> 
>

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

Thank you for the calcification Rob. I presume this happens on an atomic
(triple) level for updates as well?


On Tue, Nov 23, 2021 at 9:42 AM Rob Vesse <rv...@dotnetrdf.org> wrote:

> Marco
>
> So there's a couple of things going on.
>
> Firstly the Node Table, the mapping of RDF Terms to the internal Node IDs
> used in the indexes can only ever grow.  TDB2 doesn't do reference counting
> so it doesn't ever remove entries from the table as it doesn't know when a
> Node ID is no longer needed.  Also for RDF Terms that aren't directly
> interned (e.g. some numerics, booleans, dates etc), so primarily URIs,
> Blank Nodes and larger/arbitrarily typed literals, the Node ID actually
> encodes the offset into the Node Table to make Node ID to RDF Term decoding
> fast so you can’t just arbitrarily rewrite the Node Table.  And even if
> rewriting the Node Table were supported it would require rewriting all the
> indexes since those use the Node IDs.
>
> TL;DR the Node Table only grows because the cost of compacting it
> outweighs the benefits.  This is also why you may have seen advice in the
> past that if your database has a lot of DELETE operations made against it
> then in periodically dumping all the data and reloading it into a new
> database is recommended since that generates a fresh Node Table with only
> the RDF Terms currently in use.
>
> Secondly the indexes are themselves versioned storage, so when you modify
> the database a new state is created (potentially pointing to some/all of
> the existing data) but the old data is still there as well.  This is done
> for two reasons:
>
> 1) It allows writes to overlap with ongoing reads to improve concurrency.
> Essentially each read/write transaction operates on a snapshot of the data,
> a write creates a new snapshot but an ongoing read can continue to read the
> old snapshot it was working against
> 2) It provides for strong fault tolerance since a crash/exit during a
> write doesn't affect old data
>
> Note that you can perform a compact operation on a TDB2 database which
> essentially discards all but the latest snapshot and should reclaim the
> index data that is no longer needed.  This is a blocking exclusive write
> operation so doesn't allow for concurrent reads as a normal write would.
>
> Cheers,
>
> Rob
>
> PS. I'm sure Andy will chime in if I've misrepresented/misstated anything
> above
>
> On 22/11/2021, 21:15, "Marco Neumann" <ma...@gmail.com> wrote:
>
>     Yes I just had a look at one of my own datasets with 180mt and a
> footprint
>     of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>
>     I was surprised that the CLEAR ALL directive doesn't remove/release
> disk
>     memory. Does TDB2 require a commit to release disk space?
>
>     impressed to see that load times went up to 250k/s with 4.2. more than
>     twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>     20.04.3 LTS) related.
>
>     Maybe we should make a recommendation to the wikidata team to provide
> us
>     with a production environment type machine to run some load and query
> tests.
>
>
>
>
>
>
>     On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
>
>     >
>     >
>     > On 21/11/2021 21:03, Marco Neumann wrote:
>     > > What's the disk footprint these days for 1b on tdb2?
>     >
>     > Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>     > sized literals - the node themselves are 50G). Obvious for current WD
>     > scale usage a sprinkling of compression would be good!
>     >
>     > One thing xloader gives us is that it makes it possible to load on a
>     > spinning disk. (it also has lower peak intermediate file space and
>     > faster because it does not fall into a slow loading mode for the node
>     > table that tdbloader2 did sometimes.)
>     >
>     >      Andy
>     >
>     > >
>     > > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org>
> wrote:
>     > >
>     > >>
>     > >>
>     > >> On 20/11/2021 14:21, Andy Seaborne wrote:
>     > >>> Wikidata are looking for a replace for BlazeGraph
>     > >>>
>     > >>> About WDQS, current scale and current challenges
>     > >>>     https://youtu.be/wn2BrQomvFU?t=9148
>     > >>>
>     > >>> And in the process of appointing a graph consultant: (5 month
>     > contract):
>     > >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>     > >>>
>     > >>> and Apache Jena came up:
>     > >>> https://phabricator.wikimedia.org/T206560#7517212
>     > >>>
>     > >>> Realistically?
>     > >>>
>     > >>> Full wikidata is 16B triples. Very hard to load - xloader may
> help
>     > >>> though the goal for that was to make loading the truthy subset
> (5B)
>     > >>> easier. 5B -> 16B is not a trivial step.
>     > >>
>     > >> And it's growing at about 1B per quarter.
>     > >>
>     > >>
>     >
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>     > >>
>     > >>>
>     > >>> Even if wikidata loads, it would be impractically slow as TDB is
> today.
>     > >>> (yes, that's fixable; not practical in their timescales.)
>     > >>>
>     > >>> The current discussions feel more like they are looking for a
> "product"
>     > >>> - a triplestore that they are use - rather than a collaboration.
>     > >>>
>     > >>>       Andy
>     > >>
>     > >
>     > >
>     >
>
>
>     --
>
>
>     ---
>     Marco Neumann
>     KONA
>
>
>
>
>

-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Marco

So there's a couple of things going on.

Firstly the Node Table, the mapping of RDF Terms to the internal Node IDs used in the indexes can only ever grow.  TDB2 doesn't do reference counting so it doesn't ever remove entries from the table as it doesn't know when a Node ID is no longer needed.  Also for RDF Terms that aren't directly interned (e.g. some numerics, booleans, dates etc), so primarily URIs, Blank Nodes and larger/arbitrarily typed literals, the Node ID actually encodes the offset into the Node Table to make Node ID to RDF Term decoding fast so you can’t just arbitrarily rewrite the Node Table.  And even if rewriting the Node Table were supported it would require rewriting all the indexes since those use the Node IDs.  

TL;DR the Node Table only grows because the cost of compacting it outweighs the benefits.  This is also why you may have seen advice in the past that if your database has a lot of DELETE operations made against it then in periodically dumping all the data and reloading it into a new database is recommended since that generates a fresh Node Table with only the RDF Terms currently in use.

Secondly the indexes are themselves versioned storage, so when you modify the database a new state is created (potentially pointing to some/all of the existing data) but the old data is still there as well.  This is done for two reasons:

1) It allows writes to overlap with ongoing reads to improve concurrency.  Essentially each read/write transaction operates on a snapshot of the data, a write creates a new snapshot but an ongoing read can continue to read the old snapshot it was working against
2) It provides for strong fault tolerance since a crash/exit during a write doesn't affect old data

Note that you can perform a compact operation on a TDB2 database which essentially discards all but the latest snapshot and should reclaim the index data that is no longer needed.  This is a blocking exclusive write operation so doesn't allow for concurrent reads as a normal write would.

Cheers,

Rob

PS. I'm sure Andy will chime in if I've misrepresented/misstated anything above

On 22/11/2021, 21:15, "Marco Neumann" <ma...@gmail.com> wrote:

    Yes I just had a look at one of my own datasets with 180mt and a footprint
    of 28G. The overhead is not too bad at 10-20%. vs raw nt files

    I was surprised that the CLEAR ALL directive doesn't remove/release disk
    memory. Does TDB2 require a commit to release disk space?

    impressed to see that load times went up to 250k/s with 4.2. more than
    twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
    20.04.3 LTS) related.

    Maybe we should make a recommendation to the wikidata team to provide us
    with a production environment type machine to run some load and query tests.

    On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:

    >
    >
    > On 21/11/2021 21:03, Marco Neumann wrote:
    > > What's the disk footprint these days for 1b on tdb2?
    >
    > Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
    > sized literals - the node themselves are 50G). Obvious for current WD
    > scale usage a sprinkling of compression would be good!
    >
    > One thing xloader gives us is that it makes it possible to load on a
    > spinning disk. (it also has lower peak intermediate file space and
    > faster because it does not fall into a slow loading mode for the node
    > table that tdbloader2 did sometimes.)
    >
    >      Andy
    >
    > >
    > > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
    > >
    > >>
    > >>
    > >> On 20/11/2021 14:21, Andy Seaborne wrote:
    > >>> Wikidata are looking for a replace for BlazeGraph
    > >>>
    > >>> About WDQS, current scale and current challenges
    > >>>     https://youtu.be/wn2BrQomvFU?t=9148
    > >>>
    > >>> And in the process of appointing a graph consultant: (5 month
    > contract):
    > >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
    > >>>
    > >>> and Apache Jena came up:
    > >>> https://phabricator.wikimedia.org/T206560#7517212
    > >>>
    > >>> Realistically?
    > >>>
    > >>> Full wikidata is 16B triples. Very hard to load - xloader may help
    > >>> though the goal for that was to make loading the truthy subset (5B)
    > >>> easier. 5B -> 16B is not a trivial step.
    > >>
    > >> And it's growing at about 1B per quarter.
    > >>
    > >>
    > https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
    > >>
    > >>>
    > >>> Even if wikidata loads, it would be impractically slow as TDB is today.
    > >>> (yes, that's fixable; not practical in their timescales.)
    > >>>
    > >>> The current discussions feel more like they are looking for a "product"
    > >>> - a triplestore that they are use - rather than a collaboration.
    > >>>
    > >>>       Andy
    > >>
    > >
    > >
    >

    -- 

    ---
    Marco Neumann
    KONA

Re: Wikidata evolution

Posted by LB <co...@googlemail.com.INVALID>.

On 06.03.22 23:21, Andy Seaborne wrote:
>
>
> On 06/03/2022 09:08, LB wrote:
>> Hi Andy,
>>
>> yes I also did with rewriting which indeed performed faster. Indeed 
>> the issue here was that TDB2 didn't use the stats.opt file because it 
>> was in the wrong location. I'm still not convinced that it should be 
>> located in $TDB2_LOCATION/Data-XXX instead of $TDB2_LOCATION/ - 
>> especially when you would do a compact or some other operation you 
>> will have to move the stats file to the newer data directory.
>
> Compaction could copy it over.
>
> Actually, compact could update it if it is generated by stats.
>
> The reason it's in Data-XXXX is that it is related to the storage not 
> the switchable overlay.  Not immovable but the reason it is where it 
> is. History.

Oh, didn't know this. Did we document this somewhere? The only section 
about the stats is more or less for TDB(1). While technically I think 
still correct to say

" The output should first go to a temporary file, then that file moved 
into the database location."

users might be a bit confused, especially given that the parameter 
"location" mostly holds for the TDB2 base directory.

>
>> Moved the TDB2 instance now to the faster server Ryzen 5950X, 
>> 16C/32C, 128GB RAM, 3.4GB NVMe RAID 1
>>
>> I did few more queries which take a lot of time, either my machine is 
>> too slow or it is just as it is:
>>
>> The count query to compute the dataset size
>>
>> SELECT (count(*) as ?cnt) {
>>    ?s ?p ?o
>> }
>>
>> Runtime: 2489.115 seconds
>
> That will use the SPO.* files.
>
> Massively sensitive to warm up!
I see - I'm still not sure what is limiting here. Reading that file 
should be fast, no? Is it the Java deserialization? I don'T get why 
iotop shows 150M/s on an NVMe. When I do benchmarks with dd or fio, the 
read performance is of course way faster, in GB/s scale
>
> I am suspicious that OS caching could be better informed about access 
> patterns - that would take some native code (a lot more practical in 
> Java17).
I'm wondering, did you ever consider having those larger files split 
into smaller ones to maybe make use of the IO of modern hardware? I know 
partitioning is always with pr/cons, but I'm sure you also investigated 
in that direction.
>
>> Observations are rather hard, iotop showed a read speed of 150M/s - I 
>> don't know how to interpret this, sounds rather slow for an NVMe SSD. 
>> I also don't yet understand which files are touched. If those are 
>> just the index files, then I don't get why it takes so much time 
>> given that the index files are rather small with ~1G
>>
>>
>> Some counting with a join and a filter:
>>
>> SELECT (count(*) as ?cnt) WHERE {
>>    ?s wdt:P31 wd:Q5 ;
>>       rdfs:label ?l
>>    filter(lang(?l)='en')
>> }
>>
>> Runtime: 4817.022 seconds
>>
>>
>> I compared those queries with (public) QLever triple store, the 
>> latter query takes 2s - indeed as this is on their public server the 
>> comparison is not fair, and maybe there init process does more 
>> caching in advance.
>>
>> I'm also trying to set it up locally on the same server as the TDB2 
>> instance and will compare again - just learned that in future we 
>> should rent servers with way more disk space ... "lesson learned"
>
> Great.
>
> From the public server, QLever doesn't support much of SPARQL functions.
True. They are currently in the process to support full SPARQL 1.1 and 
are aware of their current limitations afaik and according to Github 
discussions.
>
>     Andy
>
>>
>>
>>
>> On 05.03.22 11:57, Andy Seaborne wrote:
>>> Two comments inline:
>>>
>>> On 02/03/2022 15:41, LB wrote:
>>>> Hm,
>>>>
>>>> coming back to this query
>>>>
>>>> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>
>>>> I calculated the triple pattern sizes:
>>>>
>>>> p:P625: ~9M
>>>>
>>>> ps:P625: ~9M
>>>>
>>>> pq:P376: ~1K
>>>
>>> Also try to rewrite with :P376 first.
>>> SELECT * WHERE
>>>   { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
>>> LIMIT 100
>>>
>>>
>>> which is:
>>>
>>>
>>> SELECT  *
>>> WHERE
>>>   { ?s    p:P625   _:b0 .
>>>     _:b0  ps:P625  ?o ;
>>>     _:b0  pq:P376  ?body
>>>   }
>>> LIMIT   100
>>>
>>> ==>
>>>
>>>
>>> SELECT  *
>>> WHERE
>>>   { _:b0  pq:P376  ?body .
>>>     _:b0  ps:P625  ?o ;
>>>     ?s    p:P625   _:b0 .
>>>   }
>>> LIMIT   100
>>>
>>> (
>>>
>>>>
>>>> Even with computing TDB stats it doesn't seem to perform well (not 
>>>> sure if those steps have been taken into account, as usual I put 
>>>> stats.opt into TDB dir). Took 180s even after I did a full count of 
>>>> all 18.8B triples in advance to warm cache. 
>>>
>>> Counting by itself only warm triple indexes, not the node table, nor 
>>> it's indexes.
>>>
>>> COUNT(*) or COUNT(?x) does not need the details of the RDF term 
>>> itself. Term results out of TDB are lazily computed and COUNT, by 
>>> design, does not trigger pulling from the node table.
>>>
>>>     Andy
>>>
>>>> I guess the files are rather larger
>>>>
>>>>> 373G    OSP.dat
>>>>> 373G    POS.dat
>>>>> 373G    SPO.dat
>>>>> 186G    nodes-data.obj
>>>>> 85G    nodes.dat
>>>>> 1,3G    OSP.idn
>>>>> 1,3G    POS.idn
>>>>> 1,3G    SPO.idn
>>>>> 720M    nodes.idn
>>>> for computation it would touch which files first?
>>>>
>>>> By the way, counting all 18.8B triples took ~6000s - HDD read speed 
>>>> was ~70M/s and given that we have 1.4TB disk size ...
>>>>
>>>> Long story short, with that slow HDD setup it takes ages or I'm 
>>>> doing something fundamentally wrong. Will copy over the TDB image 
>>>> to another server with SSD to see how things will change,.
>>>>
>>>>
>>>> On 02.03.22 14:12, Andy Seaborne wrote:
>>>>> > iotops showed ~400M/s while executing the last time. Does this
>>>>> > performance drop really come from HDD vs SSD?
>>>>>
>>>>> Yes - it could well do.
>>>>>
>>>>> Try running the queries twice in the same server.
>>>>>
>>>>> TDB does no pre-work whatsoever so file system caching is 
>>>>> significant.
>>>>>
>>>>> > Especially the last two
>>>>> > queries just have different limits, so I assume the joins are 
>>>>> just too
>>>>> > heavy?
>>>>>
>>>>>     Andy
>>>>>
>>>>> On 02/03/2022 08:22, LB wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>>>>>> xloader on a different less powerful server:
>>>>>>
>>>>>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 
>>>>>> 2 threads per core, -> 16C/32T)
>>>>>> - 128GB RAM
>>>>>> - non SSD RAID
>>>>>>
>>>>>> it took about 93h with  --threads 28; again I lost the logs 
>>>>>> because somebody rebootet the server yesterday, will restart it 
>>>>>> soon to keep logs on disk this time instead of terminal
>>>>>>
>>>>>> Afterwards I started querying a bit via Fuseki, and surprisingly 
>>>>>> for a very common Wikidata query making use of qualifiers the 
>>>>>> performance was rather low:
>>>>>>
>>>>>>> 16:22:29 INFO  Server :: Started 2022/03/01 16:22:29 CET on port 
>>>>>>> 3031
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>>>>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s 
>>>>>>> wdt:P625 ?o } LIMIT 10
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>>>>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/> PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>>>>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>>>>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/> PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>>>>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>>>>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/> PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>>>>>
>>>>>> iotops showed ~400M/s while executing the last time. Does this 
>>>>>> performance drop really come from HDD vs SSD? Especially the last 
>>>>>> two queries just have different limits, so I assume the joins are 
>>>>>> just too heavy?
>>>>>>
>>>>>>
>>>>>> On 23.11.21 13:10, Andy Seaborne wrote:
>>>>>>> Try loading truthy:
>>>>>>>
>>>>>>>
>>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>>>>>
>>>>>>>
>>>>>>> (it always has "BETA" in the name)
>>>>>>>
>>>>>>> which the current latest:
>>>>>>>
>>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 
>>>>>>>
>>>>>>>
>>>>>>>     Andy
>>>>>>>
>>>>>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>>>>>> that's on commodity hardware
>>>>>>>>
>>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>>>>>
>>>>>>>> load times are just load times. Including indexing I'm down to 
>>>>>>>> 137,217 t/s
>>>>>>>>
>>>>>>>> sure with a billion triples I am down to 87kt/s
>>>>>>>>
>>>>>>>> but still reasonable for most of my use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne 
>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>>>>>> footprint
>>>>>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>>>>>
>>>>>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>>>>>> remove/release disk
>>>>>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>>>>>
>>>>>>>>> Any active read transactions can still see the old data. You 
>>>>>>>>> can't
>>>>>>>>> delete it for real.
>>>>>>>>>
>>>>>>>>> Run compact.
>>>>>>>>>
>>>>>>>>>> impressed to see that load times went up to 250k/s
>>>>>>>>>
>>>>>>>>> What was the hardware?
>>>>>>>>>
>>>>>>>>>> with 4.2. more than
>>>>>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>>>>>> (Ubuntu
>>>>>>>>>> 20.04.3 LTS) related.
>>>>>>>>>
>>>>>>>>> You won't get 250k at scale. Loading rate slows for 
>>>>>>>>> algorithmic reasons
>>>>>>>>> and system reasons.
>>>>>>>>>
>>>>>>>>> Now try 500m!
>>>>>>>>>
>>>>>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>>>>>> provide us
>>>>>>>>>> with a production environment type machine to run some load 
>>>>>>>>>> and query
>>>>>>>>> tests.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne 
>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>>>>>
>>>>>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>>>>>> significant
>>>>>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>>>>>> current WD
>>>>>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>>>>>
>>>>>>>>>>> One thing xloader gives us is that it makes it possible to 
>>>>>>>>>>> load on a
>>>>>>>>>>> spinning disk. (it also has lower peak intermediate file 
>>>>>>>>>>> space and
>>>>>>>>>>> faster because it does not fall into a slow loading mode for 
>>>>>>>>>>> the node
>>>>>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>>>>>
>>>>>>>>>>>        Andy
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne 
>>>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>>>>> https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And in the process of appointing a graph consultant: (5 
>>>>>>>>>>>>>> month
>>>>>>>>>>> contract):
>>>>>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Realistically?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader 
>>>>>>>>>>>>>> may help
>>>>>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>>>>>> subset (5B)
>>>>>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even if wikidata loads, it would be impractically slow as 
>>>>>>>>>>>>>> TDB is
>>>>>>>>> today.
>>>>>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The current discussions feel more like they are looking 
>>>>>>>>>>>>>> for a
>>>>>>>>> "product"
>>>>>>>>>>>>>> - a triplestore that they are use - rather than a 
>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


>> Some counting with a join and a filter:
>>
>> SELECT (count(*) as ?cnt) WHERE {
>>    ?s wdt:P31 wd:Q5 ;
>>       rdfs:label ?l
>>    filter(lang(?l)='en')
>> }
>>
>> Runtime: 4817.022 seconds
>>
>>
>> I compared those queries with (public) QLever triple store, the latter 
>> query takes 2s - indeed as this is on their public server the 
>> comparison is not fair, and maybe there init process does more caching 
>> in advance.

And it use sa different style of join - it will do either (parallel) 
sorted merge join or maybe a combined scan of ?s wdt:P31 wd:Q5  pickign 
out rdfs:label as it goes.

QLever has 6 indexes for triples - every sort order of SPO. It's 
configurable, they don't all have to be built. Most of the time it needs 
two -- PSO and PSO and here maybe OPS.

?s wdt:P31 wd:Q5 .
?s rdfs:label ?l .

As it is highly text centric it might apply (lang(?l)='en') quite low 
down but I think that would be a secondary benefit.

>> I'm also trying to set it up locally on the same server as the TDB2 
>> instance and will compare again - just learned that in future we 
>> should rent servers with way more disk space ... "lesson learned"
> 
> Great.
> 
>  From the public server, QLever doesn't support much of SPARQL functions.
> 
>      Andy
> 
>>
>>
>>
>> On 05.03.22 11:57, Andy Seaborne wrote:
>>> Two comments inline:
>>>
>>> On 02/03/2022 15:41, LB wrote:
>>>> Hm,
>>>>
>>>> coming back to this query
>>>>
>>>> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>
>>>> I calculated the triple pattern sizes:
>>>>
>>>> p:P625: ~9M
>>>>
>>>> ps:P625: ~9M
>>>>
>>>> pq:P376: ~1K
>>>
>>> Also try to rewrite with :P376 first.
>>> SELECT * WHERE
>>>   { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
>>> LIMIT 100
>>>
>>>
>>> which is:
>>>
>>>
>>> SELECT  *
>>> WHERE
>>>   { ?s    p:P625   _:b0 .
>>>     _:b0  ps:P625  ?o ;
>>>     _:b0  pq:P376  ?body
>>>   }
>>> LIMIT   100
>>>
>>> ==>
>>>
>>>
>>> SELECT  *
>>> WHERE
>>>   { _:b0  pq:P376  ?body .
>>>     _:b0  ps:P625  ?o ;
>>>     ?s    p:P625   _:b0 .
>>>   }
>>> LIMIT   100
>>>
>>> (
>>>
>>>>
>>>> Even with computing TDB stats it doesn't seem to perform well (not 
>>>> sure if those steps have been taken into account, as usual I put 
>>>> stats.opt into TDB dir). Took 180s even after I did a full count of 
>>>> all 18.8B triples in advance to warm cache. 
>>>
>>> Counting by itself only warm triple indexes, not the node table, nor 
>>> it's indexes.
>>>
>>> COUNT(*) or COUNT(?x) does not need the details of the RDF term 
>>> itself. Term results out of TDB are lazily computed and COUNT, by 
>>> design, does not trigger pulling from the node table.
>>>
>>>     Andy
>>>
>>>> I guess the files are rather larger
>>>>
>>>>> 373G    OSP.dat
>>>>> 373G    POS.dat
>>>>> 373G    SPO.dat
>>>>> 186G    nodes-data.obj
>>>>> 85G    nodes.dat
>>>>> 1,3G    OSP.idn
>>>>> 1,3G    POS.idn
>>>>> 1,3G    SPO.idn
>>>>> 720M    nodes.idn
>>>> for computation it would touch which files first?
>>>>
>>>> By the way, counting all 18.8B triples took ~6000s - HDD read speed 
>>>> was ~70M/s and given that we have 1.4TB disk size ...
>>>>
>>>> Long story short, with that slow HDD setup it takes ages or I'm 
>>>> doing something fundamentally wrong. Will copy over the TDB image to 
>>>> another server with SSD to see how things will change,.
>>>>
>>>>
>>>> On 02.03.22 14:12, Andy Seaborne wrote:
>>>>> > iotops showed ~400M/s while executing the last time. Does this
>>>>> > performance drop really come from HDD vs SSD?
>>>>>
>>>>> Yes - it could well do.
>>>>>
>>>>> Try running the queries twice in the same server.
>>>>>
>>>>> TDB does no pre-work whatsoever so file system caching is significant.
>>>>>
>>>>> > Especially the last two
>>>>> > queries just have different limits, so I assume the joins are 
>>>>> just too
>>>>> > heavy?
>>>>>
>>>>>     Andy
>>>>>
>>>>> On 02/03/2022 08:22, LB wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>>>>>> xloader on a different less powerful server:
>>>>>>
>>>>>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
>>>>>> threads per core, -> 16C/32T)
>>>>>> - 128GB RAM
>>>>>> - non SSD RAID
>>>>>>
>>>>>> it took about 93h with  --threads 28; again I lost the logs 
>>>>>> because somebody rebootet the server yesterday, will restart it 
>>>>>> soon to keep logs on disk this time instead of terminal
>>>>>>
>>>>>> Afterwards I started querying a bit via Fuseki, and surprisingly 
>>>>>> for a very common Wikidata query making use of qualifiers the 
>>>>>> performance was rather low:
>>>>>>
>>>>>>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET 
>>>>>>> on port 3031
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>>>>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s 
>>>>>>> wdt:P625 ?o } LIMIT 10
>>>>>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>>>>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>>>>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>>>>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>>>>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>>>>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>>>>>> http://localhost:3031/ds/sparql
>>>>>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>>>>>
>>>>>> iotops showed ~400M/s while executing the last time. Does this 
>>>>>> performance drop really come from HDD vs SSD? Especially the last 
>>>>>> two queries just have different limits, so I assume the joins are 
>>>>>> just too heavy?
>>>>>>
>>>>>>
>>>>>> On 23.11.21 13:10, Andy Seaborne wrote:
>>>>>>> Try loading truthy:
>>>>>>>
>>>>>>>
>>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>>>>>
>>>>>>>
>>>>>>> (it always has "BETA" in the name)
>>>>>>>
>>>>>>> which the current latest:
>>>>>>>
>>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 
>>>>>>>
>>>>>>>
>>>>>>>     Andy
>>>>>>>
>>>>>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>>>>>> that's on commodity hardware
>>>>>>>>
>>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>>>>>
>>>>>>>> load times are just load times. Including indexing I'm down to 
>>>>>>>> 137,217 t/s
>>>>>>>>
>>>>>>>> sure with a billion triples I am down to 87kt/s
>>>>>>>>
>>>>>>>> but still reasonable for most of my use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>>>>>> footprint
>>>>>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>>>>>
>>>>>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>>>>>> remove/release disk
>>>>>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>>>>>
>>>>>>>>> Any active read transactions can still see the old data. You can't
>>>>>>>>> delete it for real.
>>>>>>>>>
>>>>>>>>> Run compact.
>>>>>>>>>
>>>>>>>>>> impressed to see that load times went up to 250k/s
>>>>>>>>>
>>>>>>>>> What was the hardware?
>>>>>>>>>
>>>>>>>>>> with 4.2. more than
>>>>>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>>>>>> (Ubuntu
>>>>>>>>>> 20.04.3 LTS) related.
>>>>>>>>>
>>>>>>>>> You won't get 250k at scale. Loading rate slows for algorithmic 
>>>>>>>>> reasons
>>>>>>>>> and system reasons.
>>>>>>>>>
>>>>>>>>> Now try 500m!
>>>>>>>>>
>>>>>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>>>>>> provide us
>>>>>>>>>> with a production environment type machine to run some load 
>>>>>>>>>> and query
>>>>>>>>> tests.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne 
>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>>>>>
>>>>>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>>>>>> significant
>>>>>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>>>>>> current WD
>>>>>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>>>>>
>>>>>>>>>>> One thing xloader gives us is that it makes it possible to 
>>>>>>>>>>> load on a
>>>>>>>>>>> spinning disk. (it also has lower peak intermediate file 
>>>>>>>>>>> space and
>>>>>>>>>>> faster because it does not fall into a slow loading mode for 
>>>>>>>>>>> the node
>>>>>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>>>>>
>>>>>>>>>>>        Andy
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne 
>>>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>>>>>>> contract):
>>>>>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Realistically?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader 
>>>>>>>>>>>>>> may help
>>>>>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>>>>>> subset (5B)
>>>>>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even if wikidata loads, it would be impractically slow as 
>>>>>>>>>>>>>> TDB is
>>>>>>>>> today.
>>>>>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The current discussions feel more like they are looking for a
>>>>>>>>> "product"
>>>>>>>>>>>>>> - a triplestore that they are use - rather than a 
>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


On 06/03/2022 09:08, LB wrote:
> Hi Andy,
> 
> yes I also did with rewriting which indeed performed faster. Indeed the 
> issue here was that TDB2 didn't use the stats.opt file because it was in 
> the wrong location. I'm still not convinced that it should be located in 
> $TDB2_LOCATION/Data-XXX instead of $TDB2_LOCATION/ - especially when you 
> would do a compact or some other operation you will have to move the 
> stats file to the newer data directory.

Compaction could copy it over.

Actually, compact could update it if it is generated by stats.

The reason it's in Data-XXXX is that it is related to the storage not 
the switchable overlay.  Not immovable but the reason it is where it is. 
History.

> Moved the TDB2 instance now to the faster server Ryzen 5950X, 16C/32C, 
> 128GB RAM, 3.4GB NVMe RAID 1
> 
> I did few more queries which take a lot of time, either my machine is 
> too slow or it is just as it is:
> 
> The count query to compute the dataset size
> 
> SELECT (count(*) as ?cnt) {
>    ?s ?p ?o
> }
> 
> Runtime: 2489.115 seconds

That will use the SPO.* files.

Massively sensitive to warm up!

I am suspicious that OS caching could be better informed about access 
patterns - that would take some native code (a lot more practical in 
Java17).

> Observations are rather hard, iotop showed a read speed of 150M/s - I 
> don't know how to interpret this, sounds rather slow for an NVMe SSD. I 
> also don't yet understand which files are touched. If those are just the 
> index files, then I don't get why it takes so much time given that the 
> index files are rather small with ~1G
> 
> 
> Some counting with a join and a filter:
> 
> SELECT (count(*) as ?cnt) WHERE {
>    ?s wdt:P31 wd:Q5 ;
>       rdfs:label ?l
>    filter(lang(?l)='en')
> }
> 
> Runtime: 4817.022 seconds
> 
> 
> I compared those queries with (public) QLever triple store, the latter 
> query takes 2s - indeed as this is on their public server the comparison 
> is not fair, and maybe there init process does more caching in advance.
> 
> I'm also trying to set it up locally on the same server as the TDB2 
> instance and will compare again - just learned that in future we should 
> rent servers with way more disk space ... "lesson learned"

Great.

 From the public server, QLever doesn't support much of SPARQL functions.

	Andy

> 
> 
> 
> On 05.03.22 11:57, Andy Seaborne wrote:
>> Two comments inline:
>>
>> On 02/03/2022 15:41, LB wrote:
>>> Hm,
>>>
>>> coming back to this query
>>>
>>> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>
>>> I calculated the triple pattern sizes:
>>>
>>> p:P625: ~9M
>>>
>>> ps:P625: ~9M
>>>
>>> pq:P376: ~1K
>>
>> Also try to rewrite with :P376 first.
>> SELECT * WHERE
>>   { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
>> LIMIT 100
>>
>>
>> which is:
>>
>>
>> SELECT  *
>> WHERE
>>   { ?s    p:P625   _:b0 .
>>     _:b0  ps:P625  ?o ;
>>     _:b0  pq:P376  ?body
>>   }
>> LIMIT   100
>>
>> ==>
>>
>>
>> SELECT  *
>> WHERE
>>   { _:b0  pq:P376  ?body .
>>     _:b0  ps:P625  ?o ;
>>     ?s    p:P625   _:b0 .
>>   }
>> LIMIT   100
>>
>> (
>>
>>>
>>> Even with computing TDB stats it doesn't seem to perform well (not 
>>> sure if those steps have been taken into account, as usual I put 
>>> stats.opt into TDB dir). Took 180s even after I did a full count of 
>>> all 18.8B triples in advance to warm cache. 
>>
>> Counting by itself only warm triple indexes, not the node table, nor 
>> it's indexes.
>>
>> COUNT(*) or COUNT(?x) does not need the details of the RDF term 
>> itself. Term results out of TDB are lazily computed and COUNT, by 
>> design, does not trigger pulling from the node table.
>>
>>     Andy
>>
>>> I guess the files are rather larger
>>>
>>>> 373G    OSP.dat
>>>> 373G    POS.dat
>>>> 373G    SPO.dat
>>>> 186G    nodes-data.obj
>>>> 85G    nodes.dat
>>>> 1,3G    OSP.idn
>>>> 1,3G    POS.idn
>>>> 1,3G    SPO.idn
>>>> 720M    nodes.idn
>>> for computation it would touch which files first?
>>>
>>> By the way, counting all 18.8B triples took ~6000s - HDD read speed 
>>> was ~70M/s and given that we have 1.4TB disk size ...
>>>
>>> Long story short, with that slow HDD setup it takes ages or I'm doing 
>>> something fundamentally wrong. Will copy over the TDB image to 
>>> another server with SSD to see how things will change,.
>>>
>>>
>>> On 02.03.22 14:12, Andy Seaborne wrote:
>>>> > iotops showed ~400M/s while executing the last time. Does this
>>>> > performance drop really come from HDD vs SSD?
>>>>
>>>> Yes - it could well do.
>>>>
>>>> Try running the queries twice in the same server.
>>>>
>>>> TDB does no pre-work whatsoever so file system caching is significant.
>>>>
>>>> > Especially the last two
>>>> > queries just have different limits, so I assume the joins are just 
>>>> too
>>>> > heavy?
>>>>
>>>>     Andy
>>>>
>>>> On 02/03/2022 08:22, LB wrote:
>>>>> Hi all,
>>>>>
>>>>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>>>>> xloader on a different less powerful server:
>>>>>
>>>>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
>>>>> threads per core, -> 16C/32T)
>>>>> - 128GB RAM
>>>>> - non SSD RAID
>>>>>
>>>>> it took about 93h with  --threads 28; again I lost the logs because 
>>>>> somebody rebootet the server yesterday, will restart it soon to 
>>>>> keep logs on disk this time instead of terminal
>>>>>
>>>>> Afterwards I started querying a bit via Fuseki, and surprisingly 
>>>>> for a very common Wikidata query making use of qualifiers the 
>>>>> performance was rather low:
>>>>>
>>>>>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET 
>>>>>> on port 3031
>>>>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>>>>> http://localhost:3031/ds/sparql
>>>>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>>>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s 
>>>>>> wdt:P625 ?o } LIMIT 10
>>>>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>>>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>>>>> http://localhost:3031/ds/sparql
>>>>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>>>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>>>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>>>>> http://localhost:3031/ds/sparql
>>>>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>>>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>>>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>>>>> http://localhost:3031/ds/sparql
>>>>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>>>>
>>>>> iotops showed ~400M/s while executing the last time. Does this 
>>>>> performance drop really come from HDD vs SSD? Especially the last 
>>>>> two queries just have different limits, so I assume the joins are 
>>>>> just too heavy?
>>>>>
>>>>>
>>>>> On 23.11.21 13:10, Andy Seaborne wrote:
>>>>>> Try loading truthy:
>>>>>>
>>>>>>
>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>>>>
>>>>>>
>>>>>> (it always has "BETA" in the name)
>>>>>>
>>>>>> which the current latest:
>>>>>>
>>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 
>>>>>>
>>>>>>
>>>>>>     Andy
>>>>>>
>>>>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>>>>> that's on commodity hardware
>>>>>>>
>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>>>>
>>>>>>> load times are just load times. Including indexing I'm down to 
>>>>>>> 137,217 t/s
>>>>>>>
>>>>>>> sure with a billion triples I am down to 87kt/s
>>>>>>>
>>>>>>> but still reasonable for most of my use cases.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>>>>> footprint
>>>>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>>>>
>>>>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>>>>> remove/release disk
>>>>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>>>>
>>>>>>>> Any active read transactions can still see the old data. You can't
>>>>>>>> delete it for real.
>>>>>>>>
>>>>>>>> Run compact.
>>>>>>>>
>>>>>>>>> impressed to see that load times went up to 250k/s
>>>>>>>>
>>>>>>>> What was the hardware?
>>>>>>>>
>>>>>>>>> with 4.2. more than
>>>>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>>>>> (Ubuntu
>>>>>>>>> 20.04.3 LTS) related.
>>>>>>>>
>>>>>>>> You won't get 250k at scale. Loading rate slows for algorithmic 
>>>>>>>> reasons
>>>>>>>> and system reasons.
>>>>>>>>
>>>>>>>> Now try 500m!
>>>>>>>>
>>>>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>>>>> provide us
>>>>>>>>> with a production environment type machine to run some load and 
>>>>>>>>> query
>>>>>>>> tests.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>>>>
>>>>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>>>>> significant
>>>>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>>>>> current WD
>>>>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>>>>
>>>>>>>>>> One thing xloader gives us is that it makes it possible to 
>>>>>>>>>> load on a
>>>>>>>>>> spinning disk. (it also has lower peak intermediate file space 
>>>>>>>>>> and
>>>>>>>>>> faster because it does not fall into a slow loading mode for 
>>>>>>>>>> the node
>>>>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>>>>
>>>>>>>>>>        Andy
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne 
>>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>>>>
>>>>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>>>>
>>>>>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>>>>>> contract):
>>>>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>>>>
>>>>>>>>>>>>> Realistically?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader 
>>>>>>>>>>>>> may help
>>>>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>>>>> subset (5B)
>>>>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>>>>
>>>>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Even if wikidata loads, it would be impractically slow as 
>>>>>>>>>>>>> TDB is
>>>>>>>> today.
>>>>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current discussions feel more like they are looking for a
>>>>>>>> "product"
>>>>>>>>>>>>> - a triplestore that they are use - rather than a 
>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>
>>>>>>>>>>>>>         Andy
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>

Re: Wikidata evolution

Posted by LB <co...@googlemail.com.INVALID>.

Hi Andy,

yes I also did with rewriting which indeed performed faster. Indeed the 
issue here was that TDB2 didn't use the stats.opt file because it was in 
the wrong location. I'm still not convinced that it should be located in 
$TDB2_LOCATION/Data-XXX instead of $TDB2_LOCATION/ - especially when you 
would do a compact or some other operation you will have to move the 
stats file to the newer data directory.


Moved the TDB2 instance now to the faster server Ryzen 5950X, 16C/32C, 
128GB RAM, 3.4GB NVMe RAID 1

I did few more queries which take a lot of time, either my machine is 
too slow or it is just as it is:

The count query to compute the dataset size

SELECT (count(*) as ?cnt) {
   ?s ?p ?o
}

Runtime: 2489.115 seconds

Observations are rather hard, iotop showed a read speed of 150M/s - I 
don't know how to interpret this, sounds rather slow for an NVMe SSD. I 
also don't yet understand which files are touched. If those are just the 
index files, then I don't get why it takes so much time given that the 
index files are rather small with ~1G


Some counting with a join and a filter:

SELECT (count(*) as ?cnt) WHERE {
   ?s wdt:P31 wd:Q5 ;
      rdfs:label ?l
   filter(lang(?l)='en')
}

Runtime: 4817.022 seconds


I compared those queries with (public) QLever triple store, the latter 
query takes 2s - indeed as this is on their public server the comparison 
is not fair, and maybe there init process does more caching in advance.

I'm also trying to set it up locally on the same server as the TDB2 
instance and will compare again - just learned that in future we should 
rent servers with way more disk space ... "lesson learned"



On 05.03.22 11:57, Andy Seaborne wrote:
> Two comments inline:
>
> On 02/03/2022 15:41, LB wrote:
>> Hm,
>>
>> coming back to this query
>>
>> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>
>> I calculated the triple pattern sizes:
>>
>> p:P625: ~9M
>>
>> ps:P625: ~9M
>>
>> pq:P376: ~1K
>
> Also try to rewrite with :P376 first.
> SELECT * WHERE
>   { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
> LIMIT 100
>
>
> which is:
>
>
> SELECT  *
> WHERE
>   { ?s    p:P625   _:b0 .
>     _:b0  ps:P625  ?o ;
>     _:b0  pq:P376  ?body
>   }
> LIMIT   100
>
> ==>
>
>
> SELECT  *
> WHERE
>   { _:b0  pq:P376  ?body .
>     _:b0  ps:P625  ?o ;
>     ?s    p:P625   _:b0 .
>   }
> LIMIT   100
>
> (
>
>>
>> Even with computing TDB stats it doesn't seem to perform well (not 
>> sure if those steps have been taken into account, as usual I put  
>> stats.opt into TDB dir). Took 180s even after I did a full count of 
>> all 18.8B triples in advance to warm cache. 
>
> Counting by itself only warm triple indexes, not the node table, nor 
> it's indexes.
>
> COUNT(*) or COUNT(?x) does not need the details of the RDF term 
> itself. Term results out of TDB are lazily computed and COUNT, by 
> design, does not trigger pulling from the node table.
>
>     Andy
>
>> I guess the files are rather larger
>>
>>> 373G    OSP.dat
>>> 373G    POS.dat
>>> 373G    SPO.dat
>>> 186G    nodes-data.obj
>>> 85G    nodes.dat
>>> 1,3G    OSP.idn
>>> 1,3G    POS.idn
>>> 1,3G    SPO.idn
>>> 720M    nodes.idn
>> for computation it would touch which files first?
>>
>> By the way, counting all 18.8B triples took ~6000s - HDD read speed 
>> was ~70M/s and given that we have 1.4TB disk size ...
>>
>> Long story short, with that slow HDD setup it takes ages or I'm doing 
>> something fundamentally wrong. Will copy over the TDB image to 
>> another server with SSD to see how things will change,.
>>
>>
>> On 02.03.22 14:12, Andy Seaborne wrote:
>>> > iotops showed ~400M/s while executing the last time. Does this
>>> > performance drop really come from HDD vs SSD?
>>>
>>> Yes - it could well do.
>>>
>>> Try running the queries twice in the same server.
>>>
>>> TDB does no pre-work whatsoever so file system caching is significant.
>>>
>>> > Especially the last two
>>> > queries just have different limits, so I assume the joins are just 
>>> too
>>> > heavy?
>>>
>>>     Andy
>>>
>>> On 02/03/2022 08:22, LB wrote:
>>>> Hi all,
>>>>
>>>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>>>> xloader on a different less powerful server:
>>>>
>>>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
>>>> threads per core, -> 16C/32T)
>>>> - 128GB RAM
>>>> - non SSD RAID
>>>>
>>>> it took about 93h with  --threads 28; again I lost the logs because 
>>>> somebody rebootet the server yesterday, will restart it soon to 
>>>> keep logs on disk this time instead of terminal
>>>>
>>>> Afterwards I started querying a bit via Fuseki, and surprisingly 
>>>> for a very common Wikidata query making use of qualifiers the 
>>>> performance was rather low:
>>>>
>>>>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET 
>>>>> on port 3031
>>>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>>>> http://localhost:3031/ds/sparql
>>>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s 
>>>>> wdt:P625 ?o } LIMIT 10
>>>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>>>> http://localhost:3031/ds/sparql
>>>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>> WHERE {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>>>> http://localhost:3031/ds/sparql
>>>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>>>> http://localhost:3031/ds/sparql
>>>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * 
>>>>> WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>>>
>>>> iotops showed ~400M/s while executing the last time. Does this 
>>>> performance drop really come from HDD vs SSD? Especially the last 
>>>> two queries just have different limits, so I assume the joins are 
>>>> just too heavy?
>>>>
>>>>
>>>> On 23.11.21 13:10, Andy Seaborne wrote:
>>>>> Try loading truthy:
>>>>>
>>>>>
>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>>>
>>>>>
>>>>> (it always has "BETA" in the name)
>>>>>
>>>>> which the current latest:
>>>>>
>>>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 
>>>>>
>>>>>
>>>>>     Andy
>>>>>
>>>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>>>> that's on commodity hardware
>>>>>>
>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>>>
>>>>>> load times are just load times. Including indexing I'm down to 
>>>>>> 137,217 t/s
>>>>>>
>>>>>> sure with a billion triples I am down to 87kt/s
>>>>>>
>>>>>> but still reasonable for most of my use cases.
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>>>> footprint
>>>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>>>
>>>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>>>> remove/release disk
>>>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>>>
>>>>>>> Any active read transactions can still see the old data. You can't
>>>>>>> delete it for real.
>>>>>>>
>>>>>>> Run compact.
>>>>>>>
>>>>>>>> impressed to see that load times went up to 250k/s
>>>>>>>
>>>>>>> What was the hardware?
>>>>>>>
>>>>>>>> with 4.2. more than
>>>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>>>> (Ubuntu
>>>>>>>> 20.04.3 LTS) related.
>>>>>>>
>>>>>>> You won't get 250k at scale. Loading rate slows for algorithmic 
>>>>>>> reasons
>>>>>>> and system reasons.
>>>>>>>
>>>>>>> Now try 500m!
>>>>>>>
>>>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>>>> provide us
>>>>>>>> with a production environment type machine to run some load and 
>>>>>>>> query
>>>>>>> tests.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>>>
>>>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>>>> significant
>>>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>>>> current WD
>>>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>>>
>>>>>>>>> One thing xloader gives us is that it makes it possible to 
>>>>>>>>> load on a
>>>>>>>>> spinning disk. (it also has lower peak intermediate file space 
>>>>>>>>> and
>>>>>>>>> faster because it does not fall into a slow loading mode for 
>>>>>>>>> the node
>>>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>>>
>>>>>>>>>        Andy
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne 
>>>>>>>>>> <an...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>>>
>>>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>>>
>>>>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>>>>> contract):
>>>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>>>
>>>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>>>
>>>>>>>>>>>> Realistically?
>>>>>>>>>>>>
>>>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader 
>>>>>>>>>>>> may help
>>>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>>>> subset (5B)
>>>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>>>
>>>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Even if wikidata loads, it would be impractically slow as 
>>>>>>>>>>>> TDB is
>>>>>>> today.
>>>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>>>
>>>>>>>>>>>> The current discussions feel more like they are looking for a
>>>>>>> "product"
>>>>>>>>>>>> - a triplestore that they are use - rather than a 
>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>
>>>>>>>>>>>>         Andy
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.

Two comments inline:

On 02/03/2022 15:41, LB wrote:
> Hm,
> 
> coming back to this query
> 
> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
> 
> I calculated the triple pattern sizes:
> 
> p:P625: ~9M
> 
> ps:P625: ~9M
> 
> pq:P376: ~1K

Also try to rewrite with :P376 first.
SELECT * WHERE
   { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
LIMIT 100


which is:


SELECT  *
WHERE
   { ?s    p:P625   _:b0 .
     _:b0  ps:P625  ?o ;
     _:b0  pq:P376  ?body
   }
LIMIT   100

==>


SELECT  *
WHERE
   { _:b0  pq:P376  ?body .
     _:b0  ps:P625  ?o ;
     ?s    p:P625   _:b0 .
   }
LIMIT   100

(

> 
> Even with computing TDB stats it doesn't seem to perform well (not sure 
> if those steps have been taken into account, as usual I put  stats.opt 
> into TDB dir). Took 180s even after I did a full count of all 18.8B 
> triples in advance to warm cache. 

Counting by itself only warm triple indexes, not the node table, nor 
it's indexes.

COUNT(*) or COUNT(?x) does not need the details of the RDF term itself. 
Term results out of TDB are lazily computed and COUNT, by design, does 
not trigger pulling from the node table.

     Andy

> I guess the files are rather larger
> 
>> 373G    OSP.dat
>> 373G    POS.dat
>> 373G    SPO.dat
>> 186G    nodes-data.obj
>> 85G    nodes.dat
>> 1,3G    OSP.idn
>> 1,3G    POS.idn
>> 1,3G    SPO.idn
>> 720M    nodes.idn
> for computation it would touch which files first?
> 
> By the way, counting all 18.8B triples took ~6000s - HDD read speed was 
> ~70M/s and given that we have 1.4TB disk size ...
> 
> Long story short, with that slow HDD setup it takes ages or I'm doing 
> something fundamentally wrong. Will copy over the TDB image to another 
> server with SSD to see how things will change,.
> 
> 
> On 02.03.22 14:12, Andy Seaborne wrote:
>> > iotops showed ~400M/s while executing the last time. Does this
>> > performance drop really come from HDD vs SSD?
>>
>> Yes - it could well do.
>>
>> Try running the queries twice in the same server.
>>
>> TDB does no pre-work whatsoever so file system caching is significant.
>>
>> > Especially the last two
>> > queries just have different limits, so I assume the joins are just too
>> > heavy?
>>
>>     Andy
>>
>> On 02/03/2022 08:22, LB wrote:
>>> Hi all,
>>>
>>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>>> xloader on a different less powerful server:
>>>
>>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
>>> threads per core, -> 16C/32T)
>>> - 128GB RAM
>>> - non SSD RAID
>>>
>>> it took about 93h with  --threads 28; again I lost the logs because 
>>> somebody rebootet the server yesterday, will restart it soon to keep 
>>> logs on disk this time instead of terminal
>>>
>>> Afterwards I started querying a bit via Fuseki, and surprisingly for 
>>> a very common Wikidata query making use of qualifiers the performance 
>>> was rather low:
>>>
>>>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET on 
>>>> port 3031
>>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>>> http://localhost:3031/ds/sparql
>>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s wdt:P625 
>>>> ?o } LIMIT 10
>>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>>> http://localhost:3031/ds/sparql
>>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>>> {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>>> http://localhost:3031/ds/sparql
>>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>>> http://localhost:3031/ds/sparql
>>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>>
>>> iotops showed ~400M/s while executing the last time. Does this 
>>> performance drop really come from HDD vs SSD? Especially the last two 
>>> queries just have different limits, so I assume the joins are just 
>>> too heavy?
>>>
>>>
>>> On 23.11.21 13:10, Andy Seaborne wrote:
>>>> Try loading truthy:
>>>>
>>>>
>>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>>
>>>>
>>>> (it always has "BETA" in the name)
>>>>
>>>> which the current latest:
>>>>
>>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
>>>>
>>>>     Andy
>>>>
>>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>>> that's on commodity hardware
>>>>>
>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>>
>>>>> load times are just load times. Including indexing I'm down to 
>>>>> 137,217 t/s
>>>>>
>>>>> sure with a billion triples I am down to 87kt/s
>>>>>
>>>>> but still reasonable for most of my use cases.
>>>>>
>>>>>
>>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>>> footprint
>>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>>
>>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>>> remove/release disk
>>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>>
>>>>>> Any active read transactions can still see the old data. You can't
>>>>>> delete it for real.
>>>>>>
>>>>>> Run compact.
>>>>>>
>>>>>>> impressed to see that load times went up to 250k/s
>>>>>>
>>>>>> What was the hardware?
>>>>>>
>>>>>>> with 4.2. more than
>>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>>> (Ubuntu
>>>>>>> 20.04.3 LTS) related.
>>>>>>
>>>>>> You won't get 250k at scale. Loading rate slows for algorithmic 
>>>>>> reasons
>>>>>> and system reasons.
>>>>>>
>>>>>> Now try 500m!
>>>>>>
>>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>>> provide us
>>>>>>> with a production environment type machine to run some load and 
>>>>>>> query
>>>>>> tests.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>>
>>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>>> significant
>>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>>> current WD
>>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>>
>>>>>>>> One thing xloader gives us is that it makes it possible to load 
>>>>>>>> on a
>>>>>>>> spinning disk. (it also has lower peak intermediate file space and
>>>>>>>> faster because it does not fall into a slow loading mode for the 
>>>>>>>> node
>>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>>
>>>>>>>>        Andy
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>>
>>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>>
>>>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>>>> contract):
>>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>>
>>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>>
>>>>>>>>>>> Realistically?
>>>>>>>>>>>
>>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader may 
>>>>>>>>>>> help
>>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>>> subset (5B)
>>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>>
>>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Even if wikidata loads, it would be impractically slow as TDB is
>>>>>> today.
>>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>>
>>>>>>>>>>> The current discussions feel more like they are looking for a
>>>>>> "product"
>>>>>>>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>>>>>>>
>>>>>>>>>>>         Andy
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>

Re: Wikidata evolution

Posted by LB <co...@googlemail.com.INVALID>.

Hm,

coming back to this query

SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100

I calculated the triple pattern sizes:

p:P625: ~9M

ps:P625: ~9M

pq:P376: ~1K

Even with computing TDB stats it doesn't seem to perform well (not sure 
if those steps have been taken into account, as usual I put  stats.opt 
into TDB dir). Took 180s even after I did a full count of all 18.8B 
triples in advance to warm cache. I guess the files are rather larger

> 373G    OSP.dat
> 373G    POS.dat
> 373G    SPO.dat
> 186G    nodes-data.obj
> 85G    nodes.dat
> 1,3G    OSP.idn
> 1,3G    POS.idn
> 1,3G    SPO.idn
> 720M    nodes.idn
for computation it would touch which files first?

By the way, counting all 18.8B triples took ~6000s - HDD read speed was 
~70M/s and given that we have 1.4TB disk size ...

Long story short, with that slow HDD setup it takes ages or I'm doing 
something fundamentally wrong. Will copy over the TDB image to another 
server with SSD to see how things will change,.


On 02.03.22 14:12, Andy Seaborne wrote:
> > iotops showed ~400M/s while executing the last time. Does this
> > performance drop really come from HDD vs SSD?
>
> Yes - it could well do.
>
> Try running the queries twice in the same server.
>
> TDB does no pre-work whatsoever so file system caching is significant.
>
> > Especially the last two
> > queries just have different limits, so I assume the joins are just too
> > heavy?
>
>     Andy
>
> On 02/03/2022 08:22, LB wrote:
>> Hi all,
>>
>> just as a follow up I loaded Wikidata latest full into TDB2 via 
>> xloader on a different less powerful server:
>>
>> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
>> threads per core, -> 16C/32T)
>> - 128GB RAM
>> - non SSD RAID
>>
>> it took about 93h with  --threads 28; again I lost the logs because 
>> somebody rebootet the server yesterday, will restart it soon to keep 
>> logs on disk this time instead of terminal
>>
>> Afterwards I started querying a bit via Fuseki, and surprisingly for 
>> a very common Wikidata query making use of qualifiers the performance 
>> was rather low:
>>
>>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET on 
>>> port 3031
>>> 16:24:54 INFO  Fuseki          :: [1] POST 
>>> http://localhost:3031/ds/sparql
>>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s wdt:P625 
>>> ?o } LIMIT 10
>>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>>> 16:25:57 INFO  Fuseki          :: [2] POST 
>>> http://localhost:3031/ds/sparql
>>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>> {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>>> 16:26:51 INFO  Fuseki          :: [3] POST 
>>> http://localhost:3031/ds/sparql
>>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>>> 16:27:21 INFO  Fuseki          :: [4] POST 
>>> http://localhost:3031/ds/sparql
>>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>>> <http://www.wikidata.org/prop/> PREFIX ps: 
>>> <http://www.wikidata.org/prop/statement/>  PREFIX
>>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
>>
>> iotops showed ~400M/s while executing the last time. Does this 
>> performance drop really come from HDD vs SSD? Especially the last two 
>> queries just have different limits, so I assume the joins are just 
>> too heavy?
>>
>>
>> On 23.11.21 13:10, Andy Seaborne wrote:
>>> Try loading truthy:
>>>
>>>
>>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>>
>>>
>>> (it always has "BETA" in the name)
>>>
>>> which the current latest:
>>>
>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
>>>
>>>     Andy
>>>
>>> On 23/11/2021 11:12, Marco Neumann wrote:
>>>> that's on commodity hardware
>>>>
>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>>
>>>> load times are just load times. Including indexing I'm down to 
>>>> 137,217 t/s
>>>>
>>>> sure with a billion triples I am down to 87kt/s
>>>>
>>>> but still reasonable for most of my use cases.
>>>>
>>>>
>>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>>> footprint
>>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>>
>>>>>> I was surprised that the CLEAR ALL directive doesn't 
>>>>>> remove/release disk
>>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>>
>>>>> Any active read transactions can still see the old data. You can't
>>>>> delete it for real.
>>>>>
>>>>> Run compact.
>>>>>
>>>>>> impressed to see that load times went up to 250k/s
>>>>>
>>>>> What was the hardware?
>>>>>
>>>>>> with 4.2. more than
>>>>>> twice the speed I have seen with 3.15. Not sure if this is OS 
>>>>>> (Ubuntu
>>>>>> 20.04.3 LTS) related.
>>>>>
>>>>> You won't get 250k at scale. Loading rate slows for algorithmic 
>>>>> reasons
>>>>> and system reasons.
>>>>>
>>>>> Now try 500m!
>>>>>
>>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>>> provide us
>>>>>> with a production environment type machine to run some load and 
>>>>>> query
>>>>> tests.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>>
>>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on 
>>>>>>> significant
>>>>>>> sized literals - the node themselves are 50G). Obvious for 
>>>>>>> current WD
>>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>>
>>>>>>> One thing xloader gives us is that it makes it possible to load 
>>>>>>> on a
>>>>>>> spinning disk. (it also has lower peak intermediate file space and
>>>>>>> faster because it does not fall into a slow loading mode for the 
>>>>>>> node
>>>>>>> table that tdbloader2 did sometimes.)
>>>>>>>
>>>>>>>        Andy
>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>>
>>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>>
>>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>>> contract):
>>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>>
>>>>>>>>>> and Apache Jena came up:
>>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>>
>>>>>>>>>> Realistically?
>>>>>>>>>>
>>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader may 
>>>>>>>>>> help
>>>>>>>>>> though the goal for that was to make loading the truthy 
>>>>>>>>>> subset (5B)
>>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>>
>>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Even if wikidata loads, it would be impractically slow as TDB is
>>>>> today.
>>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>>
>>>>>>>>>> The current discussions feel more like they are looking for a
>>>>> "product"
>>>>>>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>>>>>>
>>>>>>>>>>         Andy
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.

 > iotops showed ~400M/s while executing the last time. Does this
 > performance drop really come from HDD vs SSD?

Yes - it could well do.

Try running the queries twice in the same server.

TDB does no pre-work whatsoever so file system caching is significant.

 > Especially the last two
 > queries just have different limits, so I assume the joins are just too
 > heavy?

     Andy

On 02/03/2022 08:22, LB wrote:
> Hi all,
> 
> just as a follow up I loaded Wikidata latest full into TDB2 via xloader 
> on a different less powerful server:
> 
> - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
> threads per core, -> 16C/32T)
> - 128GB RAM
> - non SSD RAID
> 
> it took about 93h with  --threads 28; again I lost the logs because 
> somebody rebootet the server yesterday, will restart it soon to keep 
> logs on disk this time instead of terminal
> 
> Afterwards I started querying a bit via Fuseki, and surprisingly for a 
> very common Wikidata query making use of qualifiers the performance was 
> rather low:
> 
>> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET on 
>> port 3031
>> 16:24:54 INFO  Fuseki          :: [1] POST 
>> http://localhost:3031/ds/sparql
>> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
>> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s wdt:P625 ?o 
>> } LIMIT 10
>> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
>> 16:25:57 INFO  Fuseki          :: [2] POST 
>> http://localhost:3031/ds/sparql
>> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
>> <http://www.wikidata.org/prop/> PREFIX ps: 
>> <http://www.wikidata.org/prop/statement/>  PREFIX
>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>> {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
>> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
>> 16:26:51 INFO  Fuseki          :: [3] POST 
>> http://localhost:3031/ds/sparql
>> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
>> <http://www.wikidata.org/prop/> PREFIX ps: 
>> <http://www.wikidata.org/prop/statement/>  PREFIX
>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
>> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
>> 16:27:21 INFO  Fuseki          :: [4] POST 
>> http://localhost:3031/ds/sparql
>> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
>> <http://www.wikidata.org/prop/> PREFIX ps: 
>> <http://www.wikidata.org/prop/statement/>  PREFIX
>> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
>> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
>> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
> 
> iotops showed ~400M/s while executing the last time. Does this 
> performance drop really come from HDD vs SSD? Especially the last two 
> queries just have different limits, so I assume the joins are just too 
> heavy?
> 
> 
> On 23.11.21 13:10, Andy Seaborne wrote:
>> Try loading truthy:
>>
>>
>> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>>
>>
>> (it always has "BETA" in the name)
>>
>> which the current latest:
>>
>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
>>
>>     Andy
>>
>> On 23/11/2021 11:12, Marco Neumann wrote:
>>> that's on commodity hardware
>>>
>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>>
>>> load times are just load times. Including indexing I'm down to 
>>> 137,217 t/s
>>>
>>> sure with a billion triples I am down to 87kt/s
>>>
>>> but still reasonable for most of my use cases.
>>>
>>>
>>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>>> footprint
>>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>>
>>>>> I was surprised that the CLEAR ALL directive doesn't remove/release 
>>>>> disk
>>>>> memory. Does TDB2 require a commit to release disk space?
>>>>
>>>> Any active read transactions can still see the old data. You can't
>>>> delete it for real.
>>>>
>>>> Run compact.
>>>>
>>>>> impressed to see that load times went up to 250k/s
>>>>
>>>> What was the hardware?
>>>>
>>>>> with 4.2. more than
>>>>> twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>>>>> 20.04.3 LTS) related.
>>>>
>>>> You won't get 250k at scale. Loading rate slows for algorithmic reasons
>>>> and system reasons.
>>>>
>>>> Now try 500m!
>>>>
>>>>> Maybe we should make a recommendation to the wikidata team to 
>>>>> provide us
>>>>> with a production environment type machine to run some load and query
>>>> tests.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>>
>>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>>>>>> sized literals - the node themselves are 50G). Obvious for current WD
>>>>>> scale usage a sprinkling of compression would be good!
>>>>>>
>>>>>> One thing xloader gives us is that it makes it possible to load on a
>>>>>> spinning disk. (it also has lower peak intermediate file space and
>>>>>> faster because it does not fall into a slow loading mode for the node
>>>>>> table that tdbloader2 did sometimes.)
>>>>>>
>>>>>>        Andy
>>>>>>
>>>>>>>
>>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>>
>>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>>
>>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>>> contract):
>>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>>
>>>>>>>>> and Apache Jena came up:
>>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>>
>>>>>>>>> Realistically?
>>>>>>>>>
>>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader may help
>>>>>>>>> though the goal for that was to make loading the truthy subset 
>>>>>>>>> (5B)
>>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>>
>>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Even if wikidata loads, it would be impractically slow as TDB is
>>>> today.
>>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>>
>>>>>>>>> The current discussions feel more like they are looking for a
>>>> "product"
>>>>>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>>>>>
>>>>>>>>>         Andy
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>

Re: Wikidata evolution

Posted by LB <co...@googlemail.com.INVALID>.

Hi all,

just as a follow up I loaded Wikidata latest full into TDB2 via xloader 
on a different less powerful server:

- 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 
threads per core, -> 16C/32T)
- 128GB RAM
- non SSD RAID

it took about 93h with  --threads 28; again I lost the logs because 
somebody rebootet the server yesterday, will restart it soon to keep 
logs on disk this time instead of terminal

Afterwards I started querying a bit via Fuseki, and surprisingly for a 
very common Wikidata query making use of qualifiers the performance was 
rather low:

> 16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET on 
> port 3031
> 16:24:54 INFO  Fuseki          :: [1] POST 
> http://localhost:3031/ds/sparql
> 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: 
> <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s wdt:P625 ?o 
> } LIMIT 10
> 16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
> 16:25:57 INFO  Fuseki          :: [2] POST 
> http://localhost:3031/ds/sparql
> 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: 
> <http://www.wikidata.org/prop/> PREFIX ps: 
> <http://www.wikidata.org/prop/statement/>  PREFIX
> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
> {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
> 16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
> 16:26:51 INFO  Fuseki          :: [3] POST 
> http://localhost:3031/ds/sparql
> 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: 
> <http://www.wikidata.org/prop/> PREFIX ps: 
> <http://www.wikidata.org/prop/statement/>  PREFIX
> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
> 16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
> 16:27:21 INFO  Fuseki          :: [4] POST 
> http://localhost:3031/ds/sparql
> 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: 
> <http://www.wikidata.org/prop/> PREFIX ps: 
> <http://www.wikidata.org/prop/statement/>  PREFIX
> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE 
> {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
> 16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)

iotops showed ~400M/s while executing the last time. Does this 
performance drop really come from HDD vs SSD? Especially the last two 
queries just have different limits, so I assume the joins are just too 
heavy?


On 23.11.21 13:10, Andy Seaborne wrote:
> Try loading truthy:
>
>
> https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2 
>
>
> (it always has "BETA" in the name)
>
> which the current latest:
>
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
>
>     Andy
>
> On 23/11/2021 11:12, Marco Neumann wrote:
>> that's on commodity hardware
>>
>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>
>> load times are just load times. Including indexing I'm down to 
>> 137,217 t/s
>>
>> sure with a billion triples I am down to 87kt/s
>>
>> but still reasonable for most of my use cases.
>>
>>
>> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> wrote:
>>
>>>
>>>
>>> On 22/11/2021 21:14, Marco Neumann wrote:
>>>> Yes I just had a look at one of my own datasets with 180mt and a
>>> footprint
>>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>>
>>>> I was surprised that the CLEAR ALL directive doesn't remove/release 
>>>> disk
>>>> memory. Does TDB2 require a commit to release disk space?
>>>
>>> Any active read transactions can still see the old data. You can't
>>> delete it for real.
>>>
>>> Run compact.
>>>
>>>> impressed to see that load times went up to 250k/s
>>>
>>> What was the hardware?
>>>
>>>> with 4.2. more than
>>>> twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>>>> 20.04.3 LTS) related.
>>>
>>> You won't get 250k at scale. Loading rate slows for algorithmic reasons
>>> and system reasons.
>>>
>>> Now try 500m!
>>>
>>>> Maybe we should make a recommendation to the wikidata team to 
>>>> provide us
>>>> with a production environment type machine to run some load and query
>>> tests.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>>
>>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>>>>> sized literals - the node themselves are 50G). Obvious for current WD
>>>>> scale usage a sprinkling of compression would be good!
>>>>>
>>>>> One thing xloader gives us is that it makes it possible to load on a
>>>>> spinning disk. (it also has lower peak intermediate file space and
>>>>> faster because it does not fall into a slow loading mode for the node
>>>>> table that tdbloader2 did sometimes.)
>>>>>
>>>>>        Andy
>>>>>
>>>>>>
>>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>>
>>>>>>>> About WDQS, current scale and current challenges
>>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>>
>>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>>> contract):
>>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>>
>>>>>>>> and Apache Jena came up:
>>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>>
>>>>>>>> Realistically?
>>>>>>>>
>>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader may help
>>>>>>>> though the goal for that was to make loading the truthy subset 
>>>>>>>> (5B)
>>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>>
>>>>>>> And it's growing at about 1B per quarter.
>>>>>>>
>>>>>>>
>>>>>
>>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy 
>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Even if wikidata loads, it would be impractically slow as TDB is
>>> today.
>>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>>
>>>>>>>> The current discussions feel more like they are looking for a
>>> "product"
>>>>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>>>>
>>>>>>>>         Andy
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.

Try loading truthy:

 
https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2

(it always has "BETA" in the name)

which the current latest:

https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2

     Andy

On 23/11/2021 11:12, Marco Neumann wrote:
> that's on commodity hardware
> 
> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> 
> load times are just load times. Including indexing I'm down to 137,217 t/s
> 
> sure with a billion triples I am down to 87kt/s
> 
> but still reasonable for most of my use cases.
> 
> 
> On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> wrote:
> 
>>
>>
>> On 22/11/2021 21:14, Marco Neumann wrote:
>>> Yes I just had a look at one of my own datasets with 180mt and a
>> footprint
>>> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>>>
>>> I was surprised that the CLEAR ALL directive doesn't remove/release disk
>>> memory. Does TDB2 require a commit to release disk space?
>>
>> Any active read transactions can still see the old data. You can't
>> delete it for real.
>>
>> Run compact.
>>
>>> impressed to see that load times went up to 250k/s
>>
>> What was the hardware?
>>
>>> with 4.2. more than
>>> twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>>> 20.04.3 LTS) related.
>>
>> You won't get 250k at scale. Loading rate slows for algorithmic reasons
>> and system reasons.
>>
>> Now try 500m!
>>
>>> Maybe we should make a recommendation to the wikidata team to provide us
>>> with a production environment type machine to run some load and query
>> tests.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On 21/11/2021 21:03, Marco Neumann wrote:
>>>>> What's the disk footprint these days for 1b on tdb2?
>>>>
>>>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>>>> sized literals - the node themselves are 50G). Obvious for current WD
>>>> scale usage a sprinkling of compression would be good!
>>>>
>>>> One thing xloader gives us is that it makes it possible to load on a
>>>> spinning disk. (it also has lower peak intermediate file space and
>>>> faster because it does not fall into a slow loading mode for the node
>>>> table that tdbloader2 did sometimes.)
>>>>
>>>>        Andy
>>>>
>>>>>
>>>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>>>
>>>>>>> About WDQS, current scale and current challenges
>>>>>>>       https://youtu.be/wn2BrQomvFU?t=9148
>>>>>>>
>>>>>>> And in the process of appointing a graph consultant: (5 month
>>>> contract):
>>>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>>>
>>>>>>> and Apache Jena came up:
>>>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>>>
>>>>>>> Realistically?
>>>>>>>
>>>>>>> Full wikidata is 16B triples. Very hard to load - xloader may help
>>>>>>> though the goal for that was to make loading the truthy subset (5B)
>>>>>>> easier. 5B -> 16B is not a trivial step.
>>>>>>
>>>>>> And it's growing at about 1B per quarter.
>>>>>>
>>>>>>
>>>>
>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>>>>>>
>>>>>>>
>>>>>>> Even if wikidata loads, it would be impractically slow as TDB is
>> today.
>>>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>>>
>>>>>>> The current discussions feel more like they are looking for a
>> "product"
>>>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>>>
>>>>>>>         Andy
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
>

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

that's on commodity hardware

http://www.lotico.com/index.php/JENA_Loader_Benchmarks

load times are just load times. Including indexing I'm down to 137,217 t/s

sure with a billion triples I am down to 87kt/s

but still reasonable for most of my use cases.


On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 22/11/2021 21:14, Marco Neumann wrote:
> > Yes I just had a look at one of my own datasets with 180mt and a
> footprint
> > of 28G. The overhead is not too bad at 10-20%. vs raw nt files
> >
> > I was surprised that the CLEAR ALL directive doesn't remove/release disk
> > memory. Does TDB2 require a commit to release disk space?
>
> Any active read transactions can still see the old data. You can't
> delete it for real.
>
> Run compact.
>
> > impressed to see that load times went up to 250k/s
>
> What was the hardware?
>
> > with 4.2. more than
> > twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
> > 20.04.3 LTS) related.
>
> You won't get 250k at scale. Loading rate slows for algorithmic reasons
> and system reasons.
>
> Now try 500m!
>
> > Maybe we should make a recommendation to the wikidata team to provide us
> > with a production environment type machine to run some load and query
> tests.
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
> >
> >>
> >>
> >> On 21/11/2021 21:03, Marco Neumann wrote:
> >>> What's the disk footprint these days for 1b on tdb2?
> >>
> >> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
> >> sized literals - the node themselves are 50G). Obvious for current WD
> >> scale usage a sprinkling of compression would be good!
> >>
> >> One thing xloader gives us is that it makes it possible to load on a
> >> spinning disk. (it also has lower peak intermediate file space and
> >> faster because it does not fall into a slow loading mode for the node
> >> table that tdbloader2 did sometimes.)
> >>
> >>       Andy
> >>
> >>>
> >>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 20/11/2021 14:21, Andy Seaborne wrote:
> >>>>> Wikidata are looking for a replace for BlazeGraph
> >>>>>
> >>>>> About WDQS, current scale and current challenges
> >>>>>      https://youtu.be/wn2BrQomvFU?t=9148
> >>>>>
> >>>>> And in the process of appointing a graph consultant: (5 month
> >> contract):
> >>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
> >>>>>
> >>>>> and Apache Jena came up:
> >>>>> https://phabricator.wikimedia.org/T206560#7517212
> >>>>>
> >>>>> Realistically?
> >>>>>
> >>>>> Full wikidata is 16B triples. Very hard to load - xloader may help
> >>>>> though the goal for that was to make loading the truthy subset (5B)
> >>>>> easier. 5B -> 16B is not a trivial step.
> >>>>
> >>>> And it's growing at about 1B per quarter.
> >>>>
> >>>>
> >>
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
> >>>>
> >>>>>
> >>>>> Even if wikidata loads, it would be impractically slow as TDB is
> today.
> >>>>> (yes, that's fixable; not practical in their timescales.)
> >>>>>
> >>>>> The current discussions feel more like they are looking for a
> "product"
> >>>>> - a triplestore that they are use - rather than a collaboration.
> >>>>>
> >>>>>        Andy
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


On 22/11/2021 21:14, Marco Neumann wrote:
> Yes I just had a look at one of my own datasets with 180mt and a footprint
> of 28G. The overhead is not too bad at 10-20%. vs raw nt files
> 
> I was surprised that the CLEAR ALL directive doesn't remove/release disk
> memory. Does TDB2 require a commit to release disk space?

Any active read transactions can still see the old data. You can't 
delete it for real.

Run compact.

> impressed to see that load times went up to 250k/s

What was the hardware?

> with 4.2. more than
> twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
> 20.04.3 LTS) related.

You won't get 250k at scale. Loading rate slows for algorithmic reasons 
and system reasons.

Now try 500m!

> Maybe we should make a recommendation to the wikidata team to provide us
> with a production environment type machine to run some load and query tests.
> 
> 
> 
> 
> 
> 
> On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:
> 
>>
>>
>> On 21/11/2021 21:03, Marco Neumann wrote:
>>> What's the disk footprint these days for 1b on tdb2?
>>
>> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>> sized literals - the node themselves are 50G). Obvious for current WD
>> scale usage a sprinkling of compression would be good!
>>
>> One thing xloader gives us is that it makes it possible to load on a
>> spinning disk. (it also has lower peak intermediate file space and
>> faster because it does not fall into a slow loading mode for the node
>> table that tdbloader2 did sometimes.)
>>
>>       Andy
>>
>>>
>>> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>>>> Wikidata are looking for a replace for BlazeGraph
>>>>>
>>>>> About WDQS, current scale and current challenges
>>>>>      https://youtu.be/wn2BrQomvFU?t=9148
>>>>>
>>>>> And in the process of appointing a graph consultant: (5 month
>> contract):
>>>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>>>
>>>>> and Apache Jena came up:
>>>>> https://phabricator.wikimedia.org/T206560#7517212
>>>>>
>>>>> Realistically?
>>>>>
>>>>> Full wikidata is 16B triples. Very hard to load - xloader may help
>>>>> though the goal for that was to make loading the truthy subset (5B)
>>>>> easier. 5B -> 16B is not a trivial step.
>>>>
>>>> And it's growing at about 1B per quarter.
>>>>
>>>>
>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>>>>
>>>>>
>>>>> Even if wikidata loads, it would be impractically slow as TDB is today.
>>>>> (yes, that's fixable; not practical in their timescales.)
>>>>>
>>>>> The current discussions feel more like they are looking for a "product"
>>>>> - a triplestore that they are use - rather than a collaboration.
>>>>>
>>>>>        Andy
>>>>
>>>
>>>
>>
> 
>

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

Yes I just had a look at one of my own datasets with 180mt and a footprint
of 28G. The overhead is not too bad at 10-20%. vs raw nt files

I was surprised that the CLEAR ALL directive doesn't remove/release disk
memory. Does TDB2 require a commit to release disk space?

impressed to see that load times went up to 250k/s with 4.2. more than
twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
20.04.3 LTS) related.

Maybe we should make a recommendation to the wikidata team to provide us
with a production environment type machine to run some load and query tests.






On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 21/11/2021 21:03, Marco Neumann wrote:
> > What's the disk footprint these days for 1b on tdb2?
>
> Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
> sized literals - the node themselves are 50G). Obvious for current WD
> scale usage a sprinkling of compression would be good!
>
> One thing xloader gives us is that it makes it possible to load on a
> spinning disk. (it also has lower peak intermediate file space and
> faster because it does not fall into a slow loading mode for the node
> table that tdbloader2 did sometimes.)
>
>      Andy
>
> >
> > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
> >
> >>
> >>
> >> On 20/11/2021 14:21, Andy Seaborne wrote:
> >>> Wikidata are looking for a replace for BlazeGraph
> >>>
> >>> About WDQS, current scale and current challenges
> >>>     https://youtu.be/wn2BrQomvFU?t=9148
> >>>
> >>> And in the process of appointing a graph consultant: (5 month
> contract):
> >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
> >>>
> >>> and Apache Jena came up:
> >>> https://phabricator.wikimedia.org/T206560#7517212
> >>>
> >>> Realistically?
> >>>
> >>> Full wikidata is 16B triples. Very hard to load - xloader may help
> >>> though the goal for that was to make loading the truthy subset (5B)
> >>> easier. 5B -> 16B is not a trivial step.
> >>
> >> And it's growing at about 1B per quarter.
> >>
> >>
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
> >>
> >>>
> >>> Even if wikidata loads, it would be impractically slow as TDB is today.
> >>> (yes, that's fixable; not practical in their timescales.)
> >>>
> >>> The current discussions feel more like they are looking for a "product"
> >>> - a triplestore that they are use - rather than a collaboration.
> >>>
> >>>       Andy
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


On 21/11/2021 21:03, Marco Neumann wrote:
> What's the disk footprint these days for 1b on tdb2?

Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant 
sized literals - the node themselves are 50G). Obvious for current WD 
scale usage a sprinkling of compression would be good!

One thing xloader gives us is that it makes it possible to load on a 
spinning disk. (it also has lower peak intermediate file space and 
faster because it does not fall into a slow loading mode for the node 
table that tdbloader2 did sometimes.)

     Andy

> 
> On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:
> 
>>
>>
>> On 20/11/2021 14:21, Andy Seaborne wrote:
>>> Wikidata are looking for a replace for BlazeGraph
>>>
>>> About WDQS, current scale and current challenges
>>>     https://youtu.be/wn2BrQomvFU?t=9148
>>>
>>> And in the process of appointing a graph consultant: (5 month contract):
>>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>>>
>>> and Apache Jena came up:
>>> https://phabricator.wikimedia.org/T206560#7517212
>>>
>>> Realistically?
>>>
>>> Full wikidata is 16B triples. Very hard to load - xloader may help
>>> though the goal for that was to make loading the truthy subset (5B)
>>> easier. 5B -> 16B is not a trivial step.
>>
>> And it's growing at about 1B per quarter.
>>
>> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>>
>>>
>>> Even if wikidata loads, it would be impractically slow as TDB is today.
>>> (yes, that's fixable; not practical in their timescales.)
>>>
>>> The current discussions feel more like they are looking for a "product"
>>> - a triplestore that they are use - rather than a collaboration.
>>>
>>>       Andy
>>
> 
>

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

What's the disk footprint these days for 1b on tdb2?

On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 20/11/2021 14:21, Andy Seaborne wrote:
> > Wikidata are looking for a replace for BlazeGraph
> >
> > About WDQS, current scale and current challenges
> >    https://youtu.be/wn2BrQomvFU?t=9148
> >
> > And in the process of appointing a graph consultant: (5 month contract):
> > https://boards.greenhouse.io/wikimedia/jobs/3546920
> >
> > and Apache Jena came up:
> > https://phabricator.wikimedia.org/T206560#7517212
> >
> > Realistically?
> >
> > Full wikidata is 16B triples. Very hard to load - xloader may help
> > though the goal for that was to make loading the truthy subset (5B)
> > easier. 5B -> 16B is not a trivial step.
>
> And it's growing at about 1B per quarter.
>
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>
> >
> > Even if wikidata loads, it would be impractically slow as TDB is today.
> > (yes, that's fixable; not practical in their timescales.)
> >
> > The current discussions feel more like they are looking for a "product"
> > - a triplestore that they are use - rather than a collaboration.
> >
> >      Andy
>


-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Posted by Andy Seaborne <an...@apache.org>.


On 20/11/2021 14:21, Andy Seaborne wrote:
> Wikidata are looking for a replace for BlazeGraph
> 
> About WDQS, current scale and current challenges
>    https://youtu.be/wn2BrQomvFU?t=9148
> 
> And in the process of appointing a graph consultant: (5 month contract):
> https://boards.greenhouse.io/wikimedia/jobs/3546920
> 
> and Apache Jena came up:
> https://phabricator.wikimedia.org/T206560#7517212
> 
> Realistically?
> 
> Full wikidata is 16B triples. Very hard to load - xloader may help 
> though the goal for that was to make loading the truthy subset (5B) 
> easier. 5B -> 16B is not a trivial step.

And it's growing at about 1B per quarter.

https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy

> 
> Even if wikidata loads, it would be impractically slow as TDB is today.
> (yes, that's fixable; not practical in their timescales.)
> 
> The current discussions feel more like they are looking for a "product" 
> - a triplestore that they are use - rather than a collaboration.
> 
>      Andy

Re: Wikidata evolution

Posted by Marco Neumann <ma...@gmail.com>.

you should go for it Andy, this is an opportunity to get Jena into an
active and public project.

The Apache Jena community can grow with the challenge. Is BlazeGraph on par
with Jena or currently stronger on indexing and loading speeds?

On Sat, Nov 20, 2021 at 2:21 PM Andy Seaborne <an...@apache.org> wrote:

> Wikidata are looking for a replace for BlazeGraph
>
> About WDQS, current scale and current challenges
>    https://youtu.be/wn2BrQomvFU?t=9148
>
> And in the process of appointing a graph consultant: (5 month contract):
> https://boards.greenhouse.io/wikimedia/jobs/3546920
>
> and Apache Jena came up:
> https://phabricator.wikimedia.org/T206560#7517212
>
> Realistically?
>
> Full wikidata is 16B triples. Very hard to load - xloader may help
> though the goal for that was to make loading the truthy subset (5B)
> easier. 5B -> 16B is not a trivial step.
>
> Even if wikidata loads, it would be impractically slow as TDB is today.
> (yes, that's fixable; not practical in their timescales.)
>
> The current discussions feel more like they are looking for a "product"
> - a triplestore that they are use - rather than a collaboration.
>
>      Andy
>


-- 


---
Marco Neumann
KONA