You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2017/10/03 21:43:31 UTC

TDB2 merged

It's in the build joined in at apache-jena-libs.

It is in Fuseki2 server jar, but not the UI - a user needs to use a 
configuration file. That also works in fuseki-basic.

Documentation to follow.

    Andy

Re: bad Fuseki Content-Length value Re: TDB2 testing

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 15:22, Osma Suominen wrote:
...

>> Please separate this out into another email - what is the problem and 
>> does it apply to the current codebase?
> 
> Sorry I wasn't clear. This is something you mentioned yourself in 
> another e-mail on 2017-10-06 about how to load large files into Fuseki 
> with TDB2:
> 
>> This seems to work:
>>
>> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 
>> 'Content-type: application/n-triples' http://localhost:3030/data
>>
>> 200M BSBM (49Gbytes) loaded at 42K triples/s.
>>
>> The content length in the fuskei log is reported wrongly (1002691465 
>> ... int/long error) but the triple count is right. 

Now fixed.

> 
> The only connection to TDB2 is that with TDB1 transaction sizes were 
> limited, so I guess that the overflow situation never happened. With 
> TDB2 you can now push very large files into Fuseki (yay!), but this 
> exposes the problem. It's a very minor issue at least to me. I'm more 
> interested in the other questions - especially if it's possible to 
> maintain a Fuseki endpoint with a TDB2 store, occasionally pushing new 
> data but not filling the disk doing so.

It was an int/long bug so it happens at 2G.

(there is an equivalent problem in Apache Common FileUpload - but in a 
place that Jena does not use fortunately when called from the UI. It's 
unfixable in FileUpload.  If yoy say "getString()" then the string is 
limited to 2G charcaters becasue Java string are char[]'s.)

A 2G file into TDB1 is a few G of RAM maximum, and isn't near the size 
limits for Fuseki. Fuseki uses TDB cautiously and further restricts the 
delayed work queue.

    Andy

> 
> -Osma
> 

bad Fuseki Content-Length value Re: TDB2 testing

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Andy Seaborne kirjoitti 16.11.2017 klo 15:54:
> I am weeks behind on email.

No problem with that. I just noticed that other TDB2 issues and comments 
were discussed on JIRA and on users@ while this one wasn't. I figured it 
might have fallen through the cracks.

>  > 4. Should there be a JIRA issue about the bad Content-Length values 
> reported by Fuseki?
> 
> I don't see any connection to TDB2.
> 
> Please separate this out into another email - what is the problem and 
> does it apply to the current codebase?

Sorry I wasn't clear. This is something you mentioned yourself in 
another e-mail on 2017-10-06 about how to load large files into Fuseki 
with TDB2:

> This seems to work:
> 
> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 'Content-type: application/n-triples' http://localhost:3030/data
> 
> 200M BSBM (49Gbytes) loaded at 42K triples/s.
> 
> The content length in the fuskei log is reported wrongly (1002691465 ... int/long error) but the triple count is right. 

The only connection to TDB2 is that with TDB1 transaction sizes were 
limited, so I guess that the overflow situation never happened. With 
TDB2 you can now push very large files into Fuseki (yay!), but this 
exposes the problem. It's a very minor issue at least to me. I'm more 
interested in the other questions - especially if it's possible to 
maintain a Fuseki endpoint with a TDB2 store, occasionally pushing new 
data but not filling the disk doing so.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 testing Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.
On 16/11/17 07:53, Osma Suominen wrote:
> Hi all (especially Andy)

I am weeks behind on email.

 > 4. Should there be a JIRA issue about the bad Content-Length values 
reported by Fuseki?

I don't see any connection to TDB2.

Please separate this out into another email - what is the problem and 
does it apply to the current codebase?

(NB I have encountered some tools getting the Content-Length wrong.)

     Andy

> I'd appreciate some kind of response to the questions at the end of the 
> below message. I can open the JIRA issues if necessary. I realize this 
> was posted during the hectic release process.
> 
> I have done some further testing for TDB & TDB2 disk space usage, will 
> write a separate message about that, probably this time on users@
> 
> -Osma
> 
> Osma Suominen kirjoitti 27.10.2017 klo 13:44:
>> Hi,
>>
>> As I've promised earlier I took TDB2 for a little test drive, using 
>> the 3.5.0rc1 builds.
>>
>> I tested two scenarios: A server running Fuseki, and command line 
>> tools operating directly on a database directory.
>>
>> 1. Server running Fuseki
>>
>> First the server (running as a VM). Currently I've been using Fuseki 
>> with HDT support, from the hdt-java repository. I'm serving a dataset 
>> of about 39M triples, which occasionally changes (eventually this will 
>> be updated once per month, or perhaps more frequently, even once per 
>> day). With HDT, I can simply rebuild the HDT file (less than 10 
>> minutes) and then restart Fuseki. Downtime for the endpoint is only a 
>> few seconds. But I'm worried about the state of the hdt-java project, 
>> it is not being actively maintained and it's still based on Fuseki1.
>>
>> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
>> thanks to the documentation that Andy provided. I usually create 
>> Fuseki2 datasets via the API (using curl), but I noticed that, like 
>> the UI, the API only supports "mem" and "tdb". So I created a "tdb" 
>> dataset first, then edited the configuration file so it uses tdb2 
>> instead.
>>
>> Loading the data took about 17 minutes. I used wget for this, per 
>> Andy's example. This is a bit slower than regenerating the HDT, but 
>> acceptable since I'm only doing it occasionally. I also tested 
>> executing queries while reloading the data. This seemed to work OK 
>> even though performance obviously did suffer. But at least the 
>> endpoint remained up.
>>
>> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + 
>> index for the same data is 560MB.
>>
>> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
>> twice its original size. I understand that the TDB2 needs to be 
>> compacted regularly, otherwise it will keep growing. I'm OK with the 
>> large disk space usage if it's constant, not growing over time like TDB1.
>>
>> 2. Command line tools
>>
>> For this I used an older version of the same dataset with 30M triples, 
>> the same one I used for my HDT vs TDB comparison that I posted on the 
>> users mailing list:
>> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E 
>>
>>
>> This was on my i3-2330M laptop with 8GB RAM and SSD.
>>
>> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
>> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
>> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. 
>> So TDB2 is slower to load and takes more disk space than TDB.
>>
>> I ran the same example query I used before on the TDB2. The first time 
>> was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
>>
>> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. 
>> The query took 13.7-14.0 seconds after the first run (24 seconds).
>>
>> I also reloaded the same data to the TDB2 to see the effect. Reloading 
>> took 11 minutes and the database grew to 5.7GB. Then I compacted it 
>> using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
>> just grew further, to 9.7GB. The database directory then contained 
>> both Data-0001 and Data-0002 directories. I removed Data-0001 and disk 
>> usage fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
>>
>> My impressions so far: It works, but it's slower than TDB and needs 
>> more disk space. Compaction seems to work, but initially it will just 
>> increase disk usage. The stale data has to be manually removed to 
>> actually reclaim any space. I didn't test subsequent load/compact 
>> cycles, but I assume there may still be some disk space growth (e.g. 
>> due to blank nodes, of which there are some in my dataset) even if the 
>> data is regularly compacted.
>>
>> For me, not growing over time like TDB is really the crucial feature 
>> that TDB2 seems to promise. Right now it's not clear whether it 
>> entirely fulfills this promise, since compaction needs to be done 
>> manually and doesn't actually reclaim disk space by itself.
>>
>> Questions/suggestions:
>>
>> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
>> prefer not taking the endpoint down for compaction.
>>
>> 2. Should the stale data be deleted after compaction, at least as an 
>> option?
>>
>> 3. Should there be a JIRA issue about UI and API support for creating 
>> TDB2 datasets?
>>
>> 4. Should there be a JIRA issue about the bad Content-Length values 
>> reported by Fuseki?
>>
>> -Osma
>>
> 
> 

Re: TDB2 testing Re: TDB2 merged

Posted by Osma Suominen <os...@helsinki.fi>.
Hi all (especially Andy)

I'd appreciate some kind of response to the questions at the end of the 
below message. I can open the JIRA issues if necessary. I realize this 
was posted during the hectic release process.

I have done some further testing for TDB & TDB2 disk space usage, will 
write a separate message about that, probably this time on users@

-Osma

Osma Suominen kirjoitti 27.10.2017 klo 13:44:
> Hi,
> 
> As I've promised earlier I took TDB2 for a little test drive, using the 
> 3.5.0rc1 builds.
> 
> I tested two scenarios: A server running Fuseki, and command line tools 
> operating directly on a database directory.
> 
> 1. Server running Fuseki
> 
> First the server (running as a VM). Currently I've been using Fuseki 
> with HDT support, from the hdt-java repository. I'm serving a dataset of 
> about 39M triples, which occasionally changes (eventually this will be 
> updated once per month, or perhaps more frequently, even once per day). 
> With HDT, I can simply rebuild the HDT file (less than 10 minutes) and 
> then restart Fuseki. Downtime for the endpoint is only a few seconds. 
> But I'm worried about the state of the hdt-java project, it is not being 
> actively maintained and it's still based on Fuseki1.
> 
> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
> thanks to the documentation that Andy provided. I usually create Fuseki2 
> datasets via the API (using curl), but I noticed that, like the UI, the 
> API only supports "mem" and "tdb". So I created a "tdb" dataset first, 
> then edited the configuration file so it uses tdb2 instead.
> 
> Loading the data took about 17 minutes. I used wget for this, per Andy's 
> example. This is a bit slower than regenerating the HDT, but acceptable 
> since I'm only doing it occasionally. I also tested executing queries 
> while reloading the data. This seemed to work OK even though performance 
> obviously did suffer. But at least the endpoint remained up.
> 
> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index 
> for the same data is 560MB.
> 
> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
> twice its original size. I understand that the TDB2 needs to be 
> compacted regularly, otherwise it will keep growing. I'm OK with the 
> large disk space usage if it's constant, not growing over time like TDB1.
> 
> 2. Command line tools
> 
> For this I used an older version of the same dataset with 30M triples, 
> the same one I used for my HDT vs TDB comparison that I posted on the 
> users mailing list:
> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E 
> 
> 
> This was on my i3-2330M laptop with 8GB RAM and SSD.
> 
> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So 
> TDB2 is slower to load and takes more disk space than TDB.
> 
> I ran the same example query I used before on the TDB2. The first time 
> was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
> 
> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The 
> query took 13.7-14.0 seconds after the first run (24 seconds).
> 
> I also reloaded the same data to the TDB2 to see the effect. Reloading 
> took 11 minutes and the database grew to 5.7GB. Then I compacted it 
> using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
> just grew further, to 9.7GB. The database directory then contained both 
> Data-0001 and Data-0002 directories. I removed Data-0001 and disk usage 
> fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
> 
> My impressions so far: It works, but it's slower than TDB and needs more 
> disk space. Compaction seems to work, but initially it will just 
> increase disk usage. The stale data has to be manually removed to 
> actually reclaim any space. I didn't test subsequent load/compact 
> cycles, but I assume there may still be some disk space growth (e.g. due 
> to blank nodes, of which there are some in my dataset) even if the data 
> is regularly compacted.
> 
> For me, not growing over time like TDB is really the crucial feature 
> that TDB2 seems to promise. Right now it's not clear whether it entirely 
> fulfills this promise, since compaction needs to be done manually and 
> doesn't actually reclaim disk space by itself.
> 
> Questions/suggestions:
> 
> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
> prefer not taking the endpoint down for compaction.
> 
> 2. Should the stale data be deleted after compaction, at least as an 
> option?
> 
> 3. Should there be a JIRA issue about UI and API support for creating 
> TDB2 datasets?
> 
> 4. Should there be a JIRA issue about the bad Content-Length values 
> reported by Fuseki?
> 
> -Osma
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 testing Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 20:27, Osma Suominen wrote:
> Hi Andy!
> 
> Thanks for your excellent answers.
> 
> Andy Seaborne kirjoitti 16.11.2017 klo 22:02:
> 
>>> But I'm worried about the state of the hdt-java project, it is not 
>>> being actively maintained and it's still based on Fuseki1.
>>
>> You don't need to use their Fuseki integration.
> 
> I need a SPARQL endpoint...AFAICT the hdt-java Fuseki integration is the 
> only available way to set up a SPARQL endpoint on top of HDT files. 
> Well, there's the LDF stack, which can do SPARQL-over-LDF-over-HDT, but 
> it doesn't make sense to use that within a single machine, it would just 
> create huge overhead. Over the network LDF makes sense in some 
> scenarios, if you want to provide data that others can compute on 
> without causing huge load on the server.

I see HDTGraphAssembler.

If you write a Fuseki service configuration, it should work. Well, 
versions ...

(untested by me).

     Andy

Re: TDB2 testing Re: TDB2 merged

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Thanks for your excellent answers.

Andy Seaborne kirjoitti 16.11.2017 klo 22:02:

>> But I'm worried about the state of the hdt-java project, 
>> it is not being actively maintained and it's still based on Fuseki1.
> 
> You don't need to use their Fuseki integration.

I need a SPARQL endpoint...AFAICT the hdt-java Fuseki integration is the 
only available way to set up a SPARQL endpoint on top of HDT files. 
Well, there's the LDF stack, which can do SPARQL-over-LDF-over-HDT, but 
it doesn't make sense to use that within a single machine, it would just 
create huge overhead. Over the network LDF makes sense in some 
scenarios, if you want to provide data that others can compute on 
without causing huge load on the server.

> Those are low figures for 40M.  Lack of free RAM? (It's more acute with 
> TDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?
> 
> Today: TDB2:
> 
> INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)

My laptop is hardly top of the line, it's a cheap model from 2011. I 
gave the specs above. It has an SSD but it's limited by the SATA bus 
speed to around 300MB/s in ideal conditions. I'm sure a more modern 
machine can do much better, as your figures indicate. But I use this one 
for comparison benchmarks, because it's easy to guarantee that there's 
nothing else running on the system, unlike on a VM with shared resources 
which is generally faster but less predictable.breaking the server.

The syntax is N-Triples. One thing I forgot to mention (I even forgot 
about it myself) is that there is some duplication of triples within 
that file, as it's created by concatenating several files. So it's more 
like 50M triples of which 40M are distinct.

>> My impressions so far: It works, but it's slower than TDB and needs 
>> more disk space. Compaction seems to work, but initially it will just 
>> increase disk usage. The stale data has to be manually removed to 
>> actually reclaim any space. 
> 
> The user can archive it or delete it.

Yes, I understand. But in this case I want to have a SPARQL endpoint 
that runs for months at a time, ideally with little supervision. I 
*don't* want to be there deleting stale files all the time!

Please don't get me wrong, I'm not trying to downplay or criticize your 
excellent work on TDB!. I'm just trying to figure out whether it would 
already be suitable for the use case I have - a public SPARQL endpoint 
with a nontrivial size dataset that updates regularly. I'm kicking the 
tyres so to speak, looking for potential causes of concern and 
suggestions for improvements. So far I've been very impressed with what 
TDB2 can do, even though it's at an early stage of development. 
Especially the way Fuseki with TDB2 can now handle very large 
transactions is great, and the performance seems to be roughly on par 
with TDB1, which is also a good sign.

>> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
>> prefer not taking the endpoint down for compaction.
> 
> Not currently, as I've said, there is no Fuseki change except integrate 
> the TDB2 jars.
> 
> Adding a template name to the HTTP API would be good but IMO it's a long 
> way off to provide UI access.  TDB1 works for people.

OK.

>> 2. Should the stale data be deleted after compaction, at least as an 
>> option?
> 
> If you want to make a PR ...

Understood.

>> 3. Should there be a JIRA issue about UI and API support for creating 
>> TDB2 datasets?
> 
> Every JIRA is a request for someone to do work or an offer to contribute.
That's why I asked first! When I notice clear bugs I create JIRA issues. 
Ditto when I have something to contribute. But delving deep into 
Fuseki/TDB2 integration issues are a bit too far outside my comfort 
zone, unfortunately.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 testing Re: TDB2 merged

Posted by ajs6f <aj...@apache.org>.
> Adding a template name to the HTTP API would be good but IMO it's a long way off to provide UI access.  TDB1 works for people.

This is true, but if we can give people an easy way to create TDB2 dbs and compare them apples-to-apples in their own systems, we will get more feedback more quickly.

That having been said, I honestly do not know anything about how the Fuseki UI is coded. Is it done with a well-known template library?

ajs6f

> On Nov 16, 2017, at 3:02 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 27/10/17 11:44, Osma Suominen wrote:
>> Hi,
>> As I've promised earlier I took TDB2 for a little test drive, using the 3.5.0rc1 builds.
>> I tested two scenarios: A server running Fuseki, and command line tools operating directly on a database directory.
>> 1. Server running Fuseki
>> First the server (running as a VM). Currently I've been using Fuseki with HDT support, from the hdt-java repository. I'm serving a dataset of about 39M triples, which occasionally changes (eventually this will be updated once per month, or perhaps more frequently, even once per day). With HDT, I can simply rebuild the HDT file (less than 10 minutes) and then restart Fuseki. Downtime for the endpoint is only a few seconds. But I'm worried about the state of the hdt-java project, it is not being actively maintained and it's still based on Fuseki1.
> 
> You don't need to use their Fuseki integration.
> 
>> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth thanks to the documentation that Andy provided. I usually create Fuseki2 datasets via the API (using curl), but I noticed that, like the UI, the API only supports "mem" and "tdb". So I created a "tdb" dataset first, then edited the configuration file so it uses tdb2 instead.
>> Loading the data took about 17 minutes. I used wget for this, per Andy's example. This is a bit slower than regenerating the HDT, but acceptable since I'm only doing it occasionally. I also tested executing queries while reloading the data. This seemed to work OK even though performance obviously did suffer. But at least the endpoint remained up.
>> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index for the same data is 560MB.
>> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost twice its original size. I understand that the TDB2 needs to be compacted regularly, otherwise it will keep growing. I'm OK with the large disk space usage if it's constant, not growing over time like TDB1.
>> 2. Command line tools
>> For this I used an older version of the same dataset with 30M triples, the same one I used for my HDT vs TDB comparison that I posted on the users mailing list:
>> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E This was on my i3-2330M laptop with 8GB RAM and SSD.
> 
> Thank you for the figures.
> 
>> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k triples per second). The TDB2 directory is 3.7GB. In contrast, using tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So TDB2 is slower to load and takes more disk space than TDB.
> 
> Those are low figures for 40M.  Lack of free RAM? (It's more acute with TDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?
> 
> Today: TDB2:
> 
> INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)
> 
> 
>> I ran the same example query I used before on the TDB2. The first time was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
>> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The query took 13.7-14.0 seconds after the first run (24 seconds).
>> I also reloaded the same data to the TDB2 to see the effect. Reloading took 11 minutes and the database grew to 5.7GB. Then I compacted it using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage just grew further, to 9.7GB. The database directory then contained both Data-0001 and Data-0002 directories. I removed Data-0001 and disk usage fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
>> My impressions so far: It works, but it's slower than TDB and needs more disk space. Compaction seems to work, but initially it will just increase disk usage. The stale data has to be manually removed to actually reclaim any space. 
> 
> The user can archive it or delete it.
> 
>> I didn't test subsequent load/compact cycles, but I assume there may still be some disk space growth (e.g. due to blank nodes, of which there are some in my dataset) even if the data is regularly compacted.
>> For me, not growing over time like TDB is really the crucial feature that TDB2 seems to promise. Right now it's not clear whether it entirely fulfills this promise, since compaction needs to be done manually and doesn't actually reclaim disk space by itself.
>> Questions/suggestions:
>> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd prefer not taking the endpoint down for compaction.
> 
> Not currently, as I've said, there is no Fuseki change except integrate the TDB2 jars.
> 
> Adding a template name to the HTTP API would be good but IMO it's a long way off to provide UI access.  TDB1 works for people.
> 
>> 2. Should the stale data be deleted after compaction, at least as an option?
> 
> If you want to make a PR ...
> 
>> 3. Should there be a JIRA issue about UI and API support for creating TDB2 datasets?
> >
> 
> Every JIRA is a request for someone to do work or an offer to contribute.
> 
>    Andy
> 
>> -Osma


Re: TDB2 testing Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.

On 27/10/17 11:44, Osma Suominen wrote:
> Hi,
> 
> As I've promised earlier I took TDB2 for a little test drive, using the 
> 3.5.0rc1 builds.
> 
> I tested two scenarios: A server running Fuseki, and command line tools 
> operating directly on a database directory.
> 
> 1. Server running Fuseki
> 
> First the server (running as a VM). Currently I've been using Fuseki 
> with HDT support, from the hdt-java repository. I'm serving a dataset of 
> about 39M triples, which occasionally changes (eventually this will be 
> updated once per month, or perhaps more frequently, even once per day). 
> With HDT, I can simply rebuild the HDT file (less than 10 minutes) and 
> then restart Fuseki. Downtime for the endpoint is only a few seconds. 
> But I'm worried about the state of the hdt-java project, it is not being 
> actively maintained and it's still based on Fuseki1.

You don't need to use their Fuseki integration.

> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
> thanks to the documentation that Andy provided. I usually create Fuseki2 
> datasets via the API (using curl), but I noticed that, like the UI, the 
> API only supports "mem" and "tdb". So I created a "tdb" dataset first, 
> then edited the configuration file so it uses tdb2 instead.
> 
> Loading the data took about 17 minutes. I used wget for this, per Andy's 
> example. This is a bit slower than regenerating the HDT, but acceptable 
> since I'm only doing it occasionally. I also tested executing queries 
> while reloading the data. This seemed to work OK even though performance 
> obviously did suffer. But at least the endpoint remained up.
> 
> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index 
> for the same data is 560MB.
> 
> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
> twice its original size. I understand that the TDB2 needs to be 
> compacted regularly, otherwise it will keep growing. I'm OK with the 
> large disk space usage if it's constant, not growing over time like TDB1.
> 
> 2. Command line tools
> 
> For this I used an older version of the same dataset with 30M triples, 
> the same one I used for my HDT vs TDB comparison that I posted on the 
> users mailing list:
> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E 
> 
> 
> This was on my i3-2330M laptop with 8GB RAM and SSD.

Thank you for the figures.

> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So 
> TDB2 is slower to load and takes more disk space than TDB.

Those are low figures for 40M.  Lack of free RAM? (It's more acute with 
TDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?

Today: TDB2:

INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)


> I ran the same example query I used before on the TDB2. The first time 
> was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
> 
> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The 
> query took 13.7-14.0 seconds after the first run (24 seconds).
> 
> I also reloaded the same data to the TDB2 to see the effect. Reloading 
> took 11 minutes and the database grew to 5.7GB. Then I compacted it 
> using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
> just grew further, to 9.7GB. The database directory then contained both 
> Data-0001 and Data-0002 directories. I removed Data-0001 and disk usage 
> fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
> 
> My impressions so far: It works, but it's slower than TDB and needs more 
> disk space. Compaction seems to work, but initially it will just 
> increase disk usage. The stale data has to be manually removed to 
> actually reclaim any space. 

The user can archive it or delete it.

> I didn't test subsequent load/compact 
> cycles, but I assume there may still be some disk space growth (e.g. due 
> to blank nodes, of which there are some in my dataset) even if the data 
> is regularly compacted.
> 
> For me, not growing over time like TDB is really the crucial feature 
> that TDB2 seems to promise. Right now it's not clear whether it entirely 
> fulfills this promise, since compaction needs to be done manually and 
> doesn't actually reclaim disk space by itself.
> 
> Questions/suggestions:
> 
> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
> prefer not taking the endpoint down for compaction.

Not currently, as I've said, there is no Fuseki change except integrate 
the TDB2 jars.

Adding a template name to the HTTP API would be good but IMO it's a long 
way off to provide UI access.  TDB1 works for people.

> 
> 2. Should the stale data be deleted after compaction, at least as an 
> option?

If you want to make a PR ...

> 
> 3. Should there be a JIRA issue about UI and API support for creating 
> TDB2 datasets?
 >

Every JIRA is a request for someone to do work or an offer to contribute.

     Andy

> 
> -Osma
> 

TDB2 testing Re: TDB2 merged

Posted by Osma Suominen <os...@helsinki.fi>.
Hi,

As I've promised earlier I took TDB2 for a little test drive, using the 
3.5.0rc1 builds.

I tested two scenarios: A server running Fuseki, and command line tools 
operating directly on a database directory.

1. Server running Fuseki

First the server (running as a VM). Currently I've been using Fuseki 
with HDT support, from the hdt-java repository. I'm serving a dataset of 
about 39M triples, which occasionally changes (eventually this will be 
updated once per month, or perhaps more frequently, even once per day). 
With HDT, I can simply rebuild the HDT file (less than 10 minutes) and 
then restart Fuseki. Downtime for the endpoint is only a few seconds. 
But I'm worried about the state of the hdt-java project, it is not being 
actively maintained and it's still based on Fuseki1.

So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
thanks to the documentation that Andy provided. I usually create Fuseki2 
datasets via the API (using curl), but I noticed that, like the UI, the 
API only supports "mem" and "tdb". So I created a "tdb" dataset first, 
then edited the configuration file so it uses tdb2 instead.

Loading the data took about 17 minutes. I used wget for this, per Andy's 
example. This is a bit slower than regenerating the HDT, but acceptable 
since I'm only doing it occasionally. I also tested executing queries 
while reloading the data. This seemed to work OK even though performance 
obviously did suffer. But at least the endpoint remained up.

The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index 
for the same data is 560MB.

I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
twice its original size. I understand that the TDB2 needs to be 
compacted regularly, otherwise it will keep growing. I'm OK with the 
large disk space usage if it's constant, not growing over time like TDB1.

2. Command line tools

For this I used an older version of the same dataset with 30M triples, 
the same one I used for my HDT vs TDB comparison that I posted on the 
users mailing list:
http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E

This was on my i3-2330M laptop with 8GB RAM and SSD.

Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
triples per second). The TDB2 directory is 3.7GB. In contrast, using 
tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So 
TDB2 is slower to load and takes more disk space than TDB.

I ran the same example query I used before on the TDB2. The first time 
was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.

I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The 
query took 13.7-14.0 seconds after the first run (24 seconds).

I also reloaded the same data to the TDB2 to see the effect. Reloading 
took 11 minutes and the database grew to 5.7GB. Then I compacted it 
using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
just grew further, to 9.7GB. The database directory then contained both 
Data-0001 and Data-0002 directories. I removed Data-0001 and disk usage 
fell to 4.0GB. Not quite the same as the original 3.7GB, but close.

My impressions so far: It works, but it's slower than TDB and needs more 
disk space. Compaction seems to work, but initially it will just 
increase disk usage. The stale data has to be manually removed to 
actually reclaim any space. I didn't test subsequent load/compact 
cycles, but I assume there may still be some disk space growth (e.g. due 
to blank nodes, of which there are some in my dataset) even if the data 
is regularly compacted.

For me, not growing over time like TDB is really the crucial feature 
that TDB2 seems to promise. Right now it's not clear whether it entirely 
fulfills this promise, since compaction needs to be done manually and 
doesn't actually reclaim disk space by itself.

Questions/suggestions:

1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
prefer not taking the endpoint down for compaction.

2. Should the stale data be deleted after compaction, at least as an option?

3. Should there be a JIRA issue about UI and API support for creating 
TDB2 datasets?

4. Should there be a JIRA issue about the bad Content-Length values 
reported by Fuseki?

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.

On 06/10/17 12:36, Andy Seaborne wrote:
> That would be very helpful.
> 
> "documentation" is a task in the next few days. It's the block on 
> sending any messages to users@ etc about it.
> 

http://jena.staging.apache.org/documentation/tdb2/


Re: TDB2 merged

Posted by aj...@apache.org.
Okay, that makes sense. We might even just swap the "namespaces" at some future point when TDB2 becomes the default, 
i.e. go to tdbquery being for TDB2 and there being a tdb1.tdbquery, as a stop on the road to deprecation.

ajs6f
Andy Seaborne wrote on 10/7/17 9:42 AM:
>
>
> On 06/10/17 21:17, ajs6f@apache.org wrote:
>>> The commands are in the binary distribution "apache-jena" download but there are no script wrappers (easy to copy and
>>> fix though).
>>
>> Just a thought-- maybe better to add flags to the current scripts? Having all-new loader scripts for TDB2 would make
>> for three different bulk loader scripts...
>
> Maybe though it's not so simple a thing to do as the scripts are a general wrapper template to call the java code.
>
> For now, the TDB2 commands are of the form "tdb2.tdb*"
>
> tdb2.tdbquery ...
>
> Sometime, detecting the database type would be great but not critical path for the 3.5.0.
>
>     Andy
>
>>
>>
>> ajs6f
>>
>> Andy Seaborne wrote on 10/6/17 7:36 AM:
>>> That would be very helpful.
>>>
>>> "documentation" is a task in the next few days. It's the block on sending any messages to users@ etc about it.
>>>
>>>
>>> The raw material is in git:
>>>
>>> https://github.com/apache/jena/blob/master/jena-db/use-fuseki-tdb2.md
>>> https://github.com/apache/jena/blob/master/jena-db/use-tdb2-cmds.md
>>>
>>> The commands are in the binary distribution "apache-jena" download but there are no script wrappers (easy to copy and
>>> fix though).
>>>
>>> Either run from development or
>>>
>>> java -cp 'DIR/lib/*' tdb2.tdbloader ... args ...
>>>
>>>> some of my data files are too big to
>>>> be loaded via the Graph Store API.
>>>
>>> From TDB2 and Fuseki's point of view, that's no longer true.
>>> You can (should be able to) load any amount.
>>>
>>> The fuseki-basic server also has TDB2 in it so if you are doing everything script-driven, you can run that "--conf
>>> config-tdb2.ttl"
>>>
>>> There is no progress indicator in the server log so you may wish to set set some kind of verbose option in the sender.
>>>
>>>     Andy
>>>
>>> Uploading large files:
>>>
>>> The UI does this all quite well.
>>>
>>> What's the magic for a command line/scripted process?
>>>
>>> It needs a tool that does not buffer or inspect the file or otherwise try to be helpful.
>>>
>>> Anyone know of good tools for this?
>>>
>>> I haven't managed to work out which set of "curl" arguments do this without buffering the file (--data* seem to
>>> buffer the file; -F is a form upload, not pure
>>> POST).
>>>
>>> This seems to work:
>>>
>>> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 'Content-type: application/n-triples'
>>> http://localhost:3030/data
>>>
>>> 200M BSBM (49Gbytes) loaded at 42K triples/s.
>>>
>>> The content length in the fuskei log is reported wrongly (1002691465 ... int/long error) but the triple count is right.
>>>
>>> It does ruins the interactive performance of the machine!
>>>
>>> s-post crashes immediately if given a large files - don't know why.
>>>
>>> On 06/10/17 07:50, Osma Suominen wrote:
>>>> Excellent!
>>>>
>>>> I have a couple of Fuseki installations where I could test drive this. I'd just need to know how to do the
>>>> configuration, and also a tool like tdbloader for
>>>> offline loading since some of my data files are too big to be loaded via the Graph Store API.
>>>>
>>>> No hurry though.
>>>>
>>>> -Osma
>>>>
>>>>
>>>> Andy Seaborne kirjoitti 04.10.2017 klo 00:43:
>>>>> It's in the build joined in at apache-jena-libs.
>>>>>
>>>>> It is in Fuseki2 server jar, but not the UI - a user needs to use a configuration file. That also works in
>>>>> fuseki-basic.
>>>>>
>>>>> Documentation to follow.
>>>>>
>>>>>     Andy
>>>>
>>>>

Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.

On 06/10/17 21:17, ajs6f@apache.org wrote:
>> The commands are in the binary distribution "apache-jena" download but 
>> there are no script wrappers (easy to copy and fix though).
> 
> Just a thought-- maybe better to add flags to the current scripts? 
> Having all-new loader scripts for TDB2 would make for three different 
> bulk loader scripts...

Maybe though it's not so simple a thing to do as the scripts are a 
general wrapper template to call the java code.

For now, the TDB2 commands are of the form "tdb2.tdb*"

tdb2.tdbquery ...

Sometime, detecting the database type would be great but not critical 
path for the 3.5.0.

     Andy

> 
> 
> ajs6f
> 
> Andy Seaborne wrote on 10/6/17 7:36 AM:
>> That would be very helpful.
>>
>> "documentation" is a task in the next few days. It's the block on 
>> sending any messages to users@ etc about it.
>>
>>
>> The raw material is in git:
>>
>> https://github.com/apache/jena/blob/master/jena-db/use-fuseki-tdb2.md
>> https://github.com/apache/jena/blob/master/jena-db/use-tdb2-cmds.md
>>
>> The commands are in the binary distribution "apache-jena" download but 
>> there are no script wrappers (easy to copy and fix though).
>>
>> Either run from development or
>>
>> java -cp 'DIR/lib/*' tdb2.tdbloader ... args ...
>>
>>> some of my data files are too big to
>>> be loaded via the Graph Store API.
>>
>> From TDB2 and Fuseki's point of view, that's no longer true.
>> You can (should be able to) load any amount.
>>
>> The fuseki-basic server also has TDB2 in it so if you are doing 
>> everything script-driven, you can run that "--conf config-tdb2.ttl"
>>
>> There is no progress indicator in the server log so you may wish to 
>> set set some kind of verbose option in the sender.
>>
>>     Andy
>>
>> Uploading large files:
>>
>> The UI does this all quite well.
>>
>> What's the magic for a command line/scripted process?
>>
>> It needs a tool that does not buffer or inspect the file or otherwise 
>> try to be helpful.
>>
>> Anyone know of good tools for this?
>>
>> I haven't managed to work out which set of "curl" arguments do this 
>> without buffering the file (--data* seem to buffer the file; -F is a 
>> form upload, not pure
>> POST).
>>
>> This seems to work:
>>
>> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 
>> 'Content-type: application/n-triples' http://localhost:3030/data
>>
>> 200M BSBM (49Gbytes) loaded at 42K triples/s.
>>
>> The content length in the fuskei log is reported wrongly (1002691465 
>> ... int/long error) but the triple count is right.
>>
>> It does ruins the interactive performance of the machine!
>>
>> s-post crashes immediately if given a large files - don't know why.
>>
>> On 06/10/17 07:50, Osma Suominen wrote:
>>> Excellent!
>>>
>>> I have a couple of Fuseki installations where I could test drive 
>>> this. I'd just need to know how to do the configuration, and also a 
>>> tool like tdbloader for
>>> offline loading since some of my data files are too big to be loaded 
>>> via the Graph Store API.
>>>
>>> No hurry though.
>>>
>>> -Osma
>>>
>>>
>>> Andy Seaborne kirjoitti 04.10.2017 klo 00:43:
>>>> It's in the build joined in at apache-jena-libs.
>>>>
>>>> It is in Fuseki2 server jar, but not the UI - a user needs to use a 
>>>> configuration file. That also works in fuseki-basic.
>>>>
>>>> Documentation to follow.
>>>>
>>>>     Andy
>>>
>>>

Re: TDB2 merged

Posted by aj...@apache.org.
> The commands are in the binary distribution "apache-jena" download but there are no script wrappers (easy to copy and fix though).

Just a thought-- maybe better to add flags to the current scripts? Having all-new loader scripts for TDB2 would make for three different bulk loader scripts...


ajs6f

Andy Seaborne wrote on 10/6/17 7:36 AM:
> That would be very helpful.
>
> "documentation" is a task in the next few days. It's the block on sending any messages to users@ etc about it.
>
>
> The raw material is in git:
>
> https://github.com/apache/jena/blob/master/jena-db/use-fuseki-tdb2.md
> https://github.com/apache/jena/blob/master/jena-db/use-tdb2-cmds.md
>
> The commands are in the binary distribution "apache-jena" download but there are no script wrappers (easy to copy and fix though).
>
> Either run from development or
>
> java -cp 'DIR/lib/*' tdb2.tdbloader ... args ...
>
>> some of my data files are too big to
>> be loaded via the Graph Store API.
>
> From TDB2 and Fuseki's point of view, that's no longer true.
> You can (should be able to) load any amount.
>
> The fuseki-basic server also has TDB2 in it so if you are doing everything script-driven, you can run that "--conf config-tdb2.ttl"
>
> There is no progress indicator in the server log so you may wish to set set some kind of verbose option in the sender.
>
>     Andy
>
> Uploading large files:
>
> The UI does this all quite well.
>
> What's the magic for a command line/scripted process?
>
> It needs a tool that does not buffer or inspect the file or otherwise try to be helpful.
>
> Anyone know of good tools for this?
>
> I haven't managed to work out which set of "curl" arguments do this without buffering the file (--data* seem to buffer the file; -F is a form upload, not pure
> POST).
>
> This seems to work:
>
> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 'Content-type: application/n-triples' http://localhost:3030/data
>
> 200M BSBM (49Gbytes) loaded at 42K triples/s.
>
> The content length in the fuskei log is reported wrongly (1002691465 ... int/long error) but the triple count is right.
>
> It does ruins the interactive performance of the machine!
>
> s-post crashes immediately if given a large files - don't know why.
>
> On 06/10/17 07:50, Osma Suominen wrote:
>> Excellent!
>>
>> I have a couple of Fuseki installations where I could test drive this. I'd just need to know how to do the configuration, and also a tool like tdbloader for
>> offline loading since some of my data files are too big to be loaded via the Graph Store API.
>>
>> No hurry though.
>>
>> -Osma
>>
>>
>> Andy Seaborne kirjoitti 04.10.2017 klo 00:43:
>>> It's in the build joined in at apache-jena-libs.
>>>
>>> It is in Fuseki2 server jar, but not the UI - a user needs to use a configuration file. That also works in fuseki-basic.
>>>
>>> Documentation to follow.
>>>
>>>     Andy
>>
>>

Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.
That would be very helpful.

"documentation" is a task in the next few days. It's the block on 
sending any messages to users@ etc about it.


The raw material is in git:

https://github.com/apache/jena/blob/master/jena-db/use-fuseki-tdb2.md
https://github.com/apache/jena/blob/master/jena-db/use-tdb2-cmds.md

The commands are in the binary distribution "apache-jena" download but 
there are no script wrappers (easy to copy and fix though).

Either run from development or

java -cp 'DIR/lib/*' tdb2.tdbloader ... args ...

 > some of my data files are too big to
 > be loaded via the Graph Store API.

 From TDB2 and Fuseki's point of view, that's no longer true.
You can (should be able to) load any amount.

The fuseki-basic server also has TDB2 in it so if you are doing 
everything script-driven, you can run that "--conf config-tdb2.ttl"

There is no progress indicator in the server log so you may wish to set 
set some kind of verbose option in the sender.

     Andy

Uploading large files:

The UI does this all quite well.

What's the magic for a command line/scripted process?

It needs a tool that does not buffer or inspect the file or otherwise 
try to be helpful.

Anyone know of good tools for this?

I haven't managed to work out which set of "curl" arguments do this 
without buffering the file (--data* seem to buffer the file; -F is a 
form upload, not pure POST).

This seems to work:

wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 
'Content-type: application/n-triples' http://localhost:3030/data

200M BSBM (49Gbytes) loaded at 42K triples/s.

The content length in the fuskei log is reported wrongly (1002691465 ... 
int/long error) but the triple count is right.

It does ruins the interactive performance of the machine!

s-post crashes immediately if given a large files - don't know why.

On 06/10/17 07:50, Osma Suominen wrote:
> Excellent!
> 
> I have a couple of Fuseki installations where I could test drive this. 
> I'd just need to know how to do the configuration, and also a tool like 
> tdbloader for offline loading since some of my data files are too big to 
> be loaded via the Graph Store API.
> 
> No hurry though.
> 
> -Osma
> 
> 
> Andy Seaborne kirjoitti 04.10.2017 klo 00:43:
>> It's in the build joined in at apache-jena-libs.
>>
>> It is in Fuseki2 server jar, but not the UI - a user needs to use a 
>> configuration file. That also works in fuseki-basic.
>>
>> Documentation to follow.
>>
>>     Andy
> 
> 

Re: TDB2 merged

Posted by Osma Suominen <os...@helsinki.fi>.
Excellent!

I have a couple of Fuseki installations where I could test drive this. 
I'd just need to know how to do the configuration, and also a tool like 
tdbloader for offline loading since some of my data files are too big to 
be loaded via the Graph Store API.

No hurry though.

-Osma


Andy Seaborne kirjoitti 04.10.2017 klo 00:43:
> It's in the build joined in at apache-jena-libs.
> 
> It is in Fuseki2 server jar, but not the UI - a user needs to use a 
> configuration file. That also works in fuseki-basic.
> 
> Documentation to follow.
> 
>     Andy


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi