You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2017/11/16 07:53:55 UTC

Re: TDB2 testing Re: TDB2 merged

Hi all (especially Andy)

I'd appreciate some kind of response to the questions at the end of the 
below message. I can open the JIRA issues if necessary. I realize this 
was posted during the hectic release process.

I have done some further testing for TDB & TDB2 disk space usage, will 
write a separate message about that, probably this time on users@

-Osma

Osma Suominen kirjoitti 27.10.2017 klo 13:44:
> Hi,
> 
> As I've promised earlier I took TDB2 for a little test drive, using the 
> 3.5.0rc1 builds.
> 
> I tested two scenarios: A server running Fuseki, and command line tools 
> operating directly on a database directory.
> 
> 1. Server running Fuseki
> 
> First the server (running as a VM). Currently I've been using Fuseki 
> with HDT support, from the hdt-java repository. I'm serving a dataset of 
> about 39M triples, which occasionally changes (eventually this will be 
> updated once per month, or perhaps more frequently, even once per day). 
> With HDT, I can simply rebuild the HDT file (less than 10 minutes) and 
> then restart Fuseki. Downtime for the endpoint is only a few seconds. 
> But I'm worried about the state of the hdt-java project, it is not being 
> actively maintained and it's still based on Fuseki1.
> 
> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
> thanks to the documentation that Andy provided. I usually create Fuseki2 
> datasets via the API (using curl), but I noticed that, like the UI, the 
> API only supports "mem" and "tdb". So I created a "tdb" dataset first, 
> then edited the configuration file so it uses tdb2 instead.
> 
> Loading the data took about 17 minutes. I used wget for this, per Andy's 
> example. This is a bit slower than regenerating the HDT, but acceptable 
> since I'm only doing it occasionally. I also tested executing queries 
> while reloading the data. This seemed to work OK even though performance 
> obviously did suffer. But at least the endpoint remained up.
> 
> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index 
> for the same data is 560MB.
> 
> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
> twice its original size. I understand that the TDB2 needs to be 
> compacted regularly, otherwise it will keep growing. I'm OK with the 
> large disk space usage if it's constant, not growing over time like TDB1.
> 
> 2. Command line tools
> 
> For this I used an older version of the same dataset with 30M triples, 
> the same one I used for my HDT vs TDB comparison that I posted on the 
> users mailing list:
> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E 
> 
> 
> This was on my i3-2330M laptop with 8GB RAM and SSD.
> 
> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So 
> TDB2 is slower to load and takes more disk space than TDB.
> 
> I ran the same example query I used before on the TDB2. The first time 
> was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
> 
> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The 
> query took 13.7-14.0 seconds after the first run (24 seconds).
> 
> I also reloaded the same data to the TDB2 to see the effect. Reloading 
> took 11 minutes and the database grew to 5.7GB. Then I compacted it 
> using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
> just grew further, to 9.7GB. The database directory then contained both 
> Data-0001 and Data-0002 directories. I removed Data-0001 and disk usage 
> fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
> 
> My impressions so far: It works, but it's slower than TDB and needs more 
> disk space. Compaction seems to work, but initially it will just 
> increase disk usage. The stale data has to be manually removed to 
> actually reclaim any space. I didn't test subsequent load/compact 
> cycles, but I assume there may still be some disk space growth (e.g. due 
> to blank nodes, of which there are some in my dataset) even if the data 
> is regularly compacted.
> 
> For me, not growing over time like TDB is really the crucial feature 
> that TDB2 seems to promise. Right now it's not clear whether it entirely 
> fulfills this promise, since compaction needs to be done manually and 
> doesn't actually reclaim disk space by itself.
> 
> Questions/suggestions:
> 
> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
> prefer not taking the endpoint down for compaction.
> 
> 2. Should the stale data be deleted after compaction, at least as an 
> option?
> 
> 3. Should there be a JIRA issue about UI and API support for creating 
> TDB2 datasets?
> 
> 4. Should there be a JIRA issue about the bad Content-Length values 
> reported by Fuseki?
> 
> -Osma
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: bad Fuseki Content-Length value Re: TDB2 testing

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 15:22, Osma Suominen wrote:
...

>> Please separate this out into another email - what is the problem and 
>> does it apply to the current codebase?
> 
> Sorry I wasn't clear. This is something you mentioned yourself in 
> another e-mail on 2017-10-06 about how to load large files into Fuseki 
> with TDB2:
> 
>> This seems to work:
>>
>> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 
>> 'Content-type: application/n-triples' http://localhost:3030/data
>>
>> 200M BSBM (49Gbytes) loaded at 42K triples/s.
>>
>> The content length in the fuskei log is reported wrongly (1002691465 
>> ... int/long error) but the triple count is right. 

Now fixed.

> 
> The only connection to TDB2 is that with TDB1 transaction sizes were 
> limited, so I guess that the overflow situation never happened. With 
> TDB2 you can now push very large files into Fuseki (yay!), but this 
> exposes the problem. It's a very minor issue at least to me. I'm more 
> interested in the other questions - especially if it's possible to 
> maintain a Fuseki endpoint with a TDB2 store, occasionally pushing new 
> data but not filling the disk doing so.

It was an int/long bug so it happens at 2G.

(there is an equivalent problem in Apache Common FileUpload - but in a 
place that Jena does not use fortunately when called from the UI. It's 
unfixable in FileUpload.  If yoy say "getString()" then the string is 
limited to 2G charcaters becasue Java string are char[]'s.)

A 2G file into TDB1 is a few G of RAM maximum, and isn't near the size 
limits for Fuseki. Fuseki uses TDB cautiously and further restricts the 
delayed work queue.

    Andy

> 
> -Osma
> 

bad Fuseki Content-Length value Re: TDB2 testing

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Andy Seaborne kirjoitti 16.11.2017 klo 15:54:
> I am weeks behind on email.

No problem with that. I just noticed that other TDB2 issues and comments 
were discussed on JIRA and on users@ while this one wasn't. I figured it 
might have fallen through the cracks.

>  > 4. Should there be a JIRA issue about the bad Content-Length values 
> reported by Fuseki?
> 
> I don't see any connection to TDB2.
> 
> Please separate this out into another email - what is the problem and 
> does it apply to the current codebase?

Sorry I wasn't clear. This is something you mentioned yourself in 
another e-mail on 2017-10-06 about how to load large files into Fuseki 
with TDB2:

> This seems to work:
> 
> wget --post-file=/home/afs/Datasets/BSBM/bsbm-200m.nt --header 'Content-type: application/n-triples' http://localhost:3030/data
> 
> 200M BSBM (49Gbytes) loaded at 42K triples/s.
> 
> The content length in the fuskei log is reported wrongly (1002691465 ... int/long error) but the triple count is right. 

The only connection to TDB2 is that with TDB1 transaction sizes were 
limited, so I guess that the overflow situation never happened. With 
TDB2 you can now push very large files into Fuseki (yay!), but this 
exposes the problem. It's a very minor issue at least to me. I'm more 
interested in the other questions - especially if it's possible to 
maintain a Fuseki endpoint with a TDB2 store, occasionally pushing new 
data but not filling the disk doing so.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 testing Re: TDB2 merged

Posted by Andy Seaborne <an...@apache.org>.
On 16/11/17 07:53, Osma Suominen wrote:
> Hi all (especially Andy)

I am weeks behind on email.

 > 4. Should there be a JIRA issue about the bad Content-Length values 
reported by Fuseki?

I don't see any connection to TDB2.

Please separate this out into another email - what is the problem and 
does it apply to the current codebase?

(NB I have encountered some tools getting the Content-Length wrong.)

     Andy

> I'd appreciate some kind of response to the questions at the end of the 
> below message. I can open the JIRA issues if necessary. I realize this 
> was posted during the hectic release process.
> 
> I have done some further testing for TDB & TDB2 disk space usage, will 
> write a separate message about that, probably this time on users@
> 
> -Osma
> 
> Osma Suominen kirjoitti 27.10.2017 klo 13:44:
>> Hi,
>>
>> As I've promised earlier I took TDB2 for a little test drive, using 
>> the 3.5.0rc1 builds.
>>
>> I tested two scenarios: A server running Fuseki, and command line 
>> tools operating directly on a database directory.
>>
>> 1. Server running Fuseki
>>
>> First the server (running as a VM). Currently I've been using Fuseki 
>> with HDT support, from the hdt-java repository. I'm serving a dataset 
>> of about 39M triples, which occasionally changes (eventually this will 
>> be updated once per month, or perhaps more frequently, even once per 
>> day). With HDT, I can simply rebuild the HDT file (less than 10 
>> minutes) and then restart Fuseki. Downtime for the endpoint is only a 
>> few seconds. But I'm worried about the state of the hdt-java project, 
>> it is not being actively maintained and it's still based on Fuseki1.
>>
>> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth 
>> thanks to the documentation that Andy provided. I usually create 
>> Fuseki2 datasets via the API (using curl), but I noticed that, like 
>> the UI, the API only supports "mem" and "tdb". So I created a "tdb" 
>> dataset first, then edited the configuration file so it uses tdb2 
>> instead.
>>
>> Loading the data took about 17 minutes. I used wget for this, per 
>> Andy's example. This is a bit slower than regenerating the HDT, but 
>> acceptable since I'm only doing it occasionally. I also tested 
>> executing queries while reloading the data. This seemed to work OK 
>> even though performance obviously did suffer. But at least the 
>> endpoint remained up.
>>
>> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + 
>> index for the same data is 560MB.
>>
>> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost 
>> twice its original size. I understand that the TDB2 needs to be 
>> compacted regularly, otherwise it will keep growing. I'm OK with the 
>> large disk space usage if it's constant, not growing over time like TDB1.
>>
>> 2. Command line tools
>>
>> For this I used an older version of the same dataset with 30M triples, 
>> the same one I used for my HDT vs TDB comparison that I posted on the 
>> users mailing list:
>> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E 
>>
>>
>> This was on my i3-2330M laptop with 8GB RAM and SSD.
>>
>> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
>> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
>> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. 
>> So TDB2 is slower to load and takes more disk space than TDB.
>>
>> I ran the same example query I used before on the TDB2. The first time 
>> was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
>>
>> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. 
>> The query took 13.7-14.0 seconds after the first run (24 seconds).
>>
>> I also reloaded the same data to the TDB2 to see the effect. Reloading 
>> took 11 minutes and the database grew to 5.7GB. Then I compacted it 
>> using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage 
>> just grew further, to 9.7GB. The database directory then contained 
>> both Data-0001 and Data-0002 directories. I removed Data-0001 and disk 
>> usage fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
>>
>> My impressions so far: It works, but it's slower than TDB and needs 
>> more disk space. Compaction seems to work, but initially it will just 
>> increase disk usage. The stale data has to be manually removed to 
>> actually reclaim any space. I didn't test subsequent load/compact 
>> cycles, but I assume there may still be some disk space growth (e.g. 
>> due to blank nodes, of which there are some in my dataset) even if the 
>> data is regularly compacted.
>>
>> For me, not growing over time like TDB is really the crucial feature 
>> that TDB2 seems to promise. Right now it's not clear whether it 
>> entirely fulfills this promise, since compaction needs to be done 
>> manually and doesn't actually reclaim disk space by itself.
>>
>> Questions/suggestions:
>>
>> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
>> prefer not taking the endpoint down for compaction.
>>
>> 2. Should the stale data be deleted after compaction, at least as an 
>> option?
>>
>> 3. Should there be a JIRA issue about UI and API support for creating 
>> TDB2 datasets?
>>
>> 4. Should there be a JIRA issue about the bad Content-Length values 
>> reported by Fuseki?
>>
>> -Osma
>>
> 
>