You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Bartalus Gáspár <ba...@codespring.ro.INVALID> on 2022/07/06 08:36:08 UTC

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Hi Lorenz,

Thanks for quick feedback and clarification on lucene indexes.

Here are my answers to your questions:
- We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
- The overall number of triples after data upload is  ~150000.
- We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
- These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.

We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.

Best regards,
Gaspar



> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
> 
> Hi and welcome Gaspar.
> 
> 
> Those files do contain the node tables.
> 
> A Lucene index is never computed by default and would be contained in Lucene specific index files.
> 
> 
> Can you give some details about the
> 
> - size of the files
> - the number of triples
> - the number triples added/removed/changed
> - the frequency of updates
> - how much the files grow
> - what kind of data you insert? Lots of blank nodes? Or literals?
> 
> Also, did you try a compact operation during time?
> 
> Lorenz
> 
> On 06.07.22 09:40, Bartalus Gáspár wrote:
>> Hi Jena support team,
>> 
>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>> 
>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>> 
>> Our usage of Jena Fuseki:
>> 
>> * Start the server with `fuseki-server —port 3030`
>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>> * Upload ttl files with HTTP POST to /db_name/data
>> 
>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>> 
>> Best regards,
>> Gaspar Bartalus
>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Andy Seaborne <an...@apache.org>.

Hi Gaspar,

(this is Jena 4.6.1?)

Not something I recall seeing before.

Are new Data-000_ directories being created?
What's in the log file about backups?

Backups are serialized to one at a time per dataset.

     Andy


On 22/11/2022 14:18, Bartalus Gáspár wrote:
> Hi Andy,
> 
> We’ve just started to run the compaction on our database, but we are encountering that compaction doesn’t always complete.
> Any ideas what could cause this behaviour?
> 
> We’re executing the http POST request /$/compact/database_name?deleteOld=true
> 
> and subsequently checking the tasks with /$/tasks.
> 
> Here is a sample output for the latter, showing that the first 3 attempts succeeded in ~6 seconds, but the attempts 4, 5 and 6 are not completing.
> 
> [ {
>      "task" : "Compact" ,
>      "taskId" : "4" ,
>      "started" : "2022-11-22T14:02:55.216+00:00"
>    } ,
>    {
>      "task" : "Compact" ,
>      "taskId" : "5" ,
>      "started" : "2022-11-22T14:05:21.628+00:00"
>    } ,
>    {
>      "task" : "Compact" ,
>      "taskId" : "6" ,
>      "started" : "2022-11-22T14:08:17.003+00:00"
>    } ,
>    {
>      "task" : "Compact" ,
>      "taskId" : "1" ,
>      "started" : "2022-11-22T13:51:53.060+00:00" ,
>      "finished" : "2022-11-22T13:51:59.105+00:00" ,
>      "success" : true
>    } ,
>    {
>      "task" : "Compact" ,
>      "taskId" : "2" ,
>      "started" : "2022-11-22T13:54:52.477+00:00" ,
>      "finished" : "2022-11-22T13:54:58.241+00:00" ,
>      "success" : true
>    } ,
>    {
>      "task" : "Compact" ,
>      "taskId" : "3" ,
>      "started" : "2022-11-22T14:01:33.245+00:00" ,
>      "finished" : "2022-11-22T14:01:40.070+00:00" ,
>      "success" : true
>    }
> ]
> 
> Thanks in advance,
> Gaspar
> 
>> On 14 Jul 2022, at 19:36, Andy Seaborne <an...@apache.org> wrote:
>>
>>
>>
>> On 07/07/2022 16:19, Lorenz Buehmann wrote:
>>> I think we should wait for Andy here with further input as he's the persons who basically designed and implemented all the fancy stuff and knows better advice for sure.
>>> @Andy Did you read the whole discussion and can you verify that it's expected behavior that lot's of daily updates lead to such a big growth of the node table files?
>>
>> Sorry for the delay.
>>
>> There is no Lucene index by default.
>>
>> SPO.dat is not nodes table related - it is the base level of the SPO B+Tree. SPO.idn is the tree above the base level and SPO.bpt keeps the pointers to the root block and some size information.
>>
>> The issue looks to be the large numbers of small updates. TDB2 used a copy-by-write MVCC scheme which means transactions can proceed without needing latches (database locks) but has the consequence of needing compaction.
>>
>> TDB1 with Fuseki is worth a try. It does not use the scheme.  It does grow but much more slowly. It is limited in the size of updates it can handle but the limit is no where near what you describe.
>>
>> Also worth trying is compaction and deletion
>>
>> /$/compact/db_name?deleteOld=true
>>
>> which will delete the old database after compaction (only the one just compacted. Old ones can be manually deleted).
>>
>>     Andy
>>
>>> On 07.07.22 10:53, Bartalus Gáspár wrote:
>>>> Hi Lorenz,
>>>>
>>>> Would you recommend using tdb1 instead of tdb2 for our use case? What would be the differences?
>>>> We are using fuseki 4.5.0 btw.
>>>>
>>>> Gaspar
>>>>
>>>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Most of the updates are DELETE/INSERT queries, i.e
>>>>>
>>>>> DELETE {?s ?p ?oldValue}
>>>>> INSERT {?s ?p ?newValue}
>>>>> WHERE {
>>>>>    OPTIONAL {?s ?p ?oldValue}
>>>>>    #derive ?newValue from somewhere
>>>>> }
>>>>>
>>>>> We also have some separate DELETE queries and INSERT queries.
>>>>>
>>>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
>>>>> - databases/db_name/Data-0001 - with the old large files
>>>>> - databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.
>>>>>
>>>>> Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?
>>>>>
>>>>> Gaspar
>>>>>
>>>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>>
>>>>>> Ok, interesting
>>>>>>
>>>>>> so
>>>>>>
>>>>>> we have
>>>>>>
>>>>>> - 150k triples, rather small dataset
>>>>>>
>>>>>> - loaded into 10MB node table files
>>>>>>
>>>>>> - 10 updates every 5s
>>>>>>
>>>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>>>>
>>>>>> - and leads to 10GB node table files
>>>>>>
>>>>>>
>>>>>> Can you share the shape of those update queries?
>>>>>>
>>>>>>
>>>>>> After doing a "compact" operation, the files are getting back to "normal" size?
>>>>>>
>>>>>>
>>>>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>>>>> Hi Lorenz,
>>>>>>>
>>>>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>>>>
>>>>>>> Here are my answers to your questions:
>>>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>>>>>>> - The overall number of triples after data upload is  ~150000.
>>>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>>>>>>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>>>>>>>
>>>>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Gaspar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>>>>
>>>>>>>> Hi and welcome Gaspar.
>>>>>>>>
>>>>>>>>
>>>>>>>> Those files do contain the node tables.
>>>>>>>>
>>>>>>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you give some details about the
>>>>>>>>
>>>>>>>> - size of the files
>>>>>>>> - the number of triples
>>>>>>>> - the number triples added/removed/changed
>>>>>>>> - the frequency of updates
>>>>>>>> - how much the files grow
>>>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>>>>
>>>>>>>> Also, did you try a compact operation during time?
>>>>>>>>
>>>>>>>> Lorenz
>>>>>>>>
>>>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>>>>> Hi Jena support team,
>>>>>>>>>
>>>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>>>>>>>
>>>>>>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>>>>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>>>>>>>
>>>>>>>>> Our usage of Jena Fuseki:
>>>>>>>>>
>>>>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>>>>
>>>>>>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Gaspar Bartalus
>>>>>>>>>
>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

Hi Andy,

We’ve just started to run the compaction on our database, but we are encountering that compaction doesn’t always complete.
Any ideas what could cause this behaviour?

We’re executing the http POST request /$/compact/database_name?deleteOld=true

and subsequently checking the tasks with /$/tasks.

Here is a sample output for the latter, showing that the first 3 attempts succeeded in ~6 seconds, but the attempts 4, 5 and 6 are not completing.

[ { 
    "task" : "Compact" ,
    "taskId" : "4" ,
    "started" : "2022-11-22T14:02:55.216+00:00"
  } ,
  { 
    "task" : "Compact" ,
    "taskId" : "5" ,
    "started" : "2022-11-22T14:05:21.628+00:00"
  } ,
  { 
    "task" : "Compact" ,
    "taskId" : "6" ,
    "started" : "2022-11-22T14:08:17.003+00:00"
  } ,
  { 
    "task" : "Compact" ,
    "taskId" : "1" ,
    "started" : "2022-11-22T13:51:53.060+00:00" ,
    "finished" : "2022-11-22T13:51:59.105+00:00" ,
    "success" : true
  } ,
  { 
    "task" : "Compact" ,
    "taskId" : "2" ,
    "started" : "2022-11-22T13:54:52.477+00:00" ,
    "finished" : "2022-11-22T13:54:58.241+00:00" ,
    "success" : true
  } ,
  { 
    "task" : "Compact" ,
    "taskId" : "3" ,
    "started" : "2022-11-22T14:01:33.245+00:00" ,
    "finished" : "2022-11-22T14:01:40.070+00:00" ,
    "success" : true
  }
]

Thanks in advance,
Gaspar

> On 14 Jul 2022, at 19:36, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 07/07/2022 16:19, Lorenz Buehmann wrote:
>> I think we should wait for Andy here with further input as he's the persons who basically designed and implemented all the fancy stuff and knows better advice for sure.
>> @Andy Did you read the whole discussion and can you verify that it's expected behavior that lot's of daily updates lead to such a big growth of the node table files?
> 
> Sorry for the delay.
> 
> There is no Lucene index by default.
> 
> SPO.dat is not nodes table related - it is the base level of the SPO B+Tree. SPO.idn is the tree above the base level and SPO.bpt keeps the pointers to the root block and some size information.
> 
> The issue looks to be the large numbers of small updates. TDB2 used a copy-by-write MVCC scheme which means transactions can proceed without needing latches (database locks) but has the consequence of needing compaction.
> 
> TDB1 with Fuseki is worth a try. It does not use the scheme.  It does grow but much more slowly. It is limited in the size of updates it can handle but the limit is no where near what you describe.
> 
> Also worth trying is compaction and deletion
> 
> /$/compact/db_name?deleteOld=true
> 
> which will delete the old database after compaction (only the one just compacted. Old ones can be manually deleted).
> 
>    Andy
> 
>> On 07.07.22 10:53, Bartalus Gáspár wrote:
>>> Hi Lorenz,
>>> 
>>> Would you recommend using tdb1 instead of tdb2 for our use case? What would be the differences?
>>> We are using fuseki 4.5.0 btw.
>>> 
>>> Gaspar
>>> 
>>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Most of the updates are DELETE/INSERT queries, i.e
>>>> 
>>>> DELETE {?s ?p ?oldValue}
>>>> INSERT {?s ?p ?newValue}
>>>> WHERE {
>>>>   OPTIONAL {?s ?p ?oldValue}
>>>>   #derive ?newValue from somewhere
>>>> }
>>>> 
>>>> We also have some separate DELETE queries and INSERT queries.
>>>> 
>>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
>>>> - databases/db_name/Data-0001 - with the old large files
>>>> - databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.
>>>> 
>>>> Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?
>>>> 
>>>> Gaspar
>>>> 
>>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>> 
>>>>> Ok, interesting
>>>>> 
>>>>> so
>>>>> 
>>>>> we have
>>>>> 
>>>>> - 150k triples, rather small dataset
>>>>> 
>>>>> - loaded into 10MB node table files
>>>>> 
>>>>> - 10 updates every 5s
>>>>> 
>>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>>> 
>>>>> - and leads to 10GB node table files
>>>>> 
>>>>> 
>>>>> Can you share the shape of those update queries?
>>>>> 
>>>>> 
>>>>> After doing a "compact" operation, the files are getting back to "normal" size?
>>>>> 
>>>>> 
>>>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>>>> Hi Lorenz,
>>>>>> 
>>>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>>> 
>>>>>> Here are my answers to your questions:
>>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>>>>>> - The overall number of triples after data upload is  ~150000.
>>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>>>>>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>>>>>> 
>>>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>>>>>> 
>>>>>> Best regards,
>>>>>> Gaspar
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>>> 
>>>>>>> Hi and welcome Gaspar.
>>>>>>> 
>>>>>>> 
>>>>>>> Those files do contain the node tables.
>>>>>>> 
>>>>>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>>>>>> 
>>>>>>> 
>>>>>>> Can you give some details about the
>>>>>>> 
>>>>>>> - size of the files
>>>>>>> - the number of triples
>>>>>>> - the number triples added/removed/changed
>>>>>>> - the frequency of updates
>>>>>>> - how much the files grow
>>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>>> 
>>>>>>> Also, did you try a compact operation during time?
>>>>>>> 
>>>>>>> Lorenz
>>>>>>> 
>>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>>>> Hi Jena support team,
>>>>>>>> 
>>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>>>>>> 
>>>>>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>>>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>>>>>> 
>>>>>>>> Our usage of Jena Fuseki:
>>>>>>>> 
>>>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>>> 
>>>>>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Gaspar Bartalus
>>>>>>>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

Hi Andy & Lorenz,

Thanks for the clarification and support.

Best regards,
Gaspar

> On 14 Jul 2022, at 19:36, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 07/07/2022 16:19, Lorenz Buehmann wrote:
>> I think we should wait for Andy here with further input as he's the persons who basically designed and implemented all the fancy stuff and knows better advice for sure.
>> @Andy Did you read the whole discussion and can you verify that it's expected behavior that lot's of daily updates lead to such a big growth of the node table files?
> 
> Sorry for the delay.
> 
> There is no Lucene index by default.
> 
> SPO.dat is not nodes table related - it is the base level of the SPO B+Tree. SPO.idn is the tree above the base level and SPO.bpt keeps the pointers to the root block and some size information.
> 
> The issue looks to be the large numbers of small updates. TDB2 used a copy-by-write MVCC scheme which means transactions can proceed without needing latches (database locks) but has the consequence of needing compaction.
> 
> TDB1 with Fuseki is worth a try. It does not use the scheme.  It does grow but much more slowly. It is limited in the size of updates it can handle but the limit is no where near what you describe.
> 
> Also worth trying is compaction and deletion
> 
> /$/compact/db_name?deleteOld=true
> 
> which will delete the old database after compaction (only the one just compacted. Old ones can be manually deleted).
> 
>    Andy
> 
>> On 07.07.22 10:53, Bartalus Gáspár wrote:
>>> Hi Lorenz,
>>> 
>>> Would you recommend using tdb1 instead of tdb2 for our use case? What would be the differences?
>>> We are using fuseki 4.5.0 btw.
>>> 
>>> Gaspar
>>> 
>>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Most of the updates are DELETE/INSERT queries, i.e
>>>> 
>>>> DELETE {?s ?p ?oldValue}
>>>> INSERT {?s ?p ?newValue}
>>>> WHERE {
>>>>   OPTIONAL {?s ?p ?oldValue}
>>>>   #derive ?newValue from somewhere
>>>> }
>>>> 
>>>> We also have some separate DELETE queries and INSERT queries.
>>>> 
>>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
>>>> - databases/db_name/Data-0001 - with the old large files
>>>> - databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.
>>>> 
>>>> Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?
>>>> 
>>>> Gaspar
>>>> 
>>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>> 
>>>>> Ok, interesting
>>>>> 
>>>>> so
>>>>> 
>>>>> we have
>>>>> 
>>>>> - 150k triples, rather small dataset
>>>>> 
>>>>> - loaded into 10MB node table files
>>>>> 
>>>>> - 10 updates every 5s
>>>>> 
>>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>>> 
>>>>> - and leads to 10GB node table files
>>>>> 
>>>>> 
>>>>> Can you share the shape of those update queries?
>>>>> 
>>>>> 
>>>>> After doing a "compact" operation, the files are getting back to "normal" size?
>>>>> 
>>>>> 
>>>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>>>> Hi Lorenz,
>>>>>> 
>>>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>>> 
>>>>>> Here are my answers to your questions:
>>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>>>>>> - The overall number of triples after data upload is  ~150000.
>>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>>>>>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>>>>>> 
>>>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>>>>>> 
>>>>>> Best regards,
>>>>>> Gaspar
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>>> 
>>>>>>> Hi and welcome Gaspar.
>>>>>>> 
>>>>>>> 
>>>>>>> Those files do contain the node tables.
>>>>>>> 
>>>>>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>>>>>> 
>>>>>>> 
>>>>>>> Can you give some details about the
>>>>>>> 
>>>>>>> - size of the files
>>>>>>> - the number of triples
>>>>>>> - the number triples added/removed/changed
>>>>>>> - the frequency of updates
>>>>>>> - how much the files grow
>>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>>> 
>>>>>>> Also, did you try a compact operation during time?
>>>>>>> 
>>>>>>> Lorenz
>>>>>>> 
>>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>>>> Hi Jena support team,
>>>>>>>> 
>>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>>>>>> 
>>>>>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>>>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>>>>>> 
>>>>>>>> Our usage of Jena Fuseki:
>>>>>>>> 
>>>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>>> 
>>>>>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Gaspar Bartalus
>>>>>>>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Andy Seaborne <an...@apache.org>.


On 07/07/2022 16:19, Lorenz Buehmann wrote:
> I think we should wait for Andy here with further input as he's the 
> persons who basically designed and implemented all the fancy stuff and 
> knows better advice for sure.
> 
> @Andy Did you read the whole discussion and can you verify that it's 
> expected behavior that lot's of daily updates lead to such a big growth 
> of the node table files?

Sorry for the delay.

There is no Lucene index by default.

SPO.dat is not nodes table related - it is the base level of the SPO 
B+Tree. SPO.idn is the tree above the base level and SPO.bpt keeps the 
pointers to the root block and some size information.

The issue looks to be the large numbers of small updates. TDB2 used a 
copy-by-write MVCC scheme which means transactions can proceed without 
needing latches (database locks) but has the consequence of needing 
compaction.

TDB1 with Fuseki is worth a try. It does not use the scheme.  It does 
grow but much more slowly. It is limited in the size of updates it can 
handle but the limit is no where near what you describe.

Also worth trying is compaction and deletion

/$/compact/db_name?deleteOld=true

which will delete the old database after compaction (only the one just 
compacted. Old ones can be manually deleted).

     Andy

> 
> On 07.07.22 10:53, Bartalus Gáspár wrote:
>> Hi Lorenz,
>>
>> Would you recommend using tdb1 instead of tdb2 for our use case? What 
>> would be the differences?
>> We are using fuseki 4.5.0 btw.
>>
>> Gaspar
>>
>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár 
>>> <ba...@codespring.ro.INVALID> wrote:
>>>
>>> Hi,
>>>
>>> Most of the updates are DELETE/INSERT queries, i.e
>>>
>>> DELETE {?s ?p ?oldValue}
>>> INSERT {?s ?p ?newValue}
>>> WHERE {
>>>   OPTIONAL {?s ?p ?oldValue}
>>>   #derive ?newValue from somewhere
>>> }
>>>
>>> We also have some separate DELETE queries and INSERT queries.
>>>
>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are 
>>> getting back to normal size. However, as far as I can tell the old 
>>> files are also kept. This is the folder structure I see:
>>> - databases/db_name/Data-0001 - with the old large files
>>> - databases/db_name/Data-0002 - presumably the result of the compact 
>>> operation with normal file sizes.
>>>
>>> Is there also some operation (http or cli) that would keep only one 
>>> (the latest) data folder, i.e. delete the old files from Data-0001?
>>>
>>> Gaspar
>>>
>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann 
>>>> <bu...@informatik.uni-leipzig.de> wrote:
>>>>
>>>> Ok, interesting
>>>>
>>>> so
>>>>
>>>> we have
>>>>
>>>> - 150k triples, rather small dataset
>>>>
>>>> - loaded into 10MB node table files
>>>>
>>>> - 10 updates every 5s
>>>>
>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>>
>>>> - and leads to 10GB node table files
>>>>
>>>>
>>>> Can you share the shape of those update queries?
>>>>
>>>>
>>>> After doing a "compact" operation, the files are getting back to 
>>>> "normal" size?
>>>>
>>>>
>>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>>> Hi Lorenz,
>>>>>
>>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>>
>>>>> Here are my answers to your questions:
>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 
>>>>> 6Mb, the others are below 200Kb.
>>>>> - The overall number of triples after data upload is  ~150000.
>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a 
>>>>> regular and frequent basis, i.e. every 5 seconds. We also have 5 
>>>>> such queries that are executed each minute. But most of the time 
>>>>> they do not produce any outcome, i.e. the dataset is not altered, 
>>>>> and when they do, there are just a couple of triples that are added 
>>>>> to the dataset.
>>>>> - These *.dat files start from ~10Mb in size, and after a day or so 
>>>>> some of them grow to ~10Gb.
>>>>>
>>>>> We have ~300 blank nodes, and ~half of the triples have a literal 
>>>>> in the object position, so ~75000.
>>>>>
>>>>> Best regards,
>>>>> Gaspar
>>>>>
>>>>>
>>>>>
>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann 
>>>>>> <bu...@informatik.uni-leipzig.de> wrote:
>>>>>>
>>>>>> Hi and welcome Gaspar.
>>>>>>
>>>>>>
>>>>>> Those files do contain the node tables.
>>>>>>
>>>>>> A Lucene index is never computed by default and would be contained 
>>>>>> in Lucene specific index files.
>>>>>>
>>>>>>
>>>>>> Can you give some details about the
>>>>>>
>>>>>> - size of the files
>>>>>> - the number of triples
>>>>>> - the number triples added/removed/changed
>>>>>> - the frequency of updates
>>>>>> - how much the files grow
>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>>
>>>>>> Also, did you try a compact operation during time?
>>>>>>
>>>>>> Lorenz
>>>>>>
>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>>> Hi Jena support team,
>>>>>>>
>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the 
>>>>>>> databases folder we see some files called SPO.dat, OSP.dat, etc., 
>>>>>>> and the size of these files are growing quickly. From our 
>>>>>>> understanding these files are containing the Lucene indexes. We 
>>>>>>> would have two questions:
>>>>>>>
>>>>>>> 1. Why are these files growing rapidly, although the underlying 
>>>>>>> data (triples) are not being changed, or only slightly changed?
>>>>>>> 2. Can we disable indexing easily, since we are not using full 
>>>>>>> text searches in our SPARQL queries?
>>>>>>>
>>>>>>> Our usage of Jena Fuseki:
>>>>>>>
>>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>>> * Create databases with HTTP POST to 
>>>>>>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>>
>>>>>>> Thanks in advance for your feedback, and if you’d require more 
>>>>>>> input from our side, please let me know.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Gaspar Bartalus
>>>>>>>

Re: Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

I think we should wait for Andy here with further input as he's the 
persons who basically designed and implemented all the fancy stuff and 
knows better advice for sure.

@Andy Did you read the whole discussion and can you verify that it's 
expected behavior that lot's of daily updates lead to such a big growth 
of the node table files?

On 07.07.22 10:53, Bartalus Gáspár wrote:
> Hi Lorenz,
>
> Would you recommend using tdb1 instead of tdb2 for our use case? What would be the differences?
> We are using fuseki 4.5.0 btw.
>
> Gaspar
>
>> On 6 Jul 2022, at 14:39, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
>>
>> Hi,
>>
>> Most of the updates are DELETE/INSERT queries, i.e
>>
>> DELETE {?s ?p ?oldValue}
>> INSERT {?s ?p ?newValue}
>> WHERE {
>>   OPTIONAL {?s ?p ?oldValue}
>>   #derive ?newValue from somewhere
>> }
>>
>> We also have some separate DELETE queries and INSERT queries.
>>
>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
>> - databases/db_name/Data-0001 - with the old large files
>> - databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.
>>
>> Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?
>>
>> Gaspar
>>
>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>
>>> Ok, interesting
>>>
>>> so
>>>
>>> we have
>>>
>>> - 150k triples, rather small dataset
>>>
>>> - loaded into 10MB node table files
>>>
>>> - 10 updates every 5s
>>>
>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>
>>> - and leads to 10GB node table files
>>>
>>>
>>> Can you share the shape of those update queries?
>>>
>>>
>>> After doing a "compact" operation, the files are getting back to "normal" size?
>>>
>>>
>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>> Hi Lorenz,
>>>>
>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>
>>>> Here are my answers to your questions:
>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>>>> - The overall number of triples after data upload is  ~150000.
>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>>>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>>>>
>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>>>>
>>>> Best regards,
>>>> Gaspar
>>>>
>>>>
>>>>
>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>
>>>>> Hi and welcome Gaspar.
>>>>>
>>>>>
>>>>> Those files do contain the node tables.
>>>>>
>>>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>>>>
>>>>>
>>>>> Can you give some details about the
>>>>>
>>>>> - size of the files
>>>>> - the number of triples
>>>>> - the number triples added/removed/changed
>>>>> - the frequency of updates
>>>>> - how much the files grow
>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>
>>>>> Also, did you try a compact operation during time?
>>>>>
>>>>> Lorenz
>>>>>
>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>> Hi Jena support team,
>>>>>>
>>>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>>>>
>>>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>>>>
>>>>>> Our usage of Jena Fuseki:
>>>>>>
>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>
>>>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>>>>
>>>>>> Best regards,
>>>>>> Gaspar Bartalus
>>>>>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

Hi Lorenz,

Would you recommend using tdb1 instead of tdb2 for our use case? What would be the differences?
We are using fuseki 4.5.0 btw.

Gaspar

> On 6 Jul 2022, at 14:39, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
> 
> Hi,
> 
> Most of the updates are DELETE/INSERT queries, i.e
> 
> DELETE {?s ?p ?oldValue}
> INSERT {?s ?p ?newValue}
> WHERE {
>  OPTIONAL {?s ?p ?oldValue}
>  #derive ?newValue from somewhere
> }
> 
> We also have some separate DELETE queries and INSERT queries.
> 
> I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
> - databases/db_name/Data-0001 - with the old large files
> - databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.
> 
> Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?
> 
> Gaspar
> 
>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>> 
>> Ok, interesting
>> 
>> so
>> 
>> we have
>> 
>> - 150k triples, rather small dataset
>> 
>> - loaded into 10MB node table files
>> 
>> - 10 updates every 5s
>> 
>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>> 
>> - and leads to 10GB node table files
>> 
>> 
>> Can you share the shape of those update queries?
>> 
>> 
>> After doing a "compact" operation, the files are getting back to "normal" size?
>> 
>> 
>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>> Hi Lorenz,
>>> 
>>> Thanks for quick feedback and clarification on lucene indexes.
>>> 
>>> Here are my answers to your questions:
>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>>> - The overall number of triples after data upload is  ~150000.
>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>>> 
>>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>>> 
>>> Best regards,
>>> Gaspar
>>> 
>>> 
>>> 
>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>> 
>>>> Hi and welcome Gaspar.
>>>> 
>>>> 
>>>> Those files do contain the node tables.
>>>> 
>>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>>> 
>>>> 
>>>> Can you give some details about the
>>>> 
>>>> - size of the files
>>>> - the number of triples
>>>> - the number triples added/removed/changed
>>>> - the frequency of updates
>>>> - how much the files grow
>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>> 
>>>> Also, did you try a compact operation during time?
>>>> 
>>>> Lorenz
>>>> 
>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>> Hi Jena support team,
>>>>> 
>>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>>> 
>>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>>> 
>>>>> Our usage of Jena Fuseki:
>>>>> 
>>>>> * Start the server with `fuseki-server —port 3030`
>>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>> 
>>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>>> 
>>>>> Best regards,
>>>>> Gaspar Bartalus
>>>>> 
>

Re: Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Hi,


you should open another thread where we can discuss your question, 
please don't mix up threads - makes me confused

Also, did you check SPARQL 1.1 Update W3C documents? They are online and 
have lots of examples
On 06.07.22 13:50, Dương Hồ wrote:
> DELETE {?s ?p ?oldValue}
> INSERT {?s ?p ?newValue}
> WHERE {
>    OPTIONAL {?s ?p ?oldValue}
>    #derive ?newValue from somewhere
> }
> If i want update 3 triples how to use this formats?
> Can you help me?
>
>
> Vào 18:39, Th 4, 6 thg 7, 2022 Bartalus Gáspár
> <ba...@codespring.ro.invalid> đã viết:
>
>> Hi,
>>
>> Most of the updates are DELETE/INSERT queries, i.e
>>
>> DELETE {?s ?p ?oldValue}
>> INSERT {?s ?p ?newValue}
>> WHERE {
>>    OPTIONAL {?s ?p ?oldValue}
>>    #derive ?newValue from somewhere
>> }
>>
>> We also have some separate DELETE queries and INSERT queries.
>>
>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are
>> getting back to normal size. However, as far as I can tell the old files
>> are also kept. This is the folder structure I see:
>> - databases/db_name/Data-0001 - with the old large files
>> - databases/db_name/Data-0002 - presumably the result of the compact
>> operation with normal file sizes.
>>
>> Is there also some operation (http or cli) that would keep only one (the
>> latest) data folder, i.e. delete the old files from Data-0001?
>>
>> Gaspar
>>
>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <
>> buehmann@informatik.uni-leipzig.de> wrote:
>>> Ok, interesting
>>>
>>> so
>>>
>>> we have
>>>
>>> - 150k triples, rather small dataset
>>>
>>> - loaded into 10MB node table files
>>>
>>> - 10 updates every 5s
>>>
>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>
>>> - and leads to 10GB node table files
>>>
>>>
>>> Can you share the shape of those update queries?
>>>
>>>
>>> After doing a "compact" operation, the files are getting back to
>> "normal" size?
>>>
>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>> Hi Lorenz,
>>>>
>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>
>>>> Here are my answers to your questions:
>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb,
>> the others are below 200Kb.
>>>> - The overall number of triples after data upload is  ~150000.
>>>> - We have around 10 SPARQL UPDATE queries that are executed on a
>> regular and frequent basis, i.e. every 5 seconds. We also have 5 such
>> queries that are executed each minute. But most of the time they do not
>> produce any outcome, i.e. the dataset is not altered, and when they do,
>> there are just a couple of triples that are added to the dataset.
>>>> - These *.dat files start from ~10Mb in size, and after a day or so
>> some of them grow to ~10Gb.
>>>> We have ~300 blank nodes, and ~half of the triples have a literal in
>> the object position, so ~75000.
>>>> Best regards,
>>>> Gaspar
>>>>
>>>>
>>>>
>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <
>> buehmann@informatik.uni-leipzig.de> wrote:
>>>>> Hi and welcome Gaspar.
>>>>>
>>>>>
>>>>> Those files do contain the node tables.
>>>>>
>>>>> A Lucene index is never computed by default and would be contained in
>> Lucene specific index files.
>>>>>
>>>>> Can you give some details about the
>>>>>
>>>>> - size of the files
>>>>> - the number of triples
>>>>> - the number triples added/removed/changed
>>>>> - the frequency of updates
>>>>> - how much the files grow
>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>
>>>>> Also, did you try a compact operation during time?
>>>>>
>>>>> Lorenz
>>>>>
>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>> Hi Jena support team,
>>>>>>
>>>>>> We are experiencing an issue with Jena Fuseki databases. In the
>> databases folder we see some files called SPO.dat, OSP.dat, etc., and the
>> size of these files are growing quickly. From our understanding these files
>> are containing the Lucene indexes. We would have two questions:
>>>>>> 1. Why are these files growing rapidly, although the underlying data
>> (triples) are not being changed, or only slightly changed?
>>>>>> 2. Can we disable indexing easily, since we are not using full text
>> searches in our SPARQL queries?
>>>>>> Our usage of Jena Fuseki:
>>>>>>
>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>> * Create databases with HTTP POST to
>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>
>>>>>> Thanks in advance for your feedback, and if you’d require more input
>> from our side, please let me know.
>>>>>> Best regards,
>>>>>> Gaspar Bartalus
>>>>>>
>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

This is a generic shape. A real-world example would be:

DELETE {?subject rdfs:label ?oldLabel}
INSERT {?subject rdfs:label ?newLabel}
WHERE {
  ?subject rdf:type SomeType.
  ?subject rdfs:label ?oldLabel.
  FILTER(?oldLabel IN (“oldLabel1”, “oldLabel2”, “oldLabel3))
  BIND(CONCAT(?oldLabel, “_updated”) AS ?newLabel)
}

> On 6 Jul 2022, at 14:50, Dương Hồ <ko...@gmail.com> wrote:
> 
> DELETE {?s ?p ?oldValue}
> INSERT {?s ?p ?newValue}
> WHERE {
>  OPTIONAL {?s ?p ?oldValue}
>  #derive ?newValue from somewhere
> }
> If i want update 3 triples how to use this formats?
> Can you help me?
> 
> 
> Vào 18:39, Th 4, 6 thg 7, 2022 Bartalus Gáspár
> <ba...@codespring.ro.invalid> đã viết:
> 
>> Hi,
>> 
>> Most of the updates are DELETE/INSERT queries, i.e
>> 
>> DELETE {?s ?p ?oldValue}
>> INSERT {?s ?p ?newValue}
>> WHERE {
>>  OPTIONAL {?s ?p ?oldValue}
>>  #derive ?newValue from somewhere
>> }
>> 
>> We also have some separate DELETE queries and INSERT queries.
>> 
>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are
>> getting back to normal size. However, as far as I can tell the old files
>> are also kept. This is the folder structure I see:
>> - databases/db_name/Data-0001 - with the old large files
>> - databases/db_name/Data-0002 - presumably the result of the compact
>> operation with normal file sizes.
>> 
>> Is there also some operation (http or cli) that would keep only one (the
>> latest) data folder, i.e. delete the old files from Data-0001?
>> 
>> Gaspar
>> 
>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann <
>> buehmann@informatik.uni-leipzig.de> wrote:
>>> 
>>> Ok, interesting
>>> 
>>> so
>>> 
>>> we have
>>> 
>>> - 150k triples, rather small dataset
>>> 
>>> - loaded into 10MB node table files
>>> 
>>> - 10 updates every 5s
>>> 
>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>> 
>>> - and leads to 10GB node table files
>>> 
>>> 
>>> Can you share the shape of those update queries?
>>> 
>>> 
>>> After doing a "compact" operation, the files are getting back to
>> "normal" size?
>>> 
>>> 
>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>> Hi Lorenz,
>>>> 
>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>> 
>>>> Here are my answers to your questions:
>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb,
>> the others are below 200Kb.
>>>> - The overall number of triples after data upload is  ~150000.
>>>> - We have around 10 SPARQL UPDATE queries that are executed on a
>> regular and frequent basis, i.e. every 5 seconds. We also have 5 such
>> queries that are executed each minute. But most of the time they do not
>> produce any outcome, i.e. the dataset is not altered, and when they do,
>> there are just a couple of triples that are added to the dataset.
>>>> - These *.dat files start from ~10Mb in size, and after a day or so
>> some of them grow to ~10Gb.
>>>> 
>>>> We have ~300 blank nodes, and ~half of the triples have a literal in
>> the object position, so ~75000.
>>>> 
>>>> Best regards,
>>>> Gaspar
>>>> 
>>>> 
>>>> 
>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <
>> buehmann@informatik.uni-leipzig.de> wrote:
>>>>> 
>>>>> Hi and welcome Gaspar.
>>>>> 
>>>>> 
>>>>> Those files do contain the node tables.
>>>>> 
>>>>> A Lucene index is never computed by default and would be contained in
>> Lucene specific index files.
>>>>> 
>>>>> 
>>>>> Can you give some details about the
>>>>> 
>>>>> - size of the files
>>>>> - the number of triples
>>>>> - the number triples added/removed/changed
>>>>> - the frequency of updates
>>>>> - how much the files grow
>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>> 
>>>>> Also, did you try a compact operation during time?
>>>>> 
>>>>> Lorenz
>>>>> 
>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>> Hi Jena support team,
>>>>>> 
>>>>>> We are experiencing an issue with Jena Fuseki databases. In the
>> databases folder we see some files called SPO.dat, OSP.dat, etc., and the
>> size of these files are growing quickly. From our understanding these files
>> are containing the Lucene indexes. We would have two questions:
>>>>>> 
>>>>>> 1. Why are these files growing rapidly, although the underlying data
>> (triples) are not being changed, or only slightly changed?
>>>>>> 2. Can we disable indexing easily, since we are not using full text
>> searches in our SPARQL queries?
>>>>>> 
>>>>>> Our usage of Jena Fuseki:
>>>>>> 
>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>> * Create databases with HTTP POST to
>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>> 
>>>>>> Thanks in advance for your feedback, and if you’d require more input
>> from our side, please let me know.
>>>>>> 
>>>>>> Best regards,
>>>>>> Gaspar Bartalus
>>>>>> 
>> 
>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Dương Hồ <ko...@gmail.com>.

DELETE {?s ?p ?oldValue}
INSERT {?s ?p ?newValue}
WHERE {
  OPTIONAL {?s ?p ?oldValue}
  #derive ?newValue from somewhere
}
If i want update 3 triples how to use this formats?
Can you help me?


Vào 18:39, Th 4, 6 thg 7, 2022 Bartalus Gáspár
<ba...@codespring.ro.invalid> đã viết:

> Hi,
>
> Most of the updates are DELETE/INSERT queries, i.e
>
> DELETE {?s ?p ?oldValue}
> INSERT {?s ?p ?newValue}
> WHERE {
>   OPTIONAL {?s ?p ?oldValue}
>   #derive ?newValue from somewhere
> }
>
> We also have some separate DELETE queries and INSERT queries.
>
> I’ve tried HTTP POST /$/compact/db_name and as a result the files are
> getting back to normal size. However, as far as I can tell the old files
> are also kept. This is the folder structure I see:
> - databases/db_name/Data-0001 - with the old large files
> - databases/db_name/Data-0002 - presumably the result of the compact
> operation with normal file sizes.
>
> Is there also some operation (http or cli) that would keep only one (the
> latest) data folder, i.e. delete the old files from Data-0001?
>
> Gaspar
>
> > On 6 Jul 2022, at 12:52, Lorenz Buehmann <
> buehmann@informatik.uni-leipzig.de> wrote:
> >
> > Ok, interesting
> >
> > so
> >
> > we have
> >
> > - 150k triples, rather small dataset
> >
> > - loaded into 10MB node table files
> >
> > - 10 updates every 5s
> >
> > - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
> >
> > - and leads to 10GB node table files
> >
> >
> > Can you share the shape of those update queries?
> >
> >
> > After doing a "compact" operation, the files are getting back to
> "normal" size?
> >
> >
> > On 06.07.22 10:36, Bartalus Gáspár wrote:
> >> Hi Lorenz,
> >>
> >> Thanks for quick feedback and clarification on lucene indexes.
> >>
> >> Here are my answers to your questions:
> >> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb,
> the others are below 200Kb.
> >> - The overall number of triples after data upload is  ~150000.
> >> - We have around 10 SPARQL UPDATE queries that are executed on a
> regular and frequent basis, i.e. every 5 seconds. We also have 5 such
> queries that are executed each minute. But most of the time they do not
> produce any outcome, i.e. the dataset is not altered, and when they do,
> there are just a couple of triples that are added to the dataset.
> >> - These *.dat files start from ~10Mb in size, and after a day or so
> some of them grow to ~10Gb.
> >>
> >> We have ~300 blank nodes, and ~half of the triples have a literal in
> the object position, so ~75000.
> >>
> >> Best regards,
> >> Gaspar
> >>
> >>
> >>
> >>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <
> buehmann@informatik.uni-leipzig.de> wrote:
> >>>
> >>> Hi and welcome Gaspar.
> >>>
> >>>
> >>> Those files do contain the node tables.
> >>>
> >>> A Lucene index is never computed by default and would be contained in
> Lucene specific index files.
> >>>
> >>>
> >>> Can you give some details about the
> >>>
> >>> - size of the files
> >>> - the number of triples
> >>> - the number triples added/removed/changed
> >>> - the frequency of updates
> >>> - how much the files grow
> >>> - what kind of data you insert? Lots of blank nodes? Or literals?
> >>>
> >>> Also, did you try a compact operation during time?
> >>>
> >>> Lorenz
> >>>
> >>> On 06.07.22 09:40, Bartalus Gáspár wrote:
> >>>> Hi Jena support team,
> >>>>
> >>>> We are experiencing an issue with Jena Fuseki databases. In the
> databases folder we see some files called SPO.dat, OSP.dat, etc., and the
> size of these files are growing quickly. From our understanding these files
> are containing the Lucene indexes. We would have two questions:
> >>>>
> >>>> 1. Why are these files growing rapidly, although the underlying data
> (triples) are not being changed, or only slightly changed?
> >>>> 2. Can we disable indexing easily, since we are not using full text
> searches in our SPARQL queries?
> >>>>
> >>>> Our usage of Jena Fuseki:
> >>>>
> >>>> * Start the server with `fuseki-server —port 3030`
> >>>> * Create databases with HTTP POST to
> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
> >>>> * Upload ttl files with HTTP POST to /db_name/data
> >>>>
> >>>> Thanks in advance for your feedback, and if you’d require more input
> from our side, please let me know.
> >>>>
> >>>> Best regards,
> >>>> Gaspar Bartalus
> >>>>
>
>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

Hi,

Most of the updates are DELETE/INSERT queries, i.e

DELETE {?s ?p ?oldValue}
INSERT {?s ?p ?newValue}
WHERE {
  OPTIONAL {?s ?p ?oldValue}
  #derive ?newValue from somewhere
}

We also have some separate DELETE queries and INSERT queries.

I’ve tried HTTP POST /$/compact/db_name and as a result the files are getting back to normal size. However, as far as I can tell the old files are also kept. This is the folder structure I see:
- databases/db_name/Data-0001 - with the old large files
- databases/db_name/Data-0002 - presumably the result of the compact operation with normal file sizes.

Is there also some operation (http or cli) that would keep only one (the latest) data folder, i.e. delete the old files from Data-0001?

Gaspar

> On 6 Jul 2022, at 12:52, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
> 
> Ok, interesting
> 
> so
> 
> we have
> 
> - 150k triples, rather small dataset
> 
> - loaded into 10MB node table files
> 
> - 10 updates every 5s
> 
> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
> 
> - and leads to 10GB node table files
> 
> 
> Can you share the shape of those update queries?
> 
> 
> After doing a "compact" operation, the files are getting back to "normal" size?
> 
> 
> On 06.07.22 10:36, Bartalus Gáspár wrote:
>> Hi Lorenz,
>> 
>> Thanks for quick feedback and clarification on lucene indexes.
>> 
>> Here are my answers to your questions:
>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
>> - The overall number of triples after data upload is  ~150000.
>> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
>> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>> 
>> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>> 
>> Best regards,
>> Gaspar
>> 
>> 
>> 
>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>> 
>>> Hi and welcome Gaspar.
>>> 
>>> 
>>> Those files do contain the node tables.
>>> 
>>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>> 
>>> 
>>> Can you give some details about the
>>> 
>>> - size of the files
>>> - the number of triples
>>> - the number triples added/removed/changed
>>> - the frequency of updates
>>> - how much the files grow
>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>> 
>>> Also, did you try a compact operation during time?
>>> 
>>> Lorenz
>>> 
>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>> Hi Jena support team,
>>>> 
>>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>> 
>>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>> 
>>>> Our usage of Jena Fuseki:
>>>> 
>>>> * Start the server with `fuseki-server —port 3030`
>>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>> 
>>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>> 
>>>> Best regards,
>>>> Gaspar Bartalus
>>>>

Re: Re: Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

You can trigger compaction from CLI via tdb2.tdbcompact (needs Fuseki 
being down I think) or with Fuseki running as POST request:

https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#datasets-and-services

On 06.07.22 11:52, Lorenz Buehmann wrote:
> Ok, interesting
>
> so
>
> we have
>
> - 150k triples, rather small dataset
>
> - loaded into 10MB node table files
>
> - 10 updates every 5s
>
> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>
> - and leads to 10GB node table files
>
>
> Can you share the shape of those update queries?
>
>
> After doing a "compact" operation, the files are getting back to 
> "normal" size?
>
>
> On 06.07.22 10:36, Bartalus Gáspár wrote:
>> Hi Lorenz,
>>
>> Thanks for quick feedback and clarification on lucene indexes.
>>
>> Here are my answers to your questions:
>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, 
>> the others are below 200Kb.
>> - The overall number of triples after data upload is  ~150000.
>> - We have around 10 SPARQL UPDATE queries that are executed on a 
>> regular and frequent basis, i.e. every 5 seconds. We also have 5 such 
>> queries that are executed each minute. But most of the time they do 
>> not produce any outcome, i.e. the dataset is not altered, and when 
>> they do, there are just a couple of triples that are added to the 
>> dataset.
>> - These *.dat files start from ~10Mb in size, and after a day or so 
>> some of them grow to ~10Gb.
>>
>> We have ~300 blank nodes, and ~half of the triples have a literal in 
>> the object position, so ~75000.
>>
>> Best regards,
>> Gaspar
>>
>>
>>
>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann 
>>> <bu...@informatik.uni-leipzig.de> wrote:
>>>
>>> Hi and welcome Gaspar.
>>>
>>>
>>> Those files do contain the node tables.
>>>
>>> A Lucene index is never computed by default and would be contained 
>>> in Lucene specific index files.
>>>
>>>
>>> Can you give some details about the
>>>
>>> - size of the files
>>> - the number of triples
>>> - the number triples added/removed/changed
>>> - the frequency of updates
>>> - how much the files grow
>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>
>>> Also, did you try a compact operation during time?
>>>
>>> Lorenz
>>>
>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>> Hi Jena support team,
>>>>
>>>> We are experiencing an issue with Jena Fuseki databases. In the 
>>>> databases folder we see some files called SPO.dat, OSP.dat, etc., 
>>>> and the size of these files are growing quickly. From our 
>>>> understanding these files are containing the Lucene indexes. We 
>>>> would have two questions:
>>>>
>>>> 1. Why are these files growing rapidly, although the underlying 
>>>> data (triples) are not being changed, or only slightly changed?
>>>> 2. Can we disable indexing easily, since we are not using full text 
>>>> searches in our SPARQL queries?
>>>>
>>>> Our usage of Jena Fuseki:
>>>>
>>>> * Start the server with `fuseki-server —port 3030`
>>>> * Create databases with HTTP POST to 
>>>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>
>>>> Thanks in advance for your feedback, and if you’d require more 
>>>> input from our side, please let me know.
>>>>
>>>> Best regards,
>>>> Gaspar Bartalus
>>>>

Re: Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Ok, interesting

so

we have

- 150k triples, rather small dataset

- loaded into 10MB node table files

- 10 updates every 5s

- which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day

- and leads to 10GB node table files


Can you share the shape of those update queries?


After doing a "compact" operation, the files are getting back to 
"normal" size?


On 06.07.22 10:36, Bartalus Gáspár wrote:
> Hi Lorenz,
>
> Thanks for quick feedback and clarification on lucene indexes.
>
> Here are my answers to your questions:
> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
> - The overall number of triples after data upload is  ~150000.
> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
>
> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
>
> Best regards,
> Gaspar
>
>
>
>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>
>> Hi and welcome Gaspar.
>>
>>
>> Those files do contain the node tables.
>>
>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>>
>>
>> Can you give some details about the
>>
>> - size of the files
>> - the number of triples
>> - the number triples added/removed/changed
>> - the frequency of updates
>> - how much the files grow
>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>
>> Also, did you try a compact operation during time?
>>
>> Lorenz
>>
>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>> Hi Jena support team,
>>>
>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>>
>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>>
>>> Our usage of Jena Fuseki:
>>>
>>> * Start the server with `fuseki-server —port 3030`
>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>> * Upload ttl files with HTTP POST to /db_name/data
>>>
>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>>
>>> Best regards,
>>> Gaspar Bartalus
>>>

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Posted by Bartalus Gáspár <ba...@codespring.ro.INVALID>.

The 3 dat files that are growing significantly are SPO.dat, OSP.dat and POS.dat ordered by size.

> On 6 Jul 2022, at 11:36, Bartalus Gáspár <ba...@codespring.ro.INVALID> wrote:
> 
> Hi Lorenz,
> 
> Thanks for quick feedback and clarification on lucene indexes.
> 
> Here are my answers to your questions:
> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, the others are below 200Kb.
> - The overall number of triples after data upload is  ~150000.
> - We have around 10 SPARQL UPDATE queries that are executed on a regular and frequent basis, i.e. every 5 seconds. We also have 5 such queries that are executed each minute. But most of the time they do not produce any outcome, i.e. the dataset is not altered, and when they do, there are just a couple of triples that are added to the dataset.
> - These *.dat files start from ~10Mb in size, and after a day or so some of them grow to ~10Gb.
> 
> We have ~300 blank nodes, and ~half of the triples have a literal in the object position, so ~75000.
> 
> Best regards,
> Gaspar
> 
> 
> 
>> On 6 Jul 2022, at 10:55, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>> 
>> Hi and welcome Gaspar.
>> 
>> 
>> Those files do contain the node tables.
>> 
>> A Lucene index is never computed by default and would be contained in Lucene specific index files.
>> 
>> 
>> Can you give some details about the
>> 
>> - size of the files
>> - the number of triples
>> - the number triples added/removed/changed
>> - the frequency of updates
>> - how much the files grow
>> - what kind of data you insert? Lots of blank nodes? Or literals?
>> 
>> Also, did you try a compact operation during time?
>> 
>> Lorenz
>> 
>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>> Hi Jena support team,
>>> 
>>> We are experiencing an issue with Jena Fuseki databases. In the databases folder we see some files called SPO.dat, OSP.dat, etc., and the size of these files are growing quickly. From our understanding these files are containing the Lucene indexes. We would have two questions:
>>> 
>>> 1. Why are these files growing rapidly, although the underlying data (triples) are not being changed, or only slightly changed?
>>> 2. Can we disable indexing easily, since we are not using full text searches in our SPARQL queries?
>>> 
>>> Our usage of Jena Fuseki:
>>> 
>>> * Start the server with `fuseki-server —port 3030`
>>> * Create databases with HTTP POST to `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>> * Upload ttl files with HTTP POST to /db_name/data
>>> 
>>> Thanks in advance for your feedback, and if you’d require more input from our side, please let me know.
>>> 
>>> Best regards,
>>> Gaspar Bartalus
>>> 
>