You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2019/06/13 04:27:12 UTC

tdb2.tdbsync

This is only a potential suggestion, not an issue.

I think it could be handy to have a tdb2.tdbsync tool for synchronizing a TDB dataset with a rdf file(s). Something to use like this: tdb2.tdbsync --loc dataset data.nt, and will automatically delete/insert tuples to maintain the dataset updated with the changes in the file. It would be handy when batch processing a large number of tuples.

Re: tdb2.tdbsync

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Well it's deletes primarily that are problematic for two reasons.

First is blank nodes equivalence as the internal IDs TDB2 assigns for blank nodes are completely unrelated to the source blank node IDs in the data serialization especially if that source data is changing over time (because your data serializer may use different IDs each time).  Figuring out what are new blank nodes versus what are equivalent blank nodes is the sub-graph isomorphism problem which has NP-complete complexity.

Secondly in order to detect deletes you would need to build a completely new dataset from the data file and then do a comparison between the old and new dataset by looping over the old and doing lookups against the other.  This would be extremely expensive both in terms of time and resources even in datasets that used no blank nodes.

Creating a fresh dataset will always be much faster

Rob

On 13/06/2019, 09:34, "Laura Morales" <la...@mail.com> wrote:

    yes yes of course I can reload everything, that's what I do already. I simply thought it might be quite handy, for instance, if I had a folder containing an arbitrary number of rdf files, and as these files change I could call a tdb2.tdbsync tool that automatically updates a tdb dataset with only the changes (instead of reloading everything).

    > Sent: Thursday, June 13, 2019 at 10:26 AM
    > From: "Rob Vesse" <rv...@dotnetrdf.org>
    > To: users@jena.apache.org
    > Subject: Re: tdb2.tdbsync
    >
    > Can you not just do a fresh TDB load into a new dataset from the data file?
    >
    > This would be much faster and more performant than what you are proposing (in particular the delete handling would be very expensive)
    >
    > Rob

Re: tdb2.tdbsync

Posted by Laura Morales <la...@mail.com>.

yes yes of course I can reload everything, that's what I do already. I simply thought it might be quite handy, for instance, if I had a folder containing an arbitrary number of rdf files, and as these files change I could call a tdb2.tdbsync tool that automatically updates a tdb dataset with only the changes (instead of reloading everything).


> Sent: Thursday, June 13, 2019 at 10:26 AM
> From: "Rob Vesse" <rv...@dotnetrdf.org>
> To: users@jena.apache.org
> Subject: Re: tdb2.tdbsync
>
> Can you not just do a fresh TDB load into a new dataset from the data file?
>
> This would be much faster and more performant than what you are proposing (in particular the delete handling would be very expensive)
>
> Rob

Re: tdb2.tdbsync

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Can you not just do a fresh TDB load into a new dataset from the data file?

This would be much faster and more performant than what you are proposing (in particular the delete handling would be very expensive)

Rob

On 13/06/2019, 05:27, "Laura Morales" <la...@mail.com> wrote:

    This is only a potential suggestion, not an issue.
    
    I think it could be handy to have a tdb2.tdbsync tool for synchronizing a TDB dataset with a rdf file(s). Something to use like this: tdb2.tdbsync --loc dataset data.nt, and will automatically delete/insert tuples to maintain the dataset updated with the changes in the file. It would be handy when batch processing a large number of tuples.