You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/11/24 09:03:54 UTC

Jena/Fuseki graph sync

Does Fuseki have any tool to "synchronize" a graph in the dataset with a .nt file? In other words, some tool that given a dataset/graph and a .nt file as input, will parse the triples in the .nt file and automatically add/delete triples in the dataset/graph such that at the end of the process the graph in the dataset has the exact same triples of the .nt file?

If no such tool exists, how could I achieve something like this with the existing tools?

Thank you.

Re: Jena/Fuseki graph sync

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

But you would have to do an expensive computation anyways. The
computation of a diff would have to be done. That means you have to
compare two big datasets somehow.

Input: existing graph G and new set of triple T

For each triple t in T

  If !(t in G)

     G := G union {t}

For each triple t in G

  If !(t in T)

     G := G \ {t}

And it becomes more complex once blank nodes occur.

The better way would be to provide incremental changesets by the source.
For example, DBpedia did this some time ago.

On 24.11.2017 13:19, Laura Morales wrote:
>> What about simply deleting the old graph and loading the triples of the
>> .nt file into the graph afterwards? I don't see any benefit of such a
>> "tool" - you could just write your own bash script for this if you need
>> this quite often.
> The advantage is with large graphs, such as wikidata. If I download their dumps once a week, it's much more efficient to only change a few triples instead of deleting the entire graph and recreating the whole TDB store.

Re: Jena/Fuseki graph sync

Posted by Dan Davis <da...@gmail.com>.

What I do is load the new Nt file into an updates graph and then use
capabilities of the triple store to compare.   For me, that's virtuoso, so
SQL queries are also available.

On Nov 24, 2017 9:53 AM, "Dan Davis" <da...@gmail.com> wrote:

> Rdflib has a graph_diff method that returns common, triples, only in left,
> only in right.   It is in IsonorpgicGraph class, so it should handle blank
> nodes.
>
> On Nov 24, 2017 7:19 AM, "Laura Morales" <la...@mail.com> wrote:
>
>> > What about simply deleting the old graph and loading the triples of the
>> > .nt file into the graph afterwards? I don't see any benefit of such a
>> > "tool" - you could just write your own bash script for this if you need
>> > this quite often.
>>
>> The advantage is with large graphs, such as wikidata. If I download their
>> dumps once a week, it's much more efficient to only change a few triples
>> instead of deleting the entire graph and recreating the whole TDB store.
>>
>

Re: Jena/Fuseki graph sync

Posted by Dan Davis <da...@gmail.com>.

In terms of UNIX utilities, there's a command called "comm" which outputs
three columns:
* lines only in the first file (column 1)
* lines only in the second file (column 2)
* lines in common (column 3)

Then arguments can suppress columns:
* comm -23 a b  - will show lines only in a
* comm -13 a b - will show lines only in b

Of course checksums would not work on the whole graph, but on a sub-graph
defined by a DESCRIBE query, e.g. one subject aka owl:Thing, it could be
perfectly feasible.  Especially because you are essentially comparing a
graph digest and do not need to load the data.



On Fri, Nov 24, 2017 at 10:02 AM, Osma Suominen <os...@helsinki.fi>
wrote:

> Dan Davis kirjoitti 24.11.2017 klo 16:53:
>
>> Rdflib has a graph_diff method that returns common, triples, only in left,
>> only in right.   It is in IsonorpgicGraph class, so it should handle blank
>> nodes.
>>
>
> Good luck running that on something like Wikidata though. It's far too big
> to fit in memory.
>
> I'd use N-Triple files (old and new) sorted using the unix command sort,
> then use diff to determine added and removed triples, and finally turn
> those into INSERT DATA and DELETE DATA update operations. Assuming there
> are no blank nodes.
>
> -Osma
>
> (speaking as the author of the current rdflib in-memory store, IOMemory)
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 26 (Kaiku
> <https://maps.google.com/?q=x+26+(Kaiku&entry=gmail&source=g>katu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.suominen@helsinki.fi
> http://www.nationallibrary.fi
>

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.

On 24/11/17 15:02, Osma Suominen wrote:
> Dan Davis kirjoitti 24.11.2017 klo 16:53:
>> Rdflib has a graph_diff method that returns common, triples, only in 
>> left,
>> only in right.   It is in IsonorpgicGraph class, so it should handle 
>> blank
>> nodes.
> 
> Good luck running that on something like Wikidata though. It's far too 
> big to fit in memory.
> 
> I'd use N-Triple files (old and new) sorted using the unix command sort, 
> then use diff to determine added and removed triples, and finally turn 
> those into INSERT DATA and DELETE DATA update operations. Assuming there 
> are no blank nodes.
> 
> -Osma
> 
> (speaking as the author of the current rdflib in-memory store, IOMemory)
> 

This is where RDF patch comes in:

https://afs.github.io/rdf-delta/rdf-patch.html

send the adds and removes.

If it is just additions, you can POST the new RDF to datasets and it 
gets added.

If the deletes than need something else.

Either SPARQL Update, or RDF Patch.

Reloading the whole thing (offline) may be slow but it is reliable. 
Load it (batch job) then swap the datasets over (brief outage or get 
clever with a load balancer).

     Andy

Re: Jena/Fuseki graph sync

Posted by Osma Suominen <os...@helsinki.fi>.

Dan Davis kirjoitti 24.11.2017 klo 16:53:
> Rdflib has a graph_diff method that returns common, triples, only in left,
> only in right.   It is in IsonorpgicGraph class, so it should handle blank
> nodes.

Good luck running that on something like Wikidata though. It's far too 
big to fit in memory.

I'd use N-Triple files (old and new) sorted using the unix command sort, 
then use diff to determine added and removed triples, and finally turn 
those into INSERT DATA and DELETE DATA update operations. Assuming there 
are no blank nodes.

-Osma

(speaking as the author of the current rdflib in-memory store, IOMemory)

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena/Fuseki graph sync

Posted by Dan Davis <da...@gmail.com>.

Rdflib has a graph_diff method that returns common, triples, only in left,
only in right.   It is in IsonorpgicGraph class, so it should handle blank
nodes.

On Nov 24, 2017 7:19 AM, "Laura Morales" <la...@mail.com> wrote:

> > What about simply deleting the old graph and loading the triples of the
> > .nt file into the graph afterwards? I don't see any benefit of such a
> > "tool" - you could just write your own bash script for this if you need
> > this quite often.
>
> The advantage is with large graphs, such as wikidata. If I download their
> dumps once a week, it's much more efficient to only change a few triples
> instead of deleting the entire graph and recreating the whole TDB store.
>

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> What about simply deleting the old graph and loading the triples of the
> .nt file into the graph afterwards? I don't see any benefit of such a
> "tool" - you could just write your own bash script for this if you need
> this quite often.

The advantage is with large graphs, such as wikidata. If I download their dumps once a week, it's much more efficient to only change a few triples instead of deleting the entire graph and recreating the whole TDB store.

Re: Jena/Fuseki graph sync

Posted by Osma Suominen <os...@helsinki.fi>.

Lorenz Buehmann kirjoitti 24.11.2017 klo 11:53:
> Ok, but there is no magic behind the tool I guess. I mean, it's not a
> tool like incrementally updating a dataset by doing some diffs, etc.

No, there's no magic. It's all in the SPARQL HTTP Graph Store spec. The 
update is not incremental, it's a replacement in a single atomic 
operation, so perhaps somewhat simpler than "deleting the old graph and 
loading the triples of the .nt file into the graph afterwards" that you 
suggested.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena/Fuseki graph sync

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Ok, but there is no magic behind the tool I guess. I mean, it's not a
tool like incrementally updating a dataset by doing some diffs, etc.

Or am I wrong?


On 24.11.2017 10:51, Osma Suominen wrote:
> Lorenz Buehmann kirjoitti 24.11.2017 klo 11:46:
>
>> What about simply deleting the old graph and loading the triples of the
>> .nt file into the graph afterwards? I don't see any benefit of such a
>> "tool" - you could just write your own bash script for this if you need
>> this quite often.
>
> The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the
> SPARQL Graph Store endpoint using e.g. curl) does exactly this -
> replaces a graph with a new one in a single operation.
>
> In the original scenario, blank nodes can be a problem if you have
> them in your data. There is no way (at least not efficiently) to
> compare blank nodes in two graphs.
>
> -Osma
>
>

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.

On 26/11/17 08:05, Laura Morales wrote:
>> Experiment needed - on a normal commodity server or a portable, the
>> limiting factor may not be the CPU, or just the CPU. The system bus
>> (moving data around) and the persistent storage may be limitations.
> 
> On my computer, creating a TDB2 store with tdb2.tdloader from a .nt file 1.1G in size:
> 
> - read from disk, write to disk: 35K triples per second (as reported by tdb2.tdbloader AVG value)

TDB2 is, at the moment, much better with an SSD.

As I've said, the TDB2 loader is simplistic and crude.

It is not even TDB1 tdbloader.

> - read from disk, write to tmpfs: 43K AVG triples per second
> - read from tmpfs, write to tmpfs: 45K AVG triples per second
> 
> Adding -Xmx3G to the "java" command in tdb2.tdbloader didn't seem to have any effect.

It won't. File caching is outside the heap.

> On the other hand, I have a thread constantly at 100%
> Creating the store with the "--graph" argument seems significantly slower than without such argument. I've only tested this for the first case (read disk, write disk) and tdb2.tdbloader reports about 50K AVG triples per second

Yes - more indexes for named graphs.  The default setup is skewed for 
general query use, not for load.

> It's not a deep statistic but... if the reported AVG numbers are correct then I guess a slow CPU is the bottleneck for tdb2?

Or memory bandwidth.
And for a rotating disk, doing better write order would help.

> I guess the largest improvement would probably be adding multi-threading to tdb2.tdbloader, considering that servers can have more then 10 or 20 cores. Either this or map-reduce.

You need to factor in the NodeIds.  One RDF term, the same NodeId always.

At the moment, they are incrementally allocated and give the location in 
a file. This is not parallelizable.

They could be hashes (see TDB2 96bit ids or maybe longer), which is 
parallelizable, but you need a hash to file location index.

tdbloader (loader1) has, or had, a parallel mode. When I last used it, 
the gain was small, suggesting that the system bus (or memory channel 
bandwidth) is a factor.  Redoing that for modern server CPUs would be 
interesting.

It was parallel on building the secondary indexes - a single sequential 
pass to build the primary SPO (and have NodeId allocation) then create 
the secondary indexes.

map-reduce has high overhead and needs a cluster.

TDB3 ...

     Andy

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> Experiment needed - on a normal commodity server or a portable, the
> limiting factor may not be the CPU, or just the CPU. The system bus
> (moving data around) and the persistent storage may be limitations.

On my computer, creating a TDB2 store with tdb2.tdloader from a .nt file 1.1G in size:

- read from disk, write to disk: 35K triples per second (as reported by tdb2.tdbloader AVG value)
- read from disk, write to tmpfs: 43K AVG triples per second
- read from tmpfs, write to tmpfs: 45K AVG triples per second

Adding -Xmx3G to the "java" command in tdb2.tdbloader didn't seem to have any effect.
On the other hand, I have a thread constantly at 100%
Creating the store with the "--graph" argument seems significantly slower than without such argument. I've only tested this for the first case (read disk, write disk) and tdb2.tdbloader reports about 50K AVG triples per second

It's not a deep statistic but... if the reported AVG numbers are correct then I guess a slow CPU is the bottleneck for tdb2? I guess the largest improvement would probably be adding multi-threading to tdb2.tdbloader, considering that servers can have more then 10 or 20 cores. Either this or map-reduce.

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> That was pure *parsing* speed with no generation of triples. Data
> validation task.

A question related to this. Is tdb2.tdloader mostly limited by hard disk speed, or does it require a lot of CPU/RAM computation as well?

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.


On 25/11/17 18:46, Laura Morales wrote:
>> Parsing the data: I got:
>>
>> latest-truthy.nt.gz
>> 4,736.87 sec : 2,199,382,887 Triples : 464,311.63 per second
>>
>> latest-all.ttl.gz
>> 8,864.36 sec : 4,787,194,669 Triples : 540,049.73 per second
>> and 3,284,314 warnings.
> 
> Did you get these numbers using tdb2.tdbloader?

That was pure *parsing* speed with no generation of triples.  Data 
validation task.

> My AVG number of triples per second when using tdb2.tdbloader is ~60K and it also seems to slow down over time (<30K)
> Which kind of computer did you get these numbers on? cpu, type of ram, disk (hdd, ssd)...
>

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> Parsing the data: I got:
> 
> latest-truthy.nt.gz
> 4,736.87 sec : 2,199,382,887 Triples : 464,311.63 per second
> 
> latest-all.ttl.gz
> 8,864.36 sec : 4,787,194,669 Triples : 540,049.73 per second
> and 3,284,314 warnings.

Did you get these numbers using tdb2.tdbloader? My AVG number of triples per second when using tdb2.tdbloader is ~60K and it also seems to slow down over time (<30K)
Which kind of computer did you get these numbers on? cpu, type of ram, disk (hdd, ssd)...

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.

On 25/11/17 12:08, Laura Morales wrote:
>> How long does it take to load, using what hardware?
>> This is "all", and not "truthy"?
> 
> Yes this file https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

Parsing the data: I got:

latest-truthy.nt.gz
4,736.87 sec : 2,199,382,887 Triples : 464,311.63 per second

latest-all.ttl.gz
8,864.36 sec : 4,787,194,669 Triples : 540,049.73 per second
    and 3,284,314 warnings.

> Intel Core 2 Duo 8GB 1TB
> I have never timed it but it takes hours. Can Jena use more cores to create the TDB2 store faster?

It can, as in "it could be enhanced to do that", but it doesn't.

Experiment needed - on a normal commodity server or a portable, the 
limiting factor may not be the CPU, or just the CPU. The system bus 
(moving data around) and the persistent storage may be limitations.

     Andy

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> How long does it take to load, using what hardware?
> This is "all", and not "truthy"?

Yes this file https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Intel Core 2 Duo 8GB 1TB
I have never timed it but it takes hours. Can Jena use more cores to create the TDB2 store faster?

Re: Jena/Fuseki graph sync

Posted by zPlus <zp...@peers.community>.

> I downloaded them both yesterday - I was getting a miserable
2Mbyte/s download rate, capped per download, not that we can blame
them for rate controlling downloads, if it is the download site at
all.

I've downloaded the same file a while back, and I was able to get
around 5-6MB/s using multiple connections. However, if I remember
correctly the server would only accept 3 connections max.

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.

Laura,

Interesting.
How long does it take to load, using what hardware?
This is "all", and not "truthy"?

(I downloaded them both yesterday - I was getting a miserable 2Mbyte/s 
download rate, capped per download, not that we can blame them for rate 
controlling downloads, if it is the download site at all).

     Andy

On 24/11/17 17:24, Laura Morales wrote:
>> Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?
> 
> Wikidata is about 4 billion triples, and it takes a lot of time to create the TDB store from the nt file. They release a new dump about once a week, and I would like to update my local copy when they release a new dump. Reloading the entire graph from scratch every time seems very inefficient (as well as an intensive process) considering that only a tiny % of the wikidata graph changes in a week.
>

Re: Jena/Fuseki graph sync

Posted by Andy Seaborne <an...@apache.org>.

If the changes are in the right shape, and in RDF, the change can be 
applied with (outline):

DELETE WHERE {
   <some_resource> ?p ?o
   ... any metadata about <some_resource> ...
} ;
INSERT DATA {
  ... new data ...
}


but the danger of working with this is that not all changes fit this 
pattern and the copy will drift apart.  The compromise might be do 
incremental for a few times and rebuild every 1 or 3 months.

     Andy


On 25/11/17 09:14, Lorenz Buehmann wrote:
> Ah, interesting. That's what I meant with incremental changeset provided
> by the source maintainer. I knew that there was something like this for
> DBpedia, simply providing changesets of triples. Weird that there is no
> RDF format available for Wikidata. Maybe Laura could open a feature request.
> 
> By the way, thanks for RDF Patch pointer.
> 
> 
> Cheers,
> 
> Lorenz
> 
> 
> On 24.11.2017 18:43, ajs6f wrote:
>> Wikimedia does offer a sort of general procedure for this: you can check to see the updates since the last dump and do per-resource changes.
>>
>> https://www.wikidata.org/wiki/Wikidata:Data_access#Incremental_updates
>>
>> But perhaps more efficiently for yourself, you could use their incremental dumps:
>>
>> https://dumps.wikimedia.org/other/incr/wikidatawiki/
>>
>> which are for some reason only provided in XML.
>>
>> ajs6f
>>
>>> On Nov 24, 2017, at 12:24 PM, Laura Morales <la...@mail.com> wrote:
>>>
>>>> Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?
>>> Wikidata is about 4 billion triples, and it takes a lot of time to create the TDB store from the nt file. They release a new dump about once a week, and I would like to update my local copy when they release a new dump. Reloading the entire graph from scratch every time seems very inefficient (as well as an intensive process) considering that only a tiny % of the wikidata graph changes in a week.
>>
>

Re: Jena/Fuseki graph sync

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Ah, interesting. That's what I meant with incremental changeset provided
by the source maintainer. I knew that there was something like this for
DBpedia, simply providing changesets of triples. Weird that there is no
RDF format available for Wikidata. Maybe Laura could open a feature request.

By the way, thanks for RDF Patch pointer.


Cheers,

Lorenz


On 24.11.2017 18:43, ajs6f wrote:
> Wikimedia does offer a sort of general procedure for this: you can check to see the updates since the last dump and do per-resource changes.
>
> https://www.wikidata.org/wiki/Wikidata:Data_access#Incremental_updates
>
> But perhaps more efficiently for yourself, you could use their incremental dumps:
>
> https://dumps.wikimedia.org/other/incr/wikidatawiki/
>
> which are for some reason only provided in XML. 
>
> ajs6f
>
>> On Nov 24, 2017, at 12:24 PM, Laura Morales <la...@mail.com> wrote:
>>
>>> Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?
>> Wikidata is about 4 billion triples, and it takes a lot of time to create the TDB store from the nt file. They release a new dump about once a week, and I would like to update my local copy when they release a new dump. Reloading the entire graph from scratch every time seems very inefficient (as well as an intensive process) considering that only a tiny % of the wikidata graph changes in a week.
>

Re: Jena/Fuseki graph sync

Posted by ajs6f <aj...@apache.org>.

Wikimedia does offer a sort of general procedure for this: you can check to see the updates since the last dump and do per-resource changes.

https://www.wikidata.org/wiki/Wikidata:Data_access#Incremental_updates

But perhaps more efficiently for yourself, you could use their incremental dumps:

https://dumps.wikimedia.org/other/incr/wikidatawiki/

which are for some reason only provided in XML. 

ajs6f

> On Nov 24, 2017, at 12:24 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?
> 
> Wikidata is about 4 billion triples, and it takes a lot of time to create the TDB store from the nt file. They release a new dump about once a week, and I would like to update my local copy when they release a new dump. Reloading the entire graph from scratch every time seems very inefficient (as well as an intensive process) considering that only a tiny % of the wikidata graph changes in a week.

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?

Wikidata is about 4 billion triples, and it takes a lot of time to create the TDB store from the nt file. They release a new dump about once a week, and I would like to update my local copy when they release a new dump. Reloading the entire graph from scratch every time seems very inefficient (as well as an intensive process) considering that only a tiny % of the wikidata graph changes in a week.

Re: Jena/Fuseki graph sync

Posted by ajs6f <aj...@apache.org>.

Hey, Lorenz--

Laura specifically asked for ways to avoid transmitting the whole graph; Osma's solution (sort NTriples) is better than Model::difference in the absence of bnodes (actually, difference() seems to work based on triple equality anyway), and Andy's (RDF Patch) would be good if you want to automate the process. Andy wrote a whole system for that. [1]

Laura, can you tell us a little more about why you are trying to avoid transmitting the whole graph? Is it because of an unreliable network between your client and Fuseki or because of something else?

ajs6f

[1] https://github.com/afs/rdf-delta

> On Nov 24, 2017, at 10:21 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
> 
> Which means to load the whole new Wikidata dump first, or not?
> 
> Which means, it can simply be used the new loaded dataset.
> 
> 
> On 24.11.2017 15:57, ajs6f wrote:
>> You can use Model.difference(Model m) to do these calculations. 
>> 
>> ajs6f
>> 
>>> On Nov 24, 2017, at 7:21 AM, Laura Morales <la...@mail.com> wrote:
>>> 
>>>> The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the
>>>> SPARQL Graph Store endpoint using e.g. curl) does exactly this -
>>>> replaces a graph with a new one in a single operation.
>>> Deleting a whole graph, and pushing all the new triples over HTTP doesn't look like a good fit for large graphs (even for graphs of a few GBs).
>> 
>

Re: Jena/Fuseki graph sync

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Which means to load the whole new Wikidata dump first, or not?

Which means, it can simply be used the new loaded dataset.


On 24.11.2017 15:57, ajs6f wrote:
> You can use Model.difference(Model m) to do these calculations. 
>
> ajs6f
>
>> On Nov 24, 2017, at 7:21 AM, Laura Morales <la...@mail.com> wrote:
>>
>>> The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the
>>> SPARQL Graph Store endpoint using e.g. curl) does exactly this -
>>> replaces a graph with a new one in a single operation.
>> Deleting a whole graph, and pushing all the new triples over HTTP doesn't look like a good fit for large graphs (even for graphs of a few GBs).
>

Re: Jena/Fuseki graph sync

Posted by ajs6f <aj...@apache.org>.

You can use Model.difference(Model m) to do these calculations. 

ajs6f

> On Nov 24, 2017, at 7:21 AM, Laura Morales <la...@mail.com> wrote:
> 
>> The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the
>> SPARQL Graph Store endpoint using e.g. curl) does exactly this -
>> replaces a graph with a new one in a single operation.
> 
> Deleting a whole graph, and pushing all the new triples over HTTP doesn't look like a good fit for large graphs (even for graphs of a few GBs).

Re: Jena/Fuseki graph sync

Posted by Laura Morales <la...@mail.com>.

> The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the
> SPARQL Graph Store endpoint using e.g. curl) does exactly this -
> replaces a graph with a new one in a single operation.

Deleting a whole graph, and pushing all the new triples over HTTP doesn't look like a good fit for large graphs (even for graphs of a few GBs).

Re: Jena/Fuseki graph sync

Posted by Osma Suominen <os...@helsinki.fi>.

Lorenz Buehmann kirjoitti 24.11.2017 klo 11:46:

> What about simply deleting the old graph and loading the triples of the
> .nt file into the graph afterwards? I don't see any benefit of such a
> "tool" - you could just write your own bash script for this if you need
> this quite often.

The s-put tool that comes with Fuseki (or just doing a HTTP PUT to the 
SPARQL Graph Store endpoint using e.g. curl) does exactly this - 
replaces a graph with a new one in a single operation.

In the original scenario, blank nodes can be a problem if you have them 
in your data. There is no way (at least not efficiently) to compare 
blank nodes in two graphs.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena/Fuseki graph sync

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

> process the graph in the dataset has the exact same triples of the .nt file?

What about simply deleting the old graph and loading the triples of the
.nt file into the graph afterwards? I don't see any benefit of such a
"tool" - you could just write your own bash script for this if you need
this quite often.


On 24.11.2017 10:03, Laura Morales wrote:
> Does Fuseki have any tool to "synchronize" a graph in the dataset with a .nt file? In other words, some tool that given a dataset/graph and a .nt file as input, will parse the triples in the .nt file and automatically add/delete triples in the dataset/graph such that at the end of the process the graph in the dataset has the exact same triples of the .nt file?
>
> If no such tool exists, how could I achieve something like this with the existing tools?
>
> Thank you.