You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/11/26 15:55:55 UTC
Fuseki all graphs into dataset vs separate graphs
What is the difference of these two scenarios?
1. a dataset with all "s p o g" quads loaded into the dataset, with this configuration
<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "TDB2" ;
tdb2:unionDefaultGraph true .
2. a dataset where each graph is in its own folder, with this configuration
<#dataset> rdf:type ja:RDFDataset ;
ja:namedGraph
[ ja:graphName <http://example.org/g1> ;
ja:graph <#g1> ] ;
ja:namedGraph
[ ja:graphName <http://example.org/g2> ;
ja:graph <#g2> ] ;
.
<#g1> rdf:type tdb:GraphTDB2 ;
tdb2:location "DB-1" .
<#g2> rdf:type tdb:GraphTDB2 ;
tdb2:location "DB-2" .
- Is there any performance hit with the second method?
- With the second method, is it still possible to have the default union graph of all single graphs?
- I tried to query Fuseki with the second method:
- This query works: select ?g where { graph ?g {} }
- This query returns "Error 500: Not in a transaction":
select ?s from <http://example.org/g1> where { ?s ?p ?o } limit 1
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
Oh yeah, sorry. I've always used TDB with Fuseki, it was late night and for a moment I completely forgot about all the other tdb2.* cli tools.
Sent: Sunday, November 26, 2017 at 8:54 PM
From: ajs6f <aj...@apache.org>
To: users@jena.apache.org
Subject: Re: Fuseki all graphs into dataset vs separate graphs
"s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.
You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.
If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily.
ajs6f
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Osma Suominen <os...@helsinki.fi>.
Andrew U. Frank kirjoitti 26.11.2017 klo 23:54:
> when i use the sparql update protokoll to store data, am i not using
> fuseki?
SPARQL the protocol (defined over HTTP) requires a server such as
Fuseki. SPARQL Update the language can be used also without a server,
for example with the command line tool tdbupdate.
> I define my TDB in fuseki (in the run/configuration directory)
> and i start a fuseki server which is the endpoint to receive the update
> queries. i had the impression, that s-put is essentially doing wget to
> the sparql endpoint.
Yes, s-put is roughly equivalent to wget or curl. It's a special purpose
HTTP client that knows about the SPARQL protocol so you can perform
slightly higher level operations. But you can do pretty much the same
things with wget or curl too.
> is this (more or less) a correct understanding? how would one start the
> TDB or TDB2 server without fuseki?
If you use the TDB/TDB2 command line tools you are not starting a
server. Instead they start a Java process that reads and possibly
manipulates the TDB database (directory full of files) while it is
running, then exits when it is done. Typically they perform a single
operation, for example tdbquery answers a single SPARQL query and
tdbupdate performs a single SPARQL Update operation. Compared to Fuseki
there is some startup overhead each time. However, sometimes the
operations are more efficient when performed via command line tools
instead of going through Fuseki, especially loading to TDB via tdbloader
and tdbloader2.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Fuseki all graphs into dataset vs separate graphs
Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.
can i ask for clarification (my application is perhaps somewhat similar
to Laura's):
when i use the sparql update protokoll to store data, am i not using
fuseki? I define my TDB in fuseki (in the run/configuration directory)
and i start a fuseki server which is the endpoint to receive the update
queries. i had the impression, that s-put is essentially doing wget to
the sparql endpoint.
is this (more or less) a correct understanding? how would one start the
TDB or TDB2 server without fuseki?
andrew
On 11/26/2017 02:54 PM, ajs6f wrote:
> "s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.
>
> You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.
>
> If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily.
>
> ajs6f
>
>
>
>> On Nov 26, 2017, at 2:46 PM, Laura Morales <la...@mail.com> wrote:
>>
>>> Is this just for your own exploration
>> Yes
>>
>>> in which case you might want to avoid Fuseki entirely and just work with TDB
>> I can issue SPARQL queries directly at TDB2 without using Fuseki?
>>
>> My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?
--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
+43 1 58801 12710 direct
Geoinformation, TU Wien +43 1 58801 12700 office
Gusshausstr. 27-29 +43 1 55801 12799 fax
1040 Wien Austria +43 676 419 25 72 mobil
Re: Fuseki all graphs into dataset vs separate graphs
Posted by ajs6f <aj...@apache.org>.
"s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.
You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.
If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily.
ajs6f
> On Nov 26, 2017, at 2:46 PM, Laura Morales <la...@mail.com> wrote:
>
>> Is this just for your own exploration
>
> Yes
>
>> in which case you might want to avoid Fuseki entirely and just work with TDB
>
> I can issue SPARQL queries directly at TDB2 without using Fuseki?
>
> My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> Is this just for your own exploration
Yes
> in which case you might want to avoid Fuseki entirely and just work with TDB
I can issue SPARQL queries directly at TDB2 without using Fuseki?
My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by ajs6f <aj...@apache.org>.
As of the last time I looked, Wikidata and DBPedia do _not_ share subject URIs. The fact that they come from the same semi-structured data doesn't imply anything about how they are built up as RDF. AFAIK, Wikidata itself isn't RDF at all internally, but is mapped into RDF for publication.
https://meta.wikimedia.org/wiki/Wikidata/Essays/URI_scheme
Otherwise, it looks like you want to do provenance work, and named graphs are a pretty good way to do that. Although, if you are limiting yourself to Wikidata and DBPedia, you might be able to distinguish the source of a triple just based on the namespace of the subject. Is this just for your own exploration (in which case you might want to avoid Fuseki entirely and just work with TDB)?
ajs6f
> On Nov 26, 2017, at 2:30 PM, Laura Morales <la...@mail.com> wrote:
>
>> Do they share triple subjects?
>
> Yes dbpedia/wikidata are two graphs of the same source
>
>> I am trying to understand why you are intent on putting them in separate graphs.
>
> 1- to query only one source if I want data from a single graph (either dbpedia or wikipedia and not both)
> 2- to extract the origin of a subject, for example select (subject, graph)
> 3- because I can update only one of the two graphs instead of having to update both of them
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> Do they share triple subjects?
Yes dbpedia/wikidata are two graphs of the same source
> I am trying to understand why you are intent on putting them in separate graphs.
1- to query only one source if I want data from a single graph (either dbpedia or wikipedia and not both)
2- to extract the origin of a subject, for example select (subject, graph)
3- because I can update only one of the two graphs instead of having to update both of them
Re: Fuseki all graphs into dataset vs separate graphs
Posted by ajs6f <aj...@apache.org>.
What are the actual queries you are expecting to run against these datasets? Do they share triple subjects? I am trying to understand why you are intent on putting them in separate graphs.
ajs6f
> On Nov 26, 2017, at 1:30 PM, Laura Morales <la...@mail.com> wrote:
>
> Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
> What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> The cost is you then can't make a single query across both at the same
> time (without using SERVICE) but as the sources are independent, I can't
> see how you can anyway. Instead ask one graph, get results, do whatever
> mapping is needed, ask other graph.
OK this is what I don't understand. Using again HDT as a reference... I was able to create an assembler file, with one rdfdataset and multiple graphs, each graph pointing to a .hdt file. Then from SPARQL I could use "select from <graph-1> <graph-2> ..." and this worked.
Basically what I'm trying to do is the same thing but using TDB(2) instead of HDT. I've setup the assembler, I can get a list of graphs from sparql, but then when I query one graph I get that "Error 500: Not in a transaction" error.
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Andy Seaborne <an...@apache.org>.
No idea. Depends what the setup is.
TDB runs either no-transactions (as in ja:RDFDataset) or transactions
(ja:DatasetTDB) and once it goes no-trasnaction to transactions, it does
not go back again.
On 27/11/17 13:38, Laura Morales wrote:
>> Actually, no, that won't work. The transactions on TDB will stop it as
>> one access path is non-transactional on TDB and the other is transactional.
>
> Is this the reason then why I'm getting "Error 500: Not in a transaction"? For comparison, I was able to run a RDFDataset with several distinct HDT files as "ja:graph". On the other hand the same setup did not work for TDB. Does this mean that only a big RDFDataset (with all graphs together) is supported? It's not possible to create a dataset with graph-1 and graph-2 living in two separate locations?
>
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> Actually, no, that won't work. The transactions on TDB will stop it as
> one access path is non-transactional on TDB and the other is transactional.
Is this the reason then why I'm getting "Error 500: Not in a transaction"? For comparison, I was able to run a RDFDataset with several distinct HDT files as "ja:graph". On the other hand the same setup did not work for TDB. Does this mean that only a big RDFDataset (with all graphs together) is supported? It's not possible to create a dataset with graph-1 and graph-2 living in two separate locations?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Andy Seaborne <an...@apache.org>.
On 27/11/17 09:20, Andy Seaborne wrote:
> PS In fact you can ALSO have a RDFDataset across both (different named
> graphs, tdb:dataset pointing to the "<#tdbDataset> rdf:type
> tdb:DatasetTDB" which is also direct referenced in fuseki:dataset.
Actually, no, that won't work. The transactions on TDB will stop it as
one access path is non-transactional on TDB and the other is transactional.
Andy
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Andy Seaborne <an...@apache.org>.
Do you want to query both? How much are you willing to pay for that in
other factors?
Put each graph in a separate dataset (one graph per dataset), /dbpedia
and /wikidata.
The cost is you then can't make a single query across both at the same
time (without using SERVICE) but as the sources are independent, I can't
see how you can anyway. Instead ask one graph, get results, do whatever
mapping is needed, ask other graph.
Andy
PS In fact you can ALSO have a RDFDataset across both (different named
graphs, tdb:dataset pointing to the "<#tdbDataset> rdf:type
tdb:DatasetTDB" which is also direct referenced in fuseki:dataset.
Try on small scale data.
On 26/11/17 18:30, Laura Morales wrote:
> Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
> What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?
>
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by ajs6f <aj...@apache.org>.
Can you say a little more about your aims? Why do you want to put these graphs together? Do you want to write queries over their union? Where do the graphs come from? Do they partition the subjects of triples? (If they do, you don't need to use graphs to separate them.)
Using s-put (or any other HTTP technique) will have costs that depend on your local situation. If your client (say, s-put) is on the same machine as Fuseki, those costs might not be very much at all. If Fuseki is far away and across an unreliable network, those cost might be a lot.
ajs6f
> On Nov 26, 2017, at 1:12 PM, Laura Morales <la...@mail.com> wrote:
>
>> Instead use one TDB dataset and update one graph using s-put
>
> If I'm correct this works over HTTP; would this be a good fit for large graphs as well (eg wikidata)?
> The use case that I have in mind is "replace graph G from dataset D every once in a while (week or month)", so I thought using a TDB2 store for each graph in their own location was a good idea. Not different stores; all TDB2 stores. Like this
>
> <#dataset> rdf:type ja:RDFDataset ;
> ja:namedGraph
> [ ja:graphName <http://example.org/name1> ;
> ja:graph <#graph1> ] ;
>
> So I could just delete the graph folder and replace it with another directory containing the new version of the graph created with tdb2.tdbloader.
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> Instead use one TDB dataset and update one graph using s-put
If I'm correct this works over HTTP; would this be a good fit for large graphs as well (eg wikidata)?
The use case that I have in mind is "replace graph G from dataset D every once in a while (week or month)", so I thought using a TDB2 store for each graph in their own location was a good idea. Not different stores; all TDB2 stores. Like this
<#dataset> rdf:type ja:RDFDataset ;
ja:namedGraph
[ ja:graphName <http://example.org/name1> ;
ja:graph <#graph1> ] ;
So I could just delete the graph folder and replace it with another directory containing the new version of the graph created with tdb2.tdbloader.
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Andy Seaborne <an...@apache.org>.
It's the dataset dataset structure that's in-memory, not necessarily the
graphs (orthogonal issue).
The general dataset has only a weak notion of transaction. There aren't
cross different storage transactions. All storage is datasets - each of
your graphs is in a separate TDB dataset - there are no free standing
TDB graphs.
On 26/11/17 16:12, Laura Morales wrote:
>> Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support > transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do.
>
> I think you're right... I don't want to build an in-memory dataset, I just want to have a single dataset, but instead of having all graphs loaded together, I'd like to split them into their own folders such as "graph-1/", "graph-2/", "graph-3/" (each directory containing a TDB2 graph)... such that if I want to update a graph I can simply remove a directory and replace it with a new one. Is this possible?
That will need a server restart.
Instead use one TDB dataset and update one graph using s-put (live) or
offline.
Andy
Re: Fuseki all graphs into dataset vs separate graphs
Posted by Laura Morales <la...@mail.com>.
> Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support > transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do.
I think you're right... I don't want to build an in-memory dataset, I just want to have a single dataset, but instead of having all graphs loaded together, I'd like to split them into their own folders such as "graph-1/", "graph-2/", "graph-3/" (each directory containing a TDB2 graph)... such that if I want to update a graph I can simply remove a directory and replace it with a new one. Is this possible?
Re: Fuseki all graphs into dataset vs separate graphs
Posted by ajs6f <aj...@apache.org>.
Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do.
ajs6f
> On Nov 26, 2017, at 10:55 AM, Laura Morales <la...@mail.com> wrote:
>
> What is the difference of these two scenarios?
>
> 1. a dataset with all "s p o g" quads loaded into the dataset, with this configuration
>
> <#dataset> rdf:type tdb2:DatasetTDB2 ;
> tdb2:location "TDB2" ;
> tdb2:unionDefaultGraph true .
>
>
> 2. a dataset where each graph is in its own folder, with this configuration
>
> <#dataset> rdf:type ja:RDFDataset ;
> ja:namedGraph
> [ ja:graphName <http://example.org/g1> ;
> ja:graph <#g1> ] ;
> ja:namedGraph
> [ ja:graphName <http://example.org/g2> ;
> ja:graph <#g2> ] ;
> .
>
> <#g1> rdf:type tdb:GraphTDB2 ;
> tdb2:location "DB-1" .
>
> <#g2> rdf:type tdb:GraphTDB2 ;
> tdb2:location "DB-2" .
>
>
> - Is there any performance hit with the second method?
> - With the second method, is it still possible to have the default union graph of all single graphs?
> - I tried to query Fuseki with the second method:
> - This query works: select ?g where { graph ?g {} }
> - This query returns "Error 500: Not in a transaction":
> select ?s from <http://example.org/g1> where { ?s ?p ?o } limit 1