You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/11/26 15:55:55 UTC

Fuseki all graphs into dataset vs separate graphs

What is the difference of these two scenarios?

1. a dataset with all "s p o g" quads loaded into the dataset, with this configuration

<#dataset> rdf:type tdb2:DatasetTDB2 ;
    tdb2:location "TDB2" ;
    tdb2:unionDefaultGraph true .


2. a dataset where each graph is in its own folder, with this configuration

<#dataset> rdf:type ja:RDFDataset ;
     ja:namedGraph
        [ ja:graphName      <http://example.org/g1> ;
          ja:graph          <#g1> ] ;
     ja:namedGraph
        [ ja:graphName      <http://example.org/g2> ;
          ja:graph          <#g2> ] ;
     .

<#g1> rdf:type tdb:GraphTDB2 ;
    tdb2:location "DB-1" .

<#g2> rdf:type tdb:GraphTDB2 ;
    tdb2:location "DB-2" .


- Is there any performance hit with the second method?
- With the second method, is it still possible to have the default union graph of all single graphs?
- I tried to query Fuseki with the second method:
    - This query works: select ?g where { graph ?g {} }
    - This query returns "Error 500: Not in a transaction":
      select ?s from <http://example.org/g1> where { ?s ?p ?o } limit 1

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

Oh yeah, sorry. I've always used TDB with Fuseki, it was late night and for a moment I completely forgot about all the other tdb2.* cli tools.

Sent: Sunday, November 26, 2017 at 8:54 PM
From: ajs6f <aj...@apache.org>
To: users@jena.apache.org
Subject: Re: Fuseki all graphs into dataset vs separate graphs
"s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.

You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.

If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily.

ajs6f

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Osma Suominen <os...@helsinki.fi>.

Andrew U. Frank kirjoitti 26.11.2017 klo 23:54:
> when i use the sparql update protokoll to store data, am i not using 
> fuseki? 

SPARQL the protocol (defined over HTTP) requires a server such as 
Fuseki. SPARQL Update the language can be used also without a server, 
for example with the command line tool tdbupdate.

> I define my TDB in fuseki (in the run/configuration directory) 
> and i start a fuseki server which is the endpoint to receive the update 
> queries. i had the impression, that s-put is essentially doing wget to 
> the sparql endpoint.

Yes, s-put is roughly equivalent to wget or curl. It's a special purpose 
HTTP client that knows about the SPARQL protocol so you can perform 
slightly higher level operations. But you can do pretty much the same 
things with wget or curl too.

> is this (more or less) a correct understanding? how would one start the 
> TDB or TDB2 server without fuseki?
If you use the TDB/TDB2 command line tools you are not starting a 
server. Instead they start a Java process that reads and possibly 
manipulates the TDB database (directory full of files) while it is 
running, then exits when it is done. Typically they perform a single 
operation, for example tdbquery answers a single SPARQL query and 
tdbupdate performs a single SPARQL Update operation. Compared to Fuseki 
there is some startup overhead each time. However, sometimes the 
operations are more efficient when performed via command line tools 
instead of going through Fuseki, especially loading to TDB via tdbloader 
and tdbloader2.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Fuseki all graphs into dataset vs separate graphs

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.

can i ask for clarification (my application is perhaps somewhat similar 
to Laura's):

when i use the sparql update protokoll to store data, am i not using 
fuseki? I define my TDB in fuseki (in the run/configuration directory) 
and i start a fuseki server which is the endpoint to receive the update 
queries. i had the impression, that s-put is essentially doing wget to 
the sparql endpoint.

is this (more or less) a correct understanding? how would one start the 
TDB or TDB2 server without fuseki?

andrew



On 11/26/2017 02:54 PM, ajs6f wrote:
> "s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.
>
> You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.
>
> If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily.
>
> ajs6f
>
>
>
>> On Nov 26, 2017, at 2:46 PM, Laura Morales <la...@mail.com> wrote:
>>
>>> Is this just for your own exploration
>> Yes
>>
>>> in which case you might want to avoid Fuseki entirely and just work with TDB
>> I can issue SPARQL queries directly at TDB2 without using Fuseki?
>>
>> My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: Fuseki all graphs into dataset vs separate graphs

Posted by ajs6f <aj...@apache.org>.

"s-put" has nothing to do with TDB2-- it is entirely about SPARQL Graph Store protocol. It would work perfectly well with any implementation thereof, including non-Jena ones.

You will find in the bin/ directory of a Jena distribution a series of CLI tools for working with TDB2 databases, called tdb2.tdbquery, tdb2.tdbupdate etc. They work very much like their TDB1 counterparts. They will let you work against your TDB2 database without Fuseki.

If all you want to do is load data and query it yourself, you don't need Fuseki. You can just use the CLI tools. tdb2.tdbupdate will let you handle your graph replacement chore easily. 

ajs6f

> On Nov 26, 2017, at 2:46 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Is this just for your own exploration
> 
> Yes
> 
>> in which case you might want to avoid Fuseki entirely and just work with TDB
> 
> I can issue SPARQL queries directly at TDB2 without using Fuseki?
> 
> My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> Is this just for your own exploration

Yes

> in which case you might want to avoid Fuseki entirely and just work with TDB

I can issue SPARQL queries directly at TDB2 without using Fuseki?

My original problem still stands tho :) Is "s-put" the only way to replace a graph in a TDB2 dataset? Is there any CLI tool (non http) that can manipulate a TDB2 dataset to replace a graph with another (eg. replace wikidata with a new dump)?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by ajs6f <aj...@apache.org>.

As of the last time I looked, Wikidata and DBPedia do _not_ share subject URIs. The fact that they come from the same semi-structured data doesn't imply anything about how they are built up as RDF. AFAIK, Wikidata itself isn't RDF at all internally, but is mapped into RDF for publication.

https://meta.wikimedia.org/wiki/Wikidata/Essays/URI_scheme

Otherwise, it looks like you want to do provenance work, and named graphs are a pretty good way to do that. Although, if you are limiting yourself to Wikidata and DBPedia, you might be able to distinguish the source of a triple just based on the namespace of the subject. Is this just for your own exploration (in which case you might want to avoid Fuseki entirely and just work with TDB)?

ajs6f

> On Nov 26, 2017, at 2:30 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Do they share triple subjects?
> 
> Yes dbpedia/wikidata are two graphs of the same source
> 
>> I am trying to understand why you are intent on putting them in separate graphs.
> 
> 1- to query only one source if I want data from a single graph (either dbpedia or wikipedia and not both)
> 2- to extract the origin of a subject, for example select (subject, graph)
> 3- because I can update only one of the two graphs instead of having to update both of them

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> Do they share triple subjects?

Yes dbpedia/wikidata are two graphs of the same source

> I am trying to understand why you are intent on putting them in separate graphs.

1- to query only one source if I want data from a single graph (either dbpedia or wikipedia and not both)
2- to extract the origin of a subject, for example select (subject, graph)
3- because I can update only one of the two graphs instead of having to update both of them

Re: Fuseki all graphs into dataset vs separate graphs

Posted by ajs6f <aj...@apache.org>.

What are the actual queries you are expecting to run against these datasets? Do they share triple subjects?  I am trying to understand why you are intent on putting them in separate graphs.

ajs6f

> On Nov 26, 2017, at 1:30 PM, Laura Morales <la...@mail.com> wrote:
> 
> Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
> What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> The cost is you then can't make a single query across both at the same
> time (without using SERVICE) but as the sources are independent, I can't
> see how you can anyway. Instead ask one graph, get results, do whatever
> mapping is needed, ask other graph.

OK this is what I don't understand. Using again HDT as a reference... I was able to create an assembler file, with one rdfdataset and multiple graphs, each graph pointing to a .hdt file. Then from SPARQL I could use "select from <graph-1> <graph-2> ..." and this worked.
Basically what I'm trying to do is the same thing but using TDB(2) instead of HDT. I've setup the assembler, I can get a list of graphs from sparql, but then when I query one graph I get that "Error 500: Not in a transaction" error.

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Andy Seaborne <an...@apache.org>.

No idea.  Depends what the setup is.

TDB runs either no-transactions (as in ja:RDFDataset) or transactions 
(ja:DatasetTDB) and once it goes no-trasnaction to transactions, it does 
not go back again.

On 27/11/17 13:38, Laura Morales wrote:
>> Actually, no, that won't work. The transactions on TDB will stop it as
>> one access path is non-transactional on TDB and the other is transactional.
> 
> Is this the reason then why I'm getting "Error 500: Not in a transaction"? For comparison, I was able to run a RDFDataset with several distinct HDT files as "ja:graph". On the other hand the same setup did not work for TDB. Does this mean that only a big RDFDataset (with all graphs together) is supported? It's not possible to create a dataset with graph-1 and graph-2 living in two separate locations?
>

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> Actually, no, that won't work. The transactions on TDB will stop it as
> one access path is non-transactional on TDB and the other is transactional.

Is this the reason then why I'm getting "Error 500: Not in a transaction"? For comparison, I was able to run a RDFDataset with several distinct HDT files as "ja:graph". On the other hand the same setup did not work for TDB. Does this mean that only a big RDFDataset (with all graphs together) is supported? It's not possible to create a dataset with graph-1 and graph-2 living in two separate locations?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Andy Seaborne <an...@apache.org>.

On 27/11/17 09:20, Andy Seaborne wrote:
> PS In fact you can ALSO have a RDFDataset across both (different named 
> graphs, tdb:dataset pointing to the "<#tdbDataset> rdf:type 
> tdb:DatasetTDB" which is also direct referenced in fuseki:dataset.

Actually, no, that won't work. The transactions on TDB will stop it as 
one access path is non-transactional on TDB and the other is transactional.

     Andy

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Andy Seaborne <an...@apache.org>.

Do you want to query both? How much are you willing to pay for that in 
other factors?

Put each graph in a separate dataset (one graph per dataset), /dbpedia 
and /wikidata.

The cost is you then can't make a single query across both at the same 
time (without using SERVICE) but as the sources are independent, I can't 
see how you can anyway.  Instead ask one graph, get results, do whatever 
mapping is needed, ask other graph.

     Andy

PS In fact you can ALSO have a RDFDataset across both (different named 
graphs, tdb:dataset pointing to the "<#tdbDataset> rdf:type 
tdb:DatasetTDB" which is also direct referenced in fuseki:dataset.

Try on small scale data.

On 26/11/17 18:30, Laura Morales wrote:
> Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
> What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?
>

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

Let's say you want a local (in your LAN) fuseki server with two graphs: dbpedia and wikidata. Then, when a new wikidata dump is available you want to update your local graph. The easier approach is to delete the old graph and add the new one. So, how do you do this? Do you s-delete and then s-put?
What I was trying to do is, have dbpedia and wikidata in 2 separate folders, so that I can easily delete one folder and replace it with the new updated graph, instead of transmitting the whole thing over HTTP. Makes sense?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by ajs6f <aj...@apache.org>.

Can you say a little more about your aims? Why do you want to put these graphs together? Do you want to write queries over their union? Where do the graphs come from? Do they partition the subjects of triples? (If they do, you don't need to use graphs to separate them.)

Using s-put (or any other HTTP technique) will have costs that depend on your local situation. If your client (say, s-put) is on the same machine as Fuseki, those costs might not be very much at all. If Fuseki is far away and across an unreliable network, those cost might be a lot.

ajs6f

> On Nov 26, 2017, at 1:12 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Instead use one TDB dataset and update one graph using s-put
> 
> If I'm correct this works over HTTP; would this be a good fit for large graphs as well (eg wikidata)?
> The use case that I have in mind is "replace graph G from dataset D every once in a while (week or month)", so I thought using a TDB2 store for each graph in their own location was a good idea. Not different stores; all TDB2 stores. Like this
> 
> <#dataset> rdf:type      ja:RDFDataset ;
>     ja:namedGraph
>        [ ja:graphName      <http://example.org/name1> ;
>          ja:graph          <#graph1> ] ;
> 
> So I could just delete the graph folder and replace it with another directory containing the new version of the graph created with tdb2.tdbloader.

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> Instead use one TDB dataset and update one graph using s-put

If I'm correct this works over HTTP; would this be a good fit for large graphs as well (eg wikidata)?
The use case that I have in mind is "replace graph G from dataset D every once in a while (week or month)", so I thought using a TDB2 store for each graph in their own location was a good idea. Not different stores; all TDB2 stores. Like this

<#dataset> rdf:type      ja:RDFDataset ;
     ja:namedGraph
        [ ja:graphName      <http://example.org/name1> ;
          ja:graph          <#graph1> ] ;

So I could just delete the graph folder and replace it with another directory containing the new version of the graph created with tdb2.tdbloader.

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Andy Seaborne <an...@apache.org>.

It's the dataset dataset structure that's in-memory, not necessarily the 
graphs (orthogonal issue).

The general dataset has only a weak notion of transaction. There aren't 
cross different storage transactions.  All storage is datasets - each of 
your graphs is in a separate TDB dataset - there are no free standing 
TDB graphs.

On 26/11/17 16:12, Laura Morales wrote:
>> Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support > transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do.
> 
> I think you're right... I don't want to build an in-memory dataset, I just want to have a single dataset, but instead of having all graphs loaded together, I'd like to split them into their own folders such as "graph-1/", "graph-2/", "graph-3/" (each directory containing a TDB2 graph)... such that if I want to update a graph I can simply remove a directory and replace it with a new one. Is this possible?

That will need a server restart.

Instead use one TDB dataset and update one graph using s-put (live) or 
offline.

     Andy

Re: Fuseki all graphs into dataset vs separate graphs

Posted by Laura Morales <la...@mail.com>.

> Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support > transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do.

I think you're right... I don't want to build an in-memory dataset, I just want to have a single dataset, but instead of having all graphs loaded together, I'd like to split them into their own folders such as "graph-1/", "graph-2/", "graph-3/" (each directory containing a TDB2 graph)... such that if I want to update a graph I can simply remove a directory and replace it with a new one. Is this possible?

Re: Fuseki all graphs into dataset vs separate graphs

Posted by ajs6f <aj...@apache.org>.

Building a ja:RDFDataset is going to build you an in-memory dataset that does not fully support transactions, which I think is why you are getting that error. In that second config you are building a bunch of TDB databases and then taking views from them to compose an in-memory dataset, which I think is probably not what you intend to do. 

ajs6f

> On Nov 26, 2017, at 10:55 AM, Laura Morales <la...@mail.com> wrote:
> 
> What is the difference of these two scenarios?
> 
> 1. a dataset with all "s p o g" quads loaded into the dataset, with this configuration
> 
> <#dataset> rdf:type tdb2:DatasetTDB2 ;
>    tdb2:location "TDB2" ;
>    tdb2:unionDefaultGraph true .
> 
> 
> 2. a dataset where each graph is in its own folder, with this configuration
> 
> <#dataset> rdf:type ja:RDFDataset ;
>     ja:namedGraph
>        [ ja:graphName      <http://example.org/g1> ;
>          ja:graph          <#g1> ] ;
>     ja:namedGraph
>        [ ja:graphName      <http://example.org/g2> ;
>          ja:graph          <#g2> ] ;
>     .
> 
> <#g1> rdf:type tdb:GraphTDB2 ;
>    tdb2:location "DB-1" .
> 
> <#g2> rdf:type tdb:GraphTDB2 ;
>    tdb2:location "DB-2" .
> 
> 
> - Is there any performance hit with the second method?
> - With the second method, is it still possible to have the default union graph of all single graphs?
> - I tried to query Fuseki with the second method:
>    - This query works: select ?g where { graph ?g {} }
>    - This query returns "Error 500: Not in a transaction":
>      select ?s from <http://example.org/g1> where { ?s ?p ?o } limit 1