You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by thomas lörtsch <tl...@rat.io> on 2018/09/18 19:37:11 UTC

statement identifiers

Hi,

a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):

Does Jena happen to add an internal ID to each quad (statement+graphName)?

Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.

Thanks,
Thomas

Re: statement identifiers

Posted by thomas lörtsch <tl...@rat.io>.

> On 19. Sep 2018, aTO 12:22, Andy Seaborne <an...@apache.org> wrote:
> 
> You could look at the RDF* work  https://github.com/RDFstar/RDFstarTools from Olaf Hartig. (actually RDF* isn't a triple id, it can be implemented that way - the triple is a new kind of RDF term in the concrete data model of RDF)

I’m not fond of that literal style of reification for some reasons:
- it doesn’t scale well to annotations on annotations on annotations…
- it "feels" wrong: the triple and its reified representation are too similar
- I could be wrong but isn’t their semantics the same as of RDF standard reification (and therefor a problem, not the solution)?

> Do note that any triple id or tripel as element in the data model is not reification. RDF* and things of a similar design can be endcoded using reification but they are not reificiation.
> 
> For example: the stating (that is, a claim, not a fact):
> 
> :A :says ":moon :color :blue" .
> 
> can't be done in RDF* without
> 
> :moon :color :blue .
> 
> also being in the data but if it is in the data, it is a fact.  The triple spoken about must be in the data where as reification does not do that.
> 
> Reification can express differing points of view:
> 
> :A :says ":moon :color :blue" .
> :B :says ":moon :color :red" .
> 
> without
> 
> SELECT * { :moon :color  ?C }
> 
> returning anything (the data does not have a fact about the moon color, only claims).
> 
> With careful modelling using the triple id, you can build interesting cases, especially by using event-based modelling to attach groups of triples about claims.

I have to nitpick on your wording here: reification is reification no matter what exactly it reifies. RDF standard reification has that strange thing of being able to reify statements that haven’t actually been asserted. At least it is interpreted that way but that might be a common misconception - see below. What you want to say is that RDF standard reification can reify more things ((maybe) unasserted statements, to be exact) than what RDF* can reify.

> OWL ism: Note that :color might be a functional property so if two colors triples are in the data it infer they are owl:sameAs.
> 
> :blue owl:sameAs :red .
> 
> Oops!

Of course, but … does it really matter? There will always be such cases when things blow up because data from different sources is mixed that weren’t meant to (a case for contexts…).

> I'm be very interested in hearing your idea about adding named graphs into this mix.

Okay, you asked for it… :-) I’ve been bitching for decades about the mediocre meta modelling facilities in RDF (used to prefer Topic Maps back in the day) and finally decided to try to do something about it, so started to work on "Context on the Semantic Web". Turns out the topic is more complicated than I thought and reaches into all kinds of unpleasant, strange and unexpected areas (RDF standard reification, n-ary relations, attributed graphs etc), not to mention the vague and untangible nature of the concept of context itself.

Regarding reification I asked on semantic-web@w3.org [0] a few weeks ago and Pat Hayes was kind enough to enlighten me. If I got it right there’s 3 kinds of triples:
- an abstract triple (TA): a specific triple that could be stated but hasn’t been (at least not to our knowledge) - the thing you think RDF standard reification implements
- a triple type (TY): a specific triple and all it’s occurrences in any document, database, scribbled on a piece of paper etc
- a triple token (TO): a specific occurrence of some triple type, a triple in some named graph in some database on some server under some desk
To my great surprise I learned that RDF reification would like to deal with the third case - the concrete triple token occurrence - though it needs means external to RDF to refer to them. It was then, in the late 90ies, seen as necessary to be able to add e.g. provenance information to triples. RDF however has no way to distinguish one occurrence TO of a triple TY from another occurrence TO of the same triple TY in e.g. another file (as RDF has no concept of a document that holds the triple nor any other concept of "context", because its model theory is defined in terms of abstract set theory). So RDF here could only refer to means outside of RDF and those are not standardized.
With respect to the common conception that RDF standard reification can represent unasserted statements I’d say: no, it can’t. That would be the abstract triple TA. It does (want to) speak about triple tokens TO, it just can’t say which one exactly it means. OTOH if you prefer "If it walks like a duck…" type semantics then you’re probably right - but one thing I learned with this Semantic Web stuff is that in the end the pedants win every argument ;-)

Now, as you know all too well, also Named Graphs in Datasets have no semantically sound way to refer to them without some out-of-band mechansim as it's unspecified if a graph name actually denotes the graph it names or if it merely labels it but refers to something else. 

So RDF doesn’t provide _any_ semantically sound means for meta modelling. That is of course no news but it seems like finally it starts to be considered a problem as meta modelling is used in graphs outside the Semantic Web like in Property Graphs and in WikiData [1]. That should be good news for me but unfortunately I still don’t have a good plan how everything could be made fit as there are more problems.

Contexts, as vague a concept as they are, could be reduced to a very abstract concept of secondary attributes to (primary) relations. Everything more specific/expressive probably belongs into the realm of vocabularies. RDF however would have to provide a basic mechanism to add attributes to relations, and that would require it to be able to definitely denote triple occurrences TO - but it just can’t… Granted, syntactically it can in Datasets, it can to a lesser pleasant degree with standard reification - but both are not backed by model theoretic semantics. So no reasoning, no well defined semantics, and therefor no good.

My idea is now is to proceed step wise: first add a statement identifier to every statement. To make sure it is not (again, like with graph names) used as a label (e.g. to indicate a "context") the statement id is a hash (maybe MD5, but I’m not expert there). One could imagine it as having been there all the time, it just hadn’t been made explicit - so no need to extend the RDF semantics, just a syntactic tweak that fits snuggly into that little spot left undefined by the RDF standard reification mechanism. And boom we got rock solid triple denotation, with well defined formal model theoretic semantics, reasoning galore and are moved to tears! Ahem…

Actually that hash provides only the way to denote the triple type TY, not the triple token TO. So a second step is required: for the token TO - which is what we are really aiming for most of the time - we need a concept of context or surface or g-box or similar. But we can also get there via a second statement, like:
    aSub    a Pred     aObj    id_1
    id_1    inContext  aCon    id_2
These two triples+ID can be written in one row:
    aSub    a Pred     aObj    id_1    id_1    inContext  aCon    id_2
We see that column 4 and 5 are redundant and column 6 will always be "inContext", so we can shorten the two triples+ID to one quad+ID:
    aSub    a Pred     aObj                               aCon    id_2
The hashing function now includes the field in column 4 and ensures that a triple in the default context would still get id_1
    aSub    a Pred     aObj                               ----    id_1
So we can differentiate between a triple in the default context and some contextualized triple. What that is actually worth, depends. In a local dataset it can make a difference. As soon as datasets are shared on the web the default context will likely change to some source identifier.
It is also possible to compute a statement hash and use it as a subject of other triples without actually adding the statement to the store (still it can and probably should be described using a standard reifcation style quadlet). That covers the TA use case. It would be good though to define another type than rdf:Statement for this.
No doubt there are a lot of technical details that I overlooked. Maybe the MD5 hash be omitted in internal use and replaced by something more performant and/or legible. "#someCruelHash rdf:label ex:myfirstTriple". Maybe, if reification is only used sparingly, the fifth field can be optimized away, etc...

What I like about this approach is that it solidly binds the quad+ID to the triples+ID. As the triple ID is firmly bound to the triple itself that should guarantee some pretty solid semantics for the quad/quint as well - at least I haven’t found any gaping holes so far.

We can now, finally, talk about specific concrete statement occurrences TO in semantically sound ways. The context field can for example carry information about the database that holds this triple, but just as well anything else. It will probably be subject of some other statements that define its attributes. 
There is total freedom in the use of the context field and actually that bothers me a little. But that’s another topic. What’s important is that the basic mechanism is now in place to have the luxury of such a problem :) This in short is the background of why I find quad+ID interesting.

Another justification would be that people often want to group statements. There’s a lot of practical use cases where it seems abhorrent to add a second statement to every other statement e.g. just to record the date of ingestion. A context field provides the means to do this quite efficiently.

My main problem right now is that I find it very hard to distinguish between an attribute to a relation and a context of that relation. The mechanism described above allows to annotate a statement with additional statements through its ID or through its context field which can refer to an arbitrarily complex context object. A pragmatic approach could be to say that contexts are for grouping purposes whereas statement specific attributions should handle aspects specific to that single statement. That probably makes it easier to model stuff but it sure makes it harder to query (having to search contexts as well as reifications). This requires more thinking...

Just one more idea for useful applications of statement IDs: in another mail on this(?) mailinglist [2] you argue that relation reification is an anti pattern. I’m not sure I agree with you there. I’m more thinking in a direction of complex, fine-grained objects from which simplistic "A emails B" triples are derived - and back linked to the originating complex object :myFirstEmail that has all the details like date, BCC's etc. That back link could be provided through a statement annotation "ID_1 derivedFrom :myFirstEmail". The simplistic triple "A emails B" would convey a basic fact and facilitate retrieval and integration.
The triple is indeed both the weak and the strong point of RDF: anything reasonably complex is at best tedious to model with triples. But in a vast, heterogeneous, distributed graph the basic, simplistic triple is by far the best bet to succesfully navigate from A to B to C etc. So: statement IDs and back links to the rescue (hopefully, maybe…). The complex mothership then might even be an RDBMS table, a tree structure, who knows - but that’s already RDF 3.0 I fear.

Thanks for your interest! I hope this wasn’t too much detail.
Thomas

> One of the problems with reification is that applies to statements, not a group of statements.
> 
> In extrmeis:
> 
> GRAPH <someId1> { :moon :color :blue }
> GRAPH <someId2> { :moon :color :red }
> 
> at least means the default graph makes no assertion about
> { :moon :color  ?C }
> 
>    Andy
> 
> 
> 
> 
> 
> 
> On 18/09/18 22:16, Rob Vesse wrote:
>> None of the Jena provided implementations use statement IDs, that includes both TDB1 and TDB2 which both just store quads directly
>> Rob
>> On 18/09/2018, 13:15, "ajs6f" <aj...@apache.org> wrote:
>>     >>
>>     >> Not in general, no, although some specific DatasetGraph implementations may.
>>     >
>>     > Any idea where I should look?
>>          Nope. None of the in-memory implementations do this to my knowledge, because they needn't. I don't know if either TDB1 or -2 do, but I can't think of a reason they would.
>>          It's possible that someone out there in the community has written one, or you could try implementing DatasetGraph yourself, perhaps reusing some other implementation for part of the work.
>>          ajs6f
>>          > On Sep 18, 2018, at 4:00 PM, thomas lörtsch <tl...@rat.io> wrote:
>>     >
>>     >
>>     >> On 18. Sep 2018, at 21:40, ajs6f <aj...@apache.org> wrote:
>>     >
>>     > That was quick!
>>     >
>>     >> Not in general, no, although some specific DatasetGraph implementations may.
>>     >
>>     > Any idea where I should look?
>>     >
>>     >> There is some API support for reification:
>>     >>
>>     >> https://jena.apache.org/documentation/notes/reification.html
>>     >>
>>     >> Does that meet your use case?
>>     >
>>     > No, unfortunately not. I need the graph name too.
>>     >
>>     > Thomas
>>     >
>>     >
>>     >> ajs6f
>>     >>
>>     >>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
>>     >>>
>>     >>> Hi,
>>     >>>
>>     >>> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
>>     >>>
>>     >>> Does Jena happen to add an internal ID to each quad (statement+graphName)?
>>     >>>
>>     >>> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
>>     >>> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
>>     >>>
>>     >>> Thanks,
>>     >>> Thomas
>>     >>
>>     >
>>          

[0] https://lists.w3.org/Archives/Public/semantic-web/2018Jul/0024.html
[1] Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What works well with wikidata?." SSWS@ ISWC 1457 (2015): 32-47.
[2] https://apache.markmail.org/message/js6s6ry5st73soay

Re: statement identifiers

Posted by Andy Seaborne <an...@apache.org>.

You could look at the RDF* work  https://github.com/RDFstar/RDFstarTools 
from Olaf Hartig. (actually RDF* isn't a triple id, it can be 
implemented that way - the triple is a new kind of RDF term in the 
concrete data model of RDF)

Do note that any triple id or tripel as element in the data model is not 
reification. RDF* and things of a similar design can be endcoded using 
reification but they are not reificiation.

For example: the stating (that is, a claim, not a fact):

:A :says ":moon :color :blue" .

can't be done in RDF* without

:moon :color :blue .

also being in the data but if it is in the data, it is a fact.  The 
triple spoken about must be in the data where as reification does not do 
that.

Reification can express differing points of view:

:A :says ":moon :color :blue" .
:B :says ":moon :color :red" .

without

SELECT * { :moon :color  ?C }

returning anything (the data does not have a fact about the moon color, 
only claims).

With careful modelling using the triple id, you can build interesting 
cases, especially by using event-based modelling to attach groups of 
triples about claims.

OWL ism: Note that :color might be a functional property so if two 
colors triples are in the data it infer they are owl:sameAs.

:blue owl:sameAs :red .

Oops!

I'm be very interested in hearing your idea about adding named graphs 
into this mix.

One of the problems with reification is that applies to statements, not 
a group of statements.

In extrmeis:

GRAPH <someId1> { :moon :color :blue }
GRAPH <someId2> { :moon :color :red }

at least means the default graph makes no assertion about
{ :moon :color  ?C }

     Andy

On 18/09/18 22:16, Rob Vesse wrote:
> None of the Jena provided implementations use statement IDs, that includes both TDB1 and TDB2 which both just store quads directly
> 
> Rob
> 
> On 18/09/2018, 13:15, "ajs6f" <aj...@apache.org> wrote:
> 
>      >>
>      >> Not in general, no, although some specific DatasetGraph implementations may.
>      >
>      > Any idea where I should look?
>      
>      Nope. None of the in-memory implementations do this to my knowledge, because they needn't. I don't know if either TDB1 or -2 do, but I can't think of a reason they would.
>      
>      It's possible that someone out there in the community has written one, or you could try implementing DatasetGraph yourself, perhaps reusing some other implementation for part of the work.
>      
>      ajs6f
>      
>      > On Sep 18, 2018, at 4:00 PM, thomas lörtsch <tl...@rat.io> wrote:
>      >
>      >
>      >> On 18. Sep 2018, at 21:40, ajs6f <aj...@apache.org> wrote:
>      >
>      > That was quick!
>      >
>      >> Not in general, no, although some specific DatasetGraph implementations may.
>      >
>      > Any idea where I should look?
>      >
>      >> There is some API support for reification:
>      >>
>      >> https://jena.apache.org/documentation/notes/reification.html
>      >>
>      >> Does that meet your use case?
>      >
>      > No, unfortunately not. I need the graph name too.
>      >
>      > Thomas
>      >
>      >
>      >> ajs6f
>      >>
>      >>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
>      >>>
>      >>> Hi,
>      >>>
>      >>> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
>      >>>
>      >>> Does Jena happen to add an internal ID to each quad (statement+graphName)?
>      >>>
>      >>> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
>      >>> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
>      >>>
>      >>> Thanks,
>      >>> Thomas
>      >>
>      >
>      
>      
> 
> 
> 
>

Re: statement identifiers

Posted by Rob Vesse <rv...@dotnetrdf.org>.

None of the Jena provided implementations use statement IDs, that includes both TDB1 and TDB2 which both just store quads directly

Rob

On 18/09/2018, 13:15, "ajs6f" <aj...@apache.org> wrote:

    >> 
    >> Not in general, no, although some specific DatasetGraph implementations may.
    > 
    > Any idea where I should look?
    
    Nope. None of the in-memory implementations do this to my knowledge, because they needn't. I don't know if either TDB1 or -2 do, but I can't think of a reason they would.
    
    It's possible that someone out there in the community has written one, or you could try implementing DatasetGraph yourself, perhaps reusing some other implementation for part of the work.
    
    ajs6f
    
    > On Sep 18, 2018, at 4:00 PM, thomas lörtsch <tl...@rat.io> wrote:
    > 
    > 
    >> On 18. Sep 2018, at 21:40, ajs6f <aj...@apache.org> wrote:
    > 
    > That was quick!
    > 
    >> Not in general, no, although some specific DatasetGraph implementations may.
    > 
    > Any idea where I should look?
    > 
    >> There is some API support for reification:
    >> 
    >> https://jena.apache.org/documentation/notes/reification.html
    >> 
    >> Does that meet your use case?
    > 
    > No, unfortunately not. I need the graph name too.
    > 
    > Thomas
    > 
    > 
    >> ajs6f
    >> 
    >>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
    >>> 
    >>> Hi,
    >>> 
    >>> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
    >>> 
    >>> Does Jena happen to add an internal ID to each quad (statement+graphName)?
    >>> 
    >>> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
    >>> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
    >>> 
    >>> Thanks,
    >>> Thomas
    >> 
    >

Re: statement identifiers

Posted by ajs6f <aj...@apache.org>.

>> 
>> Not in general, no, although some specific DatasetGraph implementations may.
> 
> Any idea where I should look?

Nope. None of the in-memory implementations do this to my knowledge, because they needn't. I don't know if either TDB1 or -2 do, but I can't think of a reason they would.

It's possible that someone out there in the community has written one, or you could try implementing DatasetGraph yourself, perhaps reusing some other implementation for part of the work.

ajs6f

> On Sep 18, 2018, at 4:00 PM, thomas lörtsch <tl...@rat.io> wrote:
> 
> 
>> On 18. Sep 2018, at 21:40, ajs6f <aj...@apache.org> wrote:
> 
> That was quick!
> 
>> Not in general, no, although some specific DatasetGraph implementations may.
> 
> Any idea where I should look?
> 
>> There is some API support for reification:
>> 
>> https://jena.apache.org/documentation/notes/reification.html
>> 
>> Does that meet your use case?
> 
> No, unfortunately not. I need the graph name too.
> 
> Thomas
> 
> 
>> ajs6f
>> 
>>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
>>> 
>>> Hi,
>>> 
>>> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
>>> 
>>> Does Jena happen to add an internal ID to each quad (statement+graphName)?
>>> 
>>> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
>>> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
>>> 
>>> Thanks,
>>> Thomas
>> 
>

Re: statement identifiers

Posted by thomas lörtsch <tl...@rat.io>.

> On 18. Sep 2018, at 21:40, ajs6f <aj...@apache.org> wrote:

That was quick!

> Not in general, no, although some specific DatasetGraph implementations may.

Any idea where I should look?

> There is some API support for reification:
> 
> https://jena.apache.org/documentation/notes/reification.html
> 
> Does that meet your use case?

No, unfortunately not. I need the graph name too.

Thomas


> ajs6f
> 
>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
>> 
>> Hi,
>> 
>> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
>> 
>> Does Jena happen to add an internal ID to each quad (statement+graphName)?
>> 
>> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
>> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
>> 
>> Thanks,
>> Thomas
>

Re: statement identifiers

Posted by ajs6f <aj...@apache.org>.

Not in general, no, although some specific DatasetGraph implementations may. There is some API support for reification:

https://jena.apache.org/documentation/notes/reification.html

Does that meet your use case?

ajs6f

> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <tl...@rat.io> wrote:
> 
> Hi,
> 
> a questions (and my apologies upfront that I don’t take the time to dive into the code myself, but it would take me a lot of time):
> 
> Does Jena happen to add an internal ID to each quad (statement+graphName)?
> 
> Some databases do so for internal administrative purposes (I believe) and so I thought it might be worth to ask.
> If Jena does provide such IDs I would like to use them as reification IDs and my next question would be about how hard it is to access them.
> 
> Thanks,
> Thomas