You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Patrick Hoeffel <pa...@issinc.com> on 2015/10/08 17:11:40 UTC

Performance Cost of Reification

I was at the Cassandra Summit conference last week I talked to the guys from Titan (ThinkAurelius) about their roadmap with the TitanDB property graph that they developed prior to being acquired by DataStax. While their opinion was clearly biased by their own work, they made a good point that if your ontology is sufficiently complex that it requires reification in order to add properties to a triple, it can significantly affect the performance of the triple store (which seems somewhat obvious). I was wondering, in your experience, can you estimate how significant that degradation might be (if it really exists), or does it depend on too many factors to answer?

Thanks,

Patrick Hoeffel
Senior Software Engineer
Intelligent Software Solutions (www.issinc.com<http://www.issinc.com/>)
(719) 452-7371 (direct)
(719) 210-3706 (mobile)

"Bringing Knowledge to Light"

Re: Performance Cost of Reification

Posted by Andy Seaborne <an...@apache.org>.

(bother - wrong list - resent to users@)

On 09/10/15 14:44, Andy Seaborne wrote:
> It is nice that the Titan guys see RDF as something to compare to.
> Coincidently, I was giving a talk about Property Graph / Linked Data
> just recently at the European ApacheCon BigData conference.
>
>
> The Property Graph (PG) market is maybe x2 the size of the RDF market,
> and both are small.  The challenge is growing the graph market, not one
> form taking market share away from the other.
>
> And the key difference between graph databases and other data systems is
> modelling.  The differences between graph systems is not the key here.
>
> About reification, they are somewhat off track.  Reification is a quite
> specialised feature for limited use. It is not RDF's equivalent to
> attributes on links in PG.
>
> Let me make that concrete with an example simplified from Graph
> databases / chapter 3 (page 52 in my copy).  The book is written the
> Neo4J folks.
>
> Email provenance.
>
>      A sends_email_to B
>
> Now, you could reify that statement (the act by A of sending the email
> to B).
>
> Reification is way more powerful than just being about to add data to
> the triple.  It says "claim: A sends_mail_to B"  - several different and
> competing claims can be made. But let's continue assuming reification
> and assertion of the triple ... [*]
>
> <<A sends email to B>>
>      cc C
>      cc D
>      sentOn Tuesday
>
> In the same modelling way you could add attributes to a PG graph edge
> for sends_email_to.
>
> Both are anti-patterns (as chapter 3 notes).
>
> The email sent is an important concept so model it explicitly:
>
> A   sends       MSG
> MSG receivedBy  B
> MSG cc_to       C
> MSG cc_to       D
> MSG sentOn      "Tuesday"
>
> By modelling the email message as a first class concept, not implicit in
> the activity via reification/link attributes, you can better add
> information e.g. which servers it was transferred by and stored on, when
> was it received (this is email - that might be twice) and better query
> it (who else accessed it on receipt).  Modelling those on the act of
> sending is making life hard (how do you talk about a draft email?)
>
> MSG contents        URL_to_content
> MSG hasChecksum     0xABCDEF
> MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."
>
> That last one is tricky - one sending of a message can result in
> different receivedHeaders depending on the receiver.
>
> This event based modelling, not reification.
>
>
> If you wanted a highly efficient reification-supporting RDF store, then
> build one.  No need to blindly store as multiple triples (its called
> compression!).  You don't see such stores because reification is a minor
> feature of RDF.  Event-based modelling and named graphs are often better.
>
>      Andy
>
> [*]
> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
> reification support but didn't gain much support. It is still in the ARQ
> parser source but not active.
>

Re: Performance Cost of Reification

Posted by Paul Houle <on...@gmail.com>.

These days I am a big fan of RDF* and SPARQL*,  which unifies RDF with the
property graph model.  On the other hand I used to hate blank nodes but I
learned to stop worrying and love them.  I am hoping anyway that Neo4J and
it's ilk become a gateway drug to the RDF world.




On Fri, Oct 9, 2015 at 9:44 AM, Andy Seaborne <an...@apache.org> wrote:

> It is nice that the Titan guys see RDF as something to compare to.
> Coincidently, I was giving a talk about Property Graph / Linked Data just
> recently at the European ApacheCon BigData conference.
>
>
> The Property Graph (PG) market is maybe x2 the size of the RDF market, and
> both are small.  The challenge is growing the graph market, not one form
> taking market share away from the other.
>
> And the key difference between graph databases and other data systems is
> modelling.  The differences between graph systems is not the key here.
>
> About reification, they are somewhat off track.  Reification is a quite
> specialised feature for limited use. It is not RDF's equivalent to
> attributes on links in PG.
>
> Let me make that concrete with an example simplified from Graph databases
> / chapter 3 (page 52 in my copy).  The book is written the Neo4J folks.
>
> Email provenance.
>
>     A sends_email_to B
>
> Now, you could reify that statement (the act by A of sending the email to
> B).
>
> Reification is way more powerful than just being about to add data to the
> triple.  It says "claim: A sends_mail_to B"  - several different and
> competing claims can be made. But let's continue assuming reification and
> assertion of the triple ... [*]
>
> <<A sends email to B>>
>     cc C
>     cc D
>     sentOn Tuesday
>
> In the same modelling way you could add attributes to a PG graph edge for
> sends_email_to.
>
> Both are anti-patterns (as chapter 3 notes).
>
> The email sent is an important concept so model it explicitly:
>
> A   sends       MSG
> MSG receivedBy  B
> MSG cc_to       C
> MSG cc_to       D
> MSG sentOn      "Tuesday"
>
> By modelling the email message as a first class concept, not implicit in
> the activity via reification/link attributes, you can better add
> information e.g. which servers it was transferred by and stored on, when
> was it received (this is email - that might be twice) and better query it
> (who else accessed it on receipt).  Modelling those on the act of sending
> is making life hard (how do you talk about a draft email?)
>
> MSG contents        URL_to_content
> MSG hasChecksum     0xABCDEF
> MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."
>
> That last one is tricky - one sending of a message can result in different
> receivedHeaders depending on the receiver.
>
> This event based modelling, not reification.
>
>
> If you wanted a highly efficient reification-supporting RDF store, then
> build one.  No need to blindly store as multiple triples (its called
> compression!).  You don't see such stores because reification is a minor
> feature of RDF.  Event-based modelling and named graphs are often better.
>
>         Andy
>
> [*]
> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
> reification support but didn't gain much support. It is still in the ARQ
> parser source but not active.
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275

Re: Performance Cost of Reification

Posted by Andy Seaborne <an...@apache.org>.

It is nice that the Titan guys see RDF as something to compare to. 
Coincidently, I was giving a talk about Property Graph / Linked Data 
just recently at the European ApacheCon BigData conference.


The Property Graph (PG) market is maybe x2 the size of the RDF market, 
and both are small.  The challenge is growing the graph market, not one 
form taking market share away from the other.

And the key difference between graph databases and other data systems is 
modelling.  The differences between graph systems is not the key here.

About reification, they are somewhat off track.  Reification is a quite 
specialised feature for limited use. It is not RDF's equivalent to 
attributes on links in PG.

Let me make that concrete with an example simplified from Graph 
databases / chapter 3 (page 52 in my copy).  The book is written the 
Neo4J folks.

Email provenance.

     A sends_email_to B

Now, you could reify that statement (the act by A of sending the email 
to B).

Reification is way more powerful than just being about to add data to 
the triple.  It says "claim: A sends_mail_to B"  - several different and 
competing claims can be made. But let's continue assuming reification 
and assertion of the triple ... [*]

<<A sends email to B>>
     cc C
     cc D
     sentOn Tuesday

In the same modelling way you could add attributes to a PG graph edge 
for sends_email_to.

Both are anti-patterns (as chapter 3 notes).

The email sent is an important concept so model it explicitly:

A   sends       MSG
MSG receivedBy  B
MSG cc_to       C
MSG cc_to       D
MSG sentOn      "Tuesday"

By modelling the email message as a first class concept, not implicit in 
the activity via reification/link attributes, you can better add 
information e.g. which servers it was transferred by and stored on, when 
was it received (this is email - that might be twice) and better query 
it (who else accessed it on receipt).  Modelling those on the act of 
sending is making life hard (how do you talk about a draft email?)

MSG contents        URL_to_content
MSG hasChecksum     0xABCDEF
MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."

That last one is tricky - one sending of a message can result in 
different receivedHeaders depending on the receiver.

This event based modelling, not reification.


If you wanted a highly efficient reification-supporting RDF store, then 
build one.  No need to blindly store as multiple triples (its called 
compression!).  You don't see such stores because reification is a minor 
feature of RDF.  Event-based modelling and named graphs are often better.

	Andy

[*]
<< >> is syntax that I proposed in early SPARQL drafts pre 1.0 for 
reification support but didn't gain much support. It is still in the ARQ 
parser source but not active.

Re: Performance Cost of Reification

Posted by Olivier Rossel <ol...@gmail.com>.

One tricky use case I usually encounter is conditional validity of
relationships.
In that case, the relationship must carry its own set of validity rules.
And reification is ok for that.

On Sun, Oct 11, 2015 at 7:05 PM, Stian Soiland-Reyes <st...@apache.org> wrote:
> Well described. If something requires reification, it usually means either
> your model is off track (not expanding the right concepts), or your graph
> scope is wrong (because you want to say who said that, add confidence
> values , etc).
>
> In designing PROV-O we ran into this same issue, and added typed
> "qualifications" as the alternative to reification pattern. E.g
>
> <a> prov:wasDerivedFrom <b> ;
>     prov:qualifiedDerivation :aDerivation .
>
> :aDerivation a prov:Derivation ;
>   prov:entity <b> ;
>   rdfs:comment "by copy and paste> .
>
> See http://www.w3.org/TR/prov-o/#description-qualified-terms
>
> By making these proper concepts we can relate them to the rest, not feel
> constrained by the "reified" triple, and thus add typing and subclasses.
>
> For instance a prov:Derivation can use prov:hadActivity to link to the
> prov:Activity which made <a>, and even relate to the Activity's qualified
> relation Usage and Generation.
>
> See example at http://www.w3.org/TR/prov-o/#Derivation
>
> Of course, when you have multiple alternative detail levels too choose
> from, users can get confused as to which one to use. In PROV the
> qualifications are related with prov:qualifiedThingie relations, that hints
> of them being secondary in nature rather than first class citizens to
> normally be used alone. In other vocabularies you might find it is the
> shortcuts you want to make secondary.
> On 9 Oct 2015 15:20, "Andy Seaborne" <an...@apache.org> wrote:
>
>>
>> It is nice that the Titan guys see RDF as something to compare to.
>> Coincidently, I was giving a talk about Property Graph / Linked Data just
>> recently at the European ApacheCon BigData conference.
>>
>>
>> The Property Graph (PG) market is maybe x2 the size of the RDF market, and
>> both are small.  The challenge is growing the graph market, not one form
>> taking market share away from the other.
>>
>> And the key difference between graph databases (either kind) and other
>> data systems is the approach to data modelling.  The differences between
>> graph systems are not the key here.
>>
>> About reification, they are somewhat off-track.  Reification is a quite
>> specialised feature for limited use. It is not RDF's equivalent to
>> attributes on links in PG.
>>
>> Let me make that concrete with an example simplified from Graph databases
>> / chapter 3 (page 52 in my copy).  The book is written the Neo4J folks.
>>
>> Email provenance.
>>
>>     A sends_email_to B
>>
>> Now, you could reify that statement (the act by A of sending the email to
>> B).
>>
>> Reification is way more powerful than just being about to add data to the
>> triple.  It says "claim: A sends_mail_to B"  - several different and
>> competing claims can be made. But let's continue assuming reification and
>> assertion of the triple ... [*]
>>
>> <<A sends email to B>>
>>     cc C
>>     cc D
>>     sentOn Tuesday
>>
>> In the same modelling way you could add attributes to a PG graph edge for
>> sends_email_to.
>>
>> Both PG and RDF modelling here are anti-patterns (as chapter 3 notes for
>> PG).
>>
>> The email sent is an important concept so model it explicitly:
>>
>> A   sends       MSG
>> MSG receivedBy  B
>> MSG cc_to       C
>> MSG cc_to       D
>> MSG sentOn      "Tuesday"
>>
>> By modelling the email message as a first class concept, not implicit in
>> the activity via reification/link attributes, you can better add
>> information e.g. which servers it was transferred by and stored on, when
>> was it received (this is email - that might be twice) and better query it
>> (who else accessed it on receipt).  Modelling those on the act of sending
>> is making life hard (how do you talk about a draft email?)
>>
>> MSG contents        URL_to_content
>> MSG hasChecksum     0xABCDEF
>> MSG status          :sent
>>
>> This event based modelling.
>>
>>
>> If you wanted a highly efficient reification-supporting RDF store, then
>> build one.  No need to blindly store as multiple triples (its called
>> compression!).  You don't see such stores because reification is a minor
>> feature of RDF.  Event-based modelling and named graphs are often better.
>>
>>     Andy
>>
>> [*]
>> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
>> reification support but didn't gain much support. It is still in the ARQ
>> parser source but not active.
>>
>>
>>

Re: Performance Cost of Reification

Posted by Stian Soiland-Reyes <st...@apache.org>.

Well described. If something requires reification, it usually means either
your model is off track (not expanding the right concepts), or your graph
scope is wrong (because you want to say who said that, add confidence
values , etc).

In designing PROV-O we ran into this same issue, and added typed
"qualifications" as the alternative to reification pattern. E.g

<a> prov:wasDerivedFrom <b> ;
    prov:qualifiedDerivation :aDerivation .

:aDerivation a prov:Derivation ;
  prov:entity <b> ;
  rdfs:comment "by copy and paste> .

See http://www.w3.org/TR/prov-o/#description-qualified-terms

By making these proper concepts we can relate them to the rest, not feel
constrained by the "reified" triple, and thus add typing and subclasses.

For instance a prov:Derivation can use prov:hadActivity to link to the
prov:Activity which made <a>, and even relate to the Activity's qualified
relation Usage and Generation.

See example at http://www.w3.org/TR/prov-o/#Derivation

Of course, when you have multiple alternative detail levels too choose
from, users can get confused as to which one to use. In PROV the
qualifications are related with prov:qualifiedThingie relations, that hints
of them being secondary in nature rather than first class citizens to
normally be used alone. In other vocabularies you might find it is the
shortcuts you want to make secondary.
On 9 Oct 2015 15:20, "Andy Seaborne" <an...@apache.org> wrote:

>
> It is nice that the Titan guys see RDF as something to compare to.
> Coincidently, I was giving a talk about Property Graph / Linked Data just
> recently at the European ApacheCon BigData conference.
>
>
> The Property Graph (PG) market is maybe x2 the size of the RDF market, and
> both are small.  The challenge is growing the graph market, not one form
> taking market share away from the other.
>
> And the key difference between graph databases (either kind) and other
> data systems is the approach to data modelling.  The differences between
> graph systems are not the key here.
>
> About reification, they are somewhat off-track.  Reification is a quite
> specialised feature for limited use. It is not RDF's equivalent to
> attributes on links in PG.
>
> Let me make that concrete with an example simplified from Graph databases
> / chapter 3 (page 52 in my copy).  The book is written the Neo4J folks.
>
> Email provenance.
>
>     A sends_email_to B
>
> Now, you could reify that statement (the act by A of sending the email to
> B).
>
> Reification is way more powerful than just being about to add data to the
> triple.  It says "claim: A sends_mail_to B"  - several different and
> competing claims can be made. But let's continue assuming reification and
> assertion of the triple ... [*]
>
> <<A sends email to B>>
>     cc C
>     cc D
>     sentOn Tuesday
>
> In the same modelling way you could add attributes to a PG graph edge for
> sends_email_to.
>
> Both PG and RDF modelling here are anti-patterns (as chapter 3 notes for
> PG).
>
> The email sent is an important concept so model it explicitly:
>
> A   sends       MSG
> MSG receivedBy  B
> MSG cc_to       C
> MSG cc_to       D
> MSG sentOn      "Tuesday"
>
> By modelling the email message as a first class concept, not implicit in
> the activity via reification/link attributes, you can better add
> information e.g. which servers it was transferred by and stored on, when
> was it received (this is email - that might be twice) and better query it
> (who else accessed it on receipt).  Modelling those on the act of sending
> is making life hard (how do you talk about a draft email?)
>
> MSG contents        URL_to_content
> MSG hasChecksum     0xABCDEF
> MSG status          :sent
>
> This event based modelling.
>
>
> If you wanted a highly efficient reification-supporting RDF store, then
> build one.  No need to blindly store as multiple triples (its called
> compression!).  You don't see such stores because reification is a minor
> feature of RDF.  Event-based modelling and named graphs are often better.
>
>     Andy
>
> [*]
> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
> reification support but didn't gain much support. It is still in the ARQ
> parser source but not active.
>
>
>

RE: Performance Cost of Reification

Posted by Patrick Hoeffel <pa...@issinc.com>.

Very well said, Andy. Thank you for taking the time to re-emphasize the importance of getting the data model right. I really appreciate it.

Patrick

-----Original Message-----
From: Andy Seaborne [mailto:andy@apache.org] 
Sent: Friday, October 09, 2015 8:21 AM
To: users@jena.apache.org
Subject: Re: Performance Cost of Reification

It is nice that the Titan guys see RDF as something to compare to. 
Coincidently, I was giving a talk about Property Graph / Linked Data just recently at the European ApacheCon BigData conference.

The Property Graph (PG) market is maybe x2 the size of the RDF market, 
and both are small.  The challenge is growing the graph market, not one 
form taking market share away from the other.

And the key difference between graph databases (either kind) and other 
data systems is the approach to data modelling.  The differences between 
graph systems are not the key here.

About reification, they are somewhat off-track.  Reification is a quite 
specialised feature for limited use. It is not RDF's equivalent to 
attributes on links in PG.

Let me make that concrete with an example simplified from Graph 
databases / chapter 3 (page 52 in my copy).  The book is written the 
Neo4J folks.

Email provenance.

     A sends_email_to B

Now, you could reify that statement (the act by A of sending the email 
to B).

Reification is way more powerful than just being about to add data to 
the triple.  It says "claim: A sends_mail_to B"  - several different and 
competing claims can be made. But let's continue assuming reification 
and assertion of the triple ... [*]

<<A sends email to B>>
     cc C
     cc D
     sentOn Tuesday

In the same modelling way you could add attributes to a PG graph edge 
for sends_email_to.

Both PG and RDF modelling here are anti-patterns (as chapter 3 notes for 
PG).

The email sent is an important concept so model it explicitly:

A   sends       MSG
MSG receivedBy  B
MSG cc_to       C
MSG cc_to       D
MSG sentOn      "Tuesday"

By modelling the email message as a first class concept, not implicit in 
the activity via reification/link attributes, you can better add 
information e.g. which servers it was transferred by and stored on, when 
was it received (this is email - that might be twice) and better query 
it (who else accessed it on receipt).  Modelling those on the act of 
sending is making life hard (how do you talk about a draft email?)

MSG contents        URL_to_content
MSG hasChecksum     0xABCDEF
MSG status          :sent

This event based modelling.

If you wanted a highly efficient reification-supporting RDF store, then 
build one.  No need to blindly store as multiple triples (its called 
compression!).  You don't see such stores because reification is a minor 
feature of RDF.  Event-based modelling and named graphs are often better.

     Andy

[*]
<< >> is syntax that I proposed in early SPARQL drafts pre 1.0 for 
reification support but didn't gain much support. It is still in the ARQ 
parser source but not active.

Re: Performance Cost of Reification

Posted by Andy Seaborne <an...@apache.org>.

It is nice that the Titan guys see RDF as something to compare to. 
Coincidently, I was giving a talk about Property Graph / Linked Data 
just recently at the European ApacheCon BigData conference.


The Property Graph (PG) market is maybe x2 the size of the RDF market, 
and both are small.  The challenge is growing the graph market, not one 
form taking market share away from the other.

And the key difference between graph databases (either kind) and other 
data systems is the approach to data modelling.  The differences between 
graph systems are not the key here.

About reification, they are somewhat off-track.  Reification is a quite 
specialised feature for limited use. It is not RDF's equivalent to 
attributes on links in PG.

Let me make that concrete with an example simplified from Graph 
databases / chapter 3 (page 52 in my copy).  The book is written the 
Neo4J folks.

Email provenance.

     A sends_email_to B

Now, you could reify that statement (the act by A of sending the email 
to B).

Reification is way more powerful than just being about to add data to 
the triple.  It says "claim: A sends_mail_to B"  - several different and 
competing claims can be made. But let's continue assuming reification 
and assertion of the triple ... [*]

<<A sends email to B>>
     cc C
     cc D
     sentOn Tuesday

In the same modelling way you could add attributes to a PG graph edge 
for sends_email_to.

Both PG and RDF modelling here are anti-patterns (as chapter 3 notes for 
PG).

The email sent is an important concept so model it explicitly:

A   sends       MSG
MSG receivedBy  B
MSG cc_to       C
MSG cc_to       D
MSG sentOn      "Tuesday"

By modelling the email message as a first class concept, not implicit in 
the activity via reification/link attributes, you can better add 
information e.g. which servers it was transferred by and stored on, when 
was it received (this is email - that might be twice) and better query 
it (who else accessed it on receipt).  Modelling those on the act of 
sending is making life hard (how do you talk about a draft email?)

MSG contents        URL_to_content
MSG hasChecksum     0xABCDEF
MSG status          :sent

This event based modelling.


If you wanted a highly efficient reification-supporting RDF store, then 
build one.  No need to blindly store as multiple triples (its called 
compression!).  You don't see such stores because reification is a minor 
feature of RDF.  Event-based modelling and named graphs are often better.

     Andy

[*]
<< >> is syntax that I proposed in early SPARQL drafts pre 1.0 for 
reification support but didn't gain much support. It is still in the ARQ 
parser source but not active.

Re: Performance Cost of Reification

Posted by Milorad Tosic <mb...@yahoo.com.INVALID>.

Recently, there was a short discussion about different ways to approach the reification in practice. I found quite useful a reference [1] mentioned in the discussion.
Regards,Milorad
[1] Loris Bozzato and Luciano Seraﬁni, Knowledge Propagation in Contextualized Knowledge Repositories: an Experimental Evaluation

      From: Patrick Hoeffel <pa...@issinc.com>
 To: "users@jena.apache.org" <us...@jena.apache.org> 
 Sent: Thursday, October 8, 2015 5:11 PM
 Subject: Performance Cost of Reification

I was at the Cassandra Summit conference last week I talked to the guys from Titan (ThinkAurelius) about their roadmap with the TitanDB property graph that they developed prior to being acquired by DataStax. While their opinion was clearly biased by their own work, they made a good point that if your ontology is sufficiently complex that it requires reification in order to add properties to a triple, it can significantly affect the performance of the triple store (which seems somewhat obvious). I was wondering, in your experience, can you estimate how significant that degradation might be (if it really exists), or does it depend on too many factors to answer?

Thanks,

Patrick Hoeffel
Senior Software Engineer
Intelligent Software Solutions (www.issinc.com<http://www.issinc.com/>)
(719) 452-7371 (direct)
(719) 210-3706 (mobile)

"Bringing Knowledge to Light"