You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2015/10/09 15:44:43 UTC

Re: Performance Cost of Reification

It is nice that the Titan guys see RDF as something to compare to. 
Coincidently, I was giving a talk about Property Graph / Linked Data 
just recently at the European ApacheCon BigData conference.


The Property Graph (PG) market is maybe x2 the size of the RDF market, 
and both are small.  The challenge is growing the graph market, not one 
form taking market share away from the other.

And the key difference between graph databases and other data systems is 
modelling.  The differences between graph systems is not the key here.

About reification, they are somewhat off track.  Reification is a quite 
specialised feature for limited use. It is not RDF's equivalent to 
attributes on links in PG.

Let me make that concrete with an example simplified from Graph 
databases / chapter 3 (page 52 in my copy).  The book is written the 
Neo4J folks.

Email provenance.

     A sends_email_to B

Now, you could reify that statement (the act by A of sending the email 
to B).

Reification is way more powerful than just being about to add data to 
the triple.  It says "claim: A sends_mail_to B"  - several different and 
competing claims can be made. But let's continue assuming reification 
and assertion of the triple ... [*]

<<A sends email to B>>
     cc C
     cc D
     sentOn Tuesday

In the same modelling way you could add attributes to a PG graph edge 
for sends_email_to.

Both are anti-patterns (as chapter 3 notes).

The email sent is an important concept so model it explicitly:

A   sends       MSG
MSG receivedBy  B
MSG cc_to       C
MSG cc_to       D
MSG sentOn      "Tuesday"

By modelling the email message as a first class concept, not implicit in 
the activity via reification/link attributes, you can better add 
information e.g. which servers it was transferred by and stored on, when 
was it received (this is email - that might be twice) and better query 
it (who else accessed it on receipt).  Modelling those on the act of 
sending is making life hard (how do you talk about a draft email?)

MSG contents        URL_to_content
MSG hasChecksum     0xABCDEF
MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."

That last one is tricky - one sending of a message can result in 
different receivedHeaders depending on the receiver.

This event based modelling, not reification.


If you wanted a highly efficient reification-supporting RDF store, then 
build one.  No need to blindly store as multiple triples (its called 
compression!).  You don't see such stores because reification is a minor 
feature of RDF.  Event-based modelling and named graphs are often better.

	Andy

[*]
<< >> is syntax that I proposed in early SPARQL drafts pre 1.0 for 
reification support but didn't gain much support. It is still in the ARQ 
parser source but not active.


Re: Performance Cost of Reification

Posted by Andy Seaborne <an...@apache.org>.
(bother - wrong list - resent to users@)

On 09/10/15 14:44, Andy Seaborne wrote:
> It is nice that the Titan guys see RDF as something to compare to.
> Coincidently, I was giving a talk about Property Graph / Linked Data
> just recently at the European ApacheCon BigData conference.
>
>
> The Property Graph (PG) market is maybe x2 the size of the RDF market,
> and both are small.  The challenge is growing the graph market, not one
> form taking market share away from the other.
>
> And the key difference between graph databases and other data systems is
> modelling.  The differences between graph systems is not the key here.
>
> About reification, they are somewhat off track.  Reification is a quite
> specialised feature for limited use. It is not RDF's equivalent to
> attributes on links in PG.
>
> Let me make that concrete with an example simplified from Graph
> databases / chapter 3 (page 52 in my copy).  The book is written the
> Neo4J folks.
>
> Email provenance.
>
>      A sends_email_to B
>
> Now, you could reify that statement (the act by A of sending the email
> to B).
>
> Reification is way more powerful than just being about to add data to
> the triple.  It says "claim: A sends_mail_to B"  - several different and
> competing claims can be made. But let's continue assuming reification
> and assertion of the triple ... [*]
>
> <<A sends email to B>>
>      cc C
>      cc D
>      sentOn Tuesday
>
> In the same modelling way you could add attributes to a PG graph edge
> for sends_email_to.
>
> Both are anti-patterns (as chapter 3 notes).
>
> The email sent is an important concept so model it explicitly:
>
> A   sends       MSG
> MSG receivedBy  B
> MSG cc_to       C
> MSG cc_to       D
> MSG sentOn      "Tuesday"
>
> By modelling the email message as a first class concept, not implicit in
> the activity via reification/link attributes, you can better add
> information e.g. which servers it was transferred by and stored on, when
> was it received (this is email - that might be twice) and better query
> it (who else accessed it on receipt).  Modelling those on the act of
> sending is making life hard (how do you talk about a draft email?)
>
> MSG contents        URL_to_content
> MSG hasChecksum     0xABCDEF
> MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."
>
> That last one is tricky - one sending of a message can result in
> different receivedHeaders depending on the receiver.
>
> This event based modelling, not reification.
>
>
> If you wanted a highly efficient reification-supporting RDF store, then
> build one.  No need to blindly store as multiple triples (its called
> compression!).  You don't see such stores because reification is a minor
> feature of RDF.  Event-based modelling and named graphs are often better.
>
>      Andy
>
> [*]
> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
> reification support but didn't gain much support. It is still in the ARQ
> parser source but not active.
>


Re: Performance Cost of Reification

Posted by Paul Houle <on...@gmail.com>.
These days I am a big fan of RDF* and SPARQL*,  which unifies RDF with the
property graph model.  On the other hand I used to hate blank nodes but I
learned to stop worrying and love them.  I am hoping anyway that Neo4J and
it's ilk become a gateway drug to the RDF world.




On Fri, Oct 9, 2015 at 9:44 AM, Andy Seaborne <an...@apache.org> wrote:

> It is nice that the Titan guys see RDF as something to compare to.
> Coincidently, I was giving a talk about Property Graph / Linked Data just
> recently at the European ApacheCon BigData conference.
>
>
> The Property Graph (PG) market is maybe x2 the size of the RDF market, and
> both are small.  The challenge is growing the graph market, not one form
> taking market share away from the other.
>
> And the key difference between graph databases and other data systems is
> modelling.  The differences between graph systems is not the key here.
>
> About reification, they are somewhat off track.  Reification is a quite
> specialised feature for limited use. It is not RDF's equivalent to
> attributes on links in PG.
>
> Let me make that concrete with an example simplified from Graph databases
> / chapter 3 (page 52 in my copy).  The book is written the Neo4J folks.
>
> Email provenance.
>
>     A sends_email_to B
>
> Now, you could reify that statement (the act by A of sending the email to
> B).
>
> Reification is way more powerful than just being about to add data to the
> triple.  It says "claim: A sends_mail_to B"  - several different and
> competing claims can be made. But let's continue assuming reification and
> assertion of the triple ... [*]
>
> <<A sends email to B>>
>     cc C
>     cc D
>     sentOn Tuesday
>
> In the same modelling way you could add attributes to a PG graph edge for
> sends_email_to.
>
> Both are anti-patterns (as chapter 3 notes).
>
> The email sent is an important concept so model it explicitly:
>
> A   sends       MSG
> MSG receivedBy  B
> MSG cc_to       C
> MSG cc_to       D
> MSG sentOn      "Tuesday"
>
> By modelling the email message as a first class concept, not implicit in
> the activity via reification/link attributes, you can better add
> information e.g. which servers it was transferred by and stored on, when
> was it received (this is email - that might be twice) and better query it
> (who else accessed it on receipt).  Modelling those on the act of sending
> is making life hard (how do you talk about a draft email?)
>
> MSG contents        URL_to_content
> MSG hasChecksum     0xABCDEF
> MSG receivedHeader  "from nm15-vm2.bullet.mail.ne1.yahoo.com ...."
>
> That last one is tricky - one sending of a message can result in different
> receivedHeaders depending on the receiver.
>
> This event based modelling, not reification.
>
>
> If you wanted a highly efficient reification-supporting RDF store, then
> build one.  No need to blindly store as multiple triples (its called
> compression!).  You don't see such stores because reification is a minor
> feature of RDF.  Event-based modelling and named graphs are often better.
>
>         Andy
>
> [*]
> << >> is syntax that I proposed in early SPARQL drafts pre 1.0 for
> reification support but didn't gain much support. It is still in the ARQ
> parser source but not active.
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275