You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@gump.apache.org by "Adam R. B. Jack" <aj...@apache.org> on 2004/08/27 19:51:54 UTC

RDF 102 s.v.p...

Ok, so I used RDFLIB (at least on M$, see http://neukadye.chalko.com/archive/000015.html) to allow Gump to generate some RDF. The RDF is generated into files (serialized to XML) in the following fashion:

/gump.rdf --- all projects
/module1/project1/gump.rdf --- project1 RDF.
...

basically this is like the RSS and Atom feeds that Gump put's out, except they also have data at the module level (for all projects within module). Basically, I figured that folks might sometimes want specific information, and sometimes want it all (to feed into some store).

I started out pretty simply, Gump defines some classes (Project, Repository) and some properties (e.g. name) and then makes some statements (Project:X depends upon Project:Y, Project:X resides within Repo:Z). Nothing complicated, but a start.

Even this small foray allowed me to come up with some questions, and want more input:

Some areas to look into:

1) Design Decisions/Questions:

1.1) Ought we define the URI for a project (or other entity) to point to the standalone RDF for that entity? I'm sure there is no need to, but it might allow tools to discover upon demand.

1.2) What if there are two sources of RDF triples about an entity? Say we have facts in a standalone document, and in a shared one (or in a triple store)? Are triples merged? What if they clash with each other? [e.g. one source says X dependsOn Y, but another says Y dependsOn X or something contradictory?]

1.3) How do we define a URI to represent a long lived (yet varying) entity? Ought we (say) include the version of Cocoon in the URI, so we know facts about that release/state, or do we just say Cocoon? If Cocoon dependsOn Avalon today, but not tomorrow, what happens to the Cocoon dependsOn Avalon triple? Is it wrong? Expired?

2) Ongoing investigations:

2.1) I think we wish to define a Gump Ontology at 'http://gump.apache.org/schemas/main/1.0/'? I am still a little confused by OWL and/or RDFS, and I know there is no immediate need to hurry. I guess I feel without an Ontology we are speaking a language foreign to everybody, but that is ok as we learn to speak. That said, how do we go about refining this? Just set it out there and tinker?

2.2) I think we wish to map the Gump Ontology to DOAP and others (even parts of FOAF). How would we do that, and how would we test/exercise it?

2.3) Ought we consider (over time) an ASF-wide Ontology, perhaps defining TLPs/other communities, and having Gump state triples for this project memberOf this community. [We tend to figure out communities from the repository, e.g. cvs.sf.net or ...]

3) Usages:

3.1) I was hoping to work on PSP to do queries into the RDBMS. This is primarily for historical information, but I was thinking about using it for dependency information also. The more I think abotu the RDF information, and triple queries, it seems an RDF store might be a better place to hold/maintain and query. This information seems RDF-ish, not RDBMS-ish.

3.2) What other 'users' of this descriptor information seem viable? Ought tools (e.g. Depot) be wishing to figure things out from it? Others?

-----------------------------------------------------------------------

BTW: Feedback/thoughts welcomed on:

http://brutus.apache.org/gump/test/gump.rdf
http://brutus.apache.org/gump/test/ant/ant/gump.rdf

regards,

Adam
--
Have you Gump'ed your code today?
http://gump.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: RDF 102 s.v.p...

Posted by Stefano Mazzocchi <st...@apache.org>.

Adam R. B. Jack wrote:

>>>1.1) Ought we define the URI for a project (or other entity) to point to
>>>the standalone RDF for that entity? I'm sure there is no need to, but it
>>>might allow tools to discover upon demand.
>>
>>This would be a URL and my suggestion would be something like
>>
>>http://gump.apache.org/data/path/project/20040827
> 
> 
> Hmm. I wonder if we ought we have something like a 'timeless' URI of:
> 
>     http://apache.org/project/${project}
> 
> ... relying upon the organization to manage it's project names, and them
> (most likely) not being re-used over time.

yes, we could do that, but it's not that you gain much. Those dates need 
not to be precise, just the year of the project cration would suffice.

keep in mind that that is a URI not a URL referring to a model. This is 
the identifier of the project, it could well be "urn:apache.org:23" for 
what we know and it does not contain anything by design.

Several people in the semweb community (Dirk included), in fact, 
promotes the use of URNs instead of http-URIs because they allow more 
transparent persistence.... but it's long debate and it's not that 
useful here.

> and then:
> 
>   http://gump.apache.org/data/path/project/${project}/20040827
> 
> to refer to the 'make-up' of that project on that day? We'd have a triple to
> assert that this URI related to the top (fixed) one, and carries information
> for it.

Well, that's how i would have done it anyway: gump information is 
transitory and should not be in the same model of the project 
information which is much less so.

I see three layers:

  1) project own metadata (changes very slowly)
  2) project dependencies data (changes now and then)
  3) project gump-originated metadata (changes potentially at every gump 
run)

the three things should be grouped in 3 different models, then 
aggregated when needed. All of them, IMO should have URIs that are 
either numeric of date-based.

> I don't think there can be a magic bullet for solving changes over time, but
> this seems like one approach that might (at least) hint at time sensetivity.
> 
> I could really like seeing version information introduced (what version of
> the project is it [i.e. what is HEAD to become when released], and perhpas
> what version of metadata is there). Change detection is something I think is
> of interest here (i.e when was dependency X added) so somehow I'd like to be
> able to determine that from this information. Hmm, I wonder if changes are
> really part of the information we wish to be publishing, e.g versionX
> addedDependency Y.
> 
> BTW: what is the purpose/value of data/path in the URI above?

path was supposed to be the TLP in case you have subprojects (like in 
jakarta stuff), even if it's very unlikely that the ASF will allow 
projects to have the same name and being hosted in different TLP, so we 
could get rid of that.

data was supposed to make it easier to use mod_rewrite for that URL 
subspace, could well be "ns" but this is not really a namespace.

>>>If Cocoon
>>>dependsOn Avalon today, but not tomorrow, what happens to the Cocoon
>>>dependsOn Avalon triple? Is it wrong? Expired?
>>
>>This is where it starts to get very tricky.
> 
> Yup, I hear that. I want something stable and simple, some way for a store
> to extract Gump produced project information (once a day, whenever) and make
> some good current and historical determinations from it. I don't think we
> can expect masses of data to be stored semi-indefinately, so perhaps triples
> about delta is a way to compress the redundancy.

Don't! Premature optimization. Just publish all the data you have in a 
way that is consistent and persistent over time, the users making use of 
that data will do something else (we can even host a "RDQL" web service 
on top of that data in the future).

>>One way of doing it is by encoding "provenance". One way of doing it is
>>to add further statements about the statements using "reification".
>>Reification is the act of using a statement as a subject of another
>>statement. Basically, when you have a statement like
>>
>>  "Cocoon dependsOn Avalon"
>>
>>you can also say
>>
>>  ["Cocoon dependsOn Avalon"] wasAsserted 20040827
>>  ["Cocoon dependsOn Avalon"] wasAssertedBy <uri>
> 
> Does this assert two things at once, or can one reference an assertion by an
> ID or something?
 >
> I just don't feel comfortable with this approach, although maybe it is nice
> and simple. It just seems so incredibly verbose.

yep, that's why everybody thinks it's really elegant but nobody uses it ;-)

>>Dirk's group uses another method, basically encoding provenance directly
>>inside the statement (things calls 'quads' instead of 'triples'), this
>>is a non-recommended method and it's not as flexible as reification but
>>it's a *lot* more efficient. Their quad-based RDFStore is open source
>>(and very fast, I hear) but there are no bindings in python (as for now).
> 
> Interesting. I do suspect some form of versioning/timestamping of facts to
> be in order. That said, maybe also 'who told me this' (so you can judge how
> well you trust it). Hmm, I wonder if triples just need attributes...

eheh, the "provenance" thing will be huge when the W3C attempts to tacke 
the 'trust' issue, which they don't want to just yet, so I suggest we 
don't even go there for Gump ;-)

>>How to solve this?
>>
>>Well, I would just create a new model everytime, just loading the last
>>statements. For example, you can have a URL such as:
>>
>>http://gump.apache.org/data/path/project/20040827
>>
>>that gives you the /path/project of today or
>>
>>http://gump.apache.org/data/path/project
>>
>>that gives you the "latest" one.
> 
> 
> So similar to what I suggested where the non-dated URI was the project
> entity, and the dated was a view of it. Is 'latest' -- a moving concept -- a
> risky proposition? Yesterday's latest is today history, so a triple might
> fail to be true as time passes.

I really don't know what to say here. If the web architectural group 
can't agree on what a URI means, it's going to be hard for us to do it.

Also, the RDF data access WG is working on web services that allow you 
to access the RDF data that you want (rather then just harvest 
everything and do it yourself) [take a look at "joseki" 
http://www.joseki.org/ for an example of what I mean]

So, "latest" might well be just 'you know what day it is, so just ask 
for that one"

>>>2.2) I think we wish to map the Gump Ontology to DOAP and others (even
>>>parts of FOAF). How would we do that
>>
>>with some OWL ontologies.
>>
> 
> I want to try to play nice with DOAP. I want us to be flexible (a
> prototypical approach so we can flesh out time issues, etc.) so I don't want
> to be bound to DOAP, but I'd like to benefit from their endeavours. Can
> anybody help with such a mapping?

Just don't worry about it, focus on your stuff first, the mappings will 
come later.

>>>3) Usages:
>>>
>>>3.1) I was hoping to work on PSP to do queries into the RDBMS. This is
>>>primarily for historical information, but I was thinking about using it
>>>for dependency information also.  The more I think abotu the RDF
>>>information, and triple queries, it seems an RDF store might be a better
>>>place to hold/maintain and query. This information seems RDF-ish, not
>>>RDBMS-ish.
>>
>>Agreed. I would use a triple store with an RDQL query engine (Redland
>>has such a thing and has Python hooks)
> 
> 
> I might try the Jena (Java) version that Sam referenced. I think it is good
> to use Python inside Gump, but allow RDF (serialized to XML) to freely
> separate monitoring/using tools.

Our group uses Jena and it's very well written.

> Would we want to host a triple store on brutus and allow applications to
> access it? Or, would we want to publish RDF in XML and allow remote clients
> to download?

We could do both: first we publish, then we can aggregate the thing 
ourselves and serve a RDQL web service for people to ask for queries.... 
but again, this is a subsequent step so don't worry about it for now.

>>>3.2) What other 'users' of this descriptor information seem viable?
>>>Ought tools (e.g. Depot) be wishing to figure things out from it?
> 
> Others?
> 
>>Once the RDF infrastructure is in place, one of my goals is to add
>>"legal" metadata to the project and create an inferencing layer that
>>indicates whether or not a project is *legal* depending on the
>>combination of the licenses.
> 
> 
> Awesome, I love that idea. Ought we add the type attribute <license
> type="ASF2.0" (or whatever) to Gump XML-based metadata?

yep, that's the plan, but it should have a URI identifying the license, 
like the RDF version of creative commons.

> Me, I'm primarily interested in version compatibility (what lead me to Depot
> [http://incubator.apache.org/depot/version/] in the first place). I'd like
> us to be able to query this knowledge base to determine what products can
> co-exists, at what levels, and so forth.
> That, and recursive downloads from a repository.
> 
> Other thoughts?

oh, ok. that's an interesting requirements.

my suggestion is that we try to make gump work and publish that data 
first, then we find out what to do with it.

-- 
Stefano.

Re: RDF 102 s.v.p...

Posted by "Adam R. B. Jack" <aj...@apache.org>.

> > 1.1) Ought we define the URI for a project (or other entity) to point to
> > the standalone RDF for that entity? I'm sure there is no need to, but it
> > might allow tools to discover upon demand.
>
> This would be a URL and my suggestion would be something like
>
> http://gump.apache.org/data/path/project/20040827

Hmm. I wonder if we ought we have something like a 'timeless' URI of:

    http://apache.org/project/${project}

... relying upon the organization to manage it's project names, and them
(most likely) not being re-used over time.

and then:

  http://gump.apache.org/data/path/project/${project}/20040827

to refer to the 'make-up' of that project on that day? We'd have a triple to
assert that this URI related to the top (fixed) one, and carries information
for it.

I don't think there can be a magic bullet for solving changes over time, but
this seems like one approach that might (at least) hint at time sensetivity.

I could really like seeing version information introduced (what version of
the project is it [i.e. what is HEAD to become when released], and perhpas
what version of metadata is there). Change detection is something I think is
of interest here (i.e when was dependency X added) so somehow I'd like to be
able to determine that from this information. Hmm, I wonder if changes are
really part of the information we wish to be publishing, e.g versionX
addedDependency Y.

BTW: what is the purpose/value of data/path in the URI above?

>
> > 1.3) How do we define a URI to represent a long lived (yet varying)
> > entity?
>
> eheh, great question ;-)
>
> > Ought we (say) include the version of Cocoon in the URI, so we
> > know facts about that release/state, or do we just say Cocoon?
>
> I'm a big fan of numerical URIs for long-term persisting things. The
> less implicit semantics in the URI, the higher the chance of surviving
> changes without requiring the URI to change.

I see .

> > If Cocoon
> > dependsOn Avalon today, but not tomorrow, what happens to the Cocoon
> > dependsOn Avalon triple? Is it wrong? Expired?
>
> This is where it starts to get very tricky.

Yup, I hear that. I want something stable and simple, some way for a store
to extract Gump produced project information (once a day, whenever) and make
some good current and historical determinations from it. I don't think we
can expect masses of data to be stored semi-indefinately, so perhaps triples
about delta is a way to compress the redundancy.

> One way of doing it is by encoding "provenance". One way of doing it is
> to add further statements about the statements using "reification".
> Reification is the act of using a statement as a subject of another
> statement. Basically, when you have a statement like
>
>   "Cocoon dependsOn Avalon"
>
> you can also say
>
>   ["Cocoon dependsOn Avalon"] wasAsserted 20040827
>   ["Cocoon dependsOn Avalon"] wasAssertedBy <uri>

Does this assert two things at once, or can one reference an assertion by an
ID or something?

I just don't feel comfortable with this approach, although maybe it is nice
and simple. It just seems so incredibly verbose.

> Dirk's group uses another method, basically encoding provenance directly
> inside the statement (things calls 'quads' instead of 'triples'), this
> is a non-recommended method and it's not as flexible as reification but
> it's a *lot* more efficient. Their quad-based RDFStore is open source
> (and very fast, I hear) but there are no bindings in python (as for now).

Interesting. I do suspect some form of versioning/timestamping of facts to
be in order. That said, maybe also 'who told me this' (so you can judge how
well you trust it). Hmm, I wonder if triples just need attributes...

> How to solve this?
>
> Well, I would just create a new model everytime, just loading the last
> statements. For example, you can have a URL such as:
>
> http://gump.apache.org/data/path/project/20040827
>
> that gives you the /path/project of today or
>
> http://gump.apache.org/data/path/project
>
> that gives you the "latest" one.

So similar to what I suggested where the non-dated URI was the project
entity, and the dated was a view of it. Is 'latest' -- a moving concept -- a
risky proposition? Yesterday's latest is today history, so a triple might
fail to be true as time passes.


> > 2.2) I think we wish to map the Gump Ontology to DOAP and others (even
> > parts of FOAF). How would we do that
>
> with some OWL ontologies.
>

I want to try to play nice with DOAP. I want us to be flexible (a
prototypical approach so we can flesh out time issues, etc.) so I don't want
to be bound to DOAP, but I'd like to benefit from their endeavours. Can
anybody help with such a mapping?

> >
> > 3) Usages:
> >
> > 3.1) I was hoping to work on PSP to do queries into the RDBMS. This is
> > primarily for historical information, but I was thinking about using it
> > for dependency information also.  The more I think abotu the RDF
> > information, and triple queries, it seems an RDF store might be a better
> > place to hold/maintain and query. This information seems RDF-ish, not
> > RDBMS-ish.
>
> Agreed. I would use a triple store with an RDQL query engine (Redland
> has such a thing and has Python hooks)

I might try the Jena (Java) version that Sam referenced. I think it is good
to use Python inside Gump, but allow RDF (serialized to XML) to freely
separate monitoring/using tools.

Would we want to host a triple store on brutus and allow applications to
access it? Or, would we want to publish RDF in XML and allow remote clients
to download?

> > 3.2) What other 'users' of this descriptor information seem viable?
> > Ought tools (e.g. Depot) be wishing to figure things out from it?
Others?
>
> Once the RDF infrastructure is in place, one of my goals is to add
> "legal" metadata to the project and create an inferencing layer that
> indicates whether or not a project is *legal* depending on the
> combination of the licenses.

Awesome, I love that idea. Ought we add the type attribute <license
type="ASF2.0" (or whatever) to Gump XML-based metadata?

Me, I'm primarily interested in version compatibility (what lead me to Depot
[http://incubator.apache.org/depot/version/] in the first place). I'd like
us to be able to query this knowledge base to determine what products can
co-exists, at what levels, and so forth.
That, and recursive downloads from a repository.

Other thoughts?

regards,

Adam


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: RDF 102 s.v.p...

Posted by Stefano Mazzocchi <st...@apache.org>.

Adam R. B. Jack wrote:

> Ok, so I used RDFLIB (at least on M$, see 
> http://neukadye.chalko.com/archive/000015.html) to allow Gump to 
> generate some RDF. The RDF is generated into files (serialized to XML) 
> in the following fashion:
> 
> 
>   /gump.rdf --- all projects
>   /module1/project1/gump.rdf --- project1 RDF.
>   ...
> 
> basically this is like the RSS and Atom feeds that Gump put's out, 
> except they also have data at the module level (for all projects within 
> module). Basically, I figured that folks might sometimes want specific 
> information, and sometimes want it all (to feed into some store).

Kewl.

> I started out pretty simply, Gump defines some classes (Project, 
> Repository) and some properties (e.g. name) and then makes some 
> statements (Project:X depends upon Project:Y, Project:X resides within 
> Repo:Z). Nothing complicated, but a start.

nice.

> Even this small foray allowed me to come up with some questions, and 
> want more input:

<semantic-web-hat mode="on">

Copying Dirk since he's a semweb fan as much as I am.

> Some areas to look into:
> 
> 1) Design Decisions/Questions:
> 
> 1.1) Ought we define the URI for a project (or other entity) to point to 
> the standalone RDF for that entity? I'm sure there is no need to, but it 
> might allow tools to discover upon demand.

This would be a URL and my suggestion would be something like

http://gump.apache.org/data/path/project/20040827

> 1.2) What if there are two sources of RDF triples about an entity? Say 
> we have facts in a standalone document, and in a shared one (or in a 
> triple store)? Are triples merged? 

Yes, that's be beauty of the RDF model: you can have statements coming 
from different sources, and they get aggregated.

> What if they clash with each other? 
> [e.g. one source says X dependsOn Y, but another says Y dependsOn X or 
> something contradictory?]

the concept of "contraddiction" is not at the RDF level, but at the 
semantic interpretation of it.

for example, it is totally possible to have two statements "Paul 
isFatherOf Tom" and "Tom isFatherOf Paul", but the fact that these 
create a contraddiction is given by the fact that "being father of" is 
*not* a symmetric property.

So, you need three statements (the two above + "isFatherOf isNot 
symmetric") to come up with a contraddiction.

now, how the reasoning engines deal with contraddictions is again 
debatable. Another example:

  1) Stefano's Car hasColor Red
  2) hasColor hasCardinality one
  3) Stefano's Car hasColor Blue

reasoning engines will deduct that

  4) Blue isTheSameAs Red

Which is why distributed OWL doesn't really make me scream for elegance.

Anyway, the point is: the RDF is a data model, the reasoner is a program 
that works on the data model. We can write our own reasoning rules that 
trigger errors in case statements are contraddictory and OWL might not 
really help us there because of the "open world" assumption.

> 1.3) How do we define a URI to represent a long lived (yet varying) 
> entity? 

eheh, great question ;-)

> Ought we (say) include the version of Cocoon in the URI, so we 
> know facts about that release/state, or do we just say Cocoon? 

I'm a big fan of numerical URIs for long-term persisting things. The 
less implicit semantics in the URI, the higher the chance of surviving 
changes without requiring the URI to change.

> If Cocoon 
> dependsOn Avalon today, but not tomorrow, what happens to the Cocoon 
> dependsOn Avalon triple? Is it wrong? Expired?

This is where it starts to get very tricky.

One way of doing it is by encoding "provenance". One way of doing it is 
to add further statements about the statements using "reification". 
Reification is the act of using a statement as a subject of another 
statement. Basically, when you have a statement like

  "Cocoon dependsOn Avalon"

you can also say

  ["Cocoon dependsOn Avalon"] wasAsserted 20040827
  ["Cocoon dependsOn Avalon"] wasAssertedBy <uri>

which allows you to "infer" the current state of the model by 
considering only the statements that were asserted last.

Dirk's group uses another method, basically encoding provenance directly 
inside the statement (things calls 'quads' instead of 'triples'), this 
is a non-recommended method and it's not as flexible as reification but 
it's a *lot* more efficient. Their quad-based RDFStore is open source 
(and very fast, I hear) but there are no bindings in python (as for now).

How to solve this?

Well, I would just create a new model everytime, just loading the last 
statements. For example, you can have a URL such as:

http://gump.apache.org/data/path/project/20040827

that gives you the /path/project of today or

http://gump.apache.org/data/path/project

that gives you the "latest" one.

> 2) Ongoing investigations:
> 
> 2.1) I think we wish to define a Gump Ontology at 
> 'http://gump.apache.org/schemas/main/1.0/'? I am still a little confused 
> by OWL and/or RDFS, and I know there is no immediate  need to hurry. I 
> guess I feel without an Ontology we are speaking a language foreign to 
> everybody, but that is ok as we learn to speak. That said, how do we go 
> about refining this? Just set it out there and tinker?

I would not worry about this for now, just like you don't need an 
XMLSchema to write some well-formed XML.

> 2.2) I think we wish to map the Gump Ontology to DOAP and others (even 
> parts of FOAF). How would we do that

with some OWL ontologies.

> and how would we test/exercise it?

you don't, you just publish your data in the best way possible and see 
what happens ;-)

> 2.3) Ought we consider (over time) an ASF-wide Ontology, perhaps 
> defining TLPs/other communities, and having Gump state triples for this 
> project memberOf this community. [We tend to figure out communities from 
> the repository, e.g. cvs.sf.net or ...]

Adam, keep focus: one thing at a time ;-)

> 
> 3) Usages:
> 
> 3.1) I was hoping to work on PSP to do queries into the RDBMS. This is 
> primarily for historical information, but I was thinking about using it 
> for dependency information also.  The more I think abotu the RDF 
> information, and triple queries, it seems an RDF store might be a better 
> place to hold/maintain and query. This information seems RDF-ish, not 
> RDBMS-ish.

Agreed. I would use a triple store with an RDQL query engine (Redland 
has such a thing and has Python hooks)

> 3.2) What other 'users' of this descriptor information seem viable? 
> Ought tools (e.g. Depot) be wishing to figure things out from it? Others?

Once the RDF infrastructure is in place, one of my goals is to add 
"legal" metadata to the project and create an inferencing layer that 
indicates whether or not a project is *legal* depending on the 
combination of the licenses.

-- 
Stefano.