You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@gump.apache.org by Leo Simons <ma...@leosimons.com> on 2005/04/16 20:09:05 UTC

RDF (was: [RT] module, project, target = repository, module, project...)

On 16-04-2005 18:30, "Stefano Mazzocchi" <st...@apache.org> wrote:
> The more I think about it, the more I think that having our data in RDF
> would be a tremendous win, also in terms of programming.

Show me!



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: RDF

Posted by Stefano Mazzocchi <st...@apache.org>.
Leo Simons wrote:
> On 17-04-2005 00:53, "Stefano Mazzocchi" <st...@apache.org> wrote:
> 
>>Example, if you have the Module object and the Project object, you have
>>to decide which way the link goes and the notion of "Module.projects"
>>means, this is the list of projects this module contains.
>>
>>Problem is that this implicit modeling forces you to say decide the
>>direction of the link, and, in case you want both, you have to model
>>this explicitly and at update, you need to know where to change.
>>
>>In RDF, you don't have to do all that.
> 
> 
> Exactly! If you want a bi-directional link you have to model it explicitly
> and it is always very evident when using it, ie
> 
>   project.module.repository.workspace.name
> 
> Just yells "You're handling a project and accessing something related to the
> workspace. Why is that????" right at ya.

Yep.

> One thing that got Gump2 into problem was that things were relatively
> tightly coupled to another. Having "manual" modelling means that its easy to
> spot that coupling (just delete all links from repository->workspace, run
> your project-related code, boom, it blows up).
> 
> As with databases, I (model designer) have to work real hard so the plugin
> programmer has an easier time. Interestingly...
> 
>>I find it somewhat ironic that you now code in a dynamically typed
>>language (and, AFAIK, with good feelings about it) and you advocate that
>>static typing of your data (object or SQL doesn't really matter) is
>>better for you.
> 
> 
> I hadn't realised that this clearly just yet. I've been conciously making a
> lot of things statically typed to keep it understandable. Now...
> 
> <snip/>
> 
>>  failed_builds = model.get("?x is_a Build where ?x status 'failed'")
> 
> 
> Is indeed quite understandable. At least I had no problem understanding that
> when I first saw it.

Glad to hear that. I find it quite understandable myself, but only when 
you remove all the complexity that is introduced by the fact that all 
those things need to be globally unique URIs. Luckily some APIs came to 
the rescue.

>>Sure, the argument that objects are better than dealing with JDBC
>>resultsets by hand stands, but making this a general rule could be turn
>>out to be a mistake.
> 
> Do you know of an open-source reasonably sized RDF-model-based application
> that follows the approach you're describing? I'd like to see how it turns
> out! I was looking at Haystack the other day but uhm, it suffers from all of
> those "research project" flaws.

eheh, well, we are building one as we speak, but can't tell you more :-)

Let me just say that we have been dealing with as many as 30 million 
statements and as long as your queries are reasonable (say you don't 
iterate over all of the nodes!), the performance is reasonable as well.

Haystack tried to do too much (they are modelling their entire system, 
including the UI, with RDF statements... which means that its pretty 
much painful to do anything).

> Same comment again....
> 
>>I find it somewhat ironic that you now code in a dynamically typed
>>language (and, AFAIK, with good feelings about it) and you advocate that
>>static typing of your data (object or SQL doesn't really matter) is
>>better for you.
> 
> 
> You know, I still have mixed feelings about a lot of that. I have read so
> much python code recently that is hard to understand because its really
> dynamic, often for no good reason. And I've also see a lot of python code
> look really bad because developers want to add security in there that can't
> truly be enforced (ie Zope). And a whole lot of python code that is horribly
> structured simply because you can do a lot of "glueing" so easily.
> 
> On a code level scale, working with python can be real fun once you get the
> hang of it, but every time I write something like
> 
>   for command in [command for command in commands \
>       if isinstance(command,Script)]:
>     handle_script(command)
> 
> (which is kinda "pythonic")
> I do wonder whether
> 
>   it = commands.iterator();
>   while(it.hasNext()) {
>     command = it.next();
>     if(command instanceof Script)
>       handleScript(command);
>   }
> 
> Doesn't make more sense if there's other developers that have to understand
> the code.

True enough.

All I can tell is that semi-structures deal with the entropy of things a 
lot better than forcing structure on top of them: "refactoring" data in 
a triple store could be as easy as writing a few other owl:sameAs 
statements between node types and running an inferencing engine on it 
(maybe for a few hours or a day... *while* the system is still running).

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: RDF

Posted by Leo Simons <ma...@leosimons.com>.
On 17-04-2005 00:53, "Stefano Mazzocchi" <st...@apache.org> wrote:
> Example, if you have the Module object and the Project object, you have
> to decide which way the link goes and the notion of "Module.projects"
> means, this is the list of projects this module contains.
> 
> Problem is that this implicit modeling forces you to say decide the
> direction of the link, and, in case you want both, you have to model
> this explicitly and at update, you need to know where to change.
> 
> In RDF, you don't have to do all that.

Exactly! If you want a bi-directional link you have to model it explicitly
and it is always very evident when using it, ie

  project.module.repository.workspace.name

Just yells "You're handling a project and accessing something related to the
workspace. Why is that????" right at ya.

One thing that got Gump2 into problem was that things were relatively
tightly coupled to another. Having "manual" modelling means that its easy to
spot that coupling (just delete all links from repository->workspace, run
your project-related code, boom, it blows up).

As with databases, I (model designer) have to work real hard so the plugin
programmer has an easier time. Interestingly...
> I find it somewhat ironic that you now code in a dynamically typed
> language (and, AFAIK, with good feelings about it) and you advocate that
> static typing of your data (object or SQL doesn't really matter) is
> better for you.

I hadn't realised that this clearly just yet. I've been conciously making a
lot of things statically typed to keep it understandable. Now...

<snip/>
>   failed_builds = model.get("?x is_a Build where ?x status 'failed'")

Is indeed quite understandable. At least I had no problem understanding that
when I first saw it.

> Sure, the argument that objects are better than dealing with JDBC
> resultsets by hand stands, but making this a general rule could be turn
> out to be a mistake.

Do you know of an open-source reasonably sized RDF-model-based application
that follows the approach you're describing? I'd like to see how it turns
out! I was looking at Haystack the other day but uhm, it suffers from all of
those "research project" flaws.

Same comment again....
> I find it somewhat ironic that you now code in a dynamically typed
> language (and, AFAIK, with good feelings about it) and you advocate that
> static typing of your data (object or SQL doesn't really matter) is
> better for you.

You know, I still have mixed feelings about a lot of that. I have read so
much python code recently that is hard to understand because its really
dynamic, often for no good reason. And I've also see a lot of python code
look really bad because developers want to add security in there that can't
truly be enforced (ie Zope). And a whole lot of python code that is horribly
structured simply because you can do a lot of "glueing" so easily.

On a code level scale, working with python can be real fun once you get the
hang of it, but every time I write something like

  for command in [command for command in commands \
      if isinstance(command,Script)]:
    handle_script(command)

(which is kinda "pythonic")
I do wonder whether

  it = commands.iterator();
  while(it.hasNext()) {
    command = it.next();
    if(command instanceof Script)
      handleScript(command);
  }

Doesn't make more sense if there's other developers that have to understand
the code.

G'day!

LSD



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: RDF

Posted by Stefano Mazzocchi <st...@apache.org>.
Leo Simons wrote:

[snip]

> So, ehm, no, I don't actually think it'll be a tremendous win. It'll bring
> some huge benefits, but it'll incur a big cost as well. Simplicity loss.
> 
> Or maybe not. I'm not exactly an expert here. We do have one of those around
> I think. Hence: "Show me!"

The way you deal with statements is a little different than the way you 
deal with objects. Objects have explicit semantics, as much as 
statements, but their relationships are not typed.

Example, if you have the Module object and the Project object, you have 
to decide which way the link goes and the notion of "Module.projects" 
means, this is the list of projects this module contains.

Problem is that this implicit modeling forces you to say decide the 
direction of the link, and, in case you want both, you have to model 
this explicitly and at update, you need to know where to change.

In RDF, you don't have to do all that. If you have a bunch of statements

  ModuleA -(is_a)-> Module
  ProjectA -(is_a)-> Project
  ModuleA -(contains)-> ProjectA
  ProjectA -(has_name)-> "Cocoon"@en^string
  Build-20050415-343 -(is_a)-> Build
  Build-20050415-343 -(built)-> ProjectA
  Build-20050415-343 -(status)-> "failed"@en^string
  Build-20050415-343 -(depends)-> Build-20050415-234
  ...

and so on. It's basically a log of the things you come to know about 
stuff and this becomes your knowledge base. No structure, you don't need 
it, you just need to be careful about how you model things and this 
becomes natural and grows with you. No need to define the objects nor 
the schema before you know how complex your data is.

Very incremental, very XP, fits nicely both in the lazyness mode and in 
the separation between data production and data consumption that we want 
to enforce in Gump3.

Now, what about the data consumption side?

Well, the data is in the triple store, so you need to query it. There 
are many different ways to do this, but two main categories:

  1) via an API
  2) via a query language

depending on the triple store you use, you get a different API and/or 
query language. The API feels more natural, but can be less optimized by 
the triple store.

For example (pseudocode)

Get all modules:
  modules = getSubjects("is_a","Module");

Get all builds that failed:
  builds = model.getSubjects("is_a","Build");
  foreach (build in builds):
	status = model.getObjects(build,"status")
	if (status == "failed"):
		failed_builds.add(build)

you get the idea.

But you could also so something like

  failed_builds = model.get("?x is_a Build where ?x status 'failed'")
	
which is not that hard to get.

Objects are just syntax sugar around SQL statements: you have to model 
your data first, then add it in. In RDF is the other way around, you 
pile up your data and the database follows you.

Sure, the argument that objects are better than dealing with JDBC 
resultsets by hand stands, but making this a general rule could be turn 
out to be a mistake.

The vision of RDF is data first, metadata later. The vision of 
relational databases is metadata first, data later.

And the funny thing is that there is nothing in the relational model 
that suggests you that (in fact, RDF is nothing but an explicit 
relational model with globally unique identifiers) but the idea of 
building a database by creating a schema was driven by the vision that 
statical typing is good for you even if it locks you in (certanly is 
good for the query indexers, and performance is clearly not the best 
feature of a triple store nowadays)

I find it somewhat ironic that you now code in a dynamically typed 
language (and, AFAIK, with good feelings about it) and you advocate that 
static typing of your data (object or SQL doesn't really matter) is 
better for you.

I think RDF offers a better model, especially for something integrating 
data and metadata from different independent domains like Gump.

But of course, I'm biased.

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: RDF

Posted by Leo Simons <ma...@leosimons.com>.
On 16-04-2005 21:59, "Stefano Mazzocchi" <st...@apache.org> wrote:
> Leo Simons wrote:
>> On 16-04-2005 18:30, "Stefano Mazzocchi" <st...@apache.org> wrote:
>> 
>>> The more I think about it, the more I think that having our data in RDF
>>> would be a tremendous win, also in terms of programming.
>>  
>> Show me!
> 
> Nice try ;-)

Yeah I thought so :-D

I just spend some time trying to envision what gump would like codewise with
a RDF triplestore at its core. It would be a lot more like an application
that uses a database for all its storage, except that the database stores
triples instead of rows. You'd then have lots of RDQL (or similar) queries
sprankled throughout the codebase.

That wouldn't look very nice or easy to understand at all. Adam and now I in
his footsteps have worked pretty hard to make the distance between the
conceptual model (in the form of clean python objects) and its XML
representation huge, simply because that makes the majority of the code a
lot easier to understand.

Using RDF at the core instead of an object model would mean you would need
to understand RDF and how we map our conceptual model onto RDF in order to
be productive in development. That would not be nice. We have enough
concepts in there already.

Unless, of course, you could build a "magic" autogenerated model where
property setting and getting actually triggers interaction with the RDF
datastore. Not magical object-relational but magical object-triple mapping.
And, once you go there, it turns out that it doesn't matter that much right
now whether we move to RDF or not; we can just develop our plugins against
the "manual model" and do something "magic" later.

You may know I'm a little shy about "magic" (where's my little essay on that
again :-D); experience showed that a very smart sax-based xml-querying
automodelling is very possible (sam wrote one, remember) and very hard to
understand.

So, ehm, no, I don't actually think it'll be a tremendous win. It'll bring
some huge benefits, but it'll incur a big cost as well. Simplicity loss.

Or maybe not. I'm not exactly an expert here. We do have one of those around
I think. Hence: "Show me!"

:-D

Cheers,

LSD



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: RDF

Posted by Stefano Mazzocchi <st...@apache.org>.
Leo Simons wrote:
> On 16-04-2005 18:30, "Stefano Mazzocchi" <st...@apache.org> wrote:
> 
>>The more I think about it, the more I think that having our data in RDF
>>would be a tremendous win, also in terms of programming.
>  
> Show me!

Nice try ;-)

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org