You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Reto Bachmann-Gmür <re...@apache.org> on 2012/11/12 20:45:58 UTC

Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

May I suggest the following toy-usecase for comparing different API
proposals (we know all API can be used for triple stores, so it seems
interesting how the can be used to expose any data as RDF and the Space
complexity of such an adapter):

Given

interface Person() {
 String getGivenName();
 String getLastName();
 /**
 * @return true if other is an instance of Person with the same GivenName
and LastName, false otherwise
 */
 boolean equals(Object other);
}

Provide a method

Graph getAsGraph(Set<Person> pesons);

where `Graph` is the API of an RDF Graph that can change over time. The
returned `Graph`shall (if possible) be backed by the Set passed as argument
and thus reflect future changes to that set. The Graph shall support all
read operation but no addition or removal of triples. It's ok is some
iteration over the graph result in a ConcurrentModficationException if the
set changes during iteration (as one would get when iterating over the set
during such a modification).

- How does the code look like?
- Is it backed by the Set and does the result Graph reflects changes to the
set?
- What's the space complexity?

Challenge accepted?

Reto

On Mon, Nov 12, 2012 at 6:11 PM, Andy Seaborne <an...@apache.org> wrote:

> On 11/11/12 23:22, Rupert Westenthaler wrote:
>
>> Hi all ,
>>
>> On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür <re...@apache.org>
>> wrote:
>>
>>> - clerezza.rdf graudates as commons.rdf: a modular java/scala
>>> implementation of rdf related APIs, usable with and without OSGi
>>>
>>
>> For me this immediately raises the question: Why should the Clerezza
>> API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
>> based on Jena and Sesame? Creating an Apache commons project based on
>> an RDF API that is only used by a very low percentage of all Java RDF
>> applications is not feasible. Generally I see not much room for a
>> commons RDF project as long as there is not a commonly agreed RDF API
>> for Java.
>>
>
> Very good point.
>
> There is a finite and bounded supply of energy of people to work on such a
> thing and to make it work for the communities that use it.   For all of us,
> work on A means less work on B.
>
>
> An "RDF API" for applications needs to be more than RDF. A SPARQL engine
> is not simply abstracted from the storage by some "list(s,p,o)" API call.
>  It will die at scale, where scale here includes in-memory usage.
>
> My personal opinion is that wrapper APIs are not the way to go - they end
> up as a new API in themselves and the fact they are backed by different
> systems is really an implementation detail.  They end up having design
> opinions and gradually require more and more maintenace as the add more and
> more.
>
> API bridges are better (mapping one API to another - we are really talking
> about a small number of APIs, not 10s) as they expose the advantages of
> each system.
>
> The ideal is a set of interfaces systems can agree on.  I'm going to be
> contributing to the interfacization of the Graph API in Jena - if you have
> thoughts, send email to a list.
>
>         Andy
>
> PS See the work being done by Stephen Allen on coarse grained APIs:
>
> http://mail-archives.apache.**org/mod_mbox/jena-dev/201206.**
> mbox/%3CCAPTxtVOMMWxfk2%**2B4ciCExUBZyxsDKvuO0QshXF8uKha**
> D8txXjA%40mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/jena-dev/201206.mbox/%3CCAPTxtVOMMWxfk2%2B4ciCExUBZyxsDKvuO0QshXF8uKhaD8txXjA%40mail.gmail.com%3E>
>
>
>

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Mon, Nov 19, 2012 at 9:00 PM, Andy Seaborne <an...@apache.org> wrote:

> On 19/11/12 14:13, Reto Bachmann-Gmür wrote:
>
>> >  - Linked Data Platform: Reto I guess you have missed this
>>> >presentation [1] at ApacheCon. IMO a Linked Data Platform is something
>>> >that deserves an own project and as soon as there is such a Platform
>>> >available we should use it in Stanbol. This would allow us to remove a
>>> >lot of code in Stanbol (especially in the Entityhub) - a good thing as
>>> >it allows to focus more on core features of Stanbol.
>>> >
>>>
>> I don't think this can really be compared. Clerezza is already quite close
>> to comfroming with the Linked Data Platform Specification. It has a
>> lightweigh arcitecture very similar to Stanbol based on OSGi. By contrast
>> Salzburg Research Marmotta/Linda proposal is a Java Enterprise application
>> that ceratinly could use some stanbol services but which has a quite
>> different architecture. The W3C LDP isn't describing a heavy weight
>> architecture but a set of recommendation on how to use the REST principles
>> in the context of linked data. If stanbol wants to provide the promised
>> RESTfull services it should strive for compliance with LDP specifications.
>>
>
> That would be good.  This is separate from the RDF API?
>

Is uses the RDF API (so that it can be used on multiple backends). But as a
project it would imho ideally be separated (I think Stanbol would be a good
place).

>
> At the HTTP protocol level W3C/LDP is quite light - but even the plain
> LDP/resources , there is a need to interpret the vocuabulary and maintain
> certain triples in the data.  LDP/containers are somewhat more complicated
> (and evolving) with POST generating URIs, paging and again needing
> vocabulary interpretation.
>
> There is a need to make some real design decisions - LDP is not a spec for
> a generic server you can implement.  It might be IF the WG decides on how
> clients can create containers and containers-in-containers - it is just as
> likely that the container structure will be fixed by "the application", so
> configuration of the server-side will be a major part of any project.  But
> it's a very big IF.
>
> The access control is important.
>
> How much of this does Clerezza show?  The WG would be interested to hear
> about it.
>

I've not yet looked at the details of section 5. For section I described
compliance here:

http://wiki.apache.org/clerezza/LinkedDataPlatform

Reto

>
>         Andy
>
>
>

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Andy Seaborne <an...@apache.org>.

On 19/11/12 14:13, Reto Bachmann-Gmür wrote:
>> >  - Linked Data Platform: Reto I guess you have missed this
>> >presentation [1] at ApacheCon. IMO a Linked Data Platform is something
>> >that deserves an own project and as soon as there is such a Platform
>> >available we should use it in Stanbol. This would allow us to remove a
>> >lot of code in Stanbol (especially in the Entityhub) - a good thing as
>> >it allows to focus more on core features of Stanbol.
>> >
> I don't think this can really be compared. Clerezza is already quite close
> to comfroming with the Linked Data Platform Specification. It has a
> lightweigh arcitecture very similar to Stanbol based on OSGi. By contrast
> Salzburg Research Marmotta/Linda proposal is a Java Enterprise application
> that ceratinly could use some stanbol services but which has a quite
> different architecture. The W3C LDP isn't describing a heavy weight
> architecture but a set of recommendation on how to use the REST principles
> in the context of linked data. If stanbol wants to provide the promised
> RESTfull services it should strive for compliance with LDP specifications.

That would be good.  This is separate from the RDF API?

At the HTTP protocol level W3C/LDP is quite light - but even the plain 
LDP/resources , there is a need to interpret the vocuabulary and 
maintain certain triples in the data.  LDP/containers are somewhat more 
complicated (and evolving) with POST generating URIs, paging and again 
needing vocabulary interpretation.

There is a need to make some real design decisions - LDP is not a spec 
for a generic server you can implement.  It might be IF the WG decides 
on how clients can create containers and containers-in-containers - it 
is just as likely that the container structure will be fixed by "the 
application", so configuration of the server-side will be a major part 
of any project.  But it's a very big IF.

The access control is important.

How much of this does Clerezza show?  The WG would be interested to hear 
about it.

	Andy

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Wed, Nov 14, 2012 at 6:35 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi
>
> I am more with Fabian. The fact is that Clerezza has not much
> activity. I am a Clerezza Committer myself and the reason why I am
> rather inactive is because I have enough things to do for Stanbol.
> This will also not much change in the future. Moving the Clerezza
> modules to Stanbol does not solve this problem. It does only move it
> from Clerezza over to Stanbol.
>

I'm also involved in the fusepool.eu project which will need a platform
providing more than the current stanbol HTTP endpoits but real REST
endpoints for humans and machines. Also security and a plugable and
optional UI are needed. If Stanbol want to go in this direction then it
would be good to integrate the parts of Clerezza providing this.

>From adding security to Stanbol I've seen that things gets messing with the
too so similar projects.

>
>  - Linked Data Platform: Reto I guess you have missed this
> presentation [1] at ApacheCon. IMO a Linked Data Platform is something
> that deserves an own project and as soon as there is such a Platform
> available we should use it in Stanbol. This would allow us to remove a
> lot of code in Stanbol (especially in the Entityhub) - a good thing as
> it allows to focus more on core features of Stanbol.
>

I don't think this can really be compared. Clerezza is already quite close
to comfroming with the Linked Data Platform Specification. It has a
lightweigh arcitecture very similar to Stanbol based on OSGi. By contrast
Salzburg Research Marmotta/Linda proposal is a Java Enterprise application
that ceratinly could use some stanbol services but which has a quite
different architecture. The W3C LDP isn't describing a heavy weight
architecture but a set of recommendation on how to use the REST principles
in the context of linked data. If stanbol wants to provide the promised
RESTfull services it should strive for compliance with LDP specifications.

Cheers,
Reto

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Reto Bachmann-Gmür <re...@wymiwyg.com>.

On 14 Nov 2012 18:36, "Rupert Westenthaler" <ru...@gmail.com>
wrote:

>
>  - Linked Data Platform: Reto I guess you have missed this
> presentation [1] at ApacheCon. IMO a Linked Data Platform is something
> that deserves an own project and as soon as there is such a Platform
> available we should use it in Stanbol. This would allow us to remove a
> lot of code in Stanbol (especially in the Entityhub) - a good thing as
> it allows to focus more on core features of Stanbol.
>
> best
> Rupert
>
> [1] http://www.slideshare.net/Wikier/incubating-apache-linda
>
> On Wed, Nov 14, 2012 at 4:56 PM, Reto Bachmann-Gmür <re...@apache.org>
wrote:
> > Thanks for bringing the discussion back to the main issue.
> >
> > Clerezza could graduate as it is. But imho it would make sense to split
> > clerezza into:
> >
> > - RDF libs
> > - Linked Data Platform
> >
> > Imho the Semantic Platform that should strive for compliance with LDPWG
> > standards could merge with Apache Stanbol as in fact for many modules
it's
> > hard to say were they best belong to. For this the clerezza stuff should
> > not become a branch but a subproject of stanbol that can be released
> > individually if needed. This subproject should become thinner and
thinner
> > as more stuff is being moved to the stanbol platform as technologies are
> > being aligned. Discussing if this would be possible should be
independent
> > of the RDF API stuff.
> >
> > Cheers,
> > Reto
> >
> > On Wed, Nov 14, 2012 at 4:18 PM, Fabian Christ <
christ.fabian@googlemail.com
> >> wrote:
> >
> >> Hi Andy,
> >>
> >> thanks for bringing the discussion back to the point where it started.
> >>
> >> Here is my view:
> >>
> >> If Clerezza can not graduate then the sources should be moved into the
> >> archive. The Stanbol community can then freely fork from there and take
> >> what it is needed. Other communities who also use Clerezza may do the
same
> >> to keep their projects working (it is not only a matter for Stanbol).
> >> Clerezza committers are more than welcome to join Stanbol and help to
> >> migrate the parts of Clerezza that are useful for Stanbol.
> >>
> >> I agree with Rupert that the best way to do it, is to set up branches
to
> >> explore different development paths.
> >>
> >> Maybe Clerezza will be able to graduate if they focus on a smaller set
of
> >> components. But this is a discussion for the Clerezza dev list.
> >>
> >> Best,
> >>  - Fabian
> >>
> >>
> >> 2012/11/14 Andy Seaborne <an...@apache.org>
> >>
> >> > The original issue was about whether migrating (part of) Clerezza
into
> >> > Stanbol made sense.  The concern raised was resourcing.
> >> >
> >> > Coupling this to new API design is making the resourcing more of a
> >> > problem, not less.
> >> >
> >> > If I understand the discussion ....
> >> >
> >> > Short term::
> >> >
> >> > Can Clerezza achieve graduation?
> >> >
> >> > Or not, does splitting out the part of Clerezza that Stanbol depends
on
> >> > work? (I sense "yes" with little work needed).  Maintaining such
> >> > transferred code was raised as a concern - e.g. SPARQL 1.1 access.
> >> >
> >> > Long term::
> >> >
> >> > Where does this leave Stanbol?  Does the maintenance cost concern
remain?
> >> > or even get worse?
> >> >
> >> > I don't have sufficient knowledge of the codebase to know what the
> >> balance
> >> > is between fine-grained API work and query-based access (and update).
> >> >
> >> > How important is switching between (e.g.) storage providers?
> >> >
> >> > (local storage - remote would be SPARQL so stanbol-client-code and
> >> > other-server can be chosen separately - that's why we do standards!)
> >> >
> >> >         Andy
> >> >
> >> >
> >>
> >>
> >> --
> >> Fabian
> >> http://twitter.com/fctwitt
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi

I am more with Fabian. The fact is that Clerezza has not much
activity. I am a Clerezza Committer myself and the reason why I am
rather inactive is because I have enough things to do for Stanbol.
This will also not much change in the future. Moving the Clerezza
modules to Stanbol does not solve this problem. It does only move it
from Clerezza over to Stanbol.

 - RDF libs: If Clerezza is no longer actively developed, than Stanbol
should - in the long term - switch to an other RDF framework. RDF is
not core feature of Stanbol so we should rather use existing stuff
than manage our own. So "if" Clerezza  can not graduate, than the
scenario mentioned by Fabian seams also likely to me.

 - Linked Data Platform: Reto I guess you have missed this
presentation [1] at ApacheCon. IMO a Linked Data Platform is something
that deserves an own project and as soon as there is such a Platform
available we should use it in Stanbol. This would allow us to remove a
lot of code in Stanbol (especially in the Entityhub) - a good thing as
it allows to focus more on core features of Stanbol.

best
Rupert

[1] http://www.slideshare.net/Wikier/incubating-apache-linda

On Wed, Nov 14, 2012 at 4:56 PM, Reto Bachmann-Gmür <re...@apache.org> wrote:
> Thanks for bringing the discussion back to the main issue.
>
> Clerezza could graduate as it is. But imho it would make sense to split
> clerezza into:
>
> - RDF libs
> - Linked Data Platform
>
> Imho the Semantic Platform that should strive for compliance with LDPWG
> standards could merge with Apache Stanbol as in fact for many modules it's
> hard to say were they best belong to. For this the clerezza stuff should
> not become a branch but a subproject of stanbol that can be released
> individually if needed. This subproject should become thinner and thinner
> as more stuff is being moved to the stanbol platform as technologies are
> being aligned. Discussing if this would be possible should be independent
> of the RDF API stuff.
>
> Cheers,
> Reto
>
> On Wed, Nov 14, 2012 at 4:18 PM, Fabian Christ <christ.fabian@googlemail.com
>> wrote:
>
>> Hi Andy,
>>
>> thanks for bringing the discussion back to the point where it started.
>>
>> Here is my view:
>>
>> If Clerezza can not graduate then the sources should be moved into the
>> archive. The Stanbol community can then freely fork from there and take
>> what it is needed. Other communities who also use Clerezza may do the same
>> to keep their projects working (it is not only a matter for Stanbol).
>> Clerezza committers are more than welcome to join Stanbol and help to
>> migrate the parts of Clerezza that are useful for Stanbol.
>>
>> I agree with Rupert that the best way to do it, is to set up branches to
>> explore different development paths.
>>
>> Maybe Clerezza will be able to graduate if they focus on a smaller set of
>> components. But this is a discussion for the Clerezza dev list.
>>
>> Best,
>>  - Fabian
>>
>>
>> 2012/11/14 Andy Seaborne <an...@apache.org>
>>
>> > The original issue was about whether migrating (part of) Clerezza into
>> > Stanbol made sense.  The concern raised was resourcing.
>> >
>> > Coupling this to new API design is making the resourcing more of a
>> > problem, not less.
>> >
>> > If I understand the discussion ....
>> >
>> > Short term::
>> >
>> > Can Clerezza achieve graduation?
>> >
>> > Or not, does splitting out the part of Clerezza that Stanbol depends on
>> > work? (I sense "yes" with little work needed).  Maintaining such
>> > transferred code was raised as a concern - e.g. SPARQL 1.1 access.
>> >
>> > Long term::
>> >
>> > Where does this leave Stanbol?  Does the maintenance cost concern remain?
>> > or even get worse?
>> >
>> > I don't have sufficient knowledge of the codebase to know what the
>> balance
>> > is between fine-grained API work and query-based access (and update).
>> >
>> > How important is switching between (e.g.) storage providers?
>> >
>> > (local storage - remote would be SPARQL so stanbol-client-code and
>> > other-server can be chosen separately - that's why we do standards!)
>> >
>> >         Andy
>> >
>> >
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Reto Bachmann-Gmür <re...@apache.org>.

Thanks for bringing the discussion back to the main issue.

Clerezza could graduate as it is. But imho it would make sense to split
clerezza into:

- RDF libs
- Linked Data Platform

Imho the Semantic Platform that should strive for compliance with LDPWG
standards could merge with Apache Stanbol as in fact for many modules it's
hard to say were they best belong to. For this the clerezza stuff should
not become a branch but a subproject of stanbol that can be released
individually if needed. This subproject should become thinner and thinner
as more stuff is being moved to the stanbol platform as technologies are
being aligned. Discussing if this would be possible should be independent
of the RDF API stuff.

Cheers,
Reto

On Wed, Nov 14, 2012 at 4:18 PM, Fabian Christ <christ.fabian@googlemail.com
> wrote:

> Hi Andy,
>
> thanks for bringing the discussion back to the point where it started.
>
> Here is my view:
>
> If Clerezza can not graduate then the sources should be moved into the
> archive. The Stanbol community can then freely fork from there and take
> what it is needed. Other communities who also use Clerezza may do the same
> to keep their projects working (it is not only a matter for Stanbol).
> Clerezza committers are more than welcome to join Stanbol and help to
> migrate the parts of Clerezza that are useful for Stanbol.
>
> I agree with Rupert that the best way to do it, is to set up branches to
> explore different development paths.
>
> Maybe Clerezza will be able to graduate if they focus on a smaller set of
> components. But this is a discussion for the Clerezza dev list.
>
> Best,
>  - Fabian
>
>
> 2012/11/14 Andy Seaborne <an...@apache.org>
>
> > The original issue was about whether migrating (part of) Clerezza into
> > Stanbol made sense.  The concern raised was resourcing.
> >
> > Coupling this to new API design is making the resourcing more of a
> > problem, not less.
> >
> > If I understand the discussion ....
> >
> > Short term::
> >
> > Can Clerezza achieve graduation?
> >
> > Or not, does splitting out the part of Clerezza that Stanbol depends on
> > work? (I sense "yes" with little work needed).  Maintaining such
> > transferred code was raised as a concern - e.g. SPARQL 1.1 access.
> >
> > Long term::
> >
> > Where does this leave Stanbol?  Does the maintenance cost concern remain?
> > or even get worse?
> >
> > I don't have sufficient knowledge of the codebase to know what the
> balance
> > is between fine-grained API work and query-based access (and update).
> >
> > How important is switching between (e.g.) storage providers?
> >
> > (local storage - remote would be SPARQL so stanbol-client-code and
> > other-server can be chosen separately - that's why we do standards!)
> >
> >         Andy
> >
> >
>
>
> --
> Fabian
> http://twitter.com/fctwitt
>

Re: (Back to the) Future of Clerezza and Stanbol

Posted by Fabian Christ <ch...@googlemail.com>.

Hi Andy,

thanks for bringing the discussion back to the point where it started.

Here is my view:

If Clerezza can not graduate then the sources should be moved into the
archive. The Stanbol community can then freely fork from there and take
what it is needed. Other communities who also use Clerezza may do the same
to keep their projects working (it is not only a matter for Stanbol).
Clerezza committers are more than welcome to join Stanbol and help to
migrate the parts of Clerezza that are useful for Stanbol.

I agree with Rupert that the best way to do it, is to set up branches to
explore different development paths.

Maybe Clerezza will be able to graduate if they focus on a smaller set of
components. But this is a discussion for the Clerezza dev list.

Best,
 - Fabian


2012/11/14 Andy Seaborne <an...@apache.org>

> The original issue was about whether migrating (part of) Clerezza into
> Stanbol made sense.  The concern raised was resourcing.
>
> Coupling this to new API design is making the resourcing more of a
> problem, not less.
>
> If I understand the discussion ....
>
> Short term::
>
> Can Clerezza achieve graduation?
>
> Or not, does splitting out the part of Clerezza that Stanbol depends on
> work? (I sense "yes" with little work needed).  Maintaining such
> transferred code was raised as a concern - e.g. SPARQL 1.1 access.
>
> Long term::
>
> Where does this leave Stanbol?  Does the maintenance cost concern remain?
> or even get worse?
>
> I don't have sufficient knowledge of the codebase to know what the balance
> is between fine-grained API work and query-based access (and update).
>
> How important is switching between (e.g.) storage providers?
>
> (local storage - remote would be SPARQL so stanbol-client-code and
> other-server can be chosen separately - that's why we do standards!)
>
>         Andy
>
>


-- 
Fabian
http://twitter.com/fctwitt

(Back to the) Future of Clerezza and Stanbol

Posted by Andy Seaborne <an...@apache.org>.

The original issue was about whether migrating (part of) Clerezza into 
Stanbol made sense.  The concern raised was resourcing.

Coupling this to new API design is making the resourcing more of a 
problem, not less.

If I understand the discussion ....

Short term::

Can Clerezza achieve graduation?

Or not, does splitting out the part of Clerezza that Stanbol depends on 
work? (I sense "yes" with little work needed).  Maintaining such 
transferred code was raised as a concern - e.g. SPARQL 1.1 access.

Long term::

Where does this leave Stanbol?  Does the maintenance cost concern 
remain? or even get worse?

I don't have sufficient knowledge of the codebase to know what the 
balance is between fine-grained API work and query-based access (and 
update).

How important is switching between (e.g.) storage providers?

(local storage - remote would be SPARQL so stanbol-client-code and 
other-server can be chosen separately - that's why we do standards!)

	Andy

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by adasal <ad...@gmail.com>.

Hello,
I think that Sebastian Schaffert is looking at things from a large data set
point of view while Reto wants to evaluate clear and efficient design.

Reto says:

> Besides I would like to compare possible APIs here, ideally the best API
> would be largely adopted
> making wrapper superfluous. (I could also mention that the jena Model class
> also wraps a Graph instance)

So some sort of wrappers will be implemented.
I think Reto is concerned with the well suitedness, that is that where

> ... having a wrapper on these objects that makes them RDF graphs is the
> first step to then allow processing with
> the generic RDF tools and e.g. merging with other RDF data

the object types remain available for evaluation before (and after?)
insertion to the triple store (I suppose).
Maybe that point touches on Sebastian's concerns? I think that Sebastian is
concerned that such a design challenge does not lead to memory swamp and
there are several reason for this.
It is not just about large data sets using large amounts of memory if the
design is wrong. It is also that other use cases require that objects be
serialized early and efficiently.
In RDBM ORM and cache, this is because another part of the system, the
cache - perhaps caches, e.g. mem or ESI, is watching for changes.
This is something else, of course. That there must be an UUID associated
with the object (or triple) to facilitate this mechanism.

Best,

Adam


On 13 November 2012 13:50, Reto Bachmann-Gmür <re...@apache.org> wrote:

> On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> sebastian.schaffert@salzburgresearch.at> wrote:
> [...]
>
> >
> > Despite the solution I described, I still do not think the scenario is
> > well suited for evaluating RDF APIs. You also do not use Hibernate to
> > evaluate whether an RDBMS is good or not.
> >
> The usecase I propose and I don't think this is the only one, I just think
> that API comparison should be based on evaluating their suitability for
> different concretely defined usecases. It has nothing to do with
> hibernation neither with annotation based object to rdf property mapping
> (as there have been several proposals). Its the same principle of any23 or
> aperture but not on the binary data level but on the java object level. I
> have my instrastructure that deals with graphs I have the a Set of contacts
> how does the missing bit look like to process this set with my rdf
> infrastructure. Its a reality that people don't (yet) have all their data
> as graphs, they might have some contacts in LDAP and some mails on an Imap
> server.
>
>
> > >>
> > >> If this is really an issue, I would suggest coming up with a bigger
> > >> collection of RDF API usage scenarios that are also relevant in
> practice
> > >> (as proven by a software project using it). Including scenarios how to
> > deal
> > >> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> > >> typically include >= 100 million triples. ;-)
> > >>
> > >> In addition to what Andy said about wrapper APIs, I would also like to
> > >> emphasise the incurred memory and computation overhead of wrapper
> APIs.
> > Not
> > >> an issue if you have only a handful of triples, but a big issue when
> you
> > >> have 100 million.
> >
> A wrapper doesn't means you have an in memory objects for all your triples
> of your store, that's absurd. But if your code deals with some resources at
> runtime these resource are represented by object instances which contain at
> least a pointer to the resource located of the RAM. So the overhead of a
> wrapper is linear to the amount of RAM the application would need anyway
> and independent of the size of the triple store. Besides I would like to
> compare possible APIs here, ideally the best API would be largely adopted
> making wrapper superfluous. (I could also mention that the jena Model class
> also wraps a Graph instance)
>
>
> >
> > > It's a common misconception to think that java sets are limited to
> 231-1
> > > elements, but even that would be more than 100 millions. In the
> > challenge I
> > > didn't ask for time complexity, it would be fair to ask for that too if
> > you
> > > want to analyze scenarios with such big number of triples.
> >
> > It is a common misconception that just because you have a 64bit
> > architecture you also have 2^64 bits of memory available. And it is a
> > common misconception that in-memory data representation means you do not
> > need to take into account storage structures like indexes. Even if you
> > represent this amount of data in memory, you will run into the same
> problem.
> >
> > 95% of all RDF scenarios will require persistent storage. Selecting a
> > scenario that does not take this into account is useless.
> >
>
> I don't know where your RAM fixation comes from. My usecases doesn't
> mandate in memory storage in any way. The 2^31-1 misconception comes not
> from 32bit architecture but from the fact that Set.size() is defined to
> return an int value (i.e. a maximum of 2^31-1) but the API is clear that a
> Set can be bigger than that.  And again other usecase are welcome, lets
> look at how they can be implemented with different APIs, how elegant the
> solutions are, what they runtime properties are and of course how relevant
> the usecases are to find the most suitable API.
>
> Cheers,
> Reto
>

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Wed, Nov 14, 2012 at 8:32 PM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:

>
> Am 13.11.2012 um 14:50 schrieb Reto Bachmann-Gmür:
>
> > On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> > sebastian.schaffert@salzburgresearch.at> wrote:
> > [...]
> >
> >>
> >> Despite the solution I described, I still do not think the scenario is
> >> well suited for evaluating RDF APIs. You also do not use Hibernate to
> >> evaluate whether an RDBMS is good or not.
> >>
> > The usecase I propose and I don't think this is the only one, I just
> think
> > that API comparison should be based on evaluating their suitability for
> > different concretely defined usecases. It has nothing to do with
> > hibernation neither with annotation based object to rdf property mapping
> > (as there have been several proposals). Its the same principle of any23
> or
> > aperture but not on the binary data level but on the java object level.
>
> The Java domain object level is one level of abstraction above the data
> representation/storage level. I was mentioning Hibernate as an example of a
> generic mapping between the java object level and the data representation
> level (even though in this case it is relational database, the same can be
> done for RDF). The Java object level does not really allow to draw good
> conclusions about the data representation level.
>

We are talking about an API for modelling the entities introduced by RDF
and related specs. How implementation store the data if in ram, quantum
storages or engraved in stone is just completely irrevant for this
discussion.


>
> > I have my instrastructure that deals with graphs I have the a Set of
> contacts
> > how does the missing bit look like to process this set with my rdf
> > infrastructure. Its a reality that people don't (yet) have all their data
> > as graphs, they might have some contacts in LDAP and some mails on an
> Imap
> > server.
>
>
> I showed you an example of annotation based object to RDF mapping to fill
> exactly that missing bit. This implementation works on any RDF API (we had
> it in Sesame, in KiWi, and now in the LMF) and has been done by several
> other people as well. It does not really help much in deciding how the RDF
> API itself should look like, though.
>
Exactly. That's why the discussion is by no means required to show how the
Toy-Usecase can be implemented with Jena, Sesame, Clerezza, Banana or XY
API.


>
> >
> >
> >>>>
> >>>> If this is really an issue, I would suggest coming up with a bigger
> >>>> collection of RDF API usage scenarios that are also relevant in
> practice
> >>>> (as proven by a software project using it). Including scenarios how to
> >> deal
> >>>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> >>>> typically include >= 100 million triples. ;-)
> >>>>
> >>>> In addition to what Andy said about wrapper APIs, I would also like to
> >>>> emphasise the incurred memory and computation overhead of wrapper
> APIs.
> >> Not
> >>>> an issue if you have only a handful of triples, but a big issue when
> you
> >>>> have 100 million.
> >>
> > A wrapper doesn't means you have an in memory objects for all your
> triples
> > of your store, that's absurd. But if your code deals with some resources
> at
> > runtime these resource are represented by object instances which contain
> at
> > least a pointer to the resource located of the RAM. So the overhead of a
> > wrapper is linear to the amount of RAM the application would need anyway
> > and independent of the size of the triple store.
>
> So in other words: instead of a server with 8GB I might need one with 10GB
> RAM, just because I decided using a wrapper instead of the native API. Or
> to put it differently: with the same server I can hold less objects in my
> in-memory cache, possibly sacrificing a lot of processing time. From my
> experience, it makes a big difference.
>
Well then you probably shouldn't be using any higher level language or
abstraction. 25% which would be almost 8 months waiting by Moore's law I
think is a huge exaggeration of the overhead.

But again I'm not arguing in favour of wrappers, I want to discuss how the
best API should look like. If this API is then adopted by implementor and
if not if you use a wrapper, wait a couple of months to have the ram
required for the overhead at the same price, invest a bit more now or
decide not to use the best API in for saving RAM is out of scope.



>
> > Besides I would like to
> > compare possible APIs here, ideally the best API would be largely adopted
> > making wrapper superfluous. (I could also mention that the jena Model
> class
> > also wraps a Graph instance)
>
> Agreed.
>
> >
> >
> >>
> >>> It's a common misconception to think that java sets are limited to
> 231-1
> >>> elements, but even that would be more than 100 millions. In the
> >> challenge I
> >>> didn't ask for time complexity, it would be fair to ask for that too if
> >> you
> >>> want to analyze scenarios with such big number of triples.
> >>
> >> It is a common misconception that just because you have a 64bit
> >> architecture you also have 2^64 bits of memory available. And it is a
> >> common misconception that in-memory data representation means you do not
> >> need to take into account storage structures like indexes. Even if you
> >> represent this amount of data in memory, you will run into the same
> problem.
> >>
> >> 95% of all RDF scenarios will require persistent storage. Selecting a
> >> scenario that does not take this into account is useless.
> >>
> >
> > I don't know where your RAM fixation comes from.
>
> I started programming with 64kbyte and grew up into Computer Science when
> "640kbyte ought to be enough for anyone" ;-)
>
> Joke aside, it comes from the real world use cases we are working on, e.g.
> a Linked Data and Semantic Search server at http://search.salzburg.com,
> representing about 1,2 million news articles as RDF, resulting in about 140
> million triples. It also comes from my experience with IkeWiki, which was a
> Semantic Wiki system completely built on RDF (using Jena at that time).
>
> The server the partner has provided us with for the Semantic Search has
> 3GB of RAM and is a virtual VMWare instance with not the best I/O
> performance. Importing all news articles on this server and processing them
> takes 2 weeks (after spending many days doing performance profiling with
> YourKit and identifying bottlenecks and unnecessary overheads like wrappers
> or proxy classes). If I have a wrapper implementation inbetween, even
> lightweight, maybe just takes 10% more, i.e. 1,5 days! The performance
> overhead clearly matters.
>
> In virtually all my RDF projects of the last 10-12 years, the CENTRAL
> issues were always efficient/effective/reliable/convenient storage and
> efficient/effective/reliable/convenient querying (in parallel
> environments). These are the criteria an RDF API should IMHO be evaluated
> against.

It an API is designed in a way that implementations are necessarily less
perfomant than implementation of other API than can used to solve the same
usecase than that's a strong argument against an API-



> In my personal experience, the data model and repository API of Sesame was
> the best choice to cover these scenarios in all different kinds of use
> cases I had so far (small data and big data). It was also the most flexible
> option, because of its consistent use of interfaces and modular choice of
> backends. Jena comes close, but did not yet go through the architectural
> changes (i.e. interface based data model) that Sesame already did with the
> 2.x series. Clerezza so far is not a real option to achieve my goals. It is
> good and convenient when working with small in-memory representations of
> graphs, but (as we discussed before) lacks for me important persistence and
> querying features. If I am purely interested in Sets of triples, guess
> what: I create a Java Set and put triples in it. For example, we even have
> an extended set with a (limited) query index support [1], which I created
> out of realizing that we spent a considerable time just iterating
> unnecessarily over sets. No need for a new API.
>
java.util.Set by itself is a poor API for triples. Besides being incomplete
as it doesn't define how triples and resources look like it doesn't support
a way to filter triples with a triple pattern. Furthermore the identity of
graphs is defined differently than the one of sets. The clerezza API
extends the Collection API (a Graph is not a set) so that the API can be
used for for 120 as well as for 120 billions triples.



>
> [1]
> http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/table/TripleTable.java
>
> > My usecases doesn't mandate in memory storage in any way. The 2^31-1
> misconception comes not
> > from 32bit architecture but from the fact that Set.size() is defined to
> > return an int value (i.e. a maximum of 2^31-1) but the API is clear that
> a
> > Set can be bigger than that.
>
> I did not come up with any 2^31 misconception. And *of course* the 2^31-1
> topic is originally caused by 32 bit architectures, because this is why
> integer (in Java) is defined as 32bit (the size you can store in a
> processor register so simple computations only require a single instruction
> of the processor). And the fact that Java is using 32bit ints for many
> things DOES cause problems, as Rupert can tell you from experience: it
> might e.g. happen that two completely different objects share the same hash
> code, because the hash code is an integer while the memory address is a
> long.
>
> What I was referring to is that regardless the amount of memory you have,
> persistence and querying is the core functionality of any RDF API. The use
> cases where you are working with RDF data and don't need persistence are
> rare (serializing and deserializing domain objects via RDF comes to my
> mind) and for consistency reasons I prefer treating them in the same way as
> the persistent cases,

I agree so far. But what does this have to do with the usecase? the usecase
never says that the data should be in memory.


> even if it means that I have to deal with persistence concepts (e.g.
> repository connections or transactions) without direct need. On the other
> hand, persistence comes with some important requirements, which are known
> for long and summarized in the ACID principles, and which need to be
> satisfied by an RDF API.
>
No full ACID support is requirement in some situations but definitively not
in all situation where you have large amount of data. It's a typical
enterprise requirement in which case you probably also want your
transaction to span different systems and not be confined to the RDF
repository and are happy to technologies like JTA.


>
> > And again other usecase are welcome, lets
> > look at how they can be implemented with different APIs, how elegant the
> > solutions are, what they runtime properties are and of course how
> relevant
> > the usecases are to find the most suitable API.
>
>
> Ok, my challenges (from a real project):
> - I want to be able to run a crawler over skiing forums, extract the
> topics, posts, and user information from them, perform a POS tagging and
> sentiment analysis and store the results together with the post content in
> my RDF repository;
>
Ok. What exactly would you like to see, You get some graph or graphs from
tge crawler, have these graphs enriched by the POS tager and analyzer and
do myRepo.addAll(enrichedGraph) at the end. Maybe you could strip down the
usecase to the relevant parts, show me the solution in your favourite API,
then I translate it to Clerezza and then we can see what is missing?


> - in case one of the processes inbetween fails (e.g. due to a network
> error), I want to properly roll back all changes made to the repository
> while processing this particular post or topic
>

Ok, probably the crawler should go back as well. So this sounds like a
usecase for JTA which is orthogonal to the RDF API


> - I want to expose this dataset (with 10 million posts and 1 billion
> triples) as Linked Data, possibly taking into account a big number of
> parallel requests on that data (e.g. while Linked Data researchers are
> preparing their articles for ISWC)
>
- I want to run complex aggregate queries over big datasets (while the
> crawling process is still running!), e.g. "give me all forum posts out of a
> set of 10 million on skiing that are concerned with 'carving skiing' with
> an average sentiment of >0.5 for mentionings of the noun phrase 'Atomic
> Racer SL6' and display for each the number of replies in the forum topic"
>
And you don't just want to pass a SPARQL query but would like to have
defined combined indexes via the API before, is that the challenge?
(Clerezza has this an extension on top but wouldn't it be better to focus
on the core API before?)



> - I want to store a SKOS thesaurus on skiing in a separate named graph and
> run queries over the combination of the big data set of posts and the small
> thesaurus (e.g. to get the labels of concepts instead of the URI)
>
Isn't this just sparql?


> - I want to have a configurable rule-based reasoner where I can add simple
> rules like a "broaderTransitive" rule for the SKOS broader relationship; it
> has to run on 1 billion triples
>
Ok, a useful feature that goes beyond modelling RDF specs in Java. In the
interest of modularity of the API I would suggest to first focus on
usecases on the level of the spec family around RDF. Or does any of the API
you mentioned Jena, Sesame or Clerezza supports such a feature?


> - I want to repeat the crawling process every X days, possibly updating
> post data in case something has changed, even while another crawling
> process is running and another user is running a complex query
>
Again, I don't see the API requirement here. Could you describe from the
client perspective maybe: "the client has to be able to tell when a data
update transaction involving multiple operations starts and when it ends,
before it ends other client shall see the data without any modification..."
if that's the requirement we would be back to the transaction support
requirements and so back what probably could be solved with JTA,


>
> With the same API model (i.e. without learning a different API), I also
> want to:
> - with a few lines import a small RDF document into memory to run some
> small tests
> - take a bunch of triples and serialize them as RDF/XML or N3
>

Sure.

Would be handy handy if could boil down the hard part to want to address so
that the example code in your favourite API fits on a page and the we can
compare it with other design alternatives.

Cheers,
Reto

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Wed, Nov 14, 2012 at 8:32 PM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:

>
> Am 13.11.2012 um 14:50 schrieb Reto Bachmann-Gmür:
>
> > On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> > sebastian.schaffert@salzburgresearch.at> wrote:
> > [...]
> >
> >>
> >> Despite the solution I described, I still do not think the scenario is
> >> well suited for evaluating RDF APIs. You also do not use Hibernate to
> >> evaluate whether an RDBMS is good or not.
> >>
> > The usecase I propose and I don't think this is the only one, I just
> think
> > that API comparison should be based on evaluating their suitability for
> > different concretely defined usecases. It has nothing to do with
> > hibernation neither with annotation based object to rdf property mapping
> > (as there have been several proposals). Its the same principle of any23
> or
> > aperture but not on the binary data level but on the java object level.
>
> The Java domain object level is one level of abstraction above the data
> representation/storage level. I was mentioning Hibernate as an example of a
> generic mapping between the java object level and the data representation
> level (even though in this case it is relational database, the same can be
> done for RDF). The Java object level does not really allow to draw good
> conclusions about the data representation level.
>

We are talking about an API for modelling the entities introduced by RDF
and related specs. How implementation store the data if in ram, quantum
storages or engraved in stone is just completely irrevant for this
discussion.


>
> > I have my instrastructure that deals with graphs I have the a Set of
> contacts
> > how does the missing bit look like to process this set with my rdf
> > infrastructure. Its a reality that people don't (yet) have all their data
> > as graphs, they might have some contacts in LDAP and some mails on an
> Imap
> > server.
>
>
> I showed you an example of annotation based object to RDF mapping to fill
> exactly that missing bit. This implementation works on any RDF API (we had
> it in Sesame, in KiWi, and now in the LMF) and has been done by several
> other people as well. It does not really help much in deciding how the RDF
> API itself should look like, though.
>
Exactly. That's why the discussion is by no means required to show how the
Toy-Usecase can be implemented with Jena, Sesame, Clerezza, Banana or XY
API.


>
> >
> >
> >>>>
> >>>> If this is really an issue, I would suggest coming up with a bigger
> >>>> collection of RDF API usage scenarios that are also relevant in
> practice
> >>>> (as proven by a software project using it). Including scenarios how to
> >> deal
> >>>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> >>>> typically include >= 100 million triples. ;-)
> >>>>
> >>>> In addition to what Andy said about wrapper APIs, I would also like to
> >>>> emphasise the incurred memory and computation overhead of wrapper
> APIs.
> >> Not
> >>>> an issue if you have only a handful of triples, but a big issue when
> you
> >>>> have 100 million.
> >>
> > A wrapper doesn't means you have an in memory objects for all your
> triples
> > of your store, that's absurd. But if your code deals with some resources
> at
> > runtime these resource are represented by object instances which contain
> at
> > least a pointer to the resource located of the RAM. So the overhead of a
> > wrapper is linear to the amount of RAM the application would need anyway
> > and independent of the size of the triple store.
>
> So in other words: instead of a server with 8GB I might need one with 10GB
> RAM, just because I decided using a wrapper instead of the native API. Or
> to put it differently: with the same server I can hold less objects in my
> in-memory cache, possibly sacrificing a lot of processing time. From my
> experience, it makes a big difference.
>
Well then you probably shouldn't be using any higher level language or
abstraction. 25% which would be almost 8 months waiting by Moore's law I
think is a huge exaggeration of the overhead.

But again I'm not arguing in favour of wrappers, I want to discuss how the
best API should look like. If this API is then adopted by implementor and
if not if you use a wrapper, wait a couple of months to have the ram
required for the overhead at the same price, invest a bit more now or
decide not to use the best API in for saving RAM is out of scope.



>
> > Besides I would like to
> > compare possible APIs here, ideally the best API would be largely adopted
> > making wrapper superfluous. (I could also mention that the jena Model
> class
> > also wraps a Graph instance)
>
> Agreed.
>
> >
> >
> >>
> >>> It's a common misconception to think that java sets are limited to
> 231-1
> >>> elements, but even that would be more than 100 millions. In the
> >> challenge I
> >>> didn't ask for time complexity, it would be fair to ask for that too if
> >> you
> >>> want to analyze scenarios with such big number of triples.
> >>
> >> It is a common misconception that just because you have a 64bit
> >> architecture you also have 2^64 bits of memory available. And it is a
> >> common misconception that in-memory data representation means you do not
> >> need to take into account storage structures like indexes. Even if you
> >> represent this amount of data in memory, you will run into the same
> problem.
> >>
> >> 95% of all RDF scenarios will require persistent storage. Selecting a
> >> scenario that does not take this into account is useless.
> >>
> >
> > I don't know where your RAM fixation comes from.
>
> I started programming with 64kbyte and grew up into Computer Science when
> "640kbyte ought to be enough for anyone" ;-)
>
> Joke aside, it comes from the real world use cases we are working on, e.g.
> a Linked Data and Semantic Search server at http://search.salzburg.com,
> representing about 1,2 million news articles as RDF, resulting in about 140
> million triples. It also comes from my experience with IkeWiki, which was a
> Semantic Wiki system completely built on RDF (using Jena at that time).
>
> The server the partner has provided us with for the Semantic Search has
> 3GB of RAM and is a virtual VMWare instance with not the best I/O
> performance. Importing all news articles on this server and processing them
> takes 2 weeks (after spending many days doing performance profiling with
> YourKit and identifying bottlenecks and unnecessary overheads like wrappers
> or proxy classes). If I have a wrapper implementation inbetween, even
> lightweight, maybe just takes 10% more, i.e. 1,5 days! The performance
> overhead clearly matters.
>
> In virtually all my RDF projects of the last 10-12 years, the CENTRAL
> issues were always efficient/effective/reliable/convenient storage and
> efficient/effective/reliable/convenient querying (in parallel
> environments). These are the criteria an RDF API should IMHO be evaluated
> against.

It an API is designed in a way that implementations are necessarily less
perfomant than implementation of other API than can used to solve the same
usecase than that's a strong argument against an API-



> In my personal experience, the data model and repository API of Sesame was
> the best choice to cover these scenarios in all different kinds of use
> cases I had so far (small data and big data). It was also the most flexible
> option, because of its consistent use of interfaces and modular choice of
> backends. Jena comes close, but did not yet go through the architectural
> changes (i.e. interface based data model) that Sesame already did with the
> 2.x series. Clerezza so far is not a real option to achieve my goals. It is
> good and convenient when working with small in-memory representations of
> graphs, but (as we discussed before) lacks for me important persistence and
> querying features. If I am purely interested in Sets of triples, guess
> what: I create a Java Set and put triples in it. For example, we even have
> an extended set with a (limited) query index support [1], which I created
> out of realizing that we spent a considerable time just iterating
> unnecessarily over sets. No need for a new API.
>
java.util.Set by itself is a poor API for triples. Besides being incomplete
as it doesn't define how triples and resources look like it doesn't support
a way to filter triples with a triple pattern. Furthermore the identity of
graphs is defined differently than the one of sets. The clerezza API
extends the Collection API (a Graph is not a set) so that the API can be
used for for 120 as well as for 120 billions triples.



>
> [1]
> http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/table/TripleTable.java
>
> > My usecases doesn't mandate in memory storage in any way. The 2^31-1
> misconception comes not
> > from 32bit architecture but from the fact that Set.size() is defined to
> > return an int value (i.e. a maximum of 2^31-1) but the API is clear that
> a
> > Set can be bigger than that.
>
> I did not come up with any 2^31 misconception. And *of course* the 2^31-1
> topic is originally caused by 32 bit architectures, because this is why
> integer (in Java) is defined as 32bit (the size you can store in a
> processor register so simple computations only require a single instruction
> of the processor). And the fact that Java is using 32bit ints for many
> things DOES cause problems, as Rupert can tell you from experience: it
> might e.g. happen that two completely different objects share the same hash
> code, because the hash code is an integer while the memory address is a
> long.
>
> What I was referring to is that regardless the amount of memory you have,
> persistence and querying is the core functionality of any RDF API. The use
> cases where you are working with RDF data and don't need persistence are
> rare (serializing and deserializing domain objects via RDF comes to my
> mind) and for consistency reasons I prefer treating them in the same way as
> the persistent cases,

I agree so far. But what does this have to do with the usecase? the usecase
never says that the data should be in memory.


> even if it means that I have to deal with persistence concepts (e.g.
> repository connections or transactions) without direct need. On the other
> hand, persistence comes with some important requirements, which are known
> for long and summarized in the ACID principles, and which need to be
> satisfied by an RDF API.
>
No full ACID support is requirement in some situations but definitively not
in all situation where you have large amount of data. It's a typical
enterprise requirement in which case you probably also want your
transaction to span different systems and not be confined to the RDF
repository and are happy to technologies like JTA.


>
> > And again other usecase are welcome, lets
> > look at how they can be implemented with different APIs, how elegant the
> > solutions are, what they runtime properties are and of course how
> relevant
> > the usecases are to find the most suitable API.
>
>
> Ok, my challenges (from a real project):
> - I want to be able to run a crawler over skiing forums, extract the
> topics, posts, and user information from them, perform a POS tagging and
> sentiment analysis and store the results together with the post content in
> my RDF repository;
>
Ok. What exactly would you like to see, You get some graph or graphs from
tge crawler, have these graphs enriched by the POS tager and analyzer and
do myRepo.addAll(enrichedGraph) at the end. Maybe you could strip down the
usecase to the relevant parts, show me the solution in your favourite API,
then I translate it to Clerezza and then we can see what is missing?


> - in case one of the processes inbetween fails (e.g. due to a network
> error), I want to properly roll back all changes made to the repository
> while processing this particular post or topic
>

Ok, probably the crawler should go back as well. So this sounds like a
usecase for JTA which is orthogonal to the RDF API


> - I want to expose this dataset (with 10 million posts and 1 billion
> triples) as Linked Data, possibly taking into account a big number of
> parallel requests on that data (e.g. while Linked Data researchers are
> preparing their articles for ISWC)
>
- I want to run complex aggregate queries over big datasets (while the
> crawling process is still running!), e.g. "give me all forum posts out of a
> set of 10 million on skiing that are concerned with 'carving skiing' with
> an average sentiment of >0.5 for mentionings of the noun phrase 'Atomic
> Racer SL6' and display for each the number of replies in the forum topic"
>
And you don't just want to pass a SPARQL query but would like to have
defined combined indexes via the API before, is that the challenge?
(Clerezza has this an extension on top but wouldn't it be better to focus
on the core API before?)



> - I want to store a SKOS thesaurus on skiing in a separate named graph and
> run queries over the combination of the big data set of posts and the small
> thesaurus (e.g. to get the labels of concepts instead of the URI)
>
Isn't this just sparql?


> - I want to have a configurable rule-based reasoner where I can add simple
> rules like a "broaderTransitive" rule for the SKOS broader relationship; it
> has to run on 1 billion triples
>
Ok, a useful feature that goes beyond modelling RDF specs in Java. In the
interest of modularity of the API I would suggest to first focus on
usecases on the level of the spec family around RDF. Or does any of the API
you mentioned Jena, Sesame or Clerezza supports such a feature?


> - I want to repeat the crawling process every X days, possibly updating
> post data in case something has changed, even while another crawling
> process is running and another user is running a complex query
>
Again, I don't see the API requirement here. Could you describe from the
client perspective maybe: "the client has to be able to tell when a data
update transaction involving multiple operations starts and when it ends,
before it ends other client shall see the data without any modification..."
if that's the requirement we would be back to the transaction support
requirements and so back what probably could be solved with JTA,


>
> With the same API model (i.e. without learning a different API), I also
> want to:
> - with a few lines import a small RDF document into memory to run some
> small tests
> - take a bunch of triples and serialize them as RDF/XML or N3
>

Sure.

Would be handy handy if could boil down the hard part to want to address so
that the example code in your favourite API fits on a page and the we can
compare it with other design alternatives.

Cheers,
Reto

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Sebastian Schaffert <se...@salzburgresearch.at>.

Am 13.11.2012 um 14:50 schrieb Reto Bachmann-Gmür:

> On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> sebastian.schaffert@salzburgresearch.at> wrote:
> [...]
> 
>> 
>> Despite the solution I described, I still do not think the scenario is
>> well suited for evaluating RDF APIs. You also do not use Hibernate to
>> evaluate whether an RDBMS is good or not.
>> 
> The usecase I propose and I don't think this is the only one, I just think
> that API comparison should be based on evaluating their suitability for
> different concretely defined usecases. It has nothing to do with
> hibernation neither with annotation based object to rdf property mapping
> (as there have been several proposals). Its the same principle of any23 or
> aperture but not on the binary data level but on the java object level.

The Java domain object level is one level of abstraction above the data representation/storage level. I was mentioning Hibernate as an example of a generic mapping between the java object level and the data representation level (even though in this case it is relational database, the same can be done for RDF). The Java object level does not really allow to draw good conclusions about the data representation level.

> I have my instrastructure that deals with graphs I have the a Set of contacts
> how does the missing bit look like to process this set with my rdf
> infrastructure. Its a reality that people don't (yet) have all their data
> as graphs, they might have some contacts in LDAP and some mails on an Imap
> server.

I showed you an example of annotation based object to RDF mapping to fill exactly that missing bit. This implementation works on any RDF API (we had it in Sesame, in KiWi, and now in the LMF) and has been done by several other people as well. It does not really help much in deciding how the RDF API itself should look like, though. 

> 
> 
>>>> 
>>>> If this is really an issue, I would suggest coming up with a bigger
>>>> collection of RDF API usage scenarios that are also relevant in practice
>>>> (as proven by a software project using it). Including scenarios how to
>> deal
>>>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
>>>> typically include >= 100 million triples. ;-)
>>>> 
>>>> In addition to what Andy said about wrapper APIs, I would also like to
>>>> emphasise the incurred memory and computation overhead of wrapper APIs.
>> Not
>>>> an issue if you have only a handful of triples, but a big issue when you
>>>> have 100 million.
>> 
> A wrapper doesn't means you have an in memory objects for all your triples
> of your store, that's absurd. But if your code deals with some resources at
> runtime these resource are represented by object instances which contain at
> least a pointer to the resource located of the RAM. So the overhead of a
> wrapper is linear to the amount of RAM the application would need anyway
> and independent of the size of the triple store.

So in other words: instead of a server with 8GB I might need one with 10GB RAM, just because I decided using a wrapper instead of the native API. Or to put it differently: with the same server I can hold less objects in my in-memory cache, possibly sacrificing a lot of processing time. From my experience, it makes a big difference.

> Besides I would like to
> compare possible APIs here, ideally the best API would be largely adopted
> making wrapper superfluous. (I could also mention that the jena Model class
> also wraps a Graph instance)

Agreed.

> 
> 
>> 
>>> It's a common misconception to think that java sets are limited to 231-1
>>> elements, but even that would be more than 100 millions. In the
>> challenge I
>>> didn't ask for time complexity, it would be fair to ask for that too if
>> you
>>> want to analyze scenarios with such big number of triples.
>> 
>> It is a common misconception that just because you have a 64bit
>> architecture you also have 2^64 bits of memory available. And it is a
>> common misconception that in-memory data representation means you do not
>> need to take into account storage structures like indexes. Even if you
>> represent this amount of data in memory, you will run into the same problem.
>> 
>> 95% of all RDF scenarios will require persistent storage. Selecting a
>> scenario that does not take this into account is useless.
>> 
> 
> I don't know where your RAM fixation comes from.

I started programming with 64kbyte and grew up into Computer Science when "640kbyte ought to be enough for anyone" ;-)

Joke aside, it comes from the real world use cases we are working on, e.g. a Linked Data and Semantic Search server at http://search.salzburg.com, representing about 1,2 million news articles as RDF, resulting in about 140 million triples. It also comes from my experience with IkeWiki, which was a Semantic Wiki system completely built on RDF (using Jena at that time).

The server the partner has provided us with for the Semantic Search has 3GB of RAM and is a virtual VMWare instance with not the best I/O performance. Importing all news articles on this server and processing them takes 2 weeks (after spending many days doing performance profiling with YourKit and identifying bottlenecks and unnecessary overheads like wrappers or proxy classes). If I have a wrapper implementation inbetween, even lightweight, maybe just takes 10% more, i.e. 1,5 days! The performance overhead clearly matters. 

In virtually all my RDF projects of the last 10-12 years, the CENTRAL issues were always efficient/effective/reliable/convenient storage and efficient/effective/reliable/convenient querying (in parallel environments). These are the criteria an RDF API should IMHO be evaluated against. In my personal experience, the data model and repository API of Sesame was the best choice to cover these scenarios in all different kinds of use cases I had so far (small data and big data). It was also the most flexible option, because of its consistent use of interfaces and modular choice of backends. Jena comes close, but did not yet go through the architectural changes (i.e. interface based data model) that Sesame already did with the 2.x series. Clerezza so far is not a real option to achieve my goals. It is good and convenient when working with small in-memory representations of graphs, but (as we discussed before) lacks for me important persistence and querying features. If I am purely interested in Sets of triples, guess what: I create a Java Set and put triples in it. For example, we even have an extended set with a (limited) query index support [1], which I created out of realizing that we spent a considerable time just iterating unnecessarily over sets. No need for a new API.

[1] http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/table/TripleTable.java 

> My usecases doesn't mandate in memory storage in any way. The 2^31-1 misconception comes not
> from 32bit architecture but from the fact that Set.size() is defined to
> return an int value (i.e. a maximum of 2^31-1) but the API is clear that a
> Set can be bigger than that.  

I did not come up with any 2^31 misconception. And *of course* the 2^31-1 topic is originally caused by 32 bit architectures, because this is why integer (in Java) is defined as 32bit (the size you can store in a processor register so simple computations only require a single instruction of the processor). And the fact that Java is using 32bit ints for many things DOES cause problems, as Rupert can tell you from experience: it might e.g. happen that two completely different objects share the same hash code, because the hash code is an integer while the memory address is a long.

What I was referring to is that regardless the amount of memory you have, persistence and querying is the core functionality of any RDF API. The use cases where you are working with RDF data and don't need persistence are rare (serializing and deserializing domain objects via RDF comes to my mind) and for consistency reasons I prefer treating them in the same way as the persistent cases, even if it means that I have to deal with persistence concepts (e.g. repository connections or transactions) without direct need. On the other hand, persistence comes with some important requirements, which are known for long and summarized in the ACID principles, and which need to be satisfied by an RDF API.

> And again other usecase are welcome, lets
> look at how they can be implemented with different APIs, how elegant the
> solutions are, what they runtime properties are and of course how relevant
> the usecases are to find the most suitable API.

Ok, my challenges (from a real project):
- I want to be able to run a crawler over skiing forums, extract the topics, posts, and user information from them, perform a POS tagging and sentiment analysis and store the results together with the post content in my RDF repository;
- in case one of the processes inbetween fails (e.g. due to a network error), I want to properly roll back all changes made to the repository while processing this particular post or topic 
- I want to expose this dataset (with 10 million posts and 1 billion triples) as Linked Data, possibly taking into account a big number of parallel requests on that data (e.g. while Linked Data researchers are preparing their articles for ISWC) 
- I want to run complex aggregate queries over big datasets (while the crawling process is still running!), e.g. "give me all forum posts out of a set of 10 million on skiing that are concerned with 'carving skiing' with an average sentiment of >0.5 for mentionings of the noun phrase 'Atomic Racer SL6' and display for each the number of replies in the forum topic"
- I want to store a SKOS thesaurus on skiing in a separate named graph and run queries over the combination of the big data set of posts and the small thesaurus (e.g. to get the labels of concepts instead of the URI)
- I want to have a configurable rule-based reasoner where I can add simple rules like a "broaderTransitive" rule for the SKOS broader relationship; it has to run on 1 billion triples
- I want to repeat the crawling process every X days, possibly updating post data in case something has changed, even while another crawling process is running and another user is running a complex query

With the same API model (i.e. without learning a different API), I also want to:
- with a few lines import a small RDF document into memory to run some small tests
- take a bunch of triples and serialize them as RDF/XML or N3

Cheers, ;-)

Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:
[...]

>
> Despite the solution I described, I still do not think the scenario is
> well suited for evaluating RDF APIs. You also do not use Hibernate to
> evaluate whether an RDBMS is good or not.
>
The usecase I propose and I don't think this is the only one, I just think
that API comparison should be based on evaluating their suitability for
different concretely defined usecases. It has nothing to do with
hibernation neither with annotation based object to rdf property mapping
(as there have been several proposals). Its the same principle of any23 or
aperture but not on the binary data level but on the java object level. I
have my instrastructure that deals with graphs I have the a Set of contacts
how does the missing bit look like to process this set with my rdf
infrastructure. Its a reality that people don't (yet) have all their data
as graphs, they might have some contacts in LDAP and some mails on an Imap
server.


> >>
> >> If this is really an issue, I would suggest coming up with a bigger
> >> collection of RDF API usage scenarios that are also relevant in practice
> >> (as proven by a software project using it). Including scenarios how to
> deal
> >> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> >> typically include >= 100 million triples. ;-)
> >>
> >> In addition to what Andy said about wrapper APIs, I would also like to
> >> emphasise the incurred memory and computation overhead of wrapper APIs.
> Not
> >> an issue if you have only a handful of triples, but a big issue when you
> >> have 100 million.
>
A wrapper doesn't means you have an in memory objects for all your triples
of your store, that's absurd. But if your code deals with some resources at
runtime these resource are represented by object instances which contain at
least a pointer to the resource located of the RAM. So the overhead of a
wrapper is linear to the amount of RAM the application would need anyway
and independent of the size of the triple store. Besides I would like to
compare possible APIs here, ideally the best API would be largely adopted
making wrapper superfluous. (I could also mention that the jena Model class
also wraps a Graph instance)


>
> > It's a common misconception to think that java sets are limited to 231-1
> > elements, but even that would be more than 100 millions. In the
> challenge I
> > didn't ask for time complexity, it would be fair to ask for that too if
> you
> > want to analyze scenarios with such big number of triples.
>
> It is a common misconception that just because you have a 64bit
> architecture you also have 2^64 bits of memory available. And it is a
> common misconception that in-memory data representation means you do not
> need to take into account storage structures like indexes. Even if you
> represent this amount of data in memory, you will run into the same problem.
>
> 95% of all RDF scenarios will require persistent storage. Selecting a
> scenario that does not take this into account is useless.
>

I don't know where your RAM fixation comes from. My usecases doesn't
mandate in memory storage in any way. The 2^31-1 misconception comes not
from 32bit architecture but from the fact that Set.size() is defined to
return an int value (i.e. a maximum of 2^31-1) but the API is clear that a
Set can be bigger than that.  And again other usecase are welcome, lets
look at how they can be implemented with different APIs, how elegant the
solutions are, what they runtime properties are and of course how relevant
the usecases are to find the most suitable API.

Cheers,
Reto

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:
[...]

>
> Despite the solution I described, I still do not think the scenario is
> well suited for evaluating RDF APIs. You also do not use Hibernate to
> evaluate whether an RDBMS is good or not.
>
The usecase I propose and I don't think this is the only one, I just think
that API comparison should be based on evaluating their suitability for
different concretely defined usecases. It has nothing to do with
hibernation neither with annotation based object to rdf property mapping
(as there have been several proposals). Its the same principle of any23 or
aperture but not on the binary data level but on the java object level. I
have my instrastructure that deals with graphs I have the a Set of contacts
how does the missing bit look like to process this set with my rdf
infrastructure. Its a reality that people don't (yet) have all their data
as graphs, they might have some contacts in LDAP and some mails on an Imap
server.


> >>
> >> If this is really an issue, I would suggest coming up with a bigger
> >> collection of RDF API usage scenarios that are also relevant in practice
> >> (as proven by a software project using it). Including scenarios how to
> deal
> >> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> >> typically include >= 100 million triples. ;-)
> >>
> >> In addition to what Andy said about wrapper APIs, I would also like to
> >> emphasise the incurred memory and computation overhead of wrapper APIs.
> Not
> >> an issue if you have only a handful of triples, but a big issue when you
> >> have 100 million.
>
A wrapper doesn't means you have an in memory objects for all your triples
of your store, that's absurd. But if your code deals with some resources at
runtime these resource are represented by object instances which contain at
least a pointer to the resource located of the RAM. So the overhead of a
wrapper is linear to the amount of RAM the application would need anyway
and independent of the size of the triple store. Besides I would like to
compare possible APIs here, ideally the best API would be largely adopted
making wrapper superfluous. (I could also mention that the jena Model class
also wraps a Graph instance)


>
> > It's a common misconception to think that java sets are limited to 231-1
> > elements, but even that would be more than 100 millions. In the
> challenge I
> > didn't ask for time complexity, it would be fair to ask for that too if
> you
> > want to analyze scenarios with such big number of triples.
>
> It is a common misconception that just because you have a 64bit
> architecture you also have 2^64 bits of memory available. And it is a
> common misconception that in-memory data representation means you do not
> need to take into account storage structures like indexes. Even if you
> represent this amount of data in memory, you will run into the same problem.
>
> 95% of all RDF scenarios will require persistent storage. Selecting a
> scenario that does not take this into account is useless.
>

I don't know where your RAM fixation comes from. My usecases doesn't
mandate in memory storage in any way. The 2^31-1 misconception comes not
from 32bit architecture but from the fact that Set.size() is defined to
return an int value (i.e. a maximum of 2^31-1) but the API is clear that a
Set can be bigger than that.  And again other usecase are welcome, lets
look at how they can be implemented with different APIs, how elegant the
solutions are, what they runtime properties are and of course how relevant
the usecases are to find the most suitable API.

Cheers,
Reto

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Sebastian Schaffert <se...@salzburgresearch.at>.

Am 13.11.2012 um 12:39 schrieb Reto Bachmann-Gmür:

> Hi Sebastian,
> 
> On Tue, Nov 13, 2012 at 11:52 AM, Sebastian Schaffert <
> sebastian.schaffert@salzburgresearch.at> wrote:
> 
>> Hi Reto,
>> 
>> I don't understand the use case, and I don't think it is well suited for
>> comparing different RDF APIs.
>> 
> 
> Isn't that a slight contradiction? ;)

Only a slight contradiction. I don't understand why this really is a use case ;-)

> 
> Understanding: you have a set of contact objects, we don't care were they
> come fro or how many they are, we just have some contacts. Now we would
> like to deal with them as an RDF datasource.
> 
> Well Suitedness: An RDF application typically doesn't have the priviledge
> to have only graphs as inputs. It will have to deal with `Contact`S,
> `StockQuote`S and `WeatherForecasts`S having a wrapper on these objects
> that makes them RDF graphs is the first step to then allow processing with
> the generic RDF tools and e.g. merging with other RDF data .

I think we have two completely different concepts about RDF here. For me, it is purely a graph database, in a similar way a RDBMS is a relational database, and therefore should provide means to query in a graph way (i.e. listing edges and performing structural queries). So yes, an RDF application ONLY and EXCLUSIVELY deals with graphs.

You seem to want to treat it as an object repository, i.e. in a similar way Hibernate does on top of RDBMS. For me, this would mean adding an additional layer on top of graphs, and does not lend itself very well to evaluating the RDF API. 

Unfortunately, the way Java and RDF interpret objects are very different. Where Java assumes a fixed and pre-defined schema (i.e. a class or interface), RDF is a semi-structured format with no a-priori schema requirement. Where Java has (ordered) lists, RDF (without the reification concepts) only has unordered sets. Where an object of type A in Java will always be an A, in RDF the same object (resource) can be many things at the same time (e.g. a Concert, a Calendar Entry and a Location, simply different views on the same resource).

The way we solved this in KiWi and also in the LMF is through "facading", i.e. Java interfaces (e.g. [2]) that map getters/setters to RDF properties and are handled at runtime using a Java reflection invocation handler [1]. Note that this is a layer that is totally agnostic of the underlying RDF API, and complicated in any case since the Java and RDF concepts of "objects" do not go along very well with each other. Note that ELMO (from the Sesame people) implemented a very similar approach.

[1] http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/services/facading/LMFInvocationHandler.java
[2] http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/user/KiWiUser.java

Despite the solution I described, I still do not think the scenario is well suited for evaluating RDF APIs. You also do not use Hibernate to evaluate whether an RDBMS is good or not.

> 
>> 
>> If this is really an issue, I would suggest coming up with a bigger
>> collection of RDF API usage scenarios that are also relevant in practice
>> (as proven by a software project using it). Including scenarios how to deal
>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
>> typically include >= 100 million triples. ;-)
>> 
>> In addition to what Andy said about wrapper APIs, I would also like to
>> emphasise the incurred memory and computation overhead of wrapper APIs. Not
>> an issue if you have only a handful of triples, but a big issue when you
>> have 100 million.
>> 
> 
> It's a common misconception to think that java sets are limited to 231-1
> elements, but even that would be more than 100 millions. In the challenge I
> didn't ask for time complexity, it would be fair to ask for that too if you
> want to analyze scenarios with such big number of triples.

It is a common misconception that just because you have a 64bit architecture you also have 2^64 bits of memory available. And it is a common misconception that in-memory data representation means you do not need to take into account storage structures like indexes. Even if you represent this amount of data in memory, you will run into the same problem.

95% of all RDF scenarios will require persistent storage. Selecting a scenario that does not take this into account is useless.

> 
> 
>> A possible way to bypass the wrapper issue is the approach followed by
>> JDOM for XML, which we tried to use also in LDPath: abstract away the whole
>> data model and API using Java Generics. This is typically very efficient
>> (at runtime you are working with the native types), but it is also complex
>> and ugly (you end up with a big list of methods implementing delegation as
>> in
>> http://code.google.com/p/ldpath/source/browse/ldpath-api/src/main/java/at/newmedialab/ldpath/api/backend/RDFBackend.java
>> ).
>> 
> I think this only supported accessing graphs an not creation of grah
> objects, so I'm afraid you can't take the challenge with that one.
> 

In the implementation we have done, yes (to reduce the burden on the people implementing backends). It is, however, easy to apply the concept also to creating graphs.

> 
> 
>> 
>> My favorite way would ba a common interface-based model for RDF in Java,
>> implemented by different backends. This would require the involvement of at
>> least the Jena and the Sesame people. The Sesame model already comes close
>> to it, but of course also adds some concepts that are specific to Sesame
>> (e.g. the repository concept and the way contexts/named graphs are
>> handled), as we discussed some months ago.
>> 
> 
> Yes, that was the thread:
> http://mail-archives.apache.org/mod_mbox/incubator-stanbol-dev/201208.mbox/%3CCAMmeZRmQcQP1syT=ccDG=fSXHOQA4OcAvcrBkHTXritiwT353A@mail.gmail.com%3E
> 
> I think such an interface based common API is the goal, Let's compare the
> approaches we have. Le's create different usecase to see how the existing
> APIs compared, the challenge I posed is just a start.

I agree mostly, I just don't consider your use case very relevant, especially not as the "first challenge" for an RDF API. 

Greetings,

Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@wymiwyg.com>.

Hi Sebastian,

On Tue, Nov 13, 2012 at 11:52 AM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:

> Hi Reto,
>
> I don't understand the use case, and I don't think it is well suited for
> comparing different RDF APIs.
>

Isn't that a slight contradiction? ;)

Understanding: you have a set of contact objects, we don't care were they
come fro or how many they are, we just have some contacts. Now we would
like to deal with them as an RDF datasource.

Well Suitedness: An RDF application typically doesn't have the priviledge
to have only graphs as inputs. It will have to deal with `Contact`S,
`StockQuote`S and `WeatherForecasts`S having a wrapper on these objects
that makes them RDF graphs is the first step to then allow processing with
the generic RDF tools and e.g. merging with other RDF data .


>
> If this is really an issue, I would suggest coming up with a bigger
> collection of RDF API usage scenarios that are also relevant in practice
> (as proven by a software project using it). Including scenarios how to deal
> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> typically include >= 100 million triples. ;-)
>
> In addition to what Andy said about wrapper APIs, I would also like to
> emphasise the incurred memory and computation overhead of wrapper APIs. Not
> an issue if you have only a handful of triples, but a big issue when you
> have 100 million.
>

It's a common misconception to think that java sets are limited to 231-1
elements, but even that would be more than 100 millions. In the challenge I
didn't ask for time complexity, it would be fair to ask for that too if you
want to analyze scenarios with such big number of triples.


> A possible way to bypass the wrapper issue is the approach followed by
> JDOM for XML, which we tried to use also in LDPath: abstract away the whole
> data model and API using Java Generics. This is typically very efficient
> (at runtime you are working with the native types), but it is also complex
> and ugly (you end up with a big list of methods implementing delegation as
> in
> http://code.google.com/p/ldpath/source/browse/ldpath-api/src/main/java/at/newmedialab/ldpath/api/backend/RDFBackend.java
> ).
>
I think this only supported accessing graphs an not creation of grah
objects, so I'm afraid you can't take the challenge with that one.



>
> My favorite way would ba a common interface-based model for RDF in Java,
> implemented by different backends. This would require the involvement of at
> least the Jena and the Sesame people. The Sesame model already comes close
> to it, but of course also adds some concepts that are specific to Sesame
> (e.g. the repository concept and the way contexts/named graphs are
> handled), as we discussed some months ago.
>

Yes, that was the thread:
http://mail-archives.apache.org/mod_mbox/incubator-stanbol-dev/201208.mbox/%3CCAMmeZRmQcQP1syT=ccDG=fSXHOQA4OcAvcrBkHTXritiwT353A@mail.gmail.com%3E

I think such an interface based common API is the goal, Let's compare the
approaches we have. Le's create different usecase to see how the existing
APIs compared, the challenge I posed is just a start.

Cheers,
Reto

>
> Greetings,
>
> Sebastian
>
> Am 12.11.2012 um 20:45 schrieb Reto Bachmann-Gmür:
>
> > May I suggest the following toy-usecase for comparing different API
> > proposals (we know all API can be used for triple stores, so it seems
> > interesting how the can be used to expose any data as RDF and the Space
> > complexity of such an adapter):
> >
> > Given
> >
> > interface Person() {
> > String getGivenName();
> > String getLastName();
> > /**
> > * @return true if other is an instance of Person with the same GivenName
> > and LastName, false otherwise
> > */
> > boolean equals(Object other);
> > }
> >
> > Provide a method
> >
> > Graph getAsGraph(Set<Person> pesons);
> >
> > where `Graph` is the API of an RDF Graph that can change over time. The
> > returned `Graph`shall (if possible) be backed by the Set passed as
> argument
> > and thus reflect future changes to that set. The Graph shall support all
> > read operation but no addition or removal of triples. It's ok is some
> > iteration over the graph result in a ConcurrentModficationException if
> the
> > set changes during iteration (as one would get when iterating over the
> set
> > during such a modification).
> >
> > - How does the code look like?
> > - Is it backed by the Set and does the result Graph reflects changes to
> the
> > set?
> > - What's the space complexity?
> >
> > Challenge accepted?
> >
> > Reto
> >
> > On Mon, Nov 12, 2012 at 6:11 PM, Andy Seaborne <an...@apache.org> wrote:
> >
> >> On 11/11/12 23:22, Rupert Westenthaler wrote:
> >>
> >>> Hi all ,
> >>>
> >>> On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür <re...@apache.org>
> >>> wrote:
> >>>
> >>>> - clerezza.rdf graudates as commons.rdf: a modular java/scala
> >>>> implementation of rdf related APIs, usable with and without OSGi
> >>>>
> >>>
> >>> For me this immediately raises the question: Why should the Clerezza
> >>> API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
> >>> based on Jena and Sesame? Creating an Apache commons project based on
> >>> an RDF API that is only used by a very low percentage of all Java RDF
> >>> applications is not feasible. Generally I see not much room for a
> >>> commons RDF project as long as there is not a commonly agreed RDF API
> >>> for Java.
> >>>
> >>
> >> Very good point.
> >>
> >> There is a finite and bounded supply of energy of people to work on
> such a
> >> thing and to make it work for the communities that use it.   For all of
> us,
> >> work on A means less work on B.
> >>
> >>
> >> An "RDF API" for applications needs to be more than RDF. A SPARQL engine
> >> is not simply abstracted from the storage by some "list(s,p,o)" API
> call.
> >> It will die at scale, where scale here includes in-memory usage.
> >>
> >> My personal opinion is that wrapper APIs are not the way to go - they
> end
> >> up as a new API in themselves and the fact they are backed by different
> >> systems is really an implementation detail.  They end up having design
> >> opinions and gradually require more and more maintenace as the add more
> and
> >> more.
> >>
> >> API bridges are better (mapping one API to another - we are really
> talking
> >> about a small number of APIs, not 10s) as they expose the advantages of
> >> each system.
> >>
> >> The ideal is a set of interfaces systems can agree on.  I'm going to be
> >> contributing to the interfacization of the Graph API in Jena - if you
> have
> >> thoughts, send email to a list.
> >>
> >>        Andy
> >>
> >> PS See the work being done by Stephen Allen on coarse grained APIs:
> >>
> >> http://mail-archives.apache.**org/mod_mbox/jena-dev/201206.**
> >> mbox/%3CCAPTxtVOMMWxfk2%**2B4ciCExUBZyxsDKvuO0QshXF8uKha**
> >> D8txXjA%40mail.gmail.com%3E<
> http://mail-archives.apache.org/mod_mbox/jena-dev/201206.mbox/%3CCAPTxtVOMMWxfk2%2B4ciCExUBZyxsDKvuO0QshXF8uKhaD8txXjA%40mail.gmail.com%3E
> >
> >>
> >>
> >>
>
> Sebastian
> --
> | Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
> | Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
> | Head of Knowledge and Media Technologies Group          +43 662 2288 423
> | Jakob-Haringer Strasse 5/II
> | A-5020 Salzburg
>
>

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Reto Bachmann-Gmür <re...@wymiwyg.com>.

Hi Sebastian,

On Tue, Nov 13, 2012 at 11:52 AM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:

> Hi Reto,
>
> I don't understand the use case, and I don't think it is well suited for
> comparing different RDF APIs.
>

Isn't that a slight contradiction? ;)

Understanding: you have a set of contact objects, we don't care were they
come fro or how many they are, we just have some contacts. Now we would
like to deal with them as an RDF datasource.

Well Suitedness: An RDF application typically doesn't have the priviledge
to have only graphs as inputs. It will have to deal with `Contact`S,
`StockQuote`S and `WeatherForecasts`S having a wrapper on these objects
that makes them RDF graphs is the first step to then allow processing with
the generic RDF tools and e.g. merging with other RDF data .


>
> If this is really an issue, I would suggest coming up with a bigger
> collection of RDF API usage scenarios that are also relevant in practice
> (as proven by a software project using it). Including scenarios how to deal
> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> typically include >= 100 million triples. ;-)
>
> In addition to what Andy said about wrapper APIs, I would also like to
> emphasise the incurred memory and computation overhead of wrapper APIs. Not
> an issue if you have only a handful of triples, but a big issue when you
> have 100 million.
>

It's a common misconception to think that java sets are limited to 231-1
elements, but even that would be more than 100 millions. In the challenge I
didn't ask for time complexity, it would be fair to ask for that too if you
want to analyze scenarios with such big number of triples.


> A possible way to bypass the wrapper issue is the approach followed by
> JDOM for XML, which we tried to use also in LDPath: abstract away the whole
> data model and API using Java Generics. This is typically very efficient
> (at runtime you are working with the native types), but it is also complex
> and ugly (you end up with a big list of methods implementing delegation as
> in
> http://code.google.com/p/ldpath/source/browse/ldpath-api/src/main/java/at/newmedialab/ldpath/api/backend/RDFBackend.java
> ).
>
I think this only supported accessing graphs an not creation of grah
objects, so I'm afraid you can't take the challenge with that one.



>
> My favorite way would ba a common interface-based model for RDF in Java,
> implemented by different backends. This would require the involvement of at
> least the Jena and the Sesame people. The Sesame model already comes close
> to it, but of course also adds some concepts that are specific to Sesame
> (e.g. the repository concept and the way contexts/named graphs are
> handled), as we discussed some months ago.
>

Yes, that was the thread:
http://mail-archives.apache.org/mod_mbox/incubator-stanbol-dev/201208.mbox/%3CCAMmeZRmQcQP1syT=ccDG=fSXHOQA4OcAvcrBkHTXritiwT353A@mail.gmail.com%3E

I think such an interface based common API is the goal, Let's compare the
approaches we have. Le's create different usecase to see how the existing
APIs compared, the challenge I posed is just a start.

Cheers,
Reto

>
> Greetings,
>
> Sebastian
>
> Am 12.11.2012 um 20:45 schrieb Reto Bachmann-Gmür:
>
> > May I suggest the following toy-usecase for comparing different API
> > proposals (we know all API can be used for triple stores, so it seems
> > interesting how the can be used to expose any data as RDF and the Space
> > complexity of such an adapter):
> >
> > Given
> >
> > interface Person() {
> > String getGivenName();
> > String getLastName();
> > /**
> > * @return true if other is an instance of Person with the same GivenName
> > and LastName, false otherwise
> > */
> > boolean equals(Object other);
> > }
> >
> > Provide a method
> >
> > Graph getAsGraph(Set<Person> pesons);
> >
> > where `Graph` is the API of an RDF Graph that can change over time. The
> > returned `Graph`shall (if possible) be backed by the Set passed as
> argument
> > and thus reflect future changes to that set. The Graph shall support all
> > read operation but no addition or removal of triples. It's ok is some
> > iteration over the graph result in a ConcurrentModficationException if
> the
> > set changes during iteration (as one would get when iterating over the
> set
> > during such a modification).
> >
> > - How does the code look like?
> > - Is it backed by the Set and does the result Graph reflects changes to
> the
> > set?
> > - What's the space complexity?
> >
> > Challenge accepted?
> >
> > Reto
> >
> > On Mon, Nov 12, 2012 at 6:11 PM, Andy Seaborne <an...@apache.org> wrote:
> >
> >> On 11/11/12 23:22, Rupert Westenthaler wrote:
> >>
> >>> Hi all ,
> >>>
> >>> On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür <re...@apache.org>
> >>> wrote:
> >>>
> >>>> - clerezza.rdf graudates as commons.rdf: a modular java/scala
> >>>> implementation of rdf related APIs, usable with and without OSGi
> >>>>
> >>>
> >>> For me this immediately raises the question: Why should the Clerezza
> >>> API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
> >>> based on Jena and Sesame? Creating an Apache commons project based on
> >>> an RDF API that is only used by a very low percentage of all Java RDF
> >>> applications is not feasible. Generally I see not much room for a
> >>> commons RDF project as long as there is not a commonly agreed RDF API
> >>> for Java.
> >>>
> >>
> >> Very good point.
> >>
> >> There is a finite and bounded supply of energy of people to work on
> such a
> >> thing and to make it work for the communities that use it.   For all of
> us,
> >> work on A means less work on B.
> >>
> >>
> >> An "RDF API" for applications needs to be more than RDF. A SPARQL engine
> >> is not simply abstracted from the storage by some "list(s,p,o)" API
> call.
> >> It will die at scale, where scale here includes in-memory usage.
> >>
> >> My personal opinion is that wrapper APIs are not the way to go - they
> end
> >> up as a new API in themselves and the fact they are backed by different
> >> systems is really an implementation detail.  They end up having design
> >> opinions and gradually require more and more maintenace as the add more
> and
> >> more.
> >>
> >> API bridges are better (mapping one API to another - we are really
> talking
> >> about a small number of APIs, not 10s) as they expose the advantages of
> >> each system.
> >>
> >> The ideal is a set of interfaces systems can agree on.  I'm going to be
> >> contributing to the interfacization of the Graph API in Jena - if you
> have
> >> thoughts, send email to a list.
> >>
> >>        Andy
> >>
> >> PS See the work being done by Stephen Allen on coarse grained APIs:
> >>
> >> http://mail-archives.apache.**org/mod_mbox/jena-dev/201206.**
> >> mbox/%3CCAPTxtVOMMWxfk2%**2B4ciCExUBZyxsDKvuO0QshXF8uKha**
> >> D8txXjA%40mail.gmail.com%3E<
> http://mail-archives.apache.org/mod_mbox/jena-dev/201206.mbox/%3CCAPTxtVOMMWxfk2%2B4ciCExUBZyxsDKvuO0QshXF8uKhaD8txXjA%40mail.gmail.com%3E
> >
> >>
> >>
> >>
>
> Sebastian
> --
> | Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
> | Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
> | Head of Knowledge and Media Technologies Group          +43 662 2288 423
> | Jakob-Haringer Strasse 5/II
> | A-5020 Salzburg
>
>

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Posted by Sebastian Schaffert <se...@salzburgresearch.at>.

Hi Reto,

I don't understand the use case, and I don't think it is well suited for comparing different RDF APIs. 

If this is really an issue, I would suggest coming up with a bigger collection of RDF API usage scenarios that are also relevant in practice (as proven by a software project using it). Including scenarios how to deal with bigger amounts of data (i.e. beyond toy examples). My scenarios typically include >= 100 million triples. ;-)

In addition to what Andy said about wrapper APIs, I would also like to emphasise the incurred memory and computation overhead of wrapper APIs. Not an issue if you have only a handful of triples, but a big issue when you have 100 million.

A possible way to bypass the wrapper issue is the approach followed by JDOM for XML, which we tried to use also in LDPath: abstract away the whole data model and API using Java Generics. This is typically very efficient (at runtime you are working with the native types), but it is also complex and ugly (you end up with a big list of methods implementing delegation as in http://code.google.com/p/ldpath/source/browse/ldpath-api/src/main/java/at/newmedialab/ldpath/api/backend/RDFBackend.java).

My favorite way would ba a common interface-based model for RDF in Java, implemented by different backends. This would require the involvement of at least the Jena and the Sesame people. The Sesame model already comes close to it, but of course also adds some concepts that are specific to Sesame (e.g. the repository concept and the way contexts/named graphs are handled), as we discussed some months ago.

Greetings,

Sebastian

Am 12.11.2012 um 20:45 schrieb Reto Bachmann-Gmür:

> May I suggest the following toy-usecase for comparing different API
> proposals (we know all API can be used for triple stores, so it seems
> interesting how the can be used to expose any data as RDF and the Space
> complexity of such an adapter):
> 
> Given
> 
> interface Person() {
> String getGivenName();
> String getLastName();
> /**
> * @return true if other is an instance of Person with the same GivenName
> and LastName, false otherwise
> */
> boolean equals(Object other);
> }
> 
> Provide a method
> 
> Graph getAsGraph(Set<Person> pesons);
> 
> where `Graph` is the API of an RDF Graph that can change over time. The
> returned `Graph`shall (if possible) be backed by the Set passed as argument
> and thus reflect future changes to that set. The Graph shall support all
> read operation but no addition or removal of triples. It's ok is some
> iteration over the graph result in a ConcurrentModficationException if the
> set changes during iteration (as one would get when iterating over the set
> during such a modification).
> 
> - How does the code look like?
> - Is it backed by the Set and does the result Graph reflects changes to the
> set?
> - What's the space complexity?
> 
> Challenge accepted?
> 
> Reto
> 
> On Mon, Nov 12, 2012 at 6:11 PM, Andy Seaborne <an...@apache.org> wrote:
> 
>> On 11/11/12 23:22, Rupert Westenthaler wrote:
>> 
>>> Hi all ,
>>> 
>>> On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür <re...@apache.org>
>>> wrote:
>>> 
>>>> - clerezza.rdf graudates as commons.rdf: a modular java/scala
>>>> implementation of rdf related APIs, usable with and without OSGi
>>>> 
>>> 
>>> For me this immediately raises the question: Why should the Clerezza
>>> API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
>>> based on Jena and Sesame? Creating an Apache commons project based on
>>> an RDF API that is only used by a very low percentage of all Java RDF
>>> applications is not feasible. Generally I see not much room for a
>>> commons RDF project as long as there is not a commonly agreed RDF API
>>> for Java.
>>> 
>> 
>> Very good point.
>> 
>> There is a finite and bounded supply of energy of people to work on such a
>> thing and to make it work for the communities that use it.   For all of us,
>> work on A means less work on B.
>> 
>> 
>> An "RDF API" for applications needs to be more than RDF. A SPARQL engine
>> is not simply abstracted from the storage by some "list(s,p,o)" API call.
>> It will die at scale, where scale here includes in-memory usage.
>> 
>> My personal opinion is that wrapper APIs are not the way to go - they end
>> up as a new API in themselves and the fact they are backed by different
>> systems is really an implementation detail.  They end up having design
>> opinions and gradually require more and more maintenace as the add more and
>> more.
>> 
>> API bridges are better (mapping one API to another - we are really talking
>> about a small number of APIs, not 10s) as they expose the advantages of
>> each system.
>> 
>> The ideal is a set of interfaces systems can agree on.  I'm going to be
>> contributing to the interfacization of the Graph API in Jena - if you have
>> thoughts, send email to a list.
>> 
>>        Andy
>> 
>> PS See the work being done by Stephen Allen on coarse grained APIs:
>> 
>> http://mail-archives.apache.**org/mod_mbox/jena-dev/201206.**
>> mbox/%3CCAPTxtVOMMWxfk2%**2B4ciCExUBZyxsDKvuO0QshXF8uKha**
>> D8txXjA%40mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/jena-dev/201206.mbox/%3CCAPTxtVOMMWxfk2%2B4ciCExUBZyxsDKvuO0QshXF8uKhaD8txXjA%40mail.gmail.com%3E>
>> 
>> 
>> 

Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg