You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Reto Gmür <re...@apache.org> on 2015/01/13 08:30:16 UTC

Updated Commons-RDF

Hi all,

I've just committed a new version to

http://svn.apache.org/viewvc/commons/sandbox/rdf/trunk/

The most obvious change if that instead of having TripleCollection, MGraph
and Graph there is now Graph and ImmutableGraph. With this change the usage
of the term "graph" is a bit closer to the colloquial usage (at the price
of being a bit more distant to the usage in the specs).

I've also added some questions and answers to the Readme highlighting the
points where the API offers advantages compared with other APIs and API
proposals.

The API goes beyond the most minimalistic API by allowing for graph
listeners to be notified when a graph is changed. This is a popular feature
that cannot easily be added on top of the core API so it is included.

I've added the getLock method to the main Graph interface. API notes
describe how implementation can easily provide such a lock and what to do
in situation where no such lock is needed. In clerezza we have been using a
subinterface LockableMGraph to provide this feature. Experience has shown
however that this approach makes it unnecessary difficult to write generic
code, for example generic methods processing a Graph often had to check the
type and downcast to do the locking on graphs than can be locked.


Please let me know about what you think about this proposal.

Cheers,
Reto

Re: [RDF] Updated Commons-RDF

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.
> I would very much like to see discussion of use cases so we understand 
> each others expectations and technical requirements. Minto's message 
> touches on this, about distribution and design for in-memory.

At the moment we have a Hadoop cluster processing data scrapped from the Web. We use OpenNLP on the data, and a corporate ontology to create triples in RDF (IIRC we're using jena-arq dependency for that). These triples are later loaded  in a graph in Jena TDB via Fuseki. Later a data quality process is used for data deduplication and record linkage (via SPARQL). Finally the data is ready to be consumed by internal products and services.

My use case for commons-rdf would be in the Hadoop jobs. Using its API and a simple and efficient implementation to create the triples. If another implementation claimed to be more efficient, I could simply replace the impl dependency in my pom.xml and run some jobs to test it. 

Cheers,
Bruno

ps: ATM we have some custom writables, but plan to soon use Jena Elephas for that too :)


----- Original Message -----
> From: Andy Seaborne <an...@apache.org>
> To: dev@commons.apache.org
> Cc: 
> Sent: Saturday, January 17, 2015 8:26 AM
> Subject: Re: [RDF] Updated Commons-RDF
> 
> On 14/01/15 18:34, Reto Gmür wrote:
>>  There has been an indirect reply here:
>>  https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
>>  this thread I though to add a back-link but I would prefer to have a
>>  discussion here and to discuss about concrete code proposals
> 
> I would very much like to see discussion of use cases so we understand 
> each others expectations and technical requirements. Minto's message 
> touches on this, about distribution and design for in-memory.
> 
> https://mail-archives.apache.org/mod_mbox/clerezza-dev/201412.mbox/%3C54946D8B.1060005@apache.org%3E
> 
>>  According to Sergio the proposal is "a wrapper implementation instead 
> of
>>  commons interface". As the proposal doesn't contain any wrapper, 
> this might
>>  refer to the question on when to define classes and when to define
>>  interfaces.
>> 
>>  The API proposal has the following interfaces and classes (without .events)
>> 
>>  Interfaces:
> ...
> 
>>  Classes:
> ...
> 
>>  The reason why Language and Iri are classes rather than interfaces is
>>  because the additional work for service providers exposing the API to
>>  implement the interfaces themselves seems to outweigh the benefits of the
>>  possibility to provide an own implementation without inheriting the
>>  overhead of an additional String per instance (the classes are not final,
>>  so implementation can still provide an Iri implementation that stores all
>>  the lengthy IRIs on disk, in this case there is just an empty and unused
>>  string field for the JIT to optimize away).
> 
> Not having interfaces everywhere is painful for adding this new system 
> to existing code.  Java does not support multiple inheritance.  Existing 
> code may already have a super class.  A copy would be needed.
> 
> What had you in mind for different literal implementations?
> 
> I don't see why Literal is different to a IRI here - a literal is , by 
> definition, lexical form + language + datatype, in the same way URI is a 
> uristring.
> 
>>  The reason why BlankNode is a class and not an interface is to discourage
>>  polymorphism. If an instance is more than just a BNode user will be more
>>  likely to expect to get the very same instance back,
> 
> I hope they, for any RDFterms, make that assumption under any 
> circumstances!  For persistence, same object (i.e. java's ==) is 
> somewhere between "very hard" (= expensive to implement for no 
> benefit) 
> and impossible (persist data > RAM size).  Interning is not practical 
> because of reference counting to keep the intern table size managed. 
> Weak references add cost (app write and execution) at a point where 
> simple costs can mount up quickly (parsing speed ... once the I/O path 
> is straightened out ... java :-( ).
> 
> 
>>  but as described in
>>  the Readme there is no such guarantee. Typically implementations will
>>  replace BlankNode objects with instances of their own subclass of BlankNode
>>  as soon as they can (i.e. as soon as originally added instance becomes
>>  eligible for garbage collection).
> 
>     Andy
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [RDF] Updated Commons-RDF

Posted by Andy Seaborne <an...@apache.org>.
On 14/01/15 18:34, Reto Gmür wrote:
> There has been an indirect reply here:
> https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
> this thread I though to add a back-link but I would prefer to have a
> discussion here and to discuss about concrete code proposals

I would very much like to see discussion of use cases so we understand 
each others expectations and technical requirements. Minto's message 
touches on this, about distribution and design for in-memory.

https://mail-archives.apache.org/mod_mbox/clerezza-dev/201412.mbox/%3C54946D8B.1060005@apache.org%3E

> According to Sergio the proposal is "a wrapper implementation instead of
> commons interface". As the proposal doesn't contain any wrapper, this might
> refer to the question on when to define classes and when to define
> interfaces.
>
> The API proposal has the following interfaces and classes (without .events)
>
> Interfaces:
...

> Classes:
...

> The reason why Language and Iri are classes rather than interfaces is
> because the additional work for service providers exposing the API to
> implement the interfaces themselves seems to outweigh the benefits of the
> possibility to provide an own implementation without inheriting the
> overhead of an additional String per instance (the classes are not final,
> so implementation can still provide an Iri implementation that stores all
> the lengthy IRIs on disk, in this case there is just an empty and unused
> string field for the JIT to optimize away).

Not having interfaces everywhere is painful for adding this new system 
to existing code.  Java does not support multiple inheritance.  Existing 
code may already have a super class.  A copy would be needed.

What had you in mind for different literal implementations?

I don't see why Literal is different to a IRI here - a literal is , by 
definition, lexical form + language + datatype, in the same way URI is a 
uristring.

> The reason why BlankNode is a class and not an interface is to discourage
> polymorphism. If an instance is more than just a BNode user will be more
> likely to expect to get the very same instance back,

I hope they, for any RDFterms, make that assumption under any 
circumstances!  For persistence, same object (i.e. java's ==) is 
somewhere between "very hard" (= expensive to implement for no benefit) 
and impossible (persist data > RAM size).  Interning is not practical 
because of reference counting to keep the intern table size managed. 
Weak references add cost (app write and execution) at a point where 
simple costs can mount up quickly (parsing speed ... once the I/O path 
is straightened out ... java :-( ).

> but as described in
> the Readme there is no such guarantee. Typically implementations will
> replace BlankNode objects with instances of their own subclass of BlankNode
> as soon as they can (i.e. as soon as originally added instance becomes
> eligible for garbage collection).

	Andy



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: Updated Commons-RDF

Posted by Reto Gmür <re...@apache.org>.
There has been an indirect reply here:
https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
this thread I though to add a back-link but I would prefer to have a
discussion here and to discuss about concrete code proposals

According to Sergio the proposal is "a wrapper implementation instead of
commons interface". As the proposal doesn't contain any wrapper, this might
refer to the question on when to define classes and when to define
interfaces.

The API proposal has the following interfaces and classes (without .events)

Interfaces:

   - BlankNodeOrIri.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/BlankNodeOrIri.java>


   - Graph.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Graph.java>


   - ImmutableGraph.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/ImmutableGraph.java>


   - RdfTerm.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/RdfTerm.java>


   - Triple.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Triple.java>
   - Literal.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Literal.java>


Classes:

   - Language.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Language.java>


   - Iri.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Iri.java>
   - BlankNode.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/BlankNode.java>


The reason why Language and Iri are classes rather than interfaces is
because the additional work for service providers exposing the API to
implement the interfaces themselves seems to outweigh the benefits of the
possibility to provide an own implementation without inheriting the
overhead of an additional String per instance (the classes are not final,
so implementation can still provide an Iri implementation that stores all
the lengthy IRIs on disk, in this case there is just an empty and unused
string field for the JIT to optimize away).

The reason why BlankNode is a class and not an interface is to discourage
polymorphism. If an instance is more than just a BNode user will be more
likely to expect to get the very same instance back, but as described in
the Readme there is no such guarantee. Typically implementations will
replace BlankNode objects with instances of their own subclass of BlankNode
as soon as they can (i.e. as soon as originally added instance becomes
eligible for garbage collection).


Cheers,

Reto

On Tue, Jan 13, 2015 at 7:30 AM, Reto Gmür <re...@apache.org> wrote:

> Hi all,
>
> I've just committed a new version to
>
> http://svn.apache.org/viewvc/commons/sandbox/rdf/trunk/
>
> The most obvious change if that instead of having TripleCollection, MGraph
> and Graph there is now Graph and ImmutableGraph. With this change the usage
> of the term "graph" is a bit closer to the colloquial usage (at the price
> of being a bit more distant to the usage in the specs).
>
> I've also added some questions and answers to the Readme highlighting the
> points where the API offers advantages compared with other APIs and API
> proposals.
>
> The API goes beyond the most minimalistic API by allowing for graph
> listeners to be notified when a graph is changed. This is a popular feature
> that cannot easily be added on top of the core API so it is included.
>
> I've added the getLock method to the main Graph interface. API notes
> describe how implementation can easily provide such a lock and what to do
> in situation where no such lock is needed. In clerezza we have been using a
> subinterface LockableMGraph to provide this feature. Experience has shown
> however that this approach makes it unnecessary difficult to write generic
> code, for example generic methods processing a Graph often had to check the
> type and downcast to do the locking on graphs than can be locked.
>
>
> Please let me know about what you think about this proposal.
>
> Cheers,
> Reto
>
>

Re: Updated Commons-RDF

Posted by Reto Gmür <re...@apache.org>.
There has been an indirect reply here:
https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
this thread I though to add a back-link but I would prefer to have a
discussion here and to discuss about concrete code proposals

According to Sergio the proposal is "a wrapper implementation instead of
commons interface". As the proposal doesn't contain any wrapper, this might
refer to the question on when to define classes and when to define
interfaces.

The API proposal has the following interfaces and classes (without .events)

Interfaces:

   - BlankNodeOrIri.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/BlankNodeOrIri.java>


   - Graph.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Graph.java>


   - ImmutableGraph.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/ImmutableGraph.java>


   - RdfTerm.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/RdfTerm.java>


   - Triple.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Triple.java>
   - Literal.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Literal.java>


Classes:

   - Language.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Language.java>


   - Iri.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/Iri.java>
   - BlankNode.java
   <http://svn.apache.org/repos/asf/commons/sandbox/rdf/trunk/src/main/java/org/apache/commons/rdf/BlankNode.java>


The reason why Language and Iri are classes rather than interfaces is
because the additional work for service providers exposing the API to
implement the interfaces themselves seems to outweigh the benefits of the
possibility to provide an own implementation without inheriting the
overhead of an additional String per instance (the classes are not final,
so implementation can still provide an Iri implementation that stores all
the lengthy IRIs on disk, in this case there is just an empty and unused
string field for the JIT to optimize away).

The reason why BlankNode is a class and not an interface is to discourage
polymorphism. If an instance is more than just a BNode user will be more
likely to expect to get the very same instance back, but as described in
the Readme there is no such guarantee. Typically implementations will
replace BlankNode objects with instances of their own subclass of BlankNode
as soon as they can (i.e. as soon as originally added instance becomes
eligible for garbage collection).


Cheers,

Reto

On Tue, Jan 13, 2015 at 7:30 AM, Reto Gmür <re...@apache.org> wrote:

> Hi all,
>
> I've just committed a new version to
>
> http://svn.apache.org/viewvc/commons/sandbox/rdf/trunk/
>
> The most obvious change if that instead of having TripleCollection, MGraph
> and Graph there is now Graph and ImmutableGraph. With this change the usage
> of the term "graph" is a bit closer to the colloquial usage (at the price
> of being a bit more distant to the usage in the specs).
>
> I've also added some questions and answers to the Readme highlighting the
> points where the API offers advantages compared with other APIs and API
> proposals.
>
> The API goes beyond the most minimalistic API by allowing for graph
> listeners to be notified when a graph is changed. This is a popular feature
> that cannot easily be added on top of the core API so it is included.
>
> I've added the getLock method to the main Graph interface. API notes
> describe how implementation can easily provide such a lock and what to do
> in situation where no such lock is needed. In clerezza we have been using a
> subinterface LockableMGraph to provide this feature. Experience has shown
> however that this approach makes it unnecessary difficult to write generic
> code, for example generic methods processing a Graph often had to check the
> type and downcast to do the locking on graphs than can be locked.
>
>
> Please let me know about what you think about this proposal.
>
> Cheers,
> Reto
>
>