You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Andy Seaborne <an...@apache.org> on 2015/01/17 11:26:49 UTC

Re: [RDF] Updated Commons-RDF

On 14/01/15 18:34, Reto Gmür wrote:
> There has been an indirect reply here:
> https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
> this thread I though to add a back-link but I would prefer to have a
> discussion here and to discuss about concrete code proposals

I would very much like to see discussion of use cases so we understand 
each others expectations and technical requirements. Minto's message 
touches on this, about distribution and design for in-memory.

https://mail-archives.apache.org/mod_mbox/clerezza-dev/201412.mbox/%3C54946D8B.1060005@apache.org%3E

> According to Sergio the proposal is "a wrapper implementation instead of
> commons interface". As the proposal doesn't contain any wrapper, this might
> refer to the question on when to define classes and when to define
> interfaces.
>
> The API proposal has the following interfaces and classes (without .events)
>
> Interfaces:
...

> Classes:
...

> The reason why Language and Iri are classes rather than interfaces is
> because the additional work for service providers exposing the API to
> implement the interfaces themselves seems to outweigh the benefits of the
> possibility to provide an own implementation without inheriting the
> overhead of an additional String per instance (the classes are not final,
> so implementation can still provide an Iri implementation that stores all
> the lengthy IRIs on disk, in this case there is just an empty and unused
> string field for the JIT to optimize away).

Not having interfaces everywhere is painful for adding this new system 
to existing code.  Java does not support multiple inheritance.  Existing 
code may already have a super class.  A copy would be needed.

What had you in mind for different literal implementations?

I don't see why Literal is different to a IRI here - a literal is , by 
definition, lexical form + language + datatype, in the same way URI is a 
uristring.

> The reason why BlankNode is a class and not an interface is to discourage
> polymorphism. If an instance is more than just a BNode user will be more
> likely to expect to get the very same instance back,

I hope they, for any RDFterms, make that assumption under any 
circumstances!  For persistence, same object (i.e. java's ==) is 
somewhere between "very hard" (= expensive to implement for no benefit) 
and impossible (persist data > RAM size).  Interning is not practical 
because of reference counting to keep the intern table size managed. 
Weak references add cost (app write and execution) at a point where 
simple costs can mount up quickly (parsing speed ... once the I/O path 
is straightened out ... java :-( ).

> but as described in
> the Readme there is no such guarantee. Typically implementations will
> replace BlankNode objects with instances of their own subclass of BlankNode
> as soon as they can (i.e. as soon as originally added instance becomes
> eligible for garbage collection).

	Andy



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [RDF] Updated Commons-RDF

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.
> I would very much like to see discussion of use cases so we understand 
> each others expectations and technical requirements. Minto's message 
> touches on this, about distribution and design for in-memory.

At the moment we have a Hadoop cluster processing data scrapped from the Web. We use OpenNLP on the data, and a corporate ontology to create triples in RDF (IIRC we're using jena-arq dependency for that). These triples are later loaded  in a graph in Jena TDB via Fuseki. Later a data quality process is used for data deduplication and record linkage (via SPARQL). Finally the data is ready to be consumed by internal products and services.

My use case for commons-rdf would be in the Hadoop jobs. Using its API and a simple and efficient implementation to create the triples. If another implementation claimed to be more efficient, I could simply replace the impl dependency in my pom.xml and run some jobs to test it. 

Cheers,
Bruno

ps: ATM we have some custom writables, but plan to soon use Jena Elephas for that too :)


----- Original Message -----
> From: Andy Seaborne <an...@apache.org>
> To: dev@commons.apache.org
> Cc: 
> Sent: Saturday, January 17, 2015 8:26 AM
> Subject: Re: [RDF] Updated Commons-RDF
> 
> On 14/01/15 18:34, Reto Gmür wrote:
>>  There has been an indirect reply here:
>>  https://github.com/commons-rdf/commons-rdf/issues/43, as the issue point to
>>  this thread I though to add a back-link but I would prefer to have a
>>  discussion here and to discuss about concrete code proposals
> 
> I would very much like to see discussion of use cases so we understand 
> each others expectations and technical requirements. Minto's message 
> touches on this, about distribution and design for in-memory.
> 
> https://mail-archives.apache.org/mod_mbox/clerezza-dev/201412.mbox/%3C54946D8B.1060005@apache.org%3E
> 
>>  According to Sergio the proposal is "a wrapper implementation instead 
> of
>>  commons interface". As the proposal doesn't contain any wrapper, 
> this might
>>  refer to the question on when to define classes and when to define
>>  interfaces.
>> 
>>  The API proposal has the following interfaces and classes (without .events)
>> 
>>  Interfaces:
> ...
> 
>>  Classes:
> ...
> 
>>  The reason why Language and Iri are classes rather than interfaces is
>>  because the additional work for service providers exposing the API to
>>  implement the interfaces themselves seems to outweigh the benefits of the
>>  possibility to provide an own implementation without inheriting the
>>  overhead of an additional String per instance (the classes are not final,
>>  so implementation can still provide an Iri implementation that stores all
>>  the lengthy IRIs on disk, in this case there is just an empty and unused
>>  string field for the JIT to optimize away).
> 
> Not having interfaces everywhere is painful for adding this new system 
> to existing code.  Java does not support multiple inheritance.  Existing 
> code may already have a super class.  A copy would be needed.
> 
> What had you in mind for different literal implementations?
> 
> I don't see why Literal is different to a IRI here - a literal is , by 
> definition, lexical form + language + datatype, in the same way URI is a 
> uristring.
> 
>>  The reason why BlankNode is a class and not an interface is to discourage
>>  polymorphism. If an instance is more than just a BNode user will be more
>>  likely to expect to get the very same instance back,
> 
> I hope they, for any RDFterms, make that assumption under any 
> circumstances!  For persistence, same object (i.e. java's ==) is 
> somewhere between "very hard" (= expensive to implement for no 
> benefit) 
> and impossible (persist data > RAM size).  Interning is not practical 
> because of reference counting to keep the intern table size managed. 
> Weak references add cost (app write and execution) at a point where 
> simple costs can mount up quickly (parsing speed ... once the I/O path 
> is straightened out ... java :-( ).
> 
> 
>>  but as described in
>>  the Readme there is no such guarantee. Typically implementations will
>>  replace BlankNode objects with instances of their own subclass of BlankNode
>>  as soon as they can (i.e. as soon as originally added instance becomes
>>  eligible for garbage collection).
> 
>     Andy
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org