You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2019/11/17 17:35:08 UTC

Jena next (AFS)

This is a bit of a brain dump ...

== DatasetGraph

Graph Triple, Quad, DatasetGraph in a single API place.

== Graph - SPI

Graph - add a few navigation operations to make writing system directly 
on Graph easier - though still not as rich as the Model API, and avoid 
much of the object churn.

The operations are (not final names)

   Graph.fwd(subject, predicate)
        -- return a single Node or null.
   Graph.fwdList(subject, predicate)
        -- return a list of Nodes
   Graph.fwdUnique(subject, predicate)
        -- return a single Node, exception if 0 or more than one.

Same for "bwk"

https://github.com/apache/jena/blob/master/jena-shacl/src/main/java/org/apache/jena/shacl/lib/G.java 
is a library version of this that was helpful but adding a few 
operations directly to graph

If the data is known to be good (SHACL), the application code can use 
fwd()/bwk() without worrying about testing for zero or multiple predicates.

The reason for putting the basic oprations in the Graph interface and 
not everything in a library is for potential efficiency. An impl may be 
able to do a good job of fwd() and if that is the basis of graph 
analytics efficiency matters long term, at least not to design it out.

== Assembler

The graph SPI additions is also motivated by assemblers.  Assemblers are 
currently Model/Resource based but the important usage is in Fuseki - an 
ideal goal is Fuseki works on Graph/Node.

Converting assemblers to Graph/Node does not look too burdensome and 
with a wrapper layer we can hopefully include all the old tests to check 
evolution.

== Graph - indexing

Currently, Graphs are term-indexed only or value-indexed, not both.

Graph should plain term-indexed. value-indexing, which can be calculated 
on the fly, would be a separate higher-level concept.

This is motivated by scale and having the same behaviour on all graph. 
At scale, canonicalizing the inputs is better than value-indexing.

"values" would only be in the Model API.

== Transactions

Unify the transaction approach (also changes Model) so complex 
assemblages of graphs, and other things,  are transactional.

Remove graph transactions - replace by
org.apache.jena.sparql.core.Transactional.

Then graphs as views of datasets and also combinations of Transactionals 
in single transaction (two DatasetGraph, or collection of Graphs (teh 
assmebler case)) can be done.

== Events

Make events an intercepting wrapper, not built-in to Graph itself.
Add transaction lifecycle events.

== Streams - yes and no.

A Stream is several java objects so a potential cost
for a simple operations like Graph.contains() or find() or a few things 
is not small.

Keep iterators, provide stream(s,p,o).

== Nodes

Lang tags - force to lower case.

Simplify - remove a layer of indirection. This relates to indexing.

Node_Literal - no LiteralLabels
Node_Blank - two longs or a string label, not using BlankNodeId

Investigate integrate nodes with ARQ's NodeValue.

== IRIs

jena-iri is general, powerful and hard to maintain.
Jena does not use all of it.
Jena needs a simpler, direct parser/checker.

https://github.com/afs/iri4ld

which is a parser in java with little copying. It parse URIs, and then 
has a little on scheme specific rules for http(s), file and URN.

The various open source libraries and JDK classes do not track the 
current standards very well (RFC 2396 vs RFC 3986). I have found that 
compliance is mixed due to legacy compatibility needs.

Re: Jena next (AFS)

Posted by "A. Soroka" <so...@gmail.com>.
I think there may be some confusion here about Streams and Iterators.
Streams are not and were never intended to be a replacement for or
equivalent to Iterators. Iterators are a source of elements. There is no
further complexity. Streams, on the other hand, are pipelines of
computation which do not begin flowing until a terminal operation is called
and then the data flows through the pipe without further action from the
caller.

We should not even consider replacing Iterators with Streams and I did not
hear anyone suggest it. The suggestion I heard (and which I thoroughly
support) is to offer Streams alongside Iterators for their intended
purpose; pipelining computations.

For example, consider a dataset backed by a remote SPARQL connection.
Suppose stream() is called on a graph from this dataset, and suppose
limit() is called on that Stream. Then in the right conditions a smart
implementation could push that limit upstream to become a LIMIT in the
SPARQL being sent over the wire. That's the real value of all those methods
on Stream. They aren't merely for developer convenience. A library like our
Iter would do for that. They are there to provide opportunities to classify
and manage the computations being pipelined.

So Stream is very useful for its purpose, but not for Iterator's purpose.
As for crossing module boundaries, I haven't seen any problems with that
and I'm not sure what the objection actually is. In fact, refusing to expose
Streams at APIs seems to make the whole thing rather pointless.

ajs6f


On Tue, Nov 19, 2019, 10:06 AM Merlin Bögershausen <
merlin.boegershausen@rwth-aachen.de> wrote:

> Hi,
>
> To the Stream discussion: Streams should not be passed beyond module
> borders. In my personal view not even beyond scopes, because the
> exceptions thrown by an already terminated stream can be misleading.
> Every user can feel free to use the StreamSupport class whenever the
> application code uses streams.
>
> Another thing, I would like to see a graph view inside Fuseki UI, where
> the triples are visualized like in
> <https://www.w3.org/2018/09/rdf-data-viz/>
> I use such visualization whenever I explain the usage of our graph data
> model to my colleagues, whenever they have problems to develop SPARQL
> queries. Also, I feel that it would help our support team to see why
> the system performs unexpected steps.
>
> Merlin
>
>

Re: Jena next (AFS)

Posted by Merlin Bögershausen <me...@rwth-aachen.de>.
Hi,

To the Stream discussion: Streams should not be passed beyond module 
borders. In my personal view not even beyond scopes, because the 
exceptions thrown by an already terminated stream can be misleading.
Every user can feel free to use the StreamSupport class whenever the 
application code uses streams.

Another thing, I would like to see a graph view inside Fuseki UI, where 
the triples are visualized like in 
<https://www.w3.org/2018/09/rdf-data-viz/>
I use such visualization whenever I explain the usage of our graph data 
model to my colleagues, whenever they have problems to develop SPARQL 
queries. Also, I feel that it would help our support team to see why 
the system performs unexpected steps.

Merlin


Re: Jena next (AFS)

Posted by Andy Seaborne <an...@apache.org>.

On 17/11/2019 23:07, Claude Warren wrote:
> I am a bit concerned about Streams.
> 
> I am working with some large scale streams from stored objects in another
> project and keep coming up against stack overflow issues when attempting to
> convert merge  them or convert from iterators.  Perhaps I have not done it
> correctly but the iterator approach seems cleaner when you don't have or
> can't have all the data in memory at once.

Interesting - have you got some stackOverflow links?

https://www.beyondjava.net/performance-java-8-lambdas
diucussed stream costs a bit (and interstign that lambda are not the issue)

Some of the speed reports of stream() are comparing streams with 
replacing loops - and that clearly is going to have am impact if the 
loop body is small, let along the fact the JIT probably knows how to do 
magic with plain loops.

Maybe, sometime, streams() will get JIT attention.

> 
> We might consider switching from the Jena specific iterators to
> commons-collections4 (perhaps contributing some additions there).

Iter has nearly all the stream functions whereas commons-collections4 is 
a peckage of iterator classes that have to be nested.

So Iter can used like streams for nice code use (and is more complete 
than ExtendedIterator) and makes switching to streams easy, not zero but 
the syntax work needed is lessened.  In fact - that's a good goal for Iter.

(the missing bits are "collect" because Iter has Iter.toList and 
Iter.toSet - I do find the Stream.collect([Collectors.]toList) and 
absence of the direct form a bit odd - Adding Iter.collect is no bug 
deal thoiugh for completeness).

     Andy

> 
> Claude
> 
> On Sun, Nov 17, 2019 at 5:34 PM Andy Seaborne <an...@apache.org> wrote:
> 
>> This is a bit of a brain dump ...
>>
>> == DatasetGraph
>>
>> Graph Triple, Quad, DatasetGraph in a single API place.
>>
>> == Graph - SPI
>>
>> Graph - add a few navigation operations to make writing system directly
>> on Graph easier - though still not as rich as the Model API, and avoid
>> much of the object churn.
>>
>> The operations are (not final names)
>>
>>     Graph.fwd(subject, predicate)
>>          -- return a single Node or null.
>>     Graph.fwdList(subject, predicate)
>>          -- return a list of Nodes
>>     Graph.fwdUnique(subject, predicate)
>>          -- return a single Node, exception if 0 or more than one.
>>
>> Same for "bwk"
>>
>>
>> https://github.com/apache/jena/blob/master/jena-shacl/src/main/java/org/apache/jena/shacl/lib/G.java
>> is a library version of this that was helpful but adding a few
>> operations directly to graph
>>
>> If the data is known to be good (SHACL), the application code can use
>> fwd()/bwk() without worrying about testing for zero or multiple predicates.
>>
>> The reason for putting the basic oprations in the Graph interface and
>> not everything in a library is for potential efficiency. An impl may be
>> able to do a good job of fwd() and if that is the basis of graph
>> analytics efficiency matters long term, at least not to design it out.
>>
>> == Assembler
>>
>> The graph SPI additions is also motivated by assemblers.  Assemblers are
>> currently Model/Resource based but the important usage is in Fuseki - an
>> ideal goal is Fuseki works on Graph/Node.
>>
>> Converting assemblers to Graph/Node does not look too burdensome and
>> with a wrapper layer we can hopefully include all the old tests to check
>> evolution.
>>
>> == Graph - indexing
>>
>> Currently, Graphs are term-indexed only or value-indexed, not both.
>>
>> Graph should plain term-indexed. value-indexing, which can be calculated
>> on the fly, would be a separate higher-level concept.
>>
>> This is motivated by scale and having the same behaviour on all graph.
>> At scale, canonicalizing the inputs is better than value-indexing.
>>
>> "values" would only be in the Model API.
>>
>> == Transactions
>>
>> Unify the transaction approach (also changes Model) so complex
>> assemblages of graphs, and other things,  are transactional.
>>
>> Remove graph transactions - replace by
>> org.apache.jena.sparql.core.Transactional.
>>
>> Then graphs as views of datasets and also combinations of Transactionals
>> in single transaction (two DatasetGraph, or collection of Graphs (teh
>> assmebler case)) can be done.
>>
>> == Events
>>
>> Make events an intercepting wrapper, not built-in to Graph itself.
>> Add transaction lifecycle events.
>>
>> == Streams - yes and no.
>>
>> A Stream is several java objects so a potential cost
>> for a simple operations like Graph.contains() or find() or a few things
>> is not small.
>>
>> Keep iterators, provide stream(s,p,o).
>>
>> == Nodes
>>
>> Lang tags - force to lower case.
>>
>> Simplify - remove a layer of indirection. This relates to indexing.
>>
>> Node_Literal - no LiteralLabels
>> Node_Blank - two longs or a string label, not using BlankNodeId
>>
>> Investigate integrate nodes with ARQ's NodeValue.
>>
>> == IRIs
>>
>> jena-iri is general, powerful and hard to maintain.
>> Jena does not use all of it.
>> Jena needs a simpler, direct parser/checker.
>>
>> https://github.com/afs/iri4ld
>>
>> which is a parser in java with little copying. It parse URIs, and then
>> has a little on scheme specific rules for http(s), file and URN.
>>
>> The various open source libraries and JDK classes do not track the
>> current standards very well (RFC 2396 vs RFC 3986). I have found that
>> compliance is mixed due to legacy compatibility needs.
>>
> 
> 

Re: Jena next (AFS)

Posted by Claude Warren <cl...@xenei.com>.
I am a bit concerned about Streams.

I am working with some large scale streams from stored objects in another
project and keep coming up against stack overflow issues when attempting to
convert merge  them or convert from iterators.  Perhaps I have not done it
correctly but the iterator approach seems cleaner when you don't have or
can't have all the data in memory at once.

We might consider switching from the Jena specific iterators to
commons-collections4 (perhaps contributing some additions there).

Claude

On Sun, Nov 17, 2019 at 5:34 PM Andy Seaborne <an...@apache.org> wrote:

> This is a bit of a brain dump ...
>
> == DatasetGraph
>
> Graph Triple, Quad, DatasetGraph in a single API place.
>
> == Graph - SPI
>
> Graph - add a few navigation operations to make writing system directly
> on Graph easier - though still not as rich as the Model API, and avoid
> much of the object churn.
>
> The operations are (not final names)
>
>    Graph.fwd(subject, predicate)
>         -- return a single Node or null.
>    Graph.fwdList(subject, predicate)
>         -- return a list of Nodes
>    Graph.fwdUnique(subject, predicate)
>         -- return a single Node, exception if 0 or more than one.
>
> Same for "bwk"
>
>
> https://github.com/apache/jena/blob/master/jena-shacl/src/main/java/org/apache/jena/shacl/lib/G.java
> is a library version of this that was helpful but adding a few
> operations directly to graph
>
> If the data is known to be good (SHACL), the application code can use
> fwd()/bwk() without worrying about testing for zero or multiple predicates.
>
> The reason for putting the basic oprations in the Graph interface and
> not everything in a library is for potential efficiency. An impl may be
> able to do a good job of fwd() and if that is the basis of graph
> analytics efficiency matters long term, at least not to design it out.
>
> == Assembler
>
> The graph SPI additions is also motivated by assemblers.  Assemblers are
> currently Model/Resource based but the important usage is in Fuseki - an
> ideal goal is Fuseki works on Graph/Node.
>
> Converting assemblers to Graph/Node does not look too burdensome and
> with a wrapper layer we can hopefully include all the old tests to check
> evolution.
>
> == Graph - indexing
>
> Currently, Graphs are term-indexed only or value-indexed, not both.
>
> Graph should plain term-indexed. value-indexing, which can be calculated
> on the fly, would be a separate higher-level concept.
>
> This is motivated by scale and having the same behaviour on all graph.
> At scale, canonicalizing the inputs is better than value-indexing.
>
> "values" would only be in the Model API.
>
> == Transactions
>
> Unify the transaction approach (also changes Model) so complex
> assemblages of graphs, and other things,  are transactional.
>
> Remove graph transactions - replace by
> org.apache.jena.sparql.core.Transactional.
>
> Then graphs as views of datasets and also combinations of Transactionals
> in single transaction (two DatasetGraph, or collection of Graphs (teh
> assmebler case)) can be done.
>
> == Events
>
> Make events an intercepting wrapper, not built-in to Graph itself.
> Add transaction lifecycle events.
>
> == Streams - yes and no.
>
> A Stream is several java objects so a potential cost
> for a simple operations like Graph.contains() or find() or a few things
> is not small.
>
> Keep iterators, provide stream(s,p,o).
>
> == Nodes
>
> Lang tags - force to lower case.
>
> Simplify - remove a layer of indirection. This relates to indexing.
>
> Node_Literal - no LiteralLabels
> Node_Blank - two longs or a string label, not using BlankNodeId
>
> Investigate integrate nodes with ARQ's NodeValue.
>
> == IRIs
>
> jena-iri is general, powerful and hard to maintain.
> Jena does not use all of it.
> Jena needs a simpler, direct parser/checker.
>
> https://github.com/afs/iri4ld
>
> which is a parser in java with little copying. It parse URIs, and then
> has a little on scheme specific rules for http(s), file and URN.
>
> The various open source libraries and JDK classes do not track the
> current standards very well (RFC 2396 vs RFC 3986). I have found that
> compliance is mixed due to legacy compatibility needs.
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren