You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "A. Soroka" <aj...@virginia.edu> on 2016/03/04 02:36:06 UTC

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

I’m confused about two of your points here. Let me separate them out so we can discuss them easily.

1) "writes are not supported”:

Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph and DatasetGraph are the basic abstractions implemented by Jena’s own out-of-the-box implementations of RDF storage. Can you explain what you mean by this?

2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand triple caching algorithm”:

The subtypes of TupleTable with which you are working have exactly the same kinds of find() methods. Why are they not problematic in that context?

---
A. Soroka
The University of Virginia Library

> On Mar 3, 2016, at 5:47 AM, Joint <da...@gmail.com> wrote:
> 
> 
> 
> Hi Andy.
> I implemented the entire SPI at the DatasetGraph and Graph level. It got to the point where I had overridden more methods than not. In addition writes are not supported and contains methods which call find(ANY, ANY, ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using the TriTable because it fits and quads are spoofed via triple to quad iterator.
> I have a set of filters and handles which the find triple is compared against and either passed straight to the TriTable if the triple has been handled before or its passed to the appropriate handle which adds the triples to the TriTable then calls the find. As the underlying data is a tree a cache depth can be set which allows related triples to be cached. Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
> Would you consider a generic version for the Jena code base?
> 
> 
> Dick
> 
> -------- Original message --------
> From: Andy Seaborne <an...@apache.org> 
> Date: 18/02/2016  6:31 pm  (GMT+00:00) 
> To: users@jena.apache.org 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>  DatasetGraphInMemory 
> 
> Hi,
> 
> I'm not seeing how tapping into the implementation of 
> DatasetGraphInMemory is going to help (through the details
> 
> As well as the DatasetGraphMap approach, one other thought that occurred 
> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
> implementation.
> 
> It loads, and clears, the mapped graph on-demand, and passes the find() 
> call through to the now-setup data.
> 
> 	Andy
> 
> On 16/02/16 17:42, A. Soroka wrote:
>>> Based on your description the DatasetGraphInMemory would seem to match the dynamic load requirement. How did you foresee it being loaded? Is there a large over head to using the add methods?
>> 
>> No, I certainly did not mean to give that impression, and I don’t think it is entirely accurate. DSGInMemory was definitely not at all meant for dynamic loading. That doesn’t mean it can’t be used that way, but that was not in the design, which assumed that all tuples take about the same amount of time to access and that all of the same type are coming from the same implementation (in a QuadTable and a TripleTable).
>> 
>> The overhead of mutating a dataset is mostly inside the implementations of TupleTable that are actually used to store tuples. You should be aware that TupleTable extends TransactionalComponent, so if you want to use it to create some kind of connection to your storage, you will need to make that connection fully transactional. That doesn’t sound at all trivial in your case.
>> 
>> At this point it seems to me that extending DatasetGraphMap (and implementing GraphMaker and Graph instead of TupleTable) might be a more appropriate design for your work. You can put dynamic loading behavior in Graph (or a GraphView subtype) just as easily as in TupleTable subtypes. Are there reasons around the use of transactionality in your work that demand the particular semantics supported by DSGInMemory?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>> On Feb 13, 2016, at 5:18 AM, Joint <da...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> Hi.
>>> The quick full scenario is a distributed DaaS which supports queries, updates, transforms and bulkloads. Andy Seaborne knows some of the detail because I spoke to him previously. We achieve multiple writes by having parallel Datasets, both traditional TDB and on demand in memory. Writes are sent to a free dataset, free being not in a write transaction. That's a simplistic overview...
>>> Queries are handled by a dataset proxy which builds a dynamic dataset based on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find method to issue the query to all known Datasets and return the union of results. Various dataset proxies exist, some load TDBs, others load TTL files into graphs, others dynamically create tuples. The common thing being they are all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can result in multiple Datasets being loaded to satisfy the query.
>>> Nodes can be preloaded which then load Datasets to satisfy finds. This way the system can be scaled to handle increased work loads. Also specific nodes can be targeted to specific hardware.
>>> When a graph URI is encountered the proxy can interpret it's structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository foo to be dynamically loaded into memory along with the quads which are required to satisfy the find.
>>> Typically a group of people will be working on a set of data so the first to query will load the dataset then it will be accessed multiple times. There will be an initial dynamic load of data which will tail off with some additional loading over time.
>>> Based on your description the DatasetGraphInMemory would seem to match the dynamic load requirement. How did you foresee it being loaded? Is there a large over head to using the add methods?
>>> A typical scenario would be to search all SDAI repository's for some key information then load detailed information in some, continuing to drill down.
>>> Hope this helps.
>>> I'm going to extend the hex and tri tables and run some tests. I've already shimed the DGTriplesQuads so the actual caching code already exists and should bed easy to hook on.
>>> Dick
>>> 
>>> -------- Original message --------
>>> From: "A. Soroka" <aj...@virginia.edu>
>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>> To: users@jena.apache.org
>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory
>>> 
>>> Okay, I’m more confident at this point that you’re not well served by DatasetGraphInMemory, which has very strong assumptions about the speedy reachability of data. DSGInMemory was built for situations when all of the data is in core memory and multithreaded access is important. If you have a lot of core memory and can load the data fully, you might want to use it, but that doesn’t sound at all like your case. Otherwise, as far as what the right extension point is, I will need to defer to committers or more experienced devs, but I think you may need to look at DatasetGraph from a more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for example.
>>> 
>>> Can you tell us a bit more about your full scenario? I don’t know much about STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of queries are you going to be using with this data? How quickly are users going to need to switch contexts between datasets?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>>> On Feb 12, 2016, at 2:44 PM, Joint <da...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> Thanks for the fast response!
>>>>     I have a set of disk based binary SDAI repository's which are based on ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can be +1Gb. However after processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP files...
>>>> Typically only a small subset of the STEP file needs to be queried but sometimes other parts need to be queried. Hence the on demand caching and DatasetGraphInMemory. The aim is that in the find methods I check a cache and call the native SDAI find methods based on the node URI's in the case of a cache miss, calling the add methods for the minted tuples, then passing on the call to the super find. The underlying SDAI repository's are static so once a subject is cached no other work is required.
>>>> As the DatasetGraphInMemory is commented as very fast quad and triple access it seemed a logical place to extend. The shim cache would be set to expire entries and limit the total number of tuples power repository. This is currently deployed on a 256Gb ram device.
>>>> In the bigger picture l have a service very similar to Fuseki which allows SPARQL requests to be made against Datasets which are either TDB or SDAI cache backed.
>>>> What was DatasetGraphInMemory created for..? ;-)
>>>> Dick
>>>> 
>>>> -------- Original message --------
>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>> To: users@jena.apache.org
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory
>>>> 
>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question may be better answered by other folks who are more familiar with Jena's DatasetGraph implementations, or may actually not have anything to do with DatasetGraph (see below for why). I will try to give some background information, though.
>>>> 
>>>> There are several paths by which where DatasetGraphInMemory can be performing finds, but they come down to two places in the code, QuadTable:: and TripleTable::find and in default operation, the concrete forms:
>>>> 
>>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>> 
>>>> for Quads and
>>>> 
>>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>> 
>>>> for Triples. Those methods are reused by all the differently-ordered indexes within Hex- or TriTable, each of which will answer a find by selecting an appropriately-ordered index based on the fixed and variable slots in the find pattern and using the concrete methods above to stream tuples back.
>>>> 
>>>> As to why you are seeing your methods called in some places and not in others, DatasetGraphBaseFind features methods like findInDftGraph(), findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a selection between those methods— that is done by DatasetGraphBaseFind. So that is where you will find the logic that should answer your question.
>>>> 
>>>> Can you say a little more about your use case? You seem to have some efficient representation in memory of your data (I hope it is in-memory— otherwise it is a very bad choice to subclass DSGInMemory) and you want to create tuples on the fly as queries are received. That is really not at all what DSGInMemory is for (DSGInMemory is using map structures for indexing and in default mode, uses persistent data structures to support transactionality). I am wondering whether you might not be much better served by tapping into Jena at a different place, perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just implementing Quad- and TripleTable and using the constructor DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <da...@gmail.com> wrote:
>>>>> 
>>>>> Hi.
>>>>> 
>>>>> Does anyone know the "find" paths through DatasetGraphInMemory please?
>>>>> 
>>>>> For example if I extend DatasetGraphInMemory and override
>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
>>>>> * where {?s ?p ?o}" however if I override the other
>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
>>>>> calling (but as I type I'm guessing it's optimised to return the HexTable
>>>>> nodes...).
>>>>> 
>>>>> Would I be better off overriding HexTable and TriTable classes find methods
>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
>>>>> one of these methods?
>>>>> 
>>>>> I need to know the root find methods so that I can shim them to create
>>>>> triples/quads before they perform the find.
>>>>> 
>>>>> I need to create Triples/Quads on demand (because a bulk load would create
>>>>> ~100M triples but only ~1000 are ever queried) and the source binary form
>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
>>>>> than quads.
>>>>> 
>>>>> Regards Dick Murray.
>>>> 
>>> 
>> 
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by "A. Soroka" <aj...@virginia.edu>.

On this particular point, there has been such discussion recently:

http://markmail.org/message/wo5r3edi7xzt7zmx
http://markmail.org/message/hxao4izpiv7quumv

but no action that I know of. (Claude Warren would know more than me.)

---
A. Soroka
The University of Virginia Library

> On Mar 10, 2016, at 3:10 PM, Dick Murray <da...@gmail.com> wrote:
> 
> On the subject of storage is there any thought to providing granular locking, DSG, per graph, dirty..?

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by Andy Seaborne <an...@apache.org>.

On 10/03/16 20:10, Dick Murray wrote:
> Hi. Yes re TriTable and TripleTable. I too like the storage interface which
> would work for my needs and make life simpler. A few points from me.
> Currently I wrap an existing dsg and cache the additional tuples into what
> I call the deferred DSG or DDSG. The finds return a DSG iterator and a DDSG
> iterator.
>
> The DDSG is in memory and I have a number of concrete classes which achieve
> the same end.
>
> Firstly i use a Jenna core men DSG and the find handles just add tuples as
> required into the HexTable because i don't have a default graph, i.e. it's
> never referenced because i need a graph uri to find the deferred data.
>
> The second is in common I have a concurrent map which handles recording
> what graphs have been deferred then I either use TriTable or a concurrent
> set of tuples to store the graph contents. When I'm using the TriTable I
> acquire the write lock and add tuples. So writes can occur in parallel to
> different graphs. I've experimented with the concurrent set by spoofing the
> write and just adding the tuples I.e. no write lock contention per graph. I
> notice the datatsetgraphstorage

????

> does not support txn abort? This gives an
> in memory DSG which doesn't have lock contention because it never locks...
> This is applicable in some circumstances and I think that the right
> deferred tuples is one of them?
>
> I also coded a DSG which supports a reentrerant RW with upgrade lock which
> allowed me to combine the two DSG's because I could promote the read lock.
>
> Andy I notice your code has a txn interface with a read to write promotion
> indicator? Is an upgrade method being considered to the txn interface
> because that was an issue I hit and why I have two dsg's. Code further up
> the stack calls a txn read but a cache miss needs a write to persist the
> new tuples.
>
> A dynamic adapter would support a defined set of handles and the find would
> be shimmed to check if any tuples need to be added. If we could define a
> set of interfaces to achieve this which shouldn't be too difficult.
>
> On the subject of storage is there any thought to providing granular
> locking, DSG, per graph, dirty..?
>
> Dick

Per graph indexing only makes sense if the graphs are held separately. 
A quad table isn't going to work very well because some quads are in one 
graph and some in another yet all in the same index structure.

So a ConcurrentHashMap holding (c.f. what is now called 
DatasetGraphMapLink) separate graphs would seem to make sense. 
Contributions welcome.

Transaction promotion is an interestingly tricky thing - it can mean a 
system has to cause aborts or lower the isolation guarantees. (e.g. Txn1 
starts Read, Txn2 starts write-updates-commits, Txn1 continues, can't 
see Txn2 changes (note it may be before or after Txn2 ran), Txn attempts 
to promote to a W transaction.  Read-committed leads to non-repeatable 
reads (things like count() go wrong for example).

When you say "your code has a txn interface" I take you mean non-jena code?

That all said, this sound like a simpler case - just because a read 
transaction needs to update internal caches does not mean it's the fully 
general case of transaction promotion.  A lock and weaker isolation may do.

	Andy

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by Dick Murray <da...@gmail.com>.

Hi. Yes re TriTable and TripleTable. I too like the storage interface which
would work for my needs and make life simpler. A few points from me.
Currently I wrap an existing dsg and cache the additional tuples into what
I call the deferred DSG or DDSG. The finds return a DSG iterator and a DDSG
iterator.

The DDSG is in memory and I have a number of concrete classes which achieve
the same end.

Firstly i use a Jenna core men DSG and the find handles just add tuples as
required into the HexTable because i don't have a default graph, i.e. it's
never referenced because i need a graph uri to find the deferred data.

The second is in common I have a concurrent map which handles recording
what graphs have been deferred then I either use TriTable or a concurrent
set of tuples to store the graph contents. When I'm using the TriTable I
acquire the write lock and add tuples. So writes can occur in parallel to
different graphs. I've experimented with the concurrent set by spoofing the
write and just adding the tuples I.e. no write lock contention per graph. I
notice the datatsetgraphstorage does not support txn abort? This gives an
in memory DSG which doesn't have lock contention because it never locks...
This is applicable in some circumstances and I think that the right
deferred tuples is one of them?

I also coded a DSG which supports a reentrerant RW with upgrade lock which
allowed me to combine the two DSG's because I could promote the read lock.

Andy I notice your code has a txn interface with a read to write promotion
indicator? Is an upgrade method being considered to the txn interface
because that was an issue I hit and why I have two dsg's. Code further up
the stack calls a txn read but a cache miss needs a write to persist the
new tuples.

A dynamic adapter would support a defined set of handles and the find would
be shimmed to check if any tuples need to be added. If we could define a
set of interfaces to achieve this which shouldn't be too difficult.

On the subject of storage is there any thought to providing granular
locking, DSG, per graph, dirty..?

Dick
Just a point of info: I'm _pretty_ sure that we're talking about
TripleTable, not TriTable. TriTabl is an impl class (implementing
TripleTable) that uses three TripleTables to index, well, triples.
TripleTable (and its sibling, QuadTable) are the interfaces that, I think,
we are interested in possibly generalizing and making more public.

As Andy knows, I tried hard to unify Triple- and QuadTable under a
supertype TupleTable, but the fact is that Java doesn't really do variable
arity very well and we didn't want to mess with very core types like Quad
or Triple, so the method dealing with tuples by elements (::find) stayed in
the specialization, but methods dealing with the tuple as a whole (e.g.
::add) got pushed up. I think Andy has done a nice job below bringing
everything together in a simple, straightforward way.
org.apache.jena.sparql.core.mem could be rewritten very quickly to use this
instead of the current types, if that's any evidence.


---
A. Soroka
The University of Virginia Library

> On Mar 10, 2016, at 7:08 AM, Andy Seaborne <an...@apache.org> wrote:
>
> Hi Dick,
>
> Thanks for the details.
>
> So TriTable is used as the internal implementation of a caching read-only
graph and you're using the loop form for GRAPH (and often the loop is one
URI - i.e. directed to one part of the data).  Using TriTable is because
it's a convenient triple storage for the use case.
>
> The two interesting pieces to Jena:
>
> 1/ support for writing dynamic adapters
>
> 2/ a graph (DatasetGraph) implementation that more clearly has an
interface for storage.
>
>
> On the latter: I've come across this before and sketched this interface.
>
> It's nothing more than a first pass sketch.  Is this the sort of thing
that might work for your use case? (a graph storage version with quads over
the top as a subcase):
>
> interface StorageRDF {
>    default void add(Triple triple) { .... }
>    default void add(Quad quad)     { .... }
>
>    default void delete(Triple triple)  { .... }
>    default void delete(Quad quad)      { .... }
>
>    void add(Node s, Node p, Node o) ;
>    void add(Node g, Node s, Node p, Node o) ;
>
>    void delete(Node s, Node p, Node o) ;
>    void delete(Node g, Node s, Node p, Node o) ;
>
>    /** Delete all triples matching a {@code find}-like pattern */
>    void removeAll(Node s, Node p, Node o) ;
>    /** Delete all quads matching a {@code find}-like pattern */
>    void removeAll(Node g, Node s, Node p, Node o) ;
>
>    // NB Quads
>    Stream<Quad>   findDftGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   findUnionGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   find(Node g, Node s, Node p, Node o) ;
>    // For findUnion.
>    Stream<Quad>   findDistinct(Node g, Node s, Node p, Node o) ;
>
>    // triples
>    Stream<Triple> find(Node s, Node p, Node o) ;
>
> //    default Stream<Triple> find(Node s, Node p, Node o) {
> //        return findDftGraph(s,p,o).map(Quad::asTriple) ;
> //    }
>
> //    Iterator<Quad>   findUnionGraph(Node s, Node p, Node o) ;
> //    Iterator<Quad>   find(Node g, Node s, Node p, Node o) ;
>
>
>    // contains
>
>    default boolean contains(Node s, Node p, Node o)
>    { return find(s,p,o).findAny().isPresent() ; }
>    default boolean contains(Node g, Node s, Node p, Node o)
>    { return find(g,s,p,o).findAny().isPresent() ; }
>
>    // Prefixes ??
> }
>
>
> https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
> also has the companion DatasetGraphStorage.
>
>    Andy
>
>
>
> On 04/03/16 12:03, Dick Murray wrote:
>> LOL. The perils of a succinct update with no detail!
>>
>> I understand the Jena SPI supports read/writes via transactions and I
also
>> know that the wrapper classes provide a best effort for some of the
>> overridden methods which do not always sit well when materializing
triples.
>> For example DatasetGraphBase provides public boolean containsGraph(Node
>> graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
>> which results in a call to DatasetGraphBaseFind public Iterator<Quad>
>> find(Node g, Node s, Node p, Node o) which might end up with something
>> being called in DatasetGraphInMemory depending on what has been extended
>> and overridden. This causes a problem for me because I shim the finds to
>> decide whether the triples have been materialized before calling the
>> overridden find. After extending DatasetGraphTriples and
>> DatasetGraphInMemory I realised that I had overridden most of the methods
>> so I stopped and implemented DatasetGraph and Transactional.
>>
>> In my scenario the underlying data (a vendor agnostic format to get
>> AutoCAD, Bentley, etc to work together) is never changed so the
>> DatasetGraph need not support writes. Whilst we need to provide semantic
>> access to the these files they result in ~100M triples each if
transformed,
>> there are 1000's of files, they can change multiple times per day and the
>> various disciplines typically only require a subset of triples.
>>
>> That said in my DatasetGraph implementation if you call
>> begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
>> implementation in that it does not support external writes (throws UOE)
but
>> does implement writes internally (via TriTable) because it needs to write
>> the materialized triples to answer the find.
>>
>> So if we take
>>
>> select ?s
>> where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
>> <urn:iungo:iso/10303/11/schema/s/entity/e>}
>>
>> Jena via the SPARQL query engine will perform the following abridged
>> process.
>>
>>    - Jena begins a DG read transaction.
>>    - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>,
ANY,
>>    a <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - DG will;
>>       - check if the repository r has been loaded, i.e. matching the
>>       repository name URI spec fragment to a repository file on disk
>> and loading
>>       it into the SDAI session.
>>       - check if the model m has been loaded, i.e. matching the model
name
>>       URI spec fragment to a repository model and loading it into the
SDAI
>>       session.
>>          - If we have just loaded the SDAI model check if there is any
pre
>>          caching to be done which is just a set of find triples which
>> are handled as
>>          per the normal find detailed following.
>>       - We now have a G which wraps the SDAI model and uses TriTable to
>>    hold materialized triples.
>>    - DG will now call G.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - G will check the find triple against a set of already materialized
>>    find triples and if it misses;
>>       - G will search a set of triple handles which know how to
materialize
>>       triples for a given find triple and if found;
>>          - G begins a TriTable write transaction and for {ANY, a
>>          <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
>> are READ but
>>          the G TriTable is WRITE);
>>             - Check the find triples again we might have been in a race
for
>>             the find triple and lost...
>>             - Load the correct Java class for entity e which involves
>>             minting the FQCN using the schema s and entity e e.g.
>> ifc2x3 and ifcslab
>>             become org.jsdai.ifc2x3.ifcslab.
>>             - Use this to call the SDAI method findInstances(Class<?
>>             extends Entity> entityClass) which returns zero or more
>> SDAI entities from
>>             which we;
>>                - Query the ifc2x3 schema to list the explicit Entity
>>                attributes and for each we add a triple to TriTable e.g.
>>                ifcslab:ifcorganization =
>>
{<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>
<urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
>> <urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
>>                - In addition we add the triple
>>
{<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
>>                <urn:iungo:iso/10303/11/schema/s/entity/e>}.
>>                - If we are creating linked triples (i.e. max depth > 1)
>>                then for each attribute which has a SDAI entity
>> instance value call the
>>                appropriate handle to create the triples.
>>             - G commits the TriTable write transaction (make the triples
>>          visible before we update the find triples!).
>>          - G updates the find triples to include;
>>          - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
>>             - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>             ANY ANY}
>>             - Repeat the above for any linked triples created.
>>             - The TriTable now contains the triples required to answer
the
>>          find triple.
>>          - G will return TriTable.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>)
>>    - Jena ends the DG read transaction.
>>
>>
>> Some find triples will result in the appropriate handle being called
>> (handle hit) which will create triples. Others will handle miss and be
>> passed on to the TriTable find (e.g. no triples created and TriTable will
>> return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
>> example because does this mean create all of the triples (+100M) or all
of
>> the currently created triples (which relies on having queried what you
need
>> to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful
to
>> ask this find?
>>
>> Hope that clear up the "writes are not supported" (the underlying data is
>> read only) and why the TupleTable subtypes are not problematic. I could
>> have held the created triples per find triple but that wouldn't scale
with
>> duplication plus why recreate the wheel when if I'm not mistaken TriTable
>> uses the dexx collection giving subsequent HAMT advantages which is what
a
>> high performance in memory implementation requires. The solution is
working
>> and compared to a fully transformed TDB is giving the correct results. To
>> do might include timing out the G when they have not been accessed for a
>> period of time...
>>
>> Finally having wrote the wrapper I thought it wouldn't be used anywhere
>> else but subsequently it was used to abstract an existing system where
>> adhoc semantic access was required and it's lined to do a similar task on
>> two other data silos. Hence the question to Andy regarding a Jena cached
>> SPI package.
>>
>> Thanks again for your help Adam/Andy.
>>
>> Dick.
>>
>>
>>
>> On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:
>>
>>> I’m confused about two of your points here. Let me separate them out so
we
>>> can discuss them easily.
>>>
>>> 1) "writes are not supported”:
>>>
>>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>>> out-of-the-box implementations of RDF storage. Can you explain what you
>>> mean by this?
>>>
>>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>>> triple caching algorithm”:
>>>
>>> The subtypes of TupleTable with which you are working have exactly the
>>> same kinds of find() methods. Why are they not problematic in that
context?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>>> On Mar 3, 2016, at 5:47 AM, Joint <da...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi Andy.
>>>> I implemented the entire SPI at the DatasetGraph and Graph level. It
got
>>> to the point where I had overridden more methods than not. In addition
>>> writes are not supported and contains methods which call find(ANY, ANY,
>>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm
using
>>> the TriTable because it fits and quads are spoofed via triple to quad
>>> iterator.
>>>> I have a set of filters and handles which the find triple is compared
>>> against and either passed straight to the TriTable if the triple has
been
>>> handled before or its passed to the appropriate handle which adds the
>>> triples to the TriTable then calls the find. As the underlying data is a
>>> tree a cache depth can be set which allows related triples to be cached.
>>> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
>>>> Would you consider a generic version for the Jena code base?
>>>>
>>>>
>>>> Dick
>>>>
>>>> -------- Original message --------
>>>> From: Andy Seaborne <an...@apache.org>
>>>> Date: 18/02/2016  6:31 pm  (GMT+00:00)
>>>> To: users@jena.apache.org
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>>>  DatasetGraphInMemory
>>>>
>>>> Hi,
>>>>
>>>> I'm not seeing how tapping into the implementation of
>>>> DatasetGraphInMemory is going to help (through the details
>>>>
>>>> As well as the DatasetGraphMap approach, one other thought that
occurred
>>>> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
>>>> implementation.
>>>>
>>>> It loads, and clears, the mapped graph on-demand, and passes the find()
>>>> call through to the now-setup data.
>>>>
>>>>       Andy
>>>>
>>>> On 16/02/16 17:42, A. Soroka wrote:
>>>>>> Based on your description the DatasetGraphInMemory would seem to
match
>>> the dynamic load requirement. How did you foresee it being loaded? Is
there
>>> a large over head to using the add methods?
>>>>>
>>>>> No, I certainly did not mean to give that impression, and I don’t
think
>>> it is entirely accurate. DSGInMemory was definitely not at all meant for
>>> dynamic loading. That doesn’t mean it can’t be used that way, but that
was
>>> not in the design, which assumed that all tuples take about the same
amount
>>> of time to access and that all of the same type are coming from the same
>>> implementation (in a QuadTable and a TripleTable).
>>>>>
>>>>> The overhead of mutating a dataset is mostly inside the
implementations
>>> of TupleTable that are actually used to store tuples. You should be
aware
>>> that TupleTable extends TransactionalComponent, so if you want to use
it to
>>> create some kind of connection to your storage, you will need to make
that
>>> connection fully transactional. That doesn’t sound at all trivial in
your
>>> case.
>>>>>
>>>>> At this point it seems to me that extending DatasetGraphMap (and
>>> implementing GraphMaker and Graph instead of TupleTable) might be a more
>>> appropriate design for your work. You can put dynamic loading behavior
in
>>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
>>> Are there reasons around the use of transactionality in your work that
>>> demand the particular semantics supported by DSGInMemory?
>>>>>
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>>
>>>>>> On Feb 13, 2016, at 5:18 AM, Joint <da...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi.
>>>>>> The quick full scenario is a distributed DaaS which supports queries,
>>> updates, transforms and bulkloads. Andy Seaborne knows some of the
detail
>>> because I spoke to him previously. We achieve multiple writes by having
>>> parallel Datasets, both traditional TDB and on demand in memory. Writes
are
>>> sent to a free dataset, free being not in a write transaction. That's a
>>> simplistic overview...
>>>>>> Queries are handled by a dataset proxy which builds a dynamic dataset
>>> based on the graph URIs. For example the graph URI urn:Iungo:all causes
the
>>> proxy find method to issue the query to all known Datasets and return
the
>>> union of results. Various dataset proxies exist, some load TDBs, others
>>> load TTL files into graphs, others dynamically create tuples. The common
>>> thing being they are all presented as Datasets backed by DatasetGraph.
Thus
>>> a SPARQL query can result in multiple Datasets being loaded to satisfy
the
>>> query.
>>>>>> Nodes can be preloaded which then load Datasets to satisfy finds.
This
>>> way the system can be scaled to handle increased work loads. Also
specific
>>> nodes can be targeted to specific hardware.
>>>>>> When a graph URI is encountered the proxy can interpret it's
>>> structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in
the
>>> SDAI repository foo to be dynamically loaded into memory along with the
>>> quads which are required to satisfy the find.
>>>>>> Typically a group of people will be working on a set of data so the
>>> first to query will load the dataset then it will be accessed multiple
>>> times. There will be an initial dynamic load of data which will tail off
>>> with some additional loading over time.
>>>>>> Based on your description the DatasetGraphInMemory would seem to
match
>>> the dynamic load requirement. How did you foresee it being loaded? Is
there
>>> a large over head to using the add methods?
>>>>>> A typical scenario would be to search all SDAI repository's for some
>>> key information then load detailed information in some, continuing to
drill
>>> down.
>>>>>> Hope this helps.
>>>>>> I'm going to extend the hex and tri tables and run some tests. I've
>>> already shimed the DGTriplesQuads so the actual caching code already
exists
>>> and should bed easy to hook on.
>>>>>> Dick
>>>>>>
>>>>>> -------- Original message --------
>>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>>>>> To: users@jena.apache.org
>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>>
>>>>>> Okay, I’m more confident at this point that you’re not well served by
>>> DatasetGraphInMemory, which has very strong assumptions about the speedy
>>> reachability of data. DSGInMemory was built for situations when all of
the
>>> data is in core memory and multithreaded access is important. If you
have a
>>> lot of core memory and can load the data fully, you might want to use
it,
>>> but that doesn’t sound at all like your case. Otherwise, as far as what
the
>>> right extension point is, I will need to defer to committers or more
>>> experienced devs, but I think you may need to look at DatasetGraph from
a
>>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
>>> directly, for example.
>>>>>>
>>>>>> Can you tell us a bit more about your full scenario? I don’t know
much
>>> about STEP (sorry if others do)— is there a canonical RDF formulation?
What
>>> kinds of queries are you going to be using with this data? How quickly
are
>>> users going to need to switch contexts between datasets?
>>>>>>
>>>>>> ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>>
>>>>>>> On Feb 12, 2016, at 2:44 PM, Joint <da...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the fast response!
>>>>>>>     I have a set of disk based binary SDAI repository's which are
>>> based on ISO10303 parts 11/21/25/27 otherwise known as the
>>> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which
can
>>> be +1Gb. However after processing into a SDAI binary I typically see a
size
>>> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I
convert
>>> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied
by
>>> 1000's of similar sized STEP files...
>>>>>>> Typically only a small subset of the STEP file needs to be queried
>>> but sometimes other parts need to be queried. Hence the on demand
caching
>>> and DatasetGraphInMemory. The aim is that in the find methods I check a
>>> cache and call the native SDAI find methods based on the node URI's in
the
>>> case of a cache miss, calling the add methods for the minted tuples,
then
>>> passing on the call to the super find. The underlying SDAI repository's
are
>>> static so once a subject is cached no other work is required.
>>>>>>> As the DatasetGraphInMemory is commented as very fast quad and
triple
>>> access it seemed a logical place to extend. The shim cache would be set
to
>>> expire entries and limit the total number of tuples power repository.
This
>>> is currently deployed on a 256Gb ram device.
>>>>>>> In the bigger picture l have a service very similar to Fuseki which
>>> allows SPARQL requests to be made against Datasets which are either TDB
or
>>> SDAI cache backed.
>>>>>>> What was DatasetGraphInMemory created for..? ;-)
>>>>>>> Dick
>>>>>>>
>>>>>>> -------- Original message --------
>>>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>>>>> To: users@jena.apache.org
>>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>>>
>>>>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
>>> may be better answered by other folks who are more familiar with Jena's
>>> DatasetGraph implementations, or may actually not have anything to do
with
>>> DatasetGraph (see below for why). I will try to give some background
>>> information, though.
>>>>>>>
>>>>>>> There are several paths by which where DatasetGraphInMemory can be
>>> performing finds, but they come down to two places in the code,
QuadTable::
>>> and TripleTable::find and in default operation, the concrete forms:
>>>>>>>
>>>>>>>
>>>
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>>>>>
>>>>>>> for Quads and
>>>>>>>
>>>>>>>
>>>
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>>>>>
>>>>>>> for Triples. Those methods are reused by all the differently-ordered
>>> indexes within Hex- or TriTable, each of which will answer a find by
>>> selecting an appropriately-ordered index based on the fixed and variable
>>> slots in the find pattern and using the concrete methods above to stream
>>> tuples back.
>>>>>>>
>>>>>>> As to why you are seeing your methods called in some places and not
>>> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these
are
>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does
not
>>> make a selection between those methods— that is done by
>>> DatasetGraphBaseFind. So that is where you will find the logic that
should
>>> answer your question.
>>>>>>>
>>>>>>> Can you say a little more about your use case? You seem to have some
>>> efficient representation in memory of your data (I hope it is in-memory—
>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want
to
>>> create tuples on the fly as queries are received. That is really not at
all
>>> what DSGInMemory is for (DSGInMemory is using map structures for
indexing
>>> and in default mode, uses persistent data structures to support
>>> transactionality). I am wondering whether you might not be much better
>>> served by tapping into Jena at a different place, perhaps implementing
the
>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
>>> implementing Quad- and TripleTable and using the constructor
>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>>>>>
>>>>>>> ---
>>>>>>> A. Soroka
>>>>>>> The University of Virginia Library
>>>>>>>
>>>>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <da...@gmail.com>
>>> wrote:
>>>>>>>>
>>>>>>>> Hi.
>>>>>>>>
>>>>>>>> Does anyone know the "find" paths through DatasetGraphInMemory
>>> please?
>>>>>>>>
>>>>>>>> For example if I extend DatasetGraphInMemory and override
>>>>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
>>> "select
>>>>>>>> * where {?s ?p ?o}" however if I override the other
>>>>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
>>> {?s ?p
>>>>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
>>> it's
>>>>>>>> calling (but as I type I'm guessing it's optimised to return the
>>> HexTable
>>>>>>>> nodes...).
>>>>>>>>
>>>>>>>> Would I be better off overriding HexTable and TriTable classes find
>>> methods
>>>>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
>>> end in
>>>>>>>> one of these methods?
>>>>>>>>
>>>>>>>> I need to know the root find methods so that I can shim them to
>>> create
>>>>>>>> triples/quads before they perform the find.
>>>>>>>>
>>>>>>>> I need to create Triples/Quads on demand (because a bulk load would
>>> create
>>>>>>>> ~100M triples but only ~1000 are ever queried) and the source
binary
>>> form
>>>>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
>>> quads)
>>>>>>>> than quads.
>>>>>>>>
>>>>>>>> Regards Dick Murray.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by "A. Soroka" <aj...@virginia.edu>.

Just a point of info: I'm _pretty_ sure that we're talking about TripleTable, not TriTable. TriTabl is an impl class (implementing TripleTable) that uses three TripleTables to index, well, triples. TripleTable (and its sibling, QuadTable) are the interfaces that, I think, we are interested in possibly generalizing and making more public.

As Andy knows, I tried hard to unify Triple- and QuadTable under a supertype TupleTable, but the fact is that Java doesn't really do variable arity very well and we didn't want to mess with very core types like Quad or Triple, so the method dealing with tuples by elements (::find) stayed in the specialization, but methods dealing with the tuple as a whole (e.g. ::add) got pushed up. I think Andy has done a nice job below bringing everything together in a simple, straightforward way. org.apache.jena.sparql.core.mem could be rewritten very quickly to use this instead of the current types, if that's any evidence.


---
A. Soroka
The University of Virginia Library

> On Mar 10, 2016, at 7:08 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> Hi Dick,
> 
> Thanks for the details.
> 
> So TriTable is used as the internal implementation of a caching read-only graph and you're using the loop form for GRAPH (and often the loop is one URI - i.e. directed to one part of the data).  Using TriTable is because it's a convenient triple storage for the use case.
> 
> The two interesting pieces to Jena:
> 
> 1/ support for writing dynamic adapters
> 
> 2/ a graph (DatasetGraph) implementation that more clearly has an interface for storage.
> 
> 
> On the latter: I've come across this before and sketched this interface.
> 
> It's nothing more than a first pass sketch.  Is this the sort of thing that might work for your use case? (a graph storage version with quads over the top as a subcase):
> 
> interface StorageRDF {
>    default void add(Triple triple) { .... }
>    default void add(Quad quad)     { .... }
> 
>    default void delete(Triple triple)  { .... }
>    default void delete(Quad quad)      { .... }
> 
>    void add(Node s, Node p, Node o) ;
>    void add(Node g, Node s, Node p, Node o) ;
> 
>    void delete(Node s, Node p, Node o) ;
>    void delete(Node g, Node s, Node p, Node o) ;
> 
>    /** Delete all triples matching a {@code find}-like pattern */
>    void removeAll(Node s, Node p, Node o) ;
>    /** Delete all quads matching a {@code find}-like pattern */
>    void removeAll(Node g, Node s, Node p, Node o) ;
> 
>    // NB Quads
>    Stream<Quad>   findDftGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   findUnionGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   find(Node g, Node s, Node p, Node o) ;
>    // For findUnion.
>    Stream<Quad>   findDistinct(Node g, Node s, Node p, Node o) ;
> 
>    // triples
>    Stream<Triple> find(Node s, Node p, Node o) ;
> 
> //    default Stream<Triple> find(Node s, Node p, Node o) {
> //        return findDftGraph(s,p,o).map(Quad::asTriple) ;
> //    }
> 
> //    Iterator<Quad>   findUnionGraph(Node s, Node p, Node o) ;
> //    Iterator<Quad>   find(Node g, Node s, Node p, Node o) ;
> 
> 
>    // contains
> 
>    default boolean contains(Node s, Node p, Node o)
>    { return find(s,p,o).findAny().isPresent() ; }
>    default boolean contains(Node g, Node s, Node p, Node o)
>    { return find(g,s,p,o).findAny().isPresent() ; }
> 
>    // Prefixes ??
> }
> 
> 
> https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
> also has the companion DatasetGraphStorage.
> 
>    Andy
> 
> 
> 
> On 04/03/16 12:03, Dick Murray wrote:
>> LOL. The perils of a succinct update with no detail!
>> 
>> I understand the Jena SPI supports read/writes via transactions and I also
>> know that the wrapper classes provide a best effort for some of the
>> overridden methods which do not always sit well when materializing triples.
>> For example DatasetGraphBase provides public boolean containsGraph(Node
>> graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
>> which results in a call to DatasetGraphBaseFind public Iterator<Quad>
>> find(Node g, Node s, Node p, Node o) which might end up with something
>> being called in DatasetGraphInMemory depending on what has been extended
>> and overridden. This causes a problem for me because I shim the finds to
>> decide whether the triples have been materialized before calling the
>> overridden find. After extending DatasetGraphTriples and
>> DatasetGraphInMemory I realised that I had overridden most of the methods
>> so I stopped and implemented DatasetGraph and Transactional.
>> 
>> In my scenario the underlying data (a vendor agnostic format to get
>> AutoCAD, Bentley, etc to work together) is never changed so the
>> DatasetGraph need not support writes. Whilst we need to provide semantic
>> access to the these files they result in ~100M triples each if transformed,
>> there are 1000's of files, they can change multiple times per day and the
>> various disciplines typically only require a subset of triples.
>> 
>> That said in my DatasetGraph implementation if you call
>> begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
>> implementation in that it does not support external writes (throws UOE) but
>> does implement writes internally (via TriTable) because it needs to write
>> the materialized triples to answer the find.
>> 
>> So if we take
>> 
>> select ?s
>> where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
>> <urn:iungo:iso/10303/11/schema/s/entity/e>}
>> 
>> Jena via the SPARQL query engine will perform the following abridged
>> process.
>> 
>>    - Jena begins a DG read transaction.
>>    - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>, ANY,
>>    a <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - DG will;
>>       - check if the repository r has been loaded, i.e. matching the
>>       repository name URI spec fragment to a repository file on disk
>> and loading
>>       it into the SDAI session.
>>       - check if the model m has been loaded, i.e. matching the model name
>>       URI spec fragment to a repository model and loading it into the SDAI
>>       session.
>>          - If we have just loaded the SDAI model check if there is any pre
>>          caching to be done which is just a set of find triples which
>> are handled as
>>          per the normal find detailed following.
>>       - We now have a G which wraps the SDAI model and uses TriTable to
>>    hold materialized triples.
>>    - DG will now call G.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - G will check the find triple against a set of already materialized
>>    find triples and if it misses;
>>       - G will search a set of triple handles which know how to materialize
>>       triples for a given find triple and if found;
>>          - G begins a TriTable write transaction and for {ANY, a
>>          <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
>> are READ but
>>          the G TriTable is WRITE);
>>             - Check the find triples again we might have been in a race for
>>             the find triple and lost...
>>             - Load the correct Java class for entity e which involves
>>             minting the FQCN using the schema s and entity e e.g.
>> ifc2x3 and ifcslab
>>             become org.jsdai.ifc2x3.ifcslab.
>>             - Use this to call the SDAI method findInstances(Class<?
>>             extends Entity> entityClass) which returns zero or more
>> SDAI entities from
>>             which we;
>>                - Query the ifc2x3 schema to list the explicit Entity
>>                attributes and for each we add a triple to TriTable e.g.
>>                ifcslab:ifcorganization =
>>                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>                <urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
>> <urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
>>                - In addition we add the triple
>>                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
>>                <urn:iungo:iso/10303/11/schema/s/entity/e>}.
>>                - If we are creating linked triples (i.e. max depth > 1)
>>                then for each attribute which has a SDAI entity
>> instance value call the
>>                appropriate handle to create the triples.
>>             - G commits the TriTable write transaction (make the triples
>>          visible before we update the find triples!).
>>          - G updates the find triples to include;
>>          - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
>>             - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>             ANY ANY}
>>             - Repeat the above for any linked triples created.
>>             - The TriTable now contains the triples required to answer the
>>          find triple.
>>          - G will return TriTable.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>)
>>    - Jena ends the DG read transaction.
>> 
>> 
>> Some find triples will result in the appropriate handle being called
>> (handle hit) which will create triples. Others will handle miss and be
>> passed on to the TriTable find (e.g. no triples created and TriTable will
>> return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
>> example because does this mean create all of the triples (+100M) or all of
>> the currently created triples (which relies on having queried what you need
>> to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful to
>> ask this find?
>> 
>> Hope that clear up the "writes are not supported" (the underlying data is
>> read only) and why the TupleTable subtypes are not problematic. I could
>> have held the created triples per find triple but that wouldn't scale with
>> duplication plus why recreate the wheel when if I'm not mistaken TriTable
>> uses the dexx collection giving subsequent HAMT advantages which is what a
>> high performance in memory implementation requires. The solution is working
>> and compared to a fully transformed TDB is giving the correct results. To
>> do might include timing out the G when they have not been accessed for a
>> period of time...
>> 
>> Finally having wrote the wrapper I thought it wouldn't be used anywhere
>> else but subsequently it was used to abstract an existing system where
>> adhoc semantic access was required and it's lined to do a similar task on
>> two other data silos. Hence the question to Andy regarding a Jena cached
>> SPI package.
>> 
>> Thanks again for your help Adam/Andy.
>> 
>> Dick.
>> 
>> 
>> 
>> On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:
>> 
>>> I’m confused about two of your points here. Let me separate them out so we
>>> can discuss them easily.
>>> 
>>> 1) "writes are not supported”:
>>> 
>>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>>> out-of-the-box implementations of RDF storage. Can you explain what you
>>> mean by this?
>>> 
>>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>>> triple caching algorithm”:
>>> 
>>> The subtypes of TupleTable with which you are working have exactly the
>>> same kinds of find() methods. Why are they not problematic in that context?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>>> On Mar 3, 2016, at 5:47 AM, Joint <da...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Andy.
>>>> I implemented the entire SPI at the DatasetGraph and Graph level. It got
>>> to the point where I had overridden more methods than not. In addition
>>> writes are not supported and contains methods which call find(ANY, ANY,
>>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
>>> the TriTable because it fits and quads are spoofed via triple to quad
>>> iterator.
>>>> I have a set of filters and handles which the find triple is compared
>>> against and either passed straight to the TriTable if the triple has been
>>> handled before or its passed to the appropriate handle which adds the
>>> triples to the TriTable then calls the find. As the underlying data is a
>>> tree a cache depth can be set which allows related triples to be cached.
>>> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
>>>> Would you consider a generic version for the Jena code base?
>>>> 
>>>> 
>>>> Dick
>>>> 
>>>> -------- Original message --------
>>>> From: Andy Seaborne <an...@apache.org>
>>>> Date: 18/02/2016  6:31 pm  (GMT+00:00)
>>>> To: users@jena.apache.org
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>>>  DatasetGraphInMemory
>>>> 
>>>> Hi,
>>>> 
>>>> I'm not seeing how tapping into the implementation of
>>>> DatasetGraphInMemory is going to help (through the details
>>>> 
>>>> As well as the DatasetGraphMap approach, one other thought that occurred
>>>> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
>>>> implementation.
>>>> 
>>>> It loads, and clears, the mapped graph on-demand, and passes the find()
>>>> call through to the now-setup data.
>>>> 
>>>>       Andy
>>>> 
>>>> On 16/02/16 17:42, A. Soroka wrote:
>>>>>> Based on your description the DatasetGraphInMemory would seem to match
>>> the dynamic load requirement. How did you foresee it being loaded? Is there
>>> a large over head to using the add methods?
>>>>> 
>>>>> No, I certainly did not mean to give that impression, and I don’t think
>>> it is entirely accurate. DSGInMemory was definitely not at all meant for
>>> dynamic loading. That doesn’t mean it can’t be used that way, but that was
>>> not in the design, which assumed that all tuples take about the same amount
>>> of time to access and that all of the same type are coming from the same
>>> implementation (in a QuadTable and a TripleTable).
>>>>> 
>>>>> The overhead of mutating a dataset is mostly inside the implementations
>>> of TupleTable that are actually used to store tuples. You should be aware
>>> that TupleTable extends TransactionalComponent, so if you want to use it to
>>> create some kind of connection to your storage, you will need to make that
>>> connection fully transactional. That doesn’t sound at all trivial in your
>>> case.
>>>>> 
>>>>> At this point it seems to me that extending DatasetGraphMap (and
>>> implementing GraphMaker and Graph instead of TupleTable) might be a more
>>> appropriate design for your work. You can put dynamic loading behavior in
>>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
>>> Are there reasons around the use of transactionality in your work that
>>> demand the particular semantics supported by DSGInMemory?
>>>>> 
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>> 
>>>>>> On Feb 13, 2016, at 5:18 AM, Joint <da...@gmail.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi.
>>>>>> The quick full scenario is a distributed DaaS which supports queries,
>>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail
>>> because I spoke to him previously. We achieve multiple writes by having
>>> parallel Datasets, both traditional TDB and on demand in memory. Writes are
>>> sent to a free dataset, free being not in a write transaction. That's a
>>> simplistic overview...
>>>>>> Queries are handled by a dataset proxy which builds a dynamic dataset
>>> based on the graph URIs. For example the graph URI urn:Iungo:all causes the
>>> proxy find method to issue the query to all known Datasets and return the
>>> union of results. Various dataset proxies exist, some load TDBs, others
>>> load TTL files into graphs, others dynamically create tuples. The common
>>> thing being they are all presented as Datasets backed by DatasetGraph. Thus
>>> a SPARQL query can result in multiple Datasets being loaded to satisfy the
>>> query.
>>>>>> Nodes can be preloaded which then load Datasets to satisfy finds. This
>>> way the system can be scaled to handle increased work loads. Also specific
>>> nodes can be targeted to specific hardware.
>>>>>> When a graph URI is encountered the proxy can interpret it's
>>> structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
>>> SDAI repository foo to be dynamically loaded into memory along with the
>>> quads which are required to satisfy the find.
>>>>>> Typically a group of people will be working on a set of data so the
>>> first to query will load the dataset then it will be accessed multiple
>>> times. There will be an initial dynamic load of data which will tail off
>>> with some additional loading over time.
>>>>>> Based on your description the DatasetGraphInMemory would seem to match
>>> the dynamic load requirement. How did you foresee it being loaded? Is there
>>> a large over head to using the add methods?
>>>>>> A typical scenario would be to search all SDAI repository's for some
>>> key information then load detailed information in some, continuing to drill
>>> down.
>>>>>> Hope this helps.
>>>>>> I'm going to extend the hex and tri tables and run some tests. I've
>>> already shimed the DGTriplesQuads so the actual caching code already exists
>>> and should bed easy to hook on.
>>>>>> Dick
>>>>>> 
>>>>>> -------- Original message --------
>>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>>>>> To: users@jena.apache.org
>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>> 
>>>>>> Okay, I’m more confident at this point that you’re not well served by
>>> DatasetGraphInMemory, which has very strong assumptions about the speedy
>>> reachability of data. DSGInMemory was built for situations when all of the
>>> data is in core memory and multithreaded access is important. If you have a
>>> lot of core memory and can load the data fully, you might want to use it,
>>> but that doesn’t sound at all like your case. Otherwise, as far as what the
>>> right extension point is, I will need to defer to committers or more
>>> experienced devs, but I think you may need to look at DatasetGraph from a
>>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
>>> directly, for example.
>>>>>> 
>>>>>> Can you tell us a bit more about your full scenario? I don’t know much
>>> about STEP (sorry if others do)— is there a canonical RDF formulation? What
>>> kinds of queries are you going to be using with this data? How quickly are
>>> users going to need to switch contexts between datasets?
>>>>>> 
>>>>>> ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>> 
>>>>>>> On Feb 12, 2016, at 2:44 PM, Joint <da...@gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for the fast response!
>>>>>>>     I have a set of disk based binary SDAI repository's which are
>>> based on ISO10303 parts 11/21/25/27 otherwise known as the
>>> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
>>> be +1Gb. However after processing into a SDAI binary I typically see a size
>>> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
>>> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
>>> 1000's of similar sized STEP files...
>>>>>>> Typically only a small subset of the STEP file needs to be queried
>>> but sometimes other parts need to be queried. Hence the on demand caching
>>> and DatasetGraphInMemory. The aim is that in the find methods I check a
>>> cache and call the native SDAI find methods based on the node URI's in the
>>> case of a cache miss, calling the add methods for the minted tuples, then
>>> passing on the call to the super find. The underlying SDAI repository's are
>>> static so once a subject is cached no other work is required.
>>>>>>> As the DatasetGraphInMemory is commented as very fast quad and triple
>>> access it seemed a logical place to extend. The shim cache would be set to
>>> expire entries and limit the total number of tuples power repository. This
>>> is currently deployed on a 256Gb ram device.
>>>>>>> In the bigger picture l have a service very similar to Fuseki which
>>> allows SPARQL requests to be made against Datasets which are either TDB or
>>> SDAI cache backed.
>>>>>>> What was DatasetGraphInMemory created for..? ;-)
>>>>>>> Dick
>>>>>>> 
>>>>>>> -------- Original message --------
>>>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>>>>> To: users@jena.apache.org
>>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>>> 
>>>>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
>>> may be better answered by other folks who are more familiar with Jena's
>>> DatasetGraph implementations, or may actually not have anything to do with
>>> DatasetGraph (see below for why). I will try to give some background
>>> information, though.
>>>>>>> 
>>>>>>> There are several paths by which where DatasetGraphInMemory can be
>>> performing finds, but they come down to two places in the code, QuadTable::
>>> and TripleTable::find and in default operation, the concrete forms:
>>>>>>> 
>>>>>>> 
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>>>>> 
>>>>>>> for Quads and
>>>>>>> 
>>>>>>> 
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>>>>> 
>>>>>>> for Triples. Those methods are reused by all the differently-ordered
>>> indexes within Hex- or TriTable, each of which will answer a find by
>>> selecting an appropriately-ordered index based on the fixed and variable
>>> slots in the find pattern and using the concrete methods above to stream
>>> tuples back.
>>>>>>> 
>>>>>>> As to why you are seeing your methods called in some places and not
>>> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
>>> make a selection between those methods— that is done by
>>> DatasetGraphBaseFind. So that is where you will find the logic that should
>>> answer your question.
>>>>>>> 
>>>>>>> Can you say a little more about your use case? You seem to have some
>>> efficient representation in memory of your data (I hope it is in-memory—
>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
>>> create tuples on the fly as queries are received. That is really not at all
>>> what DSGInMemory is for (DSGInMemory is using map structures for indexing
>>> and in default mode, uses persistent data structures to support
>>> transactionality). I am wondering whether you might not be much better
>>> served by tapping into Jena at a different place, perhaps implementing the
>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
>>> implementing Quad- and TripleTable and using the constructor
>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>>>>> 
>>>>>>> ---
>>>>>>> A. Soroka
>>>>>>> The University of Virginia Library
>>>>>>> 
>>>>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <da...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>> Hi.
>>>>>>>> 
>>>>>>>> Does anyone know the "find" paths through DatasetGraphInMemory
>>> please?
>>>>>>>> 
>>>>>>>> For example if I extend DatasetGraphInMemory and override
>>>>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
>>> "select
>>>>>>>> * where {?s ?p ?o}" however if I override the other
>>>>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
>>> {?s ?p
>>>>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
>>> it's
>>>>>>>> calling (but as I type I'm guessing it's optimised to return the
>>> HexTable
>>>>>>>> nodes...).
>>>>>>>> 
>>>>>>>> Would I be better off overriding HexTable and TriTable classes find
>>> methods
>>>>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
>>> end in
>>>>>>>> one of these methods?
>>>>>>>> 
>>>>>>>> I need to know the root find methods so that I can shim them to
>>> create
>>>>>>>> triples/quads before they perform the find.
>>>>>>>> 
>>>>>>>> I need to create Triples/Quads on demand (because a bulk load would
>>> create
>>>>>>>> ~100M triples but only ~1000 are ever queried) and the source binary
>>> form
>>>>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
>>> quads)
>>>>>>>> than quads.
>>>>>>>> 
>>>>>>>> Regards Dick Murray.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by Andy Seaborne <an...@apache.org>.

Hi Dick,

Thanks for the details.

So TriTable is used as the internal implementation of a caching 
read-only graph and you're using the loop form for GRAPH (and often the 
loop is one URI - i.e. directed to one part of the data).  Using 
TriTable is because it's a convenient triple storage for the use case.

The two interesting pieces to Jena:

1/ support for writing dynamic adapters

2/ a graph (DatasetGraph) implementation that more clearly has an 
interface for storage.


On the latter: I've come across this before and sketched this interface.

It's nothing more than a first pass sketch.  Is this the sort of thing 
that might work for your use case? (a graph storage version with quads 
over the top as a subcase):

interface StorageRDF {
     default void add(Triple triple) { .... }
     default void add(Quad quad)     { .... }

     default void delete(Triple triple)  { .... }
     default void delete(Quad quad)      { .... }

     void add(Node s, Node p, Node o) ;
     void add(Node g, Node s, Node p, Node o) ;

     void delete(Node s, Node p, Node o) ;
     void delete(Node g, Node s, Node p, Node o) ;

     /** Delete all triples matching a {@code find}-like pattern */
     void removeAll(Node s, Node p, Node o) ;
     /** Delete all quads matching a {@code find}-like pattern */
     void removeAll(Node g, Node s, Node p, Node o) ;

     // NB Quads
     Stream<Quad>   findDftGraph(Node s, Node p, Node o) ;
     Stream<Quad>   findUnionGraph(Node s, Node p, Node o) ;
     Stream<Quad>   find(Node g, Node s, Node p, Node o) ;
     // For findUnion.
     Stream<Quad>   findDistinct(Node g, Node s, Node p, Node o) ;

     // triples
     Stream<Triple> find(Node s, Node p, Node o) ;

//    default Stream<Triple> find(Node s, Node p, Node o) {
//        return findDftGraph(s,p,o).map(Quad::asTriple) ;
//    }

//    Iterator<Quad>   findUnionGraph(Node s, Node p, Node o) ;
//    Iterator<Quad>   find(Node g, Node s, Node p, Node o) ;


     // contains

     default boolean contains(Node s, Node p, Node o)
     { return find(s,p,o).findAny().isPresent() ; }
     default boolean contains(Node g, Node s, Node p, Node o)
     { return find(g,s,p,o).findAny().isPresent() ; }

     // Prefixes ??
}


https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
also has the companion DatasetGraphStorage.

     Andy



On 04/03/16 12:03, Dick Murray wrote:
> LOL. The perils of a succinct update with no detail!
>
> I understand the Jena SPI supports read/writes via transactions and I also
> know that the wrapper classes provide a best effort for some of the
> overridden methods which do not always sit well when materializing triples.
> For example DatasetGraphBase provides public boolean containsGraph(Node
> graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
> which results in a call to DatasetGraphBaseFind public Iterator<Quad>
> find(Node g, Node s, Node p, Node o) which might end up with something
> being called in DatasetGraphInMemory depending on what has been extended
> and overridden. This causes a problem for me because I shim the finds to
> decide whether the triples have been materialized before calling the
> overridden find. After extending DatasetGraphTriples and
> DatasetGraphInMemory I realised that I had overridden most of the methods
> so I stopped and implemented DatasetGraph and Transactional.
>
> In my scenario the underlying data (a vendor agnostic format to get
> AutoCAD, Bentley, etc to work together) is never changed so the
> DatasetGraph need not support writes. Whilst we need to provide semantic
> access to the these files they result in ~100M triples each if transformed,
> there are 1000's of files, they can change multiple times per day and the
> various disciplines typically only require a subset of triples.
>
> That said in my DatasetGraph implementation if you call
> begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
> implementation in that it does not support external writes (throws UOE) but
> does implement writes internally (via TriTable) because it needs to write
> the materialized triples to answer the find.
>
> So if we take
>
> select ?s
> where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
> <urn:iungo:iso/10303/11/schema/s/entity/e>}
>
> Jena via the SPARQL query engine will perform the following abridged
> process.
>
>     - Jena begins a DG read transaction.
>     - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>, ANY,
>     a <urn:iungo:iso/10303/11/schema/s/entity/e>).
>     - DG will;
>        - check if the repository r has been loaded, i.e. matching the
>        repository name URI spec fragment to a repository file on disk
> and loading
>        it into the SDAI session.
>        - check if the model m has been loaded, i.e. matching the model name
>        URI spec fragment to a repository model and loading it into the SDAI
>        session.
>           - If we have just loaded the SDAI model check if there is any pre
>           caching to be done which is just a set of find triples which
> are handled as
>           per the normal find detailed following.
>        - We now have a G which wraps the SDAI model and uses TriTable to
>     hold materialized triples.
>     - DG will now call G.find(ANY, a
>     <urn:iungo:iso/10303/11/schema/s/entity/e>).
>     - G will check the find triple against a set of already materialized
>     find triples and if it misses;
>        - G will search a set of triple handles which know how to materialize
>        triples for a given find triple and if found;
>           - G begins a TriTable write transaction and for {ANY, a
>           <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
> are READ but
>           the G TriTable is WRITE);
>              - Check the find triples again we might have been in a race for
>              the find triple and lost...
>              - Load the correct Java class for entity e which involves
>              minting the FQCN using the schema s and entity e e.g.
> ifc2x3 and ifcslab
>              become org.jsdai.ifc2x3.ifcslab.
>              - Use this to call the SDAI method findInstances(Class<?
>              extends Entity> entityClass) which returns zero or more
> SDAI entities from
>              which we;
>                 - Query the ifc2x3 schema to list the explicit Entity
>                 attributes and for each we add a triple to TriTable e.g.
>                 ifcslab:ifcorganization =
>                 {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>                 <urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
> <urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
>                 - In addition we add the triple
>                 {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
>                 <urn:iungo:iso/10303/11/schema/s/entity/e>}.
>                 - If we are creating linked triples (i.e. max depth > 1)
>                 then for each attribute which has a SDAI entity
> instance value call the
>                 appropriate handle to create the triples.
>              - G commits the TriTable write transaction (make the triples
>           visible before we update the find triples!).
>           - G updates the find triples to include;
>           - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
>              - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>              ANY ANY}
>              - Repeat the above for any linked triples created.
>              - The TriTable now contains the triples required to answer the
>           find triple.
>           - G will return TriTable.find(ANY, a
>     <urn:iungo:iso/10303/11/schema/s/entity/e>)
>     - Jena ends the DG read transaction.
>
>
> Some find triples will result in the appropriate handle being called
> (handle hit) which will create triples. Others will handle miss and be
> passed on to the TriTable find (e.g. no triples created and TriTable will
> return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
> example because does this mean create all of the triples (+100M) or all of
> the currently created triples (which relies on having queried what you need
> to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful to
> ask this find?
>
> Hope that clear up the "writes are not supported" (the underlying data is
> read only) and why the TupleTable subtypes are not problematic. I could
> have held the created triples per find triple but that wouldn't scale with
> duplication plus why recreate the wheel when if I'm not mistaken TriTable
> uses the dexx collection giving subsequent HAMT advantages which is what a
> high performance in memory implementation requires. The solution is working
> and compared to a fully transformed TDB is giving the correct results. To
> do might include timing out the G when they have not been accessed for a
> period of time...
>
> Finally having wrote the wrapper I thought it wouldn't be used anywhere
> else but subsequently it was used to abstract an existing system where
> adhoc semantic access was required and it's lined to do a similar task on
> two other data silos. Hence the question to Andy regarding a Jena cached
> SPI package.
>
> Thanks again for your help Adam/Andy.
>
> Dick.
>
>
>
> On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:
>
>> I’m confused about two of your points here. Let me separate them out so we
>> can discuss them easily.
>>
>> 1) "writes are not supported”:
>>
>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>> out-of-the-box implementations of RDF storage. Can you explain what you
>> mean by this?
>>
>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>> triple caching algorithm”:
>>
>> The subtypes of TupleTable with which you are working have exactly the
>> same kinds of find() methods. Why are they not problematic in that context?
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>>> On Mar 3, 2016, at 5:47 AM, Joint <da...@gmail.com> wrote:
>>>
>>>
>>>
>>> Hi Andy.
>>> I implemented the entire SPI at the DatasetGraph and Graph level. It got
>> to the point where I had overridden more methods than not. In addition
>> writes are not supported and contains methods which call find(ANY, ANY,
>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
>> the TriTable because it fits and quads are spoofed via triple to quad
>> iterator.
>>> I have a set of filters and handles which the find triple is compared
>> against and either passed straight to the TriTable if the triple has been
>> handled before or its passed to the appropriate handle which adds the
>> triples to the TriTable then calls the find. As the underlying data is a
>> tree a cache depth can be set which allows related triples to be cached.
>> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
>>> Would you consider a generic version for the Jena code base?
>>>
>>>
>>> Dick
>>>
>>> -------- Original message --------
>>> From: Andy Seaborne <an...@apache.org>
>>> Date: 18/02/2016  6:31 pm  (GMT+00:00)
>>> To: users@jena.apache.org
>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>>   DatasetGraphInMemory
>>>
>>> Hi,
>>>
>>> I'm not seeing how tapping into the implementation of
>>> DatasetGraphInMemory is going to help (through the details
>>>
>>> As well as the DatasetGraphMap approach, one other thought that occurred
>>> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
>>> implementation.
>>>
>>> It loads, and clears, the mapped graph on-demand, and passes the find()
>>> call through to the now-setup data.
>>>
>>>        Andy
>>>
>>> On 16/02/16 17:42, A. Soroka wrote:
>>>>> Based on your description the DatasetGraphInMemory would seem to match
>> the dynamic load requirement. How did you foresee it being loaded? Is there
>> a large over head to using the add methods?
>>>>
>>>> No, I certainly did not mean to give that impression, and I don’t think
>> it is entirely accurate. DSGInMemory was definitely not at all meant for
>> dynamic loading. That doesn’t mean it can’t be used that way, but that was
>> not in the design, which assumed that all tuples take about the same amount
>> of time to access and that all of the same type are coming from the same
>> implementation (in a QuadTable and a TripleTable).
>>>>
>>>> The overhead of mutating a dataset is mostly inside the implementations
>> of TupleTable that are actually used to store tuples. You should be aware
>> that TupleTable extends TransactionalComponent, so if you want to use it to
>> create some kind of connection to your storage, you will need to make that
>> connection fully transactional. That doesn’t sound at all trivial in your
>> case.
>>>>
>>>> At this point it seems to me that extending DatasetGraphMap (and
>> implementing GraphMaker and Graph instead of TupleTable) might be a more
>> appropriate design for your work. You can put dynamic loading behavior in
>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
>> Are there reasons around the use of transactionality in your work that
>> demand the particular semantics supported by DSGInMemory?
>>>>
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>>
>>>>> On Feb 13, 2016, at 5:18 AM, Joint <da...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi.
>>>>> The quick full scenario is a distributed DaaS which supports queries,
>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail
>> because I spoke to him previously. We achieve multiple writes by having
>> parallel Datasets, both traditional TDB and on demand in memory. Writes are
>> sent to a free dataset, free being not in a write transaction. That's a
>> simplistic overview...
>>>>> Queries are handled by a dataset proxy which builds a dynamic dataset
>> based on the graph URIs. For example the graph URI urn:Iungo:all causes the
>> proxy find method to issue the query to all known Datasets and return the
>> union of results. Various dataset proxies exist, some load TDBs, others
>> load TTL files into graphs, others dynamically create tuples. The common
>> thing being they are all presented as Datasets backed by DatasetGraph. Thus
>> a SPARQL query can result in multiple Datasets being loaded to satisfy the
>> query.
>>>>> Nodes can be preloaded which then load Datasets to satisfy finds. This
>> way the system can be scaled to handle increased work loads. Also specific
>> nodes can be targeted to specific hardware.
>>>>> When a graph URI is encountered the proxy can interpret it's
>> structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
>> SDAI repository foo to be dynamically loaded into memory along with the
>> quads which are required to satisfy the find.
>>>>> Typically a group of people will be working on a set of data so the
>> first to query will load the dataset then it will be accessed multiple
>> times. There will be an initial dynamic load of data which will tail off
>> with some additional loading over time.
>>>>> Based on your description the DatasetGraphInMemory would seem to match
>> the dynamic load requirement. How did you foresee it being loaded? Is there
>> a large over head to using the add methods?
>>>>> A typical scenario would be to search all SDAI repository's for some
>> key information then load detailed information in some, continuing to drill
>> down.
>>>>> Hope this helps.
>>>>> I'm going to extend the hex and tri tables and run some tests. I've
>> already shimed the DGTriplesQuads so the actual caching code already exists
>> and should bed easy to hook on.
>>>>> Dick
>>>>>
>>>>> -------- Original message --------
>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>>>> To: users@jena.apache.org
>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>> DatasetGraphInMemory
>>>>>
>>>>> Okay, I’m more confident at this point that you’re not well served by
>> DatasetGraphInMemory, which has very strong assumptions about the speedy
>> reachability of data. DSGInMemory was built for situations when all of the
>> data is in core memory and multithreaded access is important. If you have a
>> lot of core memory and can load the data fully, you might want to use it,
>> but that doesn’t sound at all like your case. Otherwise, as far as what the
>> right extension point is, I will need to defer to committers or more
>> experienced devs, but I think you may need to look at DatasetGraph from a
>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
>> directly, for example.
>>>>>
>>>>> Can you tell us a bit more about your full scenario? I don’t know much
>> about STEP (sorry if others do)— is there a canonical RDF formulation? What
>> kinds of queries are you going to be using with this data? How quickly are
>> users going to need to switch contexts between datasets?
>>>>>
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>>
>>>>>> On Feb 12, 2016, at 2:44 PM, Joint <da...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for the fast response!
>>>>>>      I have a set of disk based binary SDAI repository's which are
>> based on ISO10303 parts 11/21/25/27 otherwise known as the
>> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
>> be +1Gb. However after processing into a SDAI binary I typically see a size
>> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
>> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
>> 1000's of similar sized STEP files...
>>>>>> Typically only a small subset of the STEP file needs to be queried
>> but sometimes other parts need to be queried. Hence the on demand caching
>> and DatasetGraphInMemory. The aim is that in the find methods I check a
>> cache and call the native SDAI find methods based on the node URI's in the
>> case of a cache miss, calling the add methods for the minted tuples, then
>> passing on the call to the super find. The underlying SDAI repository's are
>> static so once a subject is cached no other work is required.
>>>>>> As the DatasetGraphInMemory is commented as very fast quad and triple
>> access it seemed a logical place to extend. The shim cache would be set to
>> expire entries and limit the total number of tuples power repository. This
>> is currently deployed on a 256Gb ram device.
>>>>>> In the bigger picture l have a service very similar to Fuseki which
>> allows SPARQL requests to be made against Datasets which are either TDB or
>> SDAI cache backed.
>>>>>> What was DatasetGraphInMemory created for..? ;-)
>>>>>> Dick
>>>>>>
>>>>>> -------- Original message --------
>>>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>>>> To: users@jena.apache.org
>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>> DatasetGraphInMemory
>>>>>>
>>>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
>> may be better answered by other folks who are more familiar with Jena's
>> DatasetGraph implementations, or may actually not have anything to do with
>> DatasetGraph (see below for why). I will try to give some background
>> information, though.
>>>>>>
>>>>>> There are several paths by which where DatasetGraphInMemory can be
>> performing finds, but they come down to two places in the code, QuadTable::
>> and TripleTable::find and in default operation, the concrete forms:
>>>>>>
>>>>>>
>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>>>>
>>>>>> for Quads and
>>>>>>
>>>>>>
>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>>>>
>>>>>> for Triples. Those methods are reused by all the differently-ordered
>> indexes within Hex- or TriTable, each of which will answer a find by
>> selecting an appropriately-ordered index based on the fixed and variable
>> slots in the find pattern and using the concrete methods above to stream
>> tuples back.
>>>>>>
>>>>>> As to why you are seeing your methods called in some places and not
>> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
>> make a selection between those methods— that is done by
>> DatasetGraphBaseFind. So that is where you will find the logic that should
>> answer your question.
>>>>>>
>>>>>> Can you say a little more about your use case? You seem to have some
>> efficient representation in memory of your data (I hope it is in-memory—
>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
>> create tuples on the fly as queries are received. That is really not at all
>> what DSGInMemory is for (DSGInMemory is using map structures for indexing
>> and in default mode, uses persistent data structures to support
>> transactionality). I am wondering whether you might not be much better
>> served by tapping into Jena at a different place, perhaps implementing the
>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
>> implementing Quad- and TripleTable and using the constructor
>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>>>>
>>>>>> ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>>
>>>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <da...@gmail.com>
>> wrote:
>>>>>>>
>>>>>>> Hi.
>>>>>>>
>>>>>>> Does anyone know the "find" paths through DatasetGraphInMemory
>> please?
>>>>>>>
>>>>>>> For example if I extend DatasetGraphInMemory and override
>>>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
>> "select
>>>>>>> * where {?s ?p ?o}" however if I override the other
>>>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
>> {?s ?p
>>>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
>> it's
>>>>>>> calling (but as I type I'm guessing it's optimised to return the
>> HexTable
>>>>>>> nodes...).
>>>>>>>
>>>>>>> Would I be better off overriding HexTable and TriTable classes find
>> methods
>>>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
>> end in
>>>>>>> one of these methods?
>>>>>>>
>>>>>>> I need to know the root find methods so that I can shim them to
>> create
>>>>>>> triples/quads before they perform the find.
>>>>>>>
>>>>>>> I need to create Triples/Quads on demand (because a bulk load would
>> create
>>>>>>> ~100M triples but only ~1000 are ever queried) and the source binary
>> form
>>>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
>> quads)
>>>>>>> than quads.
>>>>>>>
>>>>>>> Regards Dick Murray.
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Posted by Dick Murray <da...@gmail.com>.

LOL. The perils of a succinct update with no detail!

I understand the Jena SPI supports read/writes via transactions and I also
know that the wrapper classes provide a best effort for some of the
overridden methods which do not always sit well when materializing triples.
For example DatasetGraphBase provides public boolean containsGraph(Node
graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
which results in a call to DatasetGraphBaseFind public Iterator<Quad>
find(Node g, Node s, Node p, Node o) which might end up with something
being called in DatasetGraphInMemory depending on what has been extended
and overridden. This causes a problem for me because I shim the finds to
decide whether the triples have been materialized before calling the
overridden find. After extending DatasetGraphTriples and
DatasetGraphInMemory I realised that I had overridden most of the methods
so I stopped and implemented DatasetGraph and Transactional.

In my scenario the underlying data (a vendor agnostic format to get
AutoCAD, Bentley, etc to work together) is never changed so the
DatasetGraph need not support writes. Whilst we need to provide semantic
access to the these files they result in ~100M triples each if transformed,
there are 1000's of files, they can change multiple times per day and the
various disciplines typically only require a subset of triples.

That said in my DatasetGraph implementation if you call
begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
implementation in that it does not support external writes (throws UOE) but
does implement writes internally (via TriTable) because it needs to write
the materialized triples to answer the find.

So if we take

select ?s
where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
<urn:iungo:iso/10303/11/schema/s/entity/e>}

Jena via the SPARQL query engine will perform the following abridged
process.

   - Jena begins a DG read transaction.
   - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>, ANY,
   a <urn:iungo:iso/10303/11/schema/s/entity/e>).
   - DG will;
      - check if the repository r has been loaded, i.e. matching the
      repository name URI spec fragment to a repository file on disk
and loading
      it into the SDAI session.
      - check if the model m has been loaded, i.e. matching the model name
      URI spec fragment to a repository model and loading it into the SDAI
      session.
         - If we have just loaded the SDAI model check if there is any pre
         caching to be done which is just a set of find triples which
are handled as
         per the normal find detailed following.
      - We now have a G which wraps the SDAI model and uses TriTable to
   hold materialized triples.
   - DG will now call G.find(ANY, a
   <urn:iungo:iso/10303/11/schema/s/entity/e>).
   - G will check the find triple against a set of already materialized
   find triples and if it misses;
      - G will search a set of triple handles which know how to materialize
      triples for a given find triple and if found;
         - G begins a TriTable write transaction and for {ANY, a
         <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
are READ but
         the G TriTable is WRITE);
            - Check the find triples again we might have been in a race for
            the find triple and lost...
            - Load the correct Java class for entity e which involves
            minting the FQCN using the schema s and entity e e.g.
ifc2x3 and ifcslab
            become org.jsdai.ifc2x3.ifcslab.
            - Use this to call the SDAI method findInstances(Class<?
            extends Entity> entityClass) which returns zero or more
SDAI entities from
            which we;
               - Query the ifc2x3 schema to list the explicit Entity
               attributes and for each we add a triple to TriTable e.g.
               ifcslab:ifcorganization =
               {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
               <urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
<urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
               - In addition we add the triple
               {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
               <urn:iungo:iso/10303/11/schema/s/entity/e>}.
               - If we are creating linked triples (i.e. max depth > 1)
               then for each attribute which has a SDAI entity
instance value call the
               appropriate handle to create the triples.
            - G commits the TriTable write transaction (make the triples
         visible before we update the find triples!).
         - G updates the find triples to include;
         - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
            - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
            ANY ANY}
            - Repeat the above for any linked triples created.
            - The TriTable now contains the triples required to answer the
         find triple.
         - G will return TriTable.find(ANY, a
   <urn:iungo:iso/10303/11/schema/s/entity/e>)
   - Jena ends the DG read transaction.


Some find triples will result in the appropriate handle being called
(handle hit) which will create triples. Others will handle miss and be
passed on to the TriTable find (e.g. no triples created and TriTable will
return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
example because does this mean create all of the triples (+100M) or all of
the currently created triples (which relies on having queried what you need
to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful to
ask this find?

Hope that clear up the "writes are not supported" (the underlying data is
read only) and why the TupleTable subtypes are not problematic. I could
have held the created triples per find triple but that wouldn't scale with
duplication plus why recreate the wheel when if I'm not mistaken TriTable
uses the dexx collection giving subsequent HAMT advantages which is what a
high performance in memory implementation requires. The solution is working
and compared to a fully transformed TDB is giving the correct results. To
do might include timing out the G when they have not been accessed for a
period of time...

Finally having wrote the wrapper I thought it wouldn't be used anywhere
else but subsequently it was used to abstract an existing system where
adhoc semantic access was required and it's lined to do a similar task on
two other data silos. Hence the question to Andy regarding a Jena cached
SPI package.

Thanks again for your help Adam/Andy.

Dick.



On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:

> I’m confused about two of your points here. Let me separate them out so we
> can discuss them easily.
>
> 1) "writes are not supported”:
>
> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
> and DatasetGraph are the basic abstractions implemented by Jena’s own
> out-of-the-box implementations of RDF storage. Can you explain what you
> mean by this?
>
> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
> triple caching algorithm”:
>
> The subtypes of TupleTable with which you are working have exactly the
> same kinds of find() methods. Why are they not problematic in that context?
>
> ---
> A. Soroka
> The University of Virginia Library
>
> > On Mar 3, 2016, at 5:47 AM, Joint <da...@gmail.com> wrote:
> >
> >
> >
> > Hi Andy.
> > I implemented the entire SPI at the DatasetGraph and Graph level. It got
> to the point where I had overridden more methods than not. In addition
> writes are not supported and contains methods which call find(ANY, ANY,
> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
> the TriTable because it fits and quads are spoofed via triple to quad
> iterator.
> > I have a set of filters and handles which the find triple is compared
> against and either passed straight to the TriTable if the triple has been
> handled before or its passed to the appropriate handle which adds the
> triples to the TriTable then calls the find. As the underlying data is a
> tree a cache depth can be set which allows related triples to be cached.
> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
> > Would you consider a generic version for the Jena code base?
> >
> >
> > Dick
> >
> > -------- Original message --------
> > From: Andy Seaborne <an...@apache.org>
> > Date: 18/02/2016  6:31 pm  (GMT+00:00)
> > To: users@jena.apache.org
> > Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> >  DatasetGraphInMemory
> >
> > Hi,
> >
> > I'm not seeing how tapping into the implementation of
> > DatasetGraphInMemory is going to help (through the details
> >
> > As well as the DatasetGraphMap approach, one other thought that occurred
> > to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
> > implementation.
> >
> > It loads, and clears, the mapped graph on-demand, and passes the find()
> > call through to the now-setup data.
> >
> >       Andy
> >
> > On 16/02/16 17:42, A. Soroka wrote:
> >>> Based on your description the DatasetGraphInMemory would seem to match
> the dynamic load requirement. How did you foresee it being loaded? Is there
> a large over head to using the add methods?
> >>
> >> No, I certainly did not mean to give that impression, and I don’t think
> it is entirely accurate. DSGInMemory was definitely not at all meant for
> dynamic loading. That doesn’t mean it can’t be used that way, but that was
> not in the design, which assumed that all tuples take about the same amount
> of time to access and that all of the same type are coming from the same
> implementation (in a QuadTable and a TripleTable).
> >>
> >> The overhead of mutating a dataset is mostly inside the implementations
> of TupleTable that are actually used to store tuples. You should be aware
> that TupleTable extends TransactionalComponent, so if you want to use it to
> create some kind of connection to your storage, you will need to make that
> connection fully transactional. That doesn’t sound at all trivial in your
> case.
> >>
> >> At this point it seems to me that extending DatasetGraphMap (and
> implementing GraphMaker and Graph instead of TupleTable) might be a more
> appropriate design for your work. You can put dynamic loading behavior in
> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
> Are there reasons around the use of transactionality in your work that
> demand the particular semantics supported by DSGInMemory?
> >>
> >> ---
> >> A. Soroka
> >> The University of Virginia Library
> >>
> >>> On Feb 13, 2016, at 5:18 AM, Joint <da...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> Hi.
> >>> The quick full scenario is a distributed DaaS which supports queries,
> updates, transforms and bulkloads. Andy Seaborne knows some of the detail
> because I spoke to him previously. We achieve multiple writes by having
> parallel Datasets, both traditional TDB and on demand in memory. Writes are
> sent to a free dataset, free being not in a write transaction. That's a
> simplistic overview...
> >>> Queries are handled by a dataset proxy which builds a dynamic dataset
> based on the graph URIs. For example the graph URI urn:Iungo:all causes the
> proxy find method to issue the query to all known Datasets and return the
> union of results. Various dataset proxies exist, some load TDBs, others
> load TTL files into graphs, others dynamically create tuples. The common
> thing being they are all presented as Datasets backed by DatasetGraph. Thus
> a SPARQL query can result in multiple Datasets being loaded to satisfy the
> query.
> >>> Nodes can be preloaded which then load Datasets to satisfy finds. This
> way the system can be scaled to handle increased work loads. Also specific
> nodes can be targeted to specific hardware.
> >>> When a graph URI is encountered the proxy can interpret it's
> structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
> SDAI repository foo to be dynamically loaded into memory along with the
> quads which are required to satisfy the find.
> >>> Typically a group of people will be working on a set of data so the
> first to query will load the dataset then it will be accessed multiple
> times. There will be an initial dynamic load of data which will tail off
> with some additional loading over time.
> >>> Based on your description the DatasetGraphInMemory would seem to match
> the dynamic load requirement. How did you foresee it being loaded? Is there
> a large over head to using the add methods?
> >>> A typical scenario would be to search all SDAI repository's for some
> key information then load detailed information in some, continuing to drill
> down.
> >>> Hope this helps.
> >>> I'm going to extend the hex and tri tables and run some tests. I've
> already shimed the DGTriplesQuads so the actual caching code already exists
> and should bed easy to hook on.
> >>> Dick
> >>>
> >>> -------- Original message --------
> >>> From: "A. Soroka" <aj...@virginia.edu>
> >>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
> >>> To: users@jena.apache.org
> >>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> DatasetGraphInMemory
> >>>
> >>> Okay, I’m more confident at this point that you’re not well served by
> DatasetGraphInMemory, which has very strong assumptions about the speedy
> reachability of data. DSGInMemory was built for situations when all of the
> data is in core memory and multithreaded access is important. If you have a
> lot of core memory and can load the data fully, you might want to use it,
> but that doesn’t sound at all like your case. Otherwise, as far as what the
> right extension point is, I will need to defer to committers or more
> experienced devs, but I think you may need to look at DatasetGraph from a
> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
> directly, for example.
> >>>
> >>> Can you tell us a bit more about your full scenario? I don’t know much
> about STEP (sorry if others do)— is there a canonical RDF formulation? What
> kinds of queries are you going to be using with this data? How quickly are
> users going to need to switch contexts between datasets?
> >>>
> >>> ---
> >>> A. Soroka
> >>> The University of Virginia Library
> >>>
> >>>> On Feb 12, 2016, at 2:44 PM, Joint <da...@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Thanks for the fast response!
> >>>>     I have a set of disk based binary SDAI repository's which are
> based on ISO10303 parts 11/21/25/27 otherwise known as the
> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
> be +1Gb. However after processing into a SDAI binary I typically see a size
> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
> 1000's of similar sized STEP files...
> >>>> Typically only a small subset of the STEP file needs to be queried
> but sometimes other parts need to be queried. Hence the on demand caching
> and DatasetGraphInMemory. The aim is that in the find methods I check a
> cache and call the native SDAI find methods based on the node URI's in the
> case of a cache miss, calling the add methods for the minted tuples, then
> passing on the call to the super find. The underlying SDAI repository's are
> static so once a subject is cached no other work is required.
> >>>> As the DatasetGraphInMemory is commented as very fast quad and triple
> access it seemed a logical place to extend. The shim cache would be set to
> expire entries and limit the total number of tuples power repository. This
> is currently deployed on a 256Gb ram device.
> >>>> In the bigger picture l have a service very similar to Fuseki which
> allows SPARQL requests to be made against Datasets which are either TDB or
> SDAI cache backed.
> >>>> What was DatasetGraphInMemory created for..? ;-)
> >>>> Dick
> >>>>
> >>>> -------- Original message --------
> >>>> From: "A. Soroka" <aj...@virginia.edu>
> >>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
> >>>> To: users@jena.apache.org
> >>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> DatasetGraphInMemory
> >>>>
> >>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
> may be better answered by other folks who are more familiar with Jena's
> DatasetGraph implementations, or may actually not have anything to do with
> DatasetGraph (see below for why). I will try to give some background
> information, though.
> >>>>
> >>>> There are several paths by which where DatasetGraphInMemory can be
> performing finds, but they come down to two places in the code, QuadTable::
> and TripleTable::find and in default operation, the concrete forms:
> >>>>
> >>>>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
> >>>>
> >>>> for Quads and
> >>>>
> >>>>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
> >>>>
> >>>> for Triples. Those methods are reused by all the differently-ordered
> indexes within Hex- or TriTable, each of which will answer a find by
> selecting an appropriately-ordered index based on the fixed and variable
> slots in the find pattern and using the concrete methods above to stream
> tuples back.
> >>>>
> >>>> As to why you are seeing your methods called in some places and not
> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
> make a selection between those methods— that is done by
> DatasetGraphBaseFind. So that is where you will find the logic that should
> answer your question.
> >>>>
> >>>> Can you say a little more about your use case? You seem to have some
> efficient representation in memory of your data (I hope it is in-memory—
> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
> create tuples on the fly as queries are received. That is really not at all
> what DSGInMemory is for (DSGInMemory is using map structures for indexing
> and in default mode, uses persistent data structures to support
> transactionality). I am wondering whether you might not be much better
> served by tapping into Jena at a different place, perhaps implementing the
> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
> implementing Quad- and TripleTable and using the constructor
> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
> >>>>
> >>>> ---
> >>>> A. Soroka
> >>>> The University of Virginia Library
> >>>>
> >>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <da...@gmail.com>
> wrote:
> >>>>>
> >>>>> Hi.
> >>>>>
> >>>>> Does anyone know the "find" paths through DatasetGraphInMemory
> please?
> >>>>>
> >>>>> For example if I extend DatasetGraphInMemory and override
> >>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
> "select
> >>>>> * where {?s ?p ?o}" however if I override the other
> >>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
> {?s ?p
> >>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
> it's
> >>>>> calling (but as I type I'm guessing it's optimised to return the
> HexTable
> >>>>> nodes...).
> >>>>>
> >>>>> Would I be better off overriding HexTable and TriTable classes find
> methods
> >>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
> end in
> >>>>> one of these methods?
> >>>>>
> >>>>> I need to know the root find methods so that I can shim them to
> create
> >>>>> triples/quads before they perform the find.
> >>>>>
> >>>>> I need to create Triples/Quads on demand (because a bulk load would
> create
> >>>>> ~100M triples but only ~1000 are ever queried) and the source binary
> form
> >>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
> quads)
> >>>>> than quads.
> >>>>>
> >>>>> Regards Dick Murray.
> >>>>
> >>>
> >>
> >
>
>