You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Claus Stadler (Jira)" <ji...@apache.org> on 2020/05/05 12:19:00 UTC

[jira] [Created] (JENA-1894) Insert-order preserving dataset

Claus Stadler created JENA-1894:
-----------------------------------

             Summary: Insert-order preserving dataset
                 Key: JENA-1894
                 URL: https://issues.apache.org/jira/browse/JENA-1894
             Project: Apache Jena
          Issue Type: Improvement
          Components: ARQ
    Affects Versions: Jena 3.14.0
            Reporter: Claus Stadler


To the best of my knowledge, there is no backend for datasets that retains insert order.
This feature is particularly useful when changing RDF files in a git repository, as it makes for nice commits. An insert-order preserving Triple/QuadTable implementation enables:
* Writing (subject-grouped) RDF files or events from an RDF stream out in nearly the same way they were read in - this makes it easier to compare outputs of data transformations
* Combining ORDER BY with CONSTRUCT queries:

{code:java}
Dataset ds = DatasetFactory.createOrderPreservingDataset();
QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p ?o", ds);
RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
{code}

I have created an implementation for this some time ago with the main classes of the machinery being:

* [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
* In addition, is created a lazy (but adequate?) wrapper for re-using a quad table as a triple table:
[TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
* The DatasetGraph wapper:
[DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]

Note, that DatasetGraphQuadsImpl at present falsly claims that it is transaction aware - because otherwise any SPARQL insert caused an exception (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any case, for the use cases of writing out RDF transactions may not even be necessary, but if there is an easy way to add them, then it should be done.

An example of the above code in action is here: [Git Diff based on ordered turtle-blocks output |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]

The downside of using this kind of order preserving dataset is, that essentially it only features an gspo index. Hence, the performance characteristics of this kind of order preserving dataset - which is intended mostly for serialization or presentation - varies greatly form the query-optimized implementations.

In any case, order preserving datasets are a highly useful feature for Jena and I'd gladly contribute a PR for that. My main questions are:
* How to call the factory methods in DatasetFactory, DatasetGraphFactory etc - createOrderPreservingDataset?
* In the approach using QuadTableFromNestedMaps needed - or can a different implementation of QuadTable be repurposed?
* It seems that the abstract class DatasetGraphQuads does not have any implementation at least in ARQ and the jena modules I use (according to eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be needed, or is there a similar class lying around in another jena package?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)