You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Claus Stadler (Jira)" <ji...@apache.org> on 2020/09/03 22:35:00 UTC
[jira] [Commented] (JENA-1894) Insert-order preserving dataset

    [ https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190417#comment-17190417 ] 

Claus Stadler commented on JENA-1894:
-------------------------------------

Hi Andy,

I took a look at PMapQuadTable however it uses the [dexx|https://github.com/andrewoma/dexx] collections and I did not see an easy way to resue plain Java LinkedHashMap's for this purpose.

I made an attempt at implementing the missing transaction support in my QuadTable implementation that used nested (LinkedHash)Maps. Essentially I ended up with creating a thread locale that gets initialized with an empty diff on txn begin and eventually applies any changes in the diff on commit

[QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/ba2d1f21388879a1dba5194f9f1538bcf83de510/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L22]

Is my approach valid? The goal is to have insert-order sensitivity w.r.t. the components in gspo order - so that graphs and subjects are grouped in the order they were first encountered.
I realized that to actually preserve the insert order the quads would additionally have to be stored in e.g. a LinkedHashSet but the diff-based approach should be easily adaptable - provided that it is not fundamentally flawed in the first place.

I can't say I have totally understood the separation of the txn code between DatasetGraphInMemory and QuadTable. So my impression based on  PMapQuadTable is that I don't have to deal with concurrent write transactions on QuadTable because this is handled at the Dataset level - I might be wrong though.



> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains insert order.
>  This feature is particularly useful when changing RDF files in a git repository, as it makes for nice commits. An insert-order preserving Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in nearly the same way they were read in - this makes it easier to compare outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes of the machinery being:
>  * [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad table as a triple table:
>  [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is transaction aware - because otherwise any SPARQL insert caused an exception (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any case, for the use cases of writing out RDF transactions may not even be necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered turtle-blocks output |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that essentially it only features an gspo index. Hence, the performance characteristics of this kind of order preserving dataset - which is intended mostly for serialization or presentation - varies greatly form the query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any implementation at least in ARQ and the jena modules I use (according to eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)