You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by brain <br...@analyticservice.net> on 2022/01/12 09:10:11 UTC

how to implement your own tripe-storage engine with inference features

Hello,

     I need some help.

     With this guide, https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html, <https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html,>
     I created an embeded-fuseki server to provide a SPARQL service .

     I make an implemention of `org.apache.jena.dboe.storage.StorageRDF` interface  in Java, so I can store rdf triples with my own storage engine(a distributed database). And it works. I can query the rdfs with SPARQL.

 
     However, I have some problems.

     When I try to change my model  to a  `InfGraph` , the reasoner can't works. It must be some bugs in my code. but I can't find it . 

     Is there any guide or something else to help me fix the bug .

     Our data stored in a distributed database. We want to do SPARQL Query and Inference.

 
Thanks

Re: how to implement your own tripe-storage engine with inference features

Posted by brain <br...@analyticservice.net>.
Thank you Andy. It’s so nice to talk with you.

I’ll look into DatasetGraphSimpleDB later.

Much appreciated for your detailed and professional  answering ,I will spend more time to digest it.
I will email again if I have any further questions .

Thank you very much.

Best Regards,
Brain


> On Jan 15, 2022, at 8:01 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> On 13/01/2022 04:06, brain wrote:
>> Hello Andy and Jena,
>>       Thanks for your kindly reply.
>>       Ok, I will try it first.
>>       Another question, is there any small examples that show me how to implements StorageRDF.
>>       or  any interface for external-storage.
>>       I also want to try to store and query the data in a  RDMS-backend or a KV-based strorage.
>>       If there any examples I can follow, I can take baby steps and try to make it.
>>       TDB/TDB2 may be a good  example, but it looks a little hard for me .
> 
> Look at he class hierarchy for DatasetGraphStorage. There is a really simple implementation in DatasetGraphSimpleDB. It is for testing and verification. It does not even have any indexing. It scans for all "find" operations.
> 
> Then the question is how fast and how much effort.
> 
> The StorageRDF (the abstraction of triples and quads) gives a basic level of access but maybe the storage engine can do joins natively.
> 
> The general purpose OpExecutor (SPARQL algebra execution) will work but does not pass joins to the storage layer.  It takes a storage-specific extension of OpExecutor to do that.
> 
> OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, the block of multiple triple patterns.
> 
> Optimization of SPARQL is an open-ended area but a lot of the improvements come from 2 optimizations: joins in BGPs, and filter placement.
> 
> In TDB2, joins are performed with "node ids", not the RDF terms, Nodes, themselves. NodeId is a fixed length 64 bit number; Nodes are variable length strings. In fact, until it needs them, TDB2 does not retrieve the full node details of an RDF term. So if the variables are linking patters together and do not appear in filters or the final results, they never get retrieved.
> 
> Doing joins better includes reordering to execute in a better order, and maybe extending to leftjoins (OPTIONAL).
> 
> Filter placement, especially noticeable for the BSBM benchmark, is also significant. It is pruning work as soon as possible.
> 
> RDMS:
> 
> The only general purpose SQL-related storage I know of that still exist work by having support for SPARQL execution inside the SQL engine itself, not layered on top. Jena had SDB which was layered but performance for both loading and query just wasn't good and scaling was poor. Too much overhead crossing the RDF-SQL boundaries.
> 
> If however the data has an SQL schema, then R2RML is practical. Now the SQL engine can employ native indexing and optimizations because it "knows" the data shapes.
> 
> KV:
> 
> There are two cases, depending on whether the keys are sorted (e.g. RocksDB, LMDB and several others).
> 
> TDB2 is uses sorted key indexes, and no value, only keys, for triples and quads.
> 
> If the keys are sorted, then storing the triples 3 times in SPO, POS and OSP means there is always an index to match a triple pattern of concrete terms and some wildcards. Two indexes are enough IF you assume that a pattern always has a predicate.
> 
> There are several read-centric systems that use 6 indexes for a graph of triples. It means any sort order is available and they can always so a merge join.
> 
> If the KV store does not provide a way to use it directly as a way to solve a pattern with wildcards, there is going to have to be some structure on top it to do so.
> 
>    Andy
> 




Re: how to implement your own tripe-storage engine with inference features

Posted by Andy Seaborne <an...@apache.org>.

On 13/01/2022 04:06, brain wrote:
> Hello Andy and Jena,
>        Thanks for your kindly reply.
>        Ok, I will try it first.
> 
>        Another question, is there any small examples that show me how to implements StorageRDF.
>        or  any interface for external-storage.
>        I also want to try to store and query the data in a  RDMS-backend or a KV-based strorage.
>        If there any examples I can follow, I can take baby steps and try to make it.
>        TDB/TDB2 may be a good  example, but it looks a little hard for me .

Look at he class hierarchy for DatasetGraphStorage. There is a really 
simple implementation in DatasetGraphSimpleDB. It is for testing and 
verification. It does not even have any indexing. It scans for all 
"find" operations.

Then the question is how fast and how much effort.

The StorageRDF (the abstraction of triples and quads) gives a basic 
level of access but maybe the storage engine can do joins natively.

The general purpose OpExecutor (SPARQL algebra execution) will work but 
does not pass joins to the storage layer.  It takes a storage-specific 
extension of OpExecutor to do that.

OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, the 
block of multiple triple patterns.

Optimization of SPARQL is an open-ended area but a lot of the 
improvements come from 2 optimizations: joins in BGPs, and filter placement.

In TDB2, joins are performed with "node ids", not the RDF terms, Nodes, 
themselves. NodeId is a fixed length 64 bit number; Nodes are variable 
length strings. In fact, until it needs them, TDB2 does not retrieve the 
full node details of an RDF term. So if the variables are linking 
patters together and do not appear in filters or the final results, they 
never get retrieved.

Doing joins better includes reordering to execute in a better order, and 
maybe extending to leftjoins (OPTIONAL).

Filter placement, especially noticeable for the BSBM benchmark, is also 
significant. It is pruning work as soon as possible.

RDMS:

The only general purpose SQL-related storage I know of that still exist 
work by having support for SPARQL execution inside the SQL engine 
itself, not layered on top. Jena had SDB which was layered but 
performance for both loading and query just wasn't good and scaling was 
poor. Too much overhead crossing the RDF-SQL boundaries.

If however the data has an SQL schema, then R2RML is practical. Now the 
SQL engine can employ native indexing and optimizations because it 
"knows" the data shapes.

KV:

There are two cases, depending on whether the keys are sorted (e.g. 
RocksDB, LMDB and several others).

TDB2 is uses sorted key indexes, and no value, only keys, for triples 
and quads.

If the keys are sorted, then storing the triples 3 times in SPO, POS and 
OSP means there is always an index to match a triple pattern of concrete 
terms and some wildcards. Two indexes are enough IF you assume that a 
pattern always has a predicate.

There are several read-centric systems that use 6 indexes for a graph of 
triples. It means any sort order is available and they can always so a 
merge join.

If the KV store does not provide a way to use it directly as a way to 
solve a pattern with wildcards, there is going to have to be some 
structure on top it to do so.

     Andy

Re: how to implement your own tripe-storage engine with inference features

Posted by brain <br...@analyticservice.net>.
Hello Andy and Jena,
      Thanks for your kindly reply.
      Ok, I will try it first.

      Another question, is there any small examples that show me how to implements StorageRDF.
      or  any interface for external-storage. 
      I also want to try to store and query the data in a  RDMS-backend or a KV-based strorage.
      If there any examples I can follow, I can take baby steps and try to make it. 
      TDB/TDB2 may be a good  example, but it looks a little hard for me .

      
> On Jan 13, 2022, at 2:58 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> It's hard too understand cold but this looks odd to me:
> 
> 
> In App.java@
> 
>        DgraphDB db = new DgraphDB(dsg, new DatasetPrefixesDgraphDB());
>        Dataset dataset = DatasetFactory.wrap(db);
>        Model myMod = dataset.getDefaultModel();
> ...
>        Dataset ds = DatasetFactory.create(myMod);
>        ds.asDatasetGraph().setDefaultGraph(infgraph);
> ---------
> 
> If using StorageRDF, then I'd the dataset to be built with DatasetGraphStorage
> 
> (see TDB2)
> 
> 
> setDefaultGraph, slightly contrary to name, is a copy. It is a bit of legacy hangover.
> 
> I'd expect:
>     StorageDgraphDB Dg_dsg =
>            new StorageDgraphDB(txnSystem, tripleTable, quadTable);
>     DatasetGragh dsg1 =
>           new DatasetGraphStorage(Dg_dsg,
>                                   new DatasetPrefixesDgraphDB(),
>                                   txnSystem);
> 
> And inference:
>    Graph g = dsg1.getDefaultGraph();
>    InfGraph infgraph = reasoner.bind(g);
>    infgraph.setDerivationLogging(true);
> 
> create a layer:
> 
>    DatasetGraph dsg2 = DatasetGraphFactory.wrap(g);
> 
>    FusekiServer server = FusekiServer.create()
>                .add("/ds", dsg2)
> 
> (untested)
> 
>        Andy
> 
> 
> On 12/01/2022 14:39, brain wrote:
>> Hi Andy and Jena,
>>     So glad to see you here.
>>     I just upload my code to GitHub https://github.com/analyticservicedev/dgraph-jena <https://github.com/analyticservicedev/dgraph-jena>
>>     It’s short and a little  dirty .
>>     I have  `class DgraphTripleTable implements TripleStore`.
>>     And do CRUD with add() delete() find() or findXxxMethods()
>>     Then I have DgraphTripleTable in  `class StorageDgraphDB implements StorageRDF`
>>     And then wrap the StorageDgraphDB with ds= DatasetFactory.wrap(db);
>>     Finally, I have FusekiServer.create().add("/ds", ds).port(6384).build().start()
>>     I have a test file in src/test/java/com/jena/app/Inf.java
>>     There are three methods named main tdbmain memoryMain
>>     for different storage  backend:  Dgraph 、 TDB、 Memory
>>    Could you check my code and give me some advice to help me make it run?
>>   Thank you very much.
>>> On Jan 12, 2022, at 9:45 PM, Andy Seaborne <an...@apache.org> wrote:
>>> 
>>> 
>>> On 12/01/2022 09:10, brain wrote:
>>>> Hello,
>>>>      I need some help.
>>>>      With this guide, https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html, <https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html,>
>>>>      I created an embeded-fuseki server to provide a SPARQL service .
>>>>      I make an implemention of `org.apache.jena.dboe.storage.StorageRDF` interface  in Java, so I can store rdf triples with my own storage engine(a distributed database). And it works. I can query the rdfs with SPARQL.
>>>>        However, I have some problems.
>>>>      When I try to change my model  to a  `InfGraph` , the reasoner can't works. It must be some bugs in my code. but I can't find it .
>>>>      Is there any guide or something else to help me fix the bug .
>>>>      Our data stored in a distributed database. We want to do SPARQL Query and Inference.
>>>>  Thanks
>>> 
>>> Hi there,
>>> 
>>> Could you give some details of your setup?
>>> 
>>> + How do you query with RDFS?
>>> 
>>> + What level of inferencing are you setting for the InfGraph?
>>> 
>>> + Are you using an assembler or setting up the InGraph with code?
>>> 
>>> If it is RDFS you are wanting, there's a different approach that might work better for you:
>>> 
>>> https://jena.apache.org/documentation/rdfs/
>>> 
>>> This is fixed schema, data-centric (so it is not full RDFS reasoning - there are no axiomatic triples, and iyt assumes that vocabulary like subproperty or subclass isn't being subproperty'ed.)
>>> 
>>> But keeps no in-memory state from the data itself so it scales and you can directly update the data and see new inferred triples.
>>> 
>>>    Andy
>>> 
> 




Re: how to implement your own tripe-storage engine with inference features

Posted by Andy Seaborne <an...@apache.org>.
It's hard too understand cold but this looks odd to me:


In App.java@

         DgraphDB db = new DgraphDB(dsg, new DatasetPrefixesDgraphDB());
         Dataset dataset = DatasetFactory.wrap(db);
         Model myMod = dataset.getDefaultModel();
...
         Dataset ds = DatasetFactory.create(myMod);
         ds.asDatasetGraph().setDefaultGraph(infgraph);
---------

If using StorageRDF, then I'd the dataset to be built with 
DatasetGraphStorage

(see TDB2)


setDefaultGraph, slightly contrary to name, is a copy. It is a bit of 
legacy hangover.

I'd expect:
      StorageDgraphDB Dg_dsg =
             new StorageDgraphDB(txnSystem, tripleTable, quadTable);
      DatasetGragh dsg1 =
            new DatasetGraphStorage(Dg_dsg,
                                    new DatasetPrefixesDgraphDB(),
                                    txnSystem);

And inference:
     Graph g = dsg1.getDefaultGraph();
     InfGraph infgraph = reasoner.bind(g);
     infgraph.setDerivationLogging(true);

create a layer:

     DatasetGraph dsg2 = DatasetGraphFactory.wrap(g);

     FusekiServer server = FusekiServer.create()
                 .add("/ds", dsg2)

(untested)

         Andy


On 12/01/2022 14:39, brain wrote:
> Hi Andy and Jena,
>      So glad to see you here.
>      I just upload my code to GitHub https://github.com/analyticservicedev/dgraph-jena <https://github.com/analyticservicedev/dgraph-jena>
> 
>      It’s short and a little  dirty .
> 
>      I have  `class DgraphTripleTable implements TripleStore`.
>      And do CRUD with add() delete() find() or findXxxMethods()
> 
>      Then I have DgraphTripleTable in  `class StorageDgraphDB implements StorageRDF`
>      And then wrap the StorageDgraphDB with ds= DatasetFactory.wrap(db);
> 
>      Finally, I have FusekiServer.create().add("/ds", ds).port(6384).build().start()
> 
>      I have a test file in src/test/java/com/jena/app/Inf.java
>      There are three methods named main tdbmain memoryMain
>      for different storage  backend:  Dgraph 、 TDB、 Memory
> 
>     Could you check my code and give me some advice to help me make it run?
>    Thank you very much.
> 
> 
>> On Jan 12, 2022, at 9:45 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>
>> On 12/01/2022 09:10, brain wrote:
>>> Hello,
>>>       I need some help.
>>>       With this guide, https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html, <https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html,>
>>>       I created an embeded-fuseki server to provide a SPARQL service .
>>>       I make an implemention of `org.apache.jena.dboe.storage.StorageRDF` interface  in Java, so I can store rdf triples with my own storage engine(a distributed database). And it works. I can query the rdfs with SPARQL.
>>>         However, I have some problems.
>>>       When I try to change my model  to a  `InfGraph` , the reasoner can't works. It must be some bugs in my code. but I can't find it .
>>>       Is there any guide or something else to help me fix the bug .
>>>       Our data stored in a distributed database. We want to do SPARQL Query and Inference.
>>>   Thanks
>>
>> Hi there,
>>
>> Could you give some details of your setup?
>>
>> + How do you query with RDFS?
>>
>> + What level of inferencing are you setting for the InfGraph?
>>
>> + Are you using an assembler or setting up the InGraph with code?
>>
>> If it is RDFS you are wanting, there's a different approach that might work better for you:
>>
>> https://jena.apache.org/documentation/rdfs/
>>
>> This is fixed schema, data-centric (so it is not full RDFS reasoning - there are no axiomatic triples, and iyt assumes that vocabulary like subproperty or subclass isn't being subproperty'ed.)
>>
>> But keeps no in-memory state from the data itself so it scales and you can directly update the data and see new inferred triples.
>>
>>     Andy
>>
> 
> 

Re: how to implement your own tripe-storage engine with inference features

Posted by brain <br...@analyticservice.net>.
Hi Andy and Jena,
    So glad to see you here.
    I just upload my code to GitHub https://github.com/analyticservicedev/dgraph-jena <https://github.com/analyticservicedev/dgraph-jena> 

    It’s short and a little  dirty . 

    I have  `class DgraphTripleTable implements TripleStore`.
    And do CRUD with add() delete() find() or findXxxMethods()

    Then I have DgraphTripleTable in  `class StorageDgraphDB implements StorageRDF`
    And then wrap the StorageDgraphDB with ds= DatasetFactory.wrap(db);

    Finally, I have FusekiServer.create().add("/ds", ds).port(6384).build().start()

    I have a test file in src/test/java/com/jena/app/Inf.java 
    There are three methods named main tdbmain memoryMain
    for different storage  backend:  Dgraph 、 TDB、 Memory

   Could you check my code and give me some advice to help me make it run?
  Thank you very much.


> On Jan 12, 2022, at 9:45 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> On 12/01/2022 09:10, brain wrote:
>> Hello,
>>      I need some help.
>>      With this guide, https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html, <https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html,>
>>      I created an embeded-fuseki server to provide a SPARQL service .
>>      I make an implemention of `org.apache.jena.dboe.storage.StorageRDF` interface  in Java, so I can store rdf triples with my own storage engine(a distributed database). And it works. I can query the rdfs with SPARQL.
>>        However, I have some problems.
>>      When I try to change my model  to a  `InfGraph` , the reasoner can't works. It must be some bugs in my code. but I can't find it .
>>      Is there any guide or something else to help me fix the bug .
>>      Our data stored in a distributed database. We want to do SPARQL Query and Inference.
>>  Thanks
> 
> Hi there,
> 
> Could you give some details of your setup?
> 
> + How do you query with RDFS?
> 
> + What level of inferencing are you setting for the InfGraph?
> 
> + Are you using an assembler or setting up the InGraph with code?
> 
> If it is RDFS you are wanting, there's a different approach that might work better for you:
> 
> https://jena.apache.org/documentation/rdfs/
> 
> This is fixed schema, data-centric (so it is not full RDFS reasoning - there are no axiomatic triples, and iyt assumes that vocabulary like subproperty or subclass isn't being subproperty'ed.)
> 
> But keeps no in-memory state from the data itself so it scales and you can directly update the data and see new inferred triples.
> 
>    Andy
> 


Re: how to implement your own tripe-storage engine with inference features

Posted by Andy Seaborne <an...@apache.org>.

On 12/01/2022 09:10, brain wrote:
> Hello,
> 
>       I need some help.
> 
>       With this guide, https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html, <https://jena.apache.org/documentation/fuseki2/fuseki-embedded.html,>
>       I created an embeded-fuseki server to provide a SPARQL service .
> 
>       I make an implemention of `org.apache.jena.dboe.storage.StorageRDF` interface  in Java, so I can store rdf triples with my own storage engine(a distributed database). And it works. I can query the rdfs with SPARQL.
> 
>   
>       However, I have some problems.
> 
>       When I try to change my model  to a  `InfGraph` , the reasoner can't works. It must be some bugs in my code. but I can't find it .
> 
>       Is there any guide or something else to help me fix the bug .
> 
>       Our data stored in a distributed database. We want to do SPARQL Query and Inference.
> 
>   
> Thanks
> 

Hi there,

Could you give some details of your setup?

+ How do you query with RDFS?

+ What level of inferencing are you setting for the InfGraph?

+ Are you using an assembler or setting up the InGraph with code?

If it is RDFS you are wanting, there's a different approach that might 
work better for you:

https://jena.apache.org/documentation/rdfs/

This is fixed schema, data-centric (so it is not full RDFS reasoning - 
there are no axiomatic triples, and iyt assumes that vocabulary like 
subproperty or subclass isn't being subproperty'ed.)

But keeps no in-memory state from the data itself so it scales and you 
can directly update the data and see new inferred triples.

     Andy