You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by graham <gr...@orangedogsoftware.com> on 2021/08/21 23:24:21 UTC

Low level details

Hi

I was wondering if there is any documentation, other than the code and 
comments, about how Jena stores data on disk (e.g. the RDF equivalents 
of things like record layouts, record linkages, etc)?

Also is there any similar documentation for how Jena executes queries 
(e.g. things like stream structure, buffering, operator parallelism, etc)?

thanks

graham

-- 
                     Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Re: Low level details

Posted by Andy Seaborne <an...@apache.org>.

 > On 24/08/2021 01:13, graham wrote:
 >> Certainly gives me a great place to start looking at this low level
 >> stuff.
 >>
 >> When you say that Jena uses the Volcano design, is this a custom
 >> implementation of Volcano, or is there some external library that Jena
 >> (and hence me!) can use?

The Volcano is more of a pattern than specific code. It means results 
are pulled as needed from sub-expressions. It was originally call 
"Iterator Model". Just Volcano sounds better!

A library would be quite hard - at the bottom most level, you really, 
really don't want to copy any bytes. Query execution, however smart, 
comes down to reading the database, shifting many triples per second. 
Copy is bad - messes the CPU caches as well.

All ARQ operators return a QueryIterator.

On 24/08/2021 12:50, jerven Bolleman wrote:
> Hi Graham,
> 
> It is actually not too difficult to use ARQ as a query engine with your 
> own data provider.
> 
> Pierre Lindenbaum did an example at 
> http://plindenbaum.blogspot.com/2012/11/creating-virtual-rdf-graph-describing.html 
> 
> 
> Which is old, but shows the general idea.
> Most of the work is in implementing a GraphBase and the rest of the 
> SPARQL implementation then comes for free.
> 
> https://jena.apache.org/documentation/javadoc/jena/org/apache/jena/graph/impl/GraphBase.html 

Yes - that's the minimum - put that graph in a dataset and SPARQL will 
be functionally complete. that's the benefit of the algebra optimizer - 
it works on anything.

https://jena.apache.org/documentation/query/arq-query-eval.html

A new storage or execution can be added incrementally. Doing 
Graph.find(s,p,o) gets you going, then hollow out more. The next is do 
Basic Graph Patterns so do some more of the basic processing next to the 
data; less copy, less translate between data formats.

     Andy

> 
> Regards,
> Jerven
> 
> On 24/08/2021 01:13, graham wrote:
>> Hi
>>
>> Just wanted to say thanks to both Andy and Lorenz for amazing answers!
>>
>> Certainly gives me a great place to start looking at this low level 
>> stuff.
>>
>> When you say that Jena uses the Volcano design, is this a custom 
>> implementation of Volcano, or is there some external library that Jena 
>> (and hence me!) can use?
>>
>> thanks again
>>
>> graham
>>
>> On 23/08/21 10:51 pm, Andy Seaborne wrote:
>>>
>>>
>>> On 22/08/2021 09:02, Lorenz Buehmann wrote:
>>>> some implementation details are among the docs, like for TDB you can 
>>>> see here: https://jena.apache.org/documentation/tdb/architecture.html
>>>>
>>>> Execution via ARQ can be found here: 
>>>> https://jena.apache.org/documentation/query/arq-query-eval.html
>>>>
>>>> In general, I think Jena also follows the Volcano design for query 
>>>> evaluation which is quite common for databases in general.
>>>>
>>>>
>>>> More pointers and information is provided soon by Andy for sure ...
>>>
>>> Mainly this mailing list :-)
>>>
>>> On the storage side (for TDB2 unless noted otherwise):
>>> TIM = Transactions In Memory = the thing behind 
>>> DatasetFactory.createTxnMem().
>>>
>>> TDB2:
>>>
>>> * All RDF terms are referred to by 64bit id (NodeId).
>>> * There is a dictionary of RDF terms (URIs, literals, bnodes, quoted 
>>> triples for RDF-star) called the Node table.
>>> * Some literals are inlined - integer values upto 56bits are stored 
>>> directly in the NodeId, not in the NodeTable. Dates, datetimes, 
>>> doubles, booleans etc. If they don't fit inline, the long form goes 
>>> into the node table.
>>>
>>> * Triples are stored as 3 NodeIds. Fix length, 24 bytes, packed into 
>>> blocks with no additional per-triple space.
>>> Quads for named graphs are 4 NodeIds.
>>>
>>> * Indexes are B+trees with MVCC blocks (MultiVersion Concurrency 
>>> Control) for transactions.
>>>
>>> * The B+Trees store 8k blocks, so the tree is about 100 to 200 way 
>>> fan out and sorted.
>>>
>>> * triples are totally indexes - normally SPO, POS and OSP. So to 
>>> lookup ":s :p ?var" use SPO to find the start SP and scan until P 
>>> changes.
>>>
>>> * There is no formal triple table. The indexes have all the 
>>> information needed - 3 NodeIds.
>>>
>>> * Buffering - the node table has a large RAM cache in-front of it.
>>>
>>> * Buffering - indexes use memory mapped files so the OS does the 
>>> buffering. This reduces the admin burden on users.
>>>
>>> Execution: see OpExecutor.
>>>
>>> * Yes - it's a volcano-style evaluator. It pulls triples from iterators.
>>>
>>> * Optimization is high-level rewrite of the SPARQL algebra, including 
>>> adding specialised alegra operators. The two big optimizations are 
>>> trying to use index joins (small left hand side) and filter placement 
>>> (as soon as all required variables available). There are others but 
>>> those two have the biggest impact.
>>>
>>> * Low level optimization includes trying to reorder basic graph 
>>> patterns to be in a better order. Rather boringly, the fixed data 
>>> independent algorithm does nearly as well as the statictics driven 
>>> one! and requires no admin.
>>>
>>> * The execution is general purpose and not, specialised to analytics 
>>> workloads. It tends to execute with joins that do not use a lot of 
>>> memory memory for example.
>>>
>>> * Each query execution is not parallel (generally).
>>>
>>> For Fuseki, parallelism happens across multiple requests running 
>>> concurrently rather than parallelism within one query execution.
>>>
>>> It's not that long ago that typical low-ish end machines have enough 
>>> (real) cores to make parallelism worth it. Interesting direction to 
>>> go in and some done experimentally. (lesson 1 from cocurrency and 
>>> optimication: they have their own costs! sometime its just better to 
>>> execute a query then think too long about it!)
>>>
>>> TIM:
>>> * It stores Node objects so no dictionary needed. Java object 
>>> references and value-based .equals are the dictionary. Also, the 
>>> parsers all add a high degree of object sharing.
>>>
>>> * It is MVCC trees so again transactions are MR+SW: 
>>> multiple-readers-and-single-writer at the same time.
>>>
>>>
>>> All storage datasets are transactional - compositions of data from 
>>> different places is only weakly transactional and MRSW (multiple read 
>>> OR a single writer).
>>>
>>> That's the main low level details.
>>>
>>>     Andy
>>>
>>>>
>>>> On 22.08.21 01:24, graham wrote:
>>>>> Hi
>>>>>
>>>>> I was wondering if there is any documentation, other than the code 
>>>>> and comments, about how Jena stores data on disk (e.g. the RDF 
>>>>> equivalents of things like record layouts, record linkages, etc)?
>>>>>
>>>>> Also is there any similar documentation for how Jena executes 
>>>>> queries (e.g. things like stream structure, buffering, operator 
>>>>> parallelism, etc)?
>>>>>
>>>>> thanks
>>>>>
>>>>> graham
>>>>>
>

Re: Low level details

Posted by jerven Bolleman <je...@sib.swiss>.

Hi Graham,

It is actually not too difficult to use ARQ as a query engine with your 
own data provider.

Pierre Lindenbaum did an example at 
http://plindenbaum.blogspot.com/2012/11/creating-virtual-rdf-graph-describing.html

Which is old, but shows the general idea.
Most of the work is in implementing a GraphBase and the rest of the 
SPARQL implementation then comes for free.

https://jena.apache.org/documentation/javadoc/jena/org/apache/jena/graph/impl/GraphBase.html


Regards,
Jerven

On 24/08/2021 01:13, graham wrote:
> Hi
> 
> Just wanted to say thanks to both Andy and Lorenz for amazing answers!
> 
> Certainly gives me a great place to start looking at this low level stuff.
> 
> When you say that Jena uses the Volcano design, is this a custom 
> implementation of Volcano, or is there some external library that Jena 
> (and hence me!) can use?
> 
> thanks again
> 
> graham
> 
> On 23/08/21 10:51 pm, Andy Seaborne wrote:
>>
>>
>> On 22/08/2021 09:02, Lorenz Buehmann wrote:
>>> some implementation details are among the docs, like for TDB you can 
>>> see here: https://jena.apache.org/documentation/tdb/architecture.html
>>>
>>> Execution via ARQ can be found here: 
>>> https://jena.apache.org/documentation/query/arq-query-eval.html
>>>
>>> In general, I think Jena also follows the Volcano design for query 
>>> evaluation which is quite common for databases in general.
>>>
>>>
>>> More pointers and information is provided soon by Andy for sure ...
>>
>> Mainly this mailing list :-)
>>
>> On the storage side (for TDB2 unless noted otherwise):
>> TIM = Transactions In Memory = the thing behind 
>> DatasetFactory.createTxnMem().
>>
>> TDB2:
>>
>> * All RDF terms are referred to by 64bit id (NodeId).
>> * There is a dictionary of RDF terms (URIs, literals, bnodes, quoted 
>> triples for RDF-star) called the Node table.
>> * Some literals are inlined - integer values upto 56bits are stored 
>> directly in the NodeId, not in the NodeTable. Dates, datetimes, 
>> doubles, booleans etc. If they don't fit inline, the long form goes 
>> into the node table.
>>
>> * Triples are stored as 3 NodeIds. Fix length, 24 bytes, packed into 
>> blocks with no additional per-triple space.
>> Quads for named graphs are 4 NodeIds.
>>
>> * Indexes are B+trees with MVCC blocks (MultiVersion Concurrency 
>> Control) for transactions.
>>
>> * The B+Trees store 8k blocks, so the tree is about 100 to 200 way fan 
>> out and sorted.
>>
>> * triples are totally indexes - normally SPO, POS and OSP. So to 
>> lookup ":s :p ?var" use SPO to find the start SP and scan until P 
>> changes.
>>
>> * There is no formal triple table. The indexes have all the 
>> information needed - 3 NodeIds.
>>
>> * Buffering - the node table has a large RAM cache in-front of it.
>>
>> * Buffering - indexes use memory mapped files so the OS does the 
>> buffering. This reduces the admin burden on users.
>>
>> Execution: see OpExecutor.
>>
>> * Yes - it's a volcano-style evaluator. It pulls triples from iterators.
>>
>> * Optimization is high-level rewrite of the SPARQL algebra, including 
>> adding specialised alegra operators. The two big optimizations are 
>> trying to use index joins (small left hand side) and filter placement 
>> (as soon as all required variables available). There are others but 
>> those two have the biggest impact.
>>
>> * Low level optimization includes trying to reorder basic graph 
>> patterns to be in a better order. Rather boringly, the fixed data 
>> independent algorithm does nearly as well as the statictics driven 
>> one! and requires no admin.
>>
>> * The execution is general purpose and not, specialised to analytics 
>> workloads. It tends to execute with joins that do not use a lot of 
>> memory memory for example.
>>
>> * Each query execution is not parallel (generally).
>>
>> For Fuseki, parallelism happens across multiple requests running 
>> concurrently rather than parallelism within one query execution.
>>
>> It's not that long ago that typical low-ish end machines have enough 
>> (real) cores to make parallelism worth it. Interesting direction to go 
>> in and some done experimentally. (lesson 1 from cocurrency and 
>> optimication: they have their own costs! sometime its just better to 
>> execute a query then think too long about it!)
>>
>> TIM:
>> * It stores Node objects so no dictionary needed. Java object 
>> references and value-based .equals are the dictionary. Also, the 
>> parsers all add a high degree of object sharing.
>>
>> * It is MVCC trees so again transactions are MR+SW: 
>> multiple-readers-and-single-writer at the same time.
>>
>>
>> All storage datasets are transactional - compositions of data from 
>> different places is only weakly transactional and MRSW (multiple read 
>> OR a single writer).
>>
>> That's the main low level details.
>>
>>     Andy
>>
>>>
>>> On 22.08.21 01:24, graham wrote:
>>>> Hi
>>>>
>>>> I was wondering if there is any documentation, other than the code 
>>>> and comments, about how Jena stores data on disk (e.g. the RDF 
>>>> equivalents of things like record layouts, record linkages, etc)?
>>>>
>>>> Also is there any similar documentation for how Jena executes 
>>>> queries (e.g. things like stream structure, buffering, operator 
>>>> parallelism, etc)?
>>>>
>>>> thanks
>>>>
>>>> graham
>>>>

-- 

	*Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss

Re: Low level details

Posted by graham <gr...@orangedogsoftware.com>.

Hi

Just wanted to say thanks to both Andy and Lorenz for amazing answers!

Certainly gives me a great place to start looking at this low level stuff.

When you say that Jena uses the Volcano design, is this a custom 
implementation of Volcano, or is there some external library that Jena 
(and hence me!) can use?

thanks again

graham

On 23/08/21 10:51 pm, Andy Seaborne wrote:
>
>
> On 22/08/2021 09:02, Lorenz Buehmann wrote:
>> some implementation details are among the docs, like for TDB you can 
>> see here: https://jena.apache.org/documentation/tdb/architecture.html
>>
>> Execution via ARQ can be found here: 
>> https://jena.apache.org/documentation/query/arq-query-eval.html
>>
>> In general, I think Jena also follows the Volcano design for query 
>> evaluation which is quite common for databases in general.
>>
>>
>> More pointers and information is provided soon by Andy for sure ...
>
> Mainly this mailing list :-)
>
> On the storage side (for TDB2 unless noted otherwise):
> TIM = Transactions In Memory = the thing behind 
> DatasetFactory.createTxnMem().
>
> TDB2:
>
> * All RDF terms are referred to by 64bit id (NodeId).
> * There is a dictionary of RDF terms (URIs, literals, bnodes, quoted 
> triples for RDF-star) called the Node table.
> * Some literals are inlined - integer values upto 56bits are stored 
> directly in the NodeId, not in the NodeTable. Dates, datetimes, 
> doubles, booleans etc. If they don't fit inline, the long form goes 
> into the node table.
>
> * Triples are stored as 3 NodeIds. Fix length, 24 bytes, packed into 
> blocks with no additional per-triple space.
> Quads for named graphs are 4 NodeIds.
>
> * Indexes are B+trees with MVCC blocks (MultiVersion Concurrency 
> Control) for transactions.
>
> * The B+Trees store 8k blocks, so the tree is about 100 to 200 way fan 
> out and sorted.
>
> * triples are totally indexes - normally SPO, POS and OSP. So to 
> lookup ":s :p ?var" use SPO to find the start SP and scan until P 
> changes.
>
> * There is no formal triple table. The indexes have all the 
> information needed - 3 NodeIds.
>
> * Buffering - the node table has a large RAM cache in-front of it.
>
> * Buffering - indexes use memory mapped files so the OS does the 
> buffering. This reduces the admin burden on users.
>
> Execution: see OpExecutor.
>
> * Yes - it's a volcano-style evaluator. It pulls triples from iterators.
>
> * Optimization is high-level rewrite of the SPARQL algebra, including 
> adding specialised alegra operators. The two big optimizations are 
> trying to use index joins (small left hand side) and filter placement 
> (as soon as all required variables available). There are others but 
> those two have the biggest impact.
>
> * Low level optimization includes trying to reorder basic graph 
> patterns to be in a better order. Rather boringly, the fixed data 
> independent algorithm does nearly as well as the statictics driven 
> one! and requires no admin.
>
> * The execution is general purpose and not, specialised to analytics 
> workloads. It tends to execute with joins that do not use a lot of 
> memory memory for example.
>
> * Each query execution is not parallel (generally).
>
> For Fuseki, parallelism happens across multiple requests running 
> concurrently rather than parallelism within one query execution.
>
> It's not that long ago that typical low-ish end machines have enough 
> (real) cores to make parallelism worth it. Interesting direction to go 
> in and some done experimentally. (lesson 1 from cocurrency and 
> optimication: they have their own costs! sometime its just better to 
> execute a query then think too long about it!)
>
> TIM:
> * It stores Node objects so no dictionary needed. Java object 
> references and value-based .equals are the dictionary. Also, the 
> parsers all add a high degree of object sharing.
>
> * It is MVCC trees so again transactions are MR+SW: 
> multiple-readers-and-single-writer at the same time.
>
>
> All storage datasets are transactional - compositions of data from 
> different places is only weakly transactional and MRSW (multiple read 
> OR a single writer).
>
> That's the main low level details.
>
>     Andy
>
>>
>> On 22.08.21 01:24, graham wrote:
>>> Hi
>>>
>>> I was wondering if there is any documentation, other than the code 
>>> and comments, about how Jena stores data on disk (e.g. the RDF 
>>> equivalents of things like record layouts, record linkages, etc)?
>>>
>>> Also is there any similar documentation for how Jena executes 
>>> queries (e.g. things like stream structure, buffering, operator 
>>> parallelism, etc)?
>>>
>>> thanks
>>>
>>> graham
>>>
-- 
                     Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Re: Low level details

Posted by Andy Seaborne <an...@apache.org>.

On 22/08/2021 09:02, Lorenz Buehmann wrote:
> some implementation details are among the docs, like for TDB you can see 
> here: https://jena.apache.org/documentation/tdb/architecture.html
> 
> Execution via ARQ can be found here: 
> https://jena.apache.org/documentation/query/arq-query-eval.html
> 
> In general, I think Jena also follows the Volcano design for query 
> evaluation which is quite common for databases in general.
> 
> 
> More pointers and information is provided soon by Andy for sure ...

Mainly this mailing list :-)

On the storage side (for TDB2 unless noted otherwise):
TIM = Transactions In Memory = the thing behind 
DatasetFactory.createTxnMem().

TDB2:

* All RDF terms are referred to by 64bit id (NodeId).
* There is a dictionary of RDF terms (URIs, literals, bnodes, quoted 
triples for RDF-star) called the Node table.
* Some literals are inlined - integer values upto 56bits are stored 
directly in the NodeId, not in the NodeTable. Dates, datetimes, doubles, 
booleans etc. If they don't fit inline, the long form goes into the node 
table.

* Triples are stored as 3 NodeIds. Fix length, 24 bytes, packed into 
blocks with no additional per-triple space.
Quads for named graphs are 4 NodeIds.

* Indexes are B+trees with MVCC blocks (MultiVersion Concurrency 
Control) for transactions.

* The B+Trees store 8k blocks, so the tree is about 100 to 200 way fan 
out and sorted.

* triples are totally indexes - normally SPO, POS and OSP. So to lookup 
":s :p ?var" use SPO to find the start SP and scan until P changes.

* There is no formal triple table. The indexes have all the information 
needed - 3 NodeIds.

* Buffering - the node table has a large RAM cache in-front of it.

* Buffering - indexes use memory mapped files so the OS does the 
buffering. This reduces the admin burden on users.

Execution: see OpExecutor.

* Yes - it's a volcano-style evaluator. It pulls triples from iterators.

* Optimization is high-level rewrite of the SPARQL algebra, including 
adding specialised alegra operators. The two big optimizations are 
trying to use index joins (small left hand side) and filter placement 
(as soon as all required variables available). There are others but 
those two have the biggest impact.

* Low level optimization includes trying to reorder basic graph patterns 
to be in a better order. Rather boringly, the fixed data independent 
algorithm does nearly as well as the statictics driven one! and requires 
no admin.

* The execution is general purpose and not, specialised to analytics 
workloads. It tends to execute with joins that do not use a lot of 
memory memory for example.

* Each query execution is not parallel (generally).

For Fuseki, parallelism happens across multiple requests running 
concurrently rather than parallelism within one query execution.

It's not that long ago that typical low-ish end machines have enough 
(real) cores to make parallelism worth it. Interesting direction to go 
in and some done experimentally. (lesson 1 from cocurrency and 
optimication: they have their own costs! sometime its just better to 
execute a query then think too long about it!)

TIM:
* It stores Node objects so no dictionary needed. Java object references 
and value-based .equals are the dictionary. Also, the parsers all add a 
high degree of object sharing.

* It is MVCC trees so again transactions are MR+SW: 
multiple-readers-and-single-writer at the same time.

All storage datasets are transactional - compositions of data from 
different places is only weakly transactional and MRSW (multiple read OR 
a single writer).

That's the main low level details.

     Andy

> 
> On 22.08.21 01:24, graham wrote:
>> Hi
>>
>> I was wondering if there is any documentation, other than the code and 
>> comments, about how Jena stores data on disk (e.g. the RDF equivalents 
>> of things like record layouts, record linkages, etc)?
>>
>> Also is there any similar documentation for how Jena executes queries 
>> (e.g. things like stream structure, buffering, operator parallelism, 
>> etc)?
>>
>> thanks
>>
>> graham
>>

Re: Low level details

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

some implementation details are among the docs, like for TDB you can see 
here: https://jena.apache.org/documentation/tdb/architecture.html

Execution via ARQ can be found here: 
https://jena.apache.org/documentation/query/arq-query-eval.html

In general, I think Jena also follows the Volcano design for query 
evaluation which is quite common for databases in general.


More pointers and information is provided soon by Andy for sure ...

On 22.08.21 01:24, graham wrote:
> Hi
>
> I was wondering if there is any documentation, other than the code and 
> comments, about how Jena stores data on disk (e.g. the RDF equivalents 
> of things like record layouts, record linkages, etc)?
>
> Also is there any similar documentation for how Jena executes queries 
> (e.g. things like stream structure, buffering, operator parallelism, 
> etc)?
>
> thanks
>
> graham
>