You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Yi Liao <Yi...@sas.com> on 2013/11/26 20:47:14 UTC

how does the node to nodeId mapping work?

Hi,

Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html


"The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor).

The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping."

My understanding is that Jena hashes the node into a long integer, and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table.

Is my understanding correct? How does Jena converts the hashed value into an address offset? How is B+ tree used in this process?

Thanks!
Yi Liao

Re: how does the node to nodeId mapping work?

Posted by Marco Neumann <ma...@gmail.com>.

Andy, are Literals in Statements also hashed with the same MD5 in TDB?

On Tue, Nov 26, 2013 at 3:38 PM, Andy Seaborne <an...@apache.org> wrote:
> On 26/11/13 19:47, Yi Liao wrote:
>>
>> Hi,
>
>
> Hi there,
>
>
>>
>> Can anybody explain to me how does Jena map node to nodeId? The
>> following is stated in
>> http://jena.apache.org/documentation/tdb/architecture.html
>>
>>
>> "The Node to NodeId mapping is based on hash of the Node (a 128 bit
>> MD5 hash - the length was found not to major performance factor).
>>
>> The default storage of the node table is a sequential access file for
>> the NodeId to Node mapping and a B+Tree for the Node to NodeId
>> mapping."
>>
>> My understanding is that Jena hashes the node into a long integer,
>
>
> Node ->(by calculation) 128 bit value ->(by index) file offset
>
>
>> and somehow converts the hashed value into an address offset to the
>> node table, and the node information is stored at the address offset
>> in the node table.
>
>
> There is a hash to offset index.
>
> The NodeTable itself is heavily cached.
>
>>
>> Is my understanding correct?
>
>
> Yes!
>
>
>> How does Jena converts the hashed value
>> into an address offset? How is B+ tree used in this process?
>
>
> TDB uses a B+tree for the hash to address offset.  While it only needed to
> be a pure key->value mapping, the B+Tree code is used as it's heavily
> tested.
>
> There is in the codebase an external hash table which is pure key->value.
> Using it did not make an observable difference (see teh cache) so using the
> B+Tree code was easy and it doesn't have the reallocate burstiness of the
> external hash table.
>
>>
>> Thanks! Yi Liao
>>
>
>         Andy
>
>



-- 


---
Marco Neumann
KONA

Re: how does the node to nodeId mapping work?

Posted by Andy Seaborne <an...@apache.org>.

On 16/05/14 18:41, Marco Neumann wrote:
> sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

At the moment ...

The hashing is done for the mapping of Node -> NodeId;

Triples (quads) are (NodeId, NodeId, NodeId) and NodeIds are not hashes.

NodeIds are 8 bytes: 1 byte for the kind of id and 7 bytes value.

For integers, decimals, dates, datetimes, booleans, the system attempts 
to encode them directly inline into the NodeId (so its 56 bits of signed 
integer, for example; timezones to the nearest 15 mins)

This is why the datatypes get lost : xsd:byte comes out as xsd:integer.

For all other nodes, they are stored in the node file and the 7 bytes is 
the address into that file.

To get from Node to NodeId: happens rarely for query and a lot for 
loading.  You only look up constants in the query.

1/ If the Node can be inlined, calculate it.

2/ If it can't, hash (type marker+lexicalform+datatype+lang or similar 
for the other nodes), and look in the node hase to NodeId table ('node2id')



I have been considering using hashes for the NodeId itself in a related 
but different system.  About 96 bits looks to be safe, or shorter with 
some clash management.  Then loading should be faster (and 
parallelisable).  You need NodeId->real Node for getting query results.

	Andy

>
> On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <an...@apache.org> wrote:
>> On 14/05/14 19:10, Yi Liao wrote:
>>>
>>> Hi Andy,
>>>
>>> Thanks for your answer.
>>>
>>> In you answer "Node ->(by calculation) 128 bit value", did you use
>>> MD5 hash for this step? Is MD5 collision resistant? I guess it is
>>> highly unlikely for two nodes to be hashed into the same value, so we
>>> might just take the risk?
>>>
>>> Thanks, Yi Liao
>>
>>
>> Hi there,
>>
>> Yes - it's MD5.
>>
>> 128 bits is a simply huge number.
>>
>> http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
>>
>> 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
>> aren't id'ed by hash.
>>
>> If you are worried by using hashing and assuming uniqueness in 128 bits,
>> then there are a few others things to worry about: RAM corruption, even
>> in error-detecting RAM; ethernet, and indeed any data storage or transfer.
>>
>> And coding bugs :-)
>>
>>          Andy
>>
>>
>>
>>>
>>>
>>> -----Original Message----- From: Andy Seaborne
>>> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
>>> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
>>> work?
>>>
>>> On 26/11/13 19:47, Yi Liao wrote:
>>>>
>>>> Hi,
>>>
>>>
>>> Hi there,
>>>
>>>>
>>>> Can anybody explain to me how does Jena map node to nodeId? The
>>>> following is stated in
>>>> http://jena.apache.org/documentation/tdb/architecture.html
>>>>
>>>>
>>>> "The Node to NodeId mapping is based on hash of the Node (a 128
>>>> bit MD5 hash - the length was found not to major performance
>>>> factor).
>>>>
>>>> The default storage of the node table is a sequential access file
>>>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>>>> mapping."
>>>>
>>>> My understanding is that Jena hashes the node into a long integer,
>>>
>>>
>>> Node ->(by calculation) 128 bit value ->(by index) file offset
>>>
>>>> and somehow converts the hashed value into an address offset to
>>>> the node table, and the node information is stored at the address
>>>> offset in the node table.
>>>
>>>
>>> There is a hash to offset index.
>>>
>>> The NodeTable itself is heavily cached.
>>>
>>>>
>>>> Is my understanding correct?
>>>
>>>
>>> Yes!
>>>
>>>> How does Jena converts the hashed value into an address offset? How
>>>> is B+ tree used in this process?
>>>
>>>
>>> TDB uses a B+tree for the hash to address offset.  While it only
>>> needed to be a pure key->value mapping, the B+Tree code is used as
>>> it's heavily tested.
>>>
>>> There is in the codebase an external hash table which is pure
>>> key->value.  Using it did not make an observable difference (see teh
>>> cache) so using the B+Tree code was easy and it doesn't have the
>>> reallocate burstiness of the external hash table.
>>>
>>>>
>>>> Thanks! Yi Liao
>>>>
>>>   Andy
>>>
>>>
>>>
>>>
>>
>
>
>

Re: how does the node to nodeId mapping work?

Posted by Marco Neumann <ma...@gmail.com>.

sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <an...@apache.org> wrote:
> On 14/05/14 19:10, Yi Liao wrote:
>>
>> Hi Andy,
>>
>> Thanks for your answer.
>>
>> In you answer "Node ->(by calculation) 128 bit value", did you use
>> MD5 hash for this step? Is MD5 collision resistant? I guess it is
>> highly unlikely for two nodes to be hashed into the same value, so we
>> might just take the risk?
>>
>> Thanks, Yi Liao
>
>
> Hi there,
>
> Yes - it's MD5.
>
> 128 bits is a simply huge number.
>
> http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
>
> 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
> aren't id'ed by hash.
>
> If you are worried by using hashing and assuming uniqueness in 128 bits,
> then there are a few others things to worry about: RAM corruption, even
> in error-detecting RAM; ethernet, and indeed any data storage or transfer.
>
> And coding bugs :-)
>
>         Andy
>
>
>
>>
>>
>> -----Original Message----- From: Andy Seaborne
>> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
>> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
>> work?
>>
>> On 26/11/13 19:47, Yi Liao wrote:
>>>
>>> Hi,
>>
>>
>> Hi there,
>>
>>>
>>> Can anybody explain to me how does Jena map node to nodeId? The
>>> following is stated in
>>> http://jena.apache.org/documentation/tdb/architecture.html
>>>
>>>
>>> "The Node to NodeId mapping is based on hash of the Node (a 128
>>> bit MD5 hash - the length was found not to major performance
>>> factor).
>>>
>>> The default storage of the node table is a sequential access file
>>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>>> mapping."
>>>
>>> My understanding is that Jena hashes the node into a long integer,
>>
>>
>> Node ->(by calculation) 128 bit value ->(by index) file offset
>>
>>> and somehow converts the hashed value into an address offset to
>>> the node table, and the node information is stored at the address
>>> offset in the node table.
>>
>>
>> There is a hash to offset index.
>>
>> The NodeTable itself is heavily cached.
>>
>>>
>>> Is my understanding correct?
>>
>>
>> Yes!
>>
>>> How does Jena converts the hashed value into an address offset? How
>>> is B+ tree used in this process?
>>
>>
>> TDB uses a B+tree for the hash to address offset.  While it only
>> needed to be a pure key->value mapping, the B+Tree code is used as
>> it's heavily tested.
>>
>> There is in the codebase an external hash table which is pure
>> key->value.  Using it did not make an observable difference (see teh
>> cache) so using the B+Tree code was easy and it doesn't have the
>> reallocate burstiness of the external hash table.
>>
>>>
>>> Thanks! Yi Liao
>>>
>>  Andy
>>
>>
>>
>>
>



-- 


---
Marco Neumann
KONA

Re: how does the node to nodeId mapping work?

Posted by Andy Seaborne <an...@apache.org>.

On 14/05/14 19:10, Yi Liao wrote:
> Hi Andy,
>
> Thanks for your answer.
>
> In you answer "Node ->(by calculation) 128 bit value", did you use
> MD5 hash for this step? Is MD5 collision resistant? I guess it is
> highly unlikely for two nodes to be hashed into the same value, so we
> might just take the risk?
>
> Thanks, Yi Liao

Hi there,

Yes - it's MD5.

128 bits is a simply huge number.

http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. 
Triples aren't id'ed by hash.

If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.

And coding bugs :-)

	Andy


>
>
> -----Original Message----- From: Andy Seaborne
> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
> work?
>
> On 26/11/13 19:47, Yi Liao wrote:
>> Hi,
>
> Hi there,
>
>>
>> Can anybody explain to me how does Jena map node to nodeId? The
>> following is stated in
>> http://jena.apache.org/documentation/tdb/architecture.html
>>
>>
>> "The Node to NodeId mapping is based on hash of the Node (a 128
>> bit MD5 hash - the length was found not to major performance
>> factor).
>>
>> The default storage of the node table is a sequential access file
>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>> mapping."
>>
>> My understanding is that Jena hashes the node into a long integer,
>
> Node ->(by calculation) 128 bit value ->(by index) file offset
>
>> and somehow converts the hashed value into an address offset to
>> the node table, and the node information is stored at the address
>> offset in the node table.
>
> There is a hash to offset index.
>
> The NodeTable itself is heavily cached.
>
>>
>> Is my understanding correct?
>
> Yes!
>
>> How does Jena converts the hashed value into an address offset? How
>> is B+ tree used in this process?
>
> TDB uses a B+tree for the hash to address offset.  While it only
> needed to be a pure key->value mapping, the B+Tree code is used as
> it's heavily tested.
>
> There is in the codebase an external hash table which is pure
> key->value.  Using it did not make an observable difference (see teh
> cache) so using the B+Tree code was easy and it doesn't have the
> reallocate burstiness of the external hash table.
>
>>
>> Thanks! Yi Liao
>>
>  Andy
>
>
>
>

RE: how does the node to nodeId mapping work?

Posted by Yi Liao <Yi...@sas.com>.

Hi Andy,

Thanks for your answer. 

In you answer "Node ->(by calculation) 128 bit value", did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk?

Thanks,
Yi Liao

-----Original Message-----
From: Andy Seaborne [mailto:andy@apache.org] 
Sent: Tuesday, November 26, 2013 3:39 PM
To: dev@jena.apache.org
Subject: Re: how does the node to nodeId mapping work?

On 26/11/13 19:47, Yi Liao wrote:
> Hi,

Hi there,

>
> Can anybody explain to me how does Jena map node to nodeId? The 
> following is stated in 
> http://jena.apache.org/documentation/tdb/architecture.html
>
>
> "The Node to NodeId mapping is based on hash of the Node (a 128 bit
> MD5 hash - the length was found not to major performance factor).
>
> The default storage of the node table is a sequential access file for 
> the NodeId to Node mapping and a B+Tree for the Node to NodeId 
> mapping."
>
> My understanding is that Jena hashes the node into a long integer,

Node ->(by calculation) 128 bit value ->(by index) file offset

> and somehow converts the hashed value into an address offset to the 
> node table, and the node information is stored at the address offset 
> in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.

>
> Is my understanding correct?

Yes!

> How does Jena converts the hashed value into an address offset? How is 
> B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only needed to be a pure key->value mapping, the B+Tree code is used as it's heavily tested.

There is in the codebase an external hash table which is pure 
key->value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table.

>
> Thanks! Yi Liao
>

	Andy

Re: how does the node to nodeId mapping work?

Posted by Andy Seaborne <an...@apache.org>.

On 26/11/13 19:47, Yi Liao wrote:
> Hi,

Hi there,

>
> Can anybody explain to me how does Jena map node to nodeId? The
> following is stated in
> http://jena.apache.org/documentation/tdb/architecture.html
>
>
> "The Node to NodeId mapping is based on hash of the Node (a 128 bit
> MD5 hash - the length was found not to major performance factor).
>
> The default storage of the node table is a sequential access file for
> the NodeId to Node mapping and a B+Tree for the Node to NodeId
> mapping."
>
> My understanding is that Jena hashes the node into a long integer,

Node ->(by calculation) 128 bit value ->(by index) file offset

> and somehow converts the hashed value into an address offset to the
> node table, and the node information is stored at the address offset
> in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.

>
> Is my understanding correct?

Yes!

> How does Jena converts the hashed value
> into an address offset? How is B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only needed 
to be a pure key->value mapping, the B+Tree code is used as it's heavily 
tested.

There is in the codebase an external hash table which is pure 
key->value.  Using it did not make an observable difference (see teh 
cache) so using the B+Tree code was easy and it doesn't have the 
reallocate burstiness of the external hash table.

>
> Thanks! Yi Liao
>

	Andy