You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Yi Liao <Yi...@sas.com> on 2014/05/14 20:10:46 UTC

RE: how does the node to nodeId mapping work?

Hi Andy,

Thanks for your answer. 

In you answer "Node ->(by calculation) 128 bit value", did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk?

Thanks,
Yi Liao


-----Original Message-----
From: Andy Seaborne [mailto:andy@apache.org] 
Sent: Tuesday, November 26, 2013 3:39 PM
To: dev@jena.apache.org
Subject: Re: how does the node to nodeId mapping work?

On 26/11/13 19:47, Yi Liao wrote:
> Hi,

Hi there,

>
> Can anybody explain to me how does Jena map node to nodeId? The 
> following is stated in 
> http://jena.apache.org/documentation/tdb/architecture.html
>
>
> "The Node to NodeId mapping is based on hash of the Node (a 128 bit
> MD5 hash - the length was found not to major performance factor).
>
> The default storage of the node table is a sequential access file for 
> the NodeId to Node mapping and a B+Tree for the Node to NodeId 
> mapping."
>
> My understanding is that Jena hashes the node into a long integer,

Node ->(by calculation) 128 bit value ->(by index) file offset

> and somehow converts the hashed value into an address offset to the 
> node table, and the node information is stored at the address offset 
> in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.

>
> Is my understanding correct?

Yes!

> How does Jena converts the hashed value into an address offset? How is 
> B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only needed to be a pure key->value mapping, the B+Tree code is used as it's heavily tested.

There is in the codebase an external hash table which is pure 
key->value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table.

>
> Thanks! Yi Liao
>
	
	Andy





Re: how does the node to nodeId mapping work?

Posted by Andy Seaborne <an...@apache.org>.
On 16/05/14 18:41, Marco Neumann wrote:
> sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

At the moment ...

The hashing is done for the mapping of Node -> NodeId;

Triples (quads) are (NodeId, NodeId, NodeId) and NodeIds are not hashes.

NodeIds are 8 bytes: 1 byte for the kind of id and 7 bytes value.

For integers, decimals, dates, datetimes, booleans, the system attempts 
to encode them directly inline into the NodeId (so its 56 bits of signed 
integer, for example; timezones to the nearest 15 mins)

This is why the datatypes get lost : xsd:byte comes out as xsd:integer.

For all other nodes, they are stored in the node file and the 7 bytes is 
the address into that file.

To get from Node to NodeId: happens rarely for query and a lot for 
loading.  You only look up constants in the query.

1/ If the Node can be inlined, calculate it.

2/ If it can't, hash (type marker+lexicalform+datatype+lang or similar 
for the other nodes), and look in the node hase to NodeId table ('node2id')



I have been considering using hashes for the NodeId itself in a related 
but different system.  About 96 bits looks to be safe, or shorter with 
some clash management.  Then loading should be faster (and 
parallelisable).  You need NodeId->real Node for getting query results.

	Andy

>
> On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <an...@apache.org> wrote:
>> On 14/05/14 19:10, Yi Liao wrote:
>>>
>>> Hi Andy,
>>>
>>> Thanks for your answer.
>>>
>>> In you answer "Node ->(by calculation) 128 bit value", did you use
>>> MD5 hash for this step? Is MD5 collision resistant? I guess it is
>>> highly unlikely for two nodes to be hashed into the same value, so we
>>> might just take the risk?
>>>
>>> Thanks, Yi Liao
>>
>>
>> Hi there,
>>
>> Yes - it's MD5.
>>
>> 128 bits is a simply huge number.
>>
>> http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
>>
>> 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
>> aren't id'ed by hash.
>>
>> If you are worried by using hashing and assuming uniqueness in 128 bits,
>> then there are a few others things to worry about: RAM corruption, even
>> in error-detecting RAM; ethernet, and indeed any data storage or transfer.
>>
>> And coding bugs :-)
>>
>>          Andy
>>
>>
>>
>>>
>>>
>>> -----Original Message----- From: Andy Seaborne
>>> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
>>> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
>>> work?
>>>
>>> On 26/11/13 19:47, Yi Liao wrote:
>>>>
>>>> Hi,
>>>
>>>
>>> Hi there,
>>>
>>>>
>>>> Can anybody explain to me how does Jena map node to nodeId? The
>>>> following is stated in
>>>> http://jena.apache.org/documentation/tdb/architecture.html
>>>>
>>>>
>>>> "The Node to NodeId mapping is based on hash of the Node (a 128
>>>> bit MD5 hash - the length was found not to major performance
>>>> factor).
>>>>
>>>> The default storage of the node table is a sequential access file
>>>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>>>> mapping."
>>>>
>>>> My understanding is that Jena hashes the node into a long integer,
>>>
>>>
>>> Node ->(by calculation) 128 bit value ->(by index) file offset
>>>
>>>> and somehow converts the hashed value into an address offset to
>>>> the node table, and the node information is stored at the address
>>>> offset in the node table.
>>>
>>>
>>> There is a hash to offset index.
>>>
>>> The NodeTable itself is heavily cached.
>>>
>>>>
>>>> Is my understanding correct?
>>>
>>>
>>> Yes!
>>>
>>>> How does Jena converts the hashed value into an address offset? How
>>>> is B+ tree used in this process?
>>>
>>>
>>> TDB uses a B+tree for the hash to address offset.  While it only
>>> needed to be a pure key->value mapping, the B+Tree code is used as
>>> it's heavily tested.
>>>
>>> There is in the codebase an external hash table which is pure
>>> key->value.  Using it did not make an observable difference (see teh
>>> cache) so using the B+Tree code was easy and it doesn't have the
>>> reallocate burstiness of the external hash table.
>>>
>>>>
>>>> Thanks! Yi Liao
>>>>
>>>   Andy
>>>
>>>
>>>
>>>
>>
>
>
>


Re: how does the node to nodeId mapping work?

Posted by Marco Neumann <ma...@gmail.com>.
sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <an...@apache.org> wrote:
> On 14/05/14 19:10, Yi Liao wrote:
>>
>> Hi Andy,
>>
>> Thanks for your answer.
>>
>> In you answer "Node ->(by calculation) 128 bit value", did you use
>> MD5 hash for this step? Is MD5 collision resistant? I guess it is
>> highly unlikely for two nodes to be hashed into the same value, so we
>> might just take the risk?
>>
>> Thanks, Yi Liao
>
>
> Hi there,
>
> Yes - it's MD5.
>
> 128 bits is a simply huge number.
>
> http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
>
> 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
> aren't id'ed by hash.
>
> If you are worried by using hashing and assuming uniqueness in 128 bits,
> then there are a few others things to worry about: RAM corruption, even
> in error-detecting RAM; ethernet, and indeed any data storage or transfer.
>
> And coding bugs :-)
>
>         Andy
>
>
>
>>
>>
>> -----Original Message----- From: Andy Seaborne
>> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
>> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
>> work?
>>
>> On 26/11/13 19:47, Yi Liao wrote:
>>>
>>> Hi,
>>
>>
>> Hi there,
>>
>>>
>>> Can anybody explain to me how does Jena map node to nodeId? The
>>> following is stated in
>>> http://jena.apache.org/documentation/tdb/architecture.html
>>>
>>>
>>> "The Node to NodeId mapping is based on hash of the Node (a 128
>>> bit MD5 hash - the length was found not to major performance
>>> factor).
>>>
>>> The default storage of the node table is a sequential access file
>>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>>> mapping."
>>>
>>> My understanding is that Jena hashes the node into a long integer,
>>
>>
>> Node ->(by calculation) 128 bit value ->(by index) file offset
>>
>>> and somehow converts the hashed value into an address offset to
>>> the node table, and the node information is stored at the address
>>> offset in the node table.
>>
>>
>> There is a hash to offset index.
>>
>> The NodeTable itself is heavily cached.
>>
>>>
>>> Is my understanding correct?
>>
>>
>> Yes!
>>
>>> How does Jena converts the hashed value into an address offset? How
>>> is B+ tree used in this process?
>>
>>
>> TDB uses a B+tree for the hash to address offset.  While it only
>> needed to be a pure key->value mapping, the B+Tree code is used as
>> it's heavily tested.
>>
>> There is in the codebase an external hash table which is pure
>> key->value.  Using it did not make an observable difference (see teh
>> cache) so using the B+Tree code was easy and it doesn't have the
>> reallocate burstiness of the external hash table.
>>
>>>
>>> Thanks! Yi Liao
>>>
>>  Andy
>>
>>
>>
>>
>



-- 


---
Marco Neumann
KONA

Re: how does the node to nodeId mapping work?

Posted by Andy Seaborne <an...@apache.org>.
On 14/05/14 19:10, Yi Liao wrote:
> Hi Andy,
>
> Thanks for your answer.
>
> In you answer "Node ->(by calculation) 128 bit value", did you use
> MD5 hash for this step? Is MD5 collision resistant? I guess it is
> highly unlikely for two nodes to be hashed into the same value, so we
> might just take the risk?
>
> Thanks, Yi Liao

Hi there,

Yes - it's MD5.

128 bits is a simply huge number.

http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. 
Triples aren't id'ed by hash.

If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.

And coding bugs :-)

	Andy


>
>
> -----Original Message----- From: Andy Seaborne
> [mailto:andy@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
> dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
> work?
>
> On 26/11/13 19:47, Yi Liao wrote:
>> Hi,
>
> Hi there,
>
>>
>> Can anybody explain to me how does Jena map node to nodeId? The
>> following is stated in
>> http://jena.apache.org/documentation/tdb/architecture.html
>>
>>
>> "The Node to NodeId mapping is based on hash of the Node (a 128
>> bit MD5 hash - the length was found not to major performance
>> factor).
>>
>> The default storage of the node table is a sequential access file
>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>> mapping."
>>
>> My understanding is that Jena hashes the node into a long integer,
>
> Node ->(by calculation) 128 bit value ->(by index) file offset
>
>> and somehow converts the hashed value into an address offset to
>> the node table, and the node information is stored at the address
>> offset in the node table.
>
> There is a hash to offset index.
>
> The NodeTable itself is heavily cached.
>
>>
>> Is my understanding correct?
>
> Yes!
>
>> How does Jena converts the hashed value into an address offset? How
>> is B+ tree used in this process?
>
> TDB uses a B+tree for the hash to address offset.  While it only
> needed to be a pure key->value mapping, the B+Tree code is used as
> it's heavily tested.
>
> There is in the codebase an external hash table which is pure
> key->value.  Using it did not make an observable difference (see teh
> cache) so using the B+Tree code was easy and it doesn't have the
> reallocate burstiness of the external hash table.
>
>>
>> Thanks! Yi Liao
>>
>  Andy
>
>
>
>