You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@marmotta.apache.org by "Sebastian Schaffert (JIRA)" <ji...@apache.org> on 2014/03/17 14:03:50 UTC

[jira] [Updated] (MARMOTTA-469) KiWi: Hashing-Based ID Generation

     [ https://issues.apache.org/jira/browse/MARMOTTA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schaffert updated MARMOTTA-469:
-----------------------------------------

    Description: 
The KiWi triple store currently generates unique IDs for nodes and triples using a kind of sequence generator. Snowflake is generally very fast, but to ensure that the same object always gets the same ID a lot of synchronization is necessary (immediate commit for nodes, triple registry for triples), which has a considerable performance impact, particularly in clustered environments.

A much faster approach would be to compute the ID from the objects themselves, e.g. using an efficient and good hashing function. With a 64bit hash, the probability for conflicts starts getting serious at around 2 billion objects (probability 10%), so it might make sense switching to 128bit keys as well.

A good overview over clash probabilities is given in:

http://preshing.com/20110504/hash-collision-probabilities/

Changes would affect the API for ID generation (IDGenerator) as well as the value factory. In addition, we would need to ignore duplicate IDs for database inserts, e.g. using triggers or merge. Finally, we need to rethink the behaviour of deleted/non-deleted triples.


  was:
The KiWi triple store currently generates unique IDs for nodes and triples using a kind of sequence generator. Snowflake is generally very fast, but to ensure that the same object always gets the same ID a lot of synchronization is necessary (immediate commit for nodes, triple registry for triples), which has a considerable performance impact, particularly in clustered environments.

A much faster approach would be to compute the ID from the objects themselves, e.g. using an efficient and good hashing function. With a 64bit hash, the probability for conflicts starts getting serious at around 2 billion objects (probability 10%), so it might make sense switching to 128bit keys as well.

A good overview over clash probabilities is given in:



> KiWi: Hashing-Based ID Generation
> ---------------------------------
>
>                 Key: MARMOTTA-469
>                 URL: https://issues.apache.org/jira/browse/MARMOTTA-469
>             Project: Marmotta
>          Issue Type: Improvement
>          Components: KiWi Triple Store
>            Reporter: Sebastian Schaffert
>            Assignee: Sebastian Schaffert
>             Fix For: 3.3
>
>
> The KiWi triple store currently generates unique IDs for nodes and triples using a kind of sequence generator. Snowflake is generally very fast, but to ensure that the same object always gets the same ID a lot of synchronization is necessary (immediate commit for nodes, triple registry for triples), which has a considerable performance impact, particularly in clustered environments.
> A much faster approach would be to compute the ID from the objects themselves, e.g. using an efficient and good hashing function. With a 64bit hash, the probability for conflicts starts getting serious at around 2 billion objects (probability 10%), so it might make sense switching to 128bit keys as well.
> A good overview over clash probabilities is given in:
> http://preshing.com/20110504/hash-collision-probabilities/
> Changes would affect the API for ID generation (IDGenerator) as well as the value factory. In addition, we would need to ignore duplicate IDs for database inserts, e.g. using triggers or merge. Finally, we need to rethink the behaviour of deleted/non-deleted triples.



--
This message was sent by Atlassian JIRA
(v6.2#6252)