You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2015/05/19 12:24:00 UTC
[jira] [Comment Edited] (OAK-2882) Support migration without access to DataStore

    [ https://issues.apache.org/jira/browse/OAK-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549960#comment-14549960 ] 

Thomas Mueller edited comment on OAK-2882 at 5/19/15 10:23 AM:
---------------------------------------------------------------

As for the "//TODO For now using an in memory map. For very large repositories this might consume lots of memory.", I have an idea how to solve that: we could use a minimum perfect hash table, and only store the length and a few bytes of the keys. This would require only about 8 bytes per entry, so about 8 MB of heap memory per 1 million entries in the data store. The minimum perfect hash table needs 2 bits per key, then let's say 4 bytes per key for a fingerprint of the identifier (to detect, with very high probability, if there is a bug or missing entry), and 4 bytes for the length. If the length is larger than 4 bytes (very very rare), we store -1, which means a file lookup is needed for those. I have [an implementation|https://github.com/h2database/h2database/blob/master/h2/src/tools/org/h2/dev/hash/MinimalPerfectHash.java] of the minimum perfect hash table. 

Just keeping a simple array, with the sorted by fingerprint, might not work as there could be multiple entries for the same fingerprint.


was (Author: tmueller):
As for the "//TODO For now using an in memory map. For very large repositories this might consume lots of memory.", I have an idea how to solve that: we could use a minimum perfect hash table, and only store the length and a few bytes of the keys. This would require only about 8 bytes per entry, so about 8 MB of heap memory per 1 million entries in the data store. The minimum perfect hash table needs 2 bits per key, then let's say 4 bytes per key for a fingerprint of the identifier (to detect, with very high probability, if there is a bug or missing entry), and 4 bytes for the length. If the length is larger than 4 bytes (very very rare), we store -1, which means a file lookup is needed for those. I have [an implementation|https://github.com/h2database/h2database/blob/master/h2/src/tools/org/h2/dev/hash/MinimalPerfectHash.java] of the minimum perfect hash table. Instead of using a hash table, we could just keep a simple array, sorted by fingerprint.

> Support migration without access to DataStore
> ---------------------------------------------
>
>                 Key: OAK-2882
>                 URL: https://issues.apache.org/jira/browse/OAK-2882
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: upgrade
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: docs-impacting
>             Fix For: 1.3.0, 1.0.15
>
>         Attachments: OAK-2882-v2.patch, OAK-2882.patch, build_datastore_list.sh
>
>
> Migration currently involves access to DataStore as its configured as part of repository.xml. However in complete migration actual binary content in DataStore is not accessed and migration logic only makes use of
> * Dataidentifier = id of the files
> * Length = As it gets encoded as part of blobId (OAK-1667)
> It would be faster and beneficial to allow migration without actual access to the DataStore. It would serve two benefits
> # Allows one to test out migration on local setup by just copying the TarPM files. For e.g. one can only zip following files to get going with repository startup if we can somehow avoid having direct access to DataStore
> {noformat}
> >crx-quickstart# tar -zcvf repo-2.tar.gz repository --exclude=repository/repository/datastore --exclude=repository/repository/index --exclude=repository/workspaces/crx.default/index --exclude=repository/tarJournal
> {noformat}
> # Provides faster (repeatable) migration as access to DataStore can be avoided which in cases like S3 might be slow.  Given we solve how to get length
> *Proposal*
> Have a DataStore implementation which can be provided a mapping file having entries for blobId and length. This file would be used to answer queries regarding length and existing of blob and thus would avoid actual access to DataStore.
> Going further this DataStore can be configured with a delegate which can be used as a fallback in case the required details is not present in pre computed data set (may be due to change in content after that data was computed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)