You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/06/15 15:30:10 UTC
[jira] [Comment Edited] (OAK-4471) More compact storage format for Documents

    [ https://issues.apache.org/jira/browse/OAK-4471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331518#comment-15331518 ] 

Chetan Mehrotra edited comment on OAK-4471 at 6/15/16 3:30 PM:
---------------------------------------------------------------

*Use dictionary for Property Names*

Under this we can use a dictionary for commonly occurring property names. Below are some stats from a repository with 

* 14.5 M documents in nodes collection. ~8 M documents are for index data. While ~2M actual nodes!
* 26 M property names
* 8475 unique property names
* 290M - Total size of property names (assuming 8bits per char)
* 8755M - Total repo size

Top property name stats

{noformat}
+-----------------------------------------------------+
|Count  |Name                    |% by count|% by size|
+-----------------------------------------------------+
|3033972|jcr:lastModified        |11.65     |15.94    |
|2573208|jcr:data                |9.88      |6.76     |
|2505308|uniqueKey               |9.62      |7.40     |
|2286350|blobSize                |8.78      |6.00     |
|1706460|match                   |6.55      |2.80     |
|1484283|jcr:primaryType         |5.70      |7.31     |
|969596 |jcr:created             |3.72      |3.50     |
|933960 |jcr:createdBy           |3.59      |3.99     |
|921199 |sling:resourceType      |3.54      |5.44     |
|702208 |:childOrder             |2.70      |2.54     |
|601959 |entry                   |2.31      |0.99     |
|600299 |jcr:uuid                |2.31      |1.58     |
|481036 |jcr:lastModifiedBy      |1.85      |2.84     |
|477625 |jcr:frozenPrimaryType   |1.83      |3.29     |
|477625 |jcr:frozenUuid          |1.83      |2.20     |
|357201 |text                    |1.37      |0.47     |
|351712 |textIsRich              |1.35      |1.15     |
|228623 |event\djob\dqueued\dtime|0.88      |1.80     |
+-----------------------------------------------------+
{noformat}

Based on above we can say for now using dictionary for property names would not provide much benefit!


was (Author: chetanm):
*Use dictionary for Property Names*

Under this we can use a dictionary for commonly occurring property names. Below are some stats from a repository with 

* 14.5 M documents in nodes collection
* 26 M property names
* 8475 unique property names
* 290M - Total size of property names (assuming 8bits per char)
* 8755M - Total repo size

Top property name stats

{noformat}
+-----------------------------------------------------+
|Count  |Name                    |% by count|% by size|
+-----------------------------------------------------+
|3033972|jcr:lastModified        |11.65     |15.94    |
|2573208|jcr:data                |9.88      |6.76     |
|2505308|uniqueKey               |9.62      |7.40     |
|2286350|blobSize                |8.78      |6.00     |
|1706460|match                   |6.55      |2.80     |
|1484283|jcr:primaryType         |5.70      |7.31     |
|969596 |jcr:created             |3.72      |3.50     |
|933960 |jcr:createdBy           |3.59      |3.99     |
|921199 |sling:resourceType      |3.54      |5.44     |
|702208 |:childOrder             |2.70      |2.54     |
|601959 |entry                   |2.31      |0.99     |
|600299 |jcr:uuid                |2.31      |1.58     |
|481036 |jcr:lastModifiedBy      |1.85      |2.84     |
|477625 |jcr:frozenPrimaryType   |1.83      |3.29     |
|477625 |jcr:frozenUuid          |1.83      |2.20     |
|357201 |text                    |1.37      |0.47     |
|351712 |textIsRich              |1.35      |1.15     |
|228623 |event\djob\dqueued\dtime|0.88      |1.80     |
+-----------------------------------------------------+
{noformat}

Based on above we can say for now using dictionary for property names would not provide much benefit!

> More compact storage format for Documents
> -----------------------------------------
>
>                 Key: OAK-4471
>                 URL: https://issues.apache.org/jira/browse/OAK-4471
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: documentmk
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.6
>
>         Attachments: node-doc-size2.png
>
>
> Aim of this task is to evaluate storage cost of current approach for various Documents in DocumentNodeStore. And then evaluate possible alternative to see if we can get a significant reduction in storage size.
> Possible areas of improvement
> # NodeDocument
> ## Use binary encoding for property values - Currently property values are stored in JSON encoding i.e. arrays and single values are encoded in json along with there type
> ## Use binary encoding for Revision values - In a given document Revision instances are a major part of storage size. A binary encoding might provide more compact storage
> # Journal - The journal entries can be stored in compressed form
> Any new approach should support working with existing setups i.e. provide gradual change in storage format. 
> *Possible Benefits*
> More compact storage would help in following ways
> # Low memory footprint of Document in Mongo and RDB
> # Low memory footprint for in memory NodeDocument instances - For e.g. property values when stored in binary format would consume less memory
> # Reduction in IO over wire - That should reduce the latency in say distributed deployments where Oak has to talk to remote primary
> Note that before doing any such change we must analyze the gains. Any change in encoding would make interpreting stored data harder and also represents significant change in stored data where we need to be careful to not introduce any bug!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)