You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2017/12/14 15:02:00 UTC

[jira] [Commented] (OAK-7066) Active deletion blob list files can grow too large due to inlined blobs

    [ https://issues.apache.org/jira/browse/OAK-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290941#comment-16290941 ] 

Vikas Saurabh commented on OAK-7066:
------------------------------------

[~amjain],

quoting [~tmueller] from OAK-7052:
{quote}
Isn't there a method to check if a blob id is inlined? I think SharedDataStore.resolveChunks could be used (if it returns an empty set, it's inline). Amit Jain is this the right way? If there is no such method yet, it would be great to add one.
{quote}

Active deletion purge logic anyway relies on non-empty result from {{resolveChunks}} before passing the chunk ids to {{countDeleteChunks}}. So, maybe, we simply do this test right away during recording blob it as well. One thing that I'm not completely sure of is - would {{resolveChunks}} would try to read in non-inlined blobs too? (I'm guessing not).

Anyway, irrespective of the stats I posted in the description - not recording inlined blobs ids would reduce the file size significantly (result would likely be ~1% of what it's currently) without any side effect on the efficacy of the feature.

/cc [~chetanm], [~tmueller]

> Active deletion blob list files can grow too large due to inlined blobs
> -----------------------------------------------------------------------
>
>                 Key: OAK-7066
>                 URL: https://issues.apache.org/jira/browse/OAK-7066
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>
> This is follow up from OAK-7052 where we noticed that deleted blob list files collected by active deletion logic can grow very large due to inlined blobs.
> One potential way (not sure how yet though) is to not actively delete inlined blobs.
> Here are some stats which might help us take a call (based on raw numbers collected at \[0])
> ||file-name||large_lines||large_size||small_lines||small_size||small_lines/total_lines||small_size/total_size||
> |blobs-1512664032264.txt|245301|3310224358|173096|35473656|0.413712335413495|0.010602766852107|
> |blobs-1512698405656.txt|370373|4443957885|256775|52997864|0.409432861142824|0.011785275852845|
> |blobs-1512987450004.txt|660669|6214740439|461168|92017554|0.411082893504137|0.014590309966251|
> |blobs-1513130410963.txt|569083|5490965583|406756|80124598|0.416826956085994|0.014382211631264|
> |blobs-1513216819447.txt|69876|1413561892|46238|9221956|0.398212101899857|0.006481628262061|
> \[0]:
> file sizes
> {noformat}
> repository/index/deleted-blobs$ ls -l blobs-151*
> -rw-r--r-- 1 root root 3369065620 Dec  8 01:59 blobs-1512664032264.txt
> -rw-r--r-- 1 root root 4532250073 Dec  9 01:59 blobs-1512698405656.txt
> -rw-r--r-- 1 root root 6370201955 Dec 13 01:59 blobs-1512987450004.txt
> -rw-r--r-- 1 root root 1916223582 Dec 13 11:52 blobs-1513130410963.txt
> {noformat}
> number of entries
> {noformat}
> repository/index/deleted-blobs$ wc -l blobs-151*
>      418397 blobs-1512664032264.txt
>      627148 blobs-1512698405656.txt
>     1121837 blobs-1512987450004.txt
>      308292 blobs-1513130410963.txt
>     2475674 total
> {noformat}
> number of entries and sizes split on threshold of 500 bytes of blob ids
> {noformat}
> repository/index/deleted-blobs$ for i in blobs-151*;do echo $i;awk 'BEGIN {FS="|"} {len = length($1); if (len > 500) {large++; largeSize+=len} else {small++; smallSize+=len}} END {print large, largeSize, small, smallSize}' $i;done
> blobs-1512664032264.txt
> 245301 3310224358 173096 35473656
> blobs-1512698405656.txt
> 370373 4443957885 256775 52997864
> blobs-1512987450004.txt
> 660669 6214740439 461168 92017554
> blobs-1513130410963.txt
> 569083 5490965583 406756 80124598
> blobs-1513216819447.txt
> 69876 1413561892 46238 9221956
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)