You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Michael Dürig (JIRA)" <ji...@apache.org> on 2016/11/30 11:40:58 UTC
[jira] [Comment Edited] (OAK-5192) Reduce Lucene related growth of repository size

    [ https://issues.apache.org/jira/browse/OAK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708334#comment-15708334 ] 

Michael Dürig edited comment on OAK-5192 at 11/30/16 11:40 AM:
---------------------------------------------------------------

The following plots show added bytes over time in content (upper plot) and added bytes over time in index (lower plot). Index is 3 order of magnitudes above regular content in terms of number of bytes added.

!added-bytes-zoom.png|width=500!

The pattern with the spike every 40s in the writes to the index is caused by Lucene's merging. Switching from {{SerialMergeScheduler}} to {{NoMergeScheduler}} flattens the curve out and also reduces the total amount of data written by factor 13.

{code}
--- oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java	(date 1480408502000)
+++ oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java	(revision )
@@ -61,7 +60,8 @@
             Analyzer analyzer = new PerFieldAnalyzerWrapper(definitionAnalyzer, analyzers);
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
-                config.setMergeScheduler(new SerialMergeScheduler());
+                config.setMergeScheduler(NoMergeScheduler.INSTANCE);
+                config.setMergePolicy(NoMergePolicy.COMPOUND_FILES);
             }
             if (definition.getCodec() != null) {
                 config.setCodec(definition.getCodec());
{code}



was (Author: mduerig):
The following plots show added bytes over time in content (upper plot) and added bytes over time in index (lower plot). Index is 3 order of magnitudes above regular content in terms of number of bytes added.

!added-bytes-zoom.png|width=500!

The pattern with the spike every 40s in the writes to the index is caused by Lucene's merging. Switching from {{SerialMergeScheduler}} to {{NoMergeScheduler}} flattens the curve out and also reduces the total amount of data written by factor 13.



> Reduce Lucene related growth of repository size
> -----------------------------------------------
>
>                 Key: OAK-5192
>                 URL: https://issues.apache.org/jira/browse/OAK-5192
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, segment-tar
>            Reporter: Michael Dürig
>              Labels: perfomance
>         Attachments: added-bytes-zoom.png
>
>
> I observed Lucene indexing contributing to up to 99% of repository growth. While the size of the index itself is well inside reasonable bounds, the overall turnover of data being written and removed again can be as much as 99%. 
> In the case of the TarMK this negatively impacts overall system performance due to fast growing number of tar files / segments, bad locality of reference, cache misses/thrashing when looking up segments and vastly prolonged garbage collection cycles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)