You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Marcel Reutegger (JIRA)" <ji...@apache.org> on 2010/02/22 12:28:27 UTC
[jira] Created: (JCR-2505) High memory usage on node with
multi-valued string properties
High memory usage on node with multi-valued string properties
-------------------------------------------------------------
Key: JCR-2505
URL: https://issues.apache.org/jira/browse/JCR-2505
Project: Jackrabbit Content Repository
Issue Type: Improvement
Components: jackrabbit-core
Reporter: Marcel Reutegger
Priority: Minor
Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Cédric Damioli (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836601#action_12836601 ]
Cédric Damioli commented on JCR-2505:
-------------------------------------
do you mean lucene tokenizer ?
what about non-whitespace-aware tokenizers ? (is there any ? CJK maybe ?)
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcel Reutegger updated JCR-2505:
----------------------------------
Attachment: JCR-2505.patch
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
> Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836699#action_12836699 ]
Marcel Reutegger commented on JCR-2505:
---------------------------------------
> do you mean lucene tokenizer ?
yes.
actually it's not that bad and only affects Jackrabbit versions 1.4 and lower. That's also the version where I discovered the issue. More recent versions use lucene-core 2.3.2 or 2.4.1, which only use one TokenStream at a time and even try to re-use the tokenizer.
See LUCENE-969.
There's still room for improvement. The JackrabbitAnalyzer does not implement reusableTokenStream(). That is, each value of a multi-valued property will instantiate a new Tokenizer.
The proposed patch fixes this. With the patch applied, the JCRAPITest suite takes about 10% less time to execute on my machine.
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
> Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836705#action_12836705 ]
Jukka Zitting commented on JCR-2505:
------------------------------------
+1 Sounds great!
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
> Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcel Reutegger updated JCR-2505:
----------------------------------
Status: Patch Available (was: Open)
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
> Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (JCR-2505) High memory usage on node with
multi-valued string properties
Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcel Reutegger updated JCR-2505:
----------------------------------
Resolution: Fixed
Fix Version/s: 2.1.0
Status: Resolved (was: Patch Available)
Applied patch in revision: 915718
> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
> Key: JCR-2505
> URL: https://issues.apache.org/jira/browse/JCR-2505
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Reporter: Marcel Reutegger
> Priority: Minor
> Fix For: 2.1.0
>
> Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.