You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Marcel Reutegger (JIRA)" <ji...@apache.org> on 2010/02/22 12:28:27 UTC

[jira] Created: (JCR-2505) High memory usage on node with multi-valued string properties

High memory usage on node with multi-valued string properties
-------------------------------------------------------------

                 Key: JCR-2505
                 URL: https://issues.apache.org/jira/browse/JCR-2505
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: jackrabbit-core
            Reporter: Marcel Reutegger
            Priority: Minor


Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.

Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Cédric Damioli (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836601#action_12836601 ] 

Cédric Damioli commented on JCR-2505:
-------------------------------------

do you mean lucene tokenizer ?
what about non-whitespace-aware tokenizers ? (is there any ? CJK maybe ?)


> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger updated JCR-2505:
----------------------------------

    Attachment: JCR-2505.patch

> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>         Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836699#action_12836699 ] 

Marcel Reutegger commented on JCR-2505:
---------------------------------------

> do you mean lucene tokenizer ? 

yes.

actually it's not that bad and only affects Jackrabbit versions 1.4 and lower. That's also the version where I discovered the issue. More recent versions use lucene-core 2.3.2 or 2.4.1, which only use one TokenStream at a time and even try to re-use the tokenizer.

See LUCENE-969.

There's still room for improvement. The JackrabbitAnalyzer does not implement reusableTokenStream(). That is, each value of a multi-valued property will instantiate a new Tokenizer.

The proposed patch fixes this. With the patch applied, the JCRAPITest suite takes about 10% less time to execute on my machine.

> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>         Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836705#action_12836705 ] 

Jukka Zitting commented on JCR-2505:
------------------------------------

+1 Sounds great!

> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>         Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger updated JCR-2505:
----------------------------------

    Status: Patch Available  (was: Open)

> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>         Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-2505) High memory usage on node with multi-valued string properties

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger updated JCR-2505:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.1.0
           Status: Resolved  (was: Patch Available)

Applied patch in revision: 915718

> High memory usage on node with multi-valued string properties
> -------------------------------------------------------------
>
>                 Key: JCR-2505
>                 URL: https://issues.apache.org/jira/browse/JCR-2505
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>             Fix For: 2.1.0
>
>         Attachments: JCR-2505.patch
>
>
> Multi-valued string properties are tokenized per value, which may consume quite some memory when there are lots of small values in on a property. The memory footprint is 2k per value, because each value is tokenized with a separate tokenizer instance. That tokenizer uses a stream buffer of 2k bytes.
> Instead the values should be concatenated (whitespace separated) and then tokenized in one go.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.