You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Alex Parvulescu (Created) (JIRA)" <ji...@apache.org> on 2012/03/28 16:39:27 UTC

[jira] [Created] (JCR-3282) Optimize usage of norms

Optimize usage of norms
-----------------------

                 Key: JCR-3282
                 URL: https://issues.apache.org/jira/browse/JCR-3282
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: indexing, jackrabbit-core
            Reporter: Alex Parvulescu
            Assignee: Alex Parvulescu


There is a very significant potential for optimizing the size of the search index.

We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.

We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JCR-3282) Optimize usage of norms

Posted by "Alex Parvulescu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-3282:
---------------------------------

    Attachment: JCR-3282.patch

attaching proposed patch.

Based on the indexing config we know if a field has its boost changed or not, so if a field has no boost setting we can safely disable norms.
                
> Optimize usage of norms
> -----------------------
>
>                 Key: JCR-3282
>                 URL: https://issues.apache.org/jira/browse/JCR-3282
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>         Attachments: JCR-3282.patch
>
>
> There is a very significant potential for optimizing the size of the search index.
> We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
> One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.
> We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (JCR-3282) Optimize usage of norms

Posted by "Alex Parvulescu (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu resolved JCR-3282.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.6

I've had some issues with a test (IndexingRuleTest#testBoost), it failed a few times.
The sort seemed to not be stable, the field that had no initial boost (so no norms) kept moving from the first position to the last, messing with the test result.
Now it seems ok.

Fixed in revision 1308833.
                
> Optimize usage of norms
> -----------------------
>
>                 Key: JCR-3282
>                 URL: https://issues.apache.org/jira/browse/JCR-3282
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>             Fix For: 2.6
>
>         Attachments: JCR-3282.patch
>
>
> There is a very significant potential for optimizing the size of the search index.
> We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
> One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.
> We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (JCR-3282) Optimize usage of norms

Posted by "Alex Parvulescu (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu resolved JCR-3282.
----------------------------------

    Resolution: Fixed

tweaked LazyTextExtractorField so it doesn't add norms unless it's needed.

fixed in revision 1308833 and 1325820.
                
> Optimize usage of norms
> -----------------------
>
>                 Key: JCR-3282
>                 URL: https://issues.apache.org/jira/browse/JCR-3282
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>             Fix For: 2.6
>
>         Attachments: JCR-3282.patch
>
>
> There is a very significant potential for optimizing the size of the search index.
> We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
> One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.
> We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JCR-3282) Optimize usage of norms

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-3282:
-------------------------------

    Fix Version/s:     (was: 2.6)
                   2.5
    
> Optimize usage of norms
> -----------------------
>
>                 Key: JCR-3282
>                 URL: https://issues.apache.org/jira/browse/JCR-3282
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>             Fix For: 2.5
>
>         Attachments: JCR-3282.patch
>
>
> There is a very significant potential for optimizing the size of the search index.
> We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
> One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.
> We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (JCR-3282) Optimize usage of norms

Posted by "Alex Parvulescu (Reopened) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu reopened JCR-3282:
----------------------------------


the patch was insufficient, I missed some norm creation code
                
> Optimize usage of norms
> -----------------------
>
>                 Key: JCR-3282
>                 URL: https://issues.apache.org/jira/browse/JCR-3282
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>             Fix For: 2.6
>
>         Attachments: JCR-3282.patch
>
>
> There is a very significant potential for optimizing the size of the search index.
> We have seen a case where there were multiple segments with about the same number of nodes (roughly 10 million), but the size on disk was very different.
> One segment was 19 GB while all others where around 3 GB. The major difference was the number of fields indexed. The large segment had significantly more fields, which resulted in a large norms file.
> We should go through our implementation and see where norms are really necessary and disable tracking of norms wherever possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira