You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Michael Dürig (JIRA)" <ji...@apache.org> on 2010/02/16 14:17:27 UTC

[jira] Created: (JCR-2498) Implement caching mechanism for ItemInfo batches

Implement caching mechanism for ItemInfo batches
------------------------------------------------

                 Key: JCR-2498
                 URL: https://issues.apache.org/jira/browse/JCR-2498
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: jackrabbit-jcr2spi, jackrabbit-spi
            Reporter: Michael Dürig
            Assignee: Michael Dürig


Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 

I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 

(*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "Michael Dürig (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Dürig updated JCR-2498:
-------------------------------

    Attachment: JCR-2498-poc.patch

POC of the cache implementation as described above. 

The patch is functionally complete. The implementation is hard coded however and not yet exposed to the respective APIs. See fixme tags in the code. 

> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>         Attachments: JCR-2498-poc.patch
>
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "Michael Dürig (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834251#action_12834251 ] 

Michael Dürig commented on JCR-2498:
------------------------------------

Here's the patch mentioned in [2] above. 

Index: src/test/java/org/apache/jackrabbit/jcr2spi/benchmark/ReadPerformanceTest.java
===================================================================
--- src/test/java/org/apache/jackrabbit/jcr2spi/benchmark/ReadPerformanceTest.java
+++ src/test/java/org/apache/jackrabbit/jcr2spi/benchmark/ReadPerformanceTest.java
@@ -136,7 +136,7 @@
         final List<Item> items = new ArrayList<Item>();
 
         for (int k = 0; k < count; k ++) {
-            switch (rnd.nextInt(4)) {
+            switch (rnd.nextInt(3)) {
                 case 0: { // getItem
                     callables.add(new Callable<Long>() {
                         public Long call() throws Exception {


> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "angela (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834324#action_12834324 ] 

angela commented on JCR-2498:
-----------------------------

although i didn't look at the poc-patch in detail....based on our f2f discussion: looks reasonable to me :)



> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>         Attachments: JCR-2498-poc.patch
>
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "Michael Dürig (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834272#action_12834272 ] 

Michael Dürig commented on JCR-2498:
------------------------------------

Some more numbers demonstrating the effect with JCR-2498-poc.patch applied. The 'new/old time' row gives the quotients of the request times with the patch applied vs. without the patch applied. The 'new/old rts' row gives the quotients of the network round trips with the patch applied vs. without the patch applied. 

The first measurement includes all operations (getItem, getNode, getProperty and refresh) as above. 

Batch size: 24340, 12170, 6085, 3043, 1521, 761, 380, 190, 95, 48, 24, 12, 6, 3, 1
new/old time: 0.1, 0.1, 0.1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.5, 0.7, 0.6, 1, 1, 1.1, 0.8
new/old rts: 2.1, 2.8, 1.8, 2.4, 1.8, 1.4, 1.3, 1.2, 1, 1.1, 1, 1, 0.9, 1, 0.9

Most obvious is the vast performance increase (up to factor 10) for reading items. However this comes along with an increase of the number of network round trips. Three things should be noted here: 1. For realistic batch sizes the increase of the number of network round trips is not so significant. 2. The increase of the number of network round trips are caused by the refresh operations. In the test scenario the number of refresh operations is unrealistically high (every fourth operation is a refresh). 3. The items in the batches of the test case are not realistically distributed across the items of the repository. That is, the items are randomly chosen from the repository. In practice however, the items in a batch would be related to each other by some locality criteria. I assume that this would further mitigate the observed effect. 

For completeness sake here the same measurement as above but without refresh operations: 

Batch size: 24340, 12170, 6085, 3043, 1521, 761, 380, 190, 95, 48, 24, 12, 6, 3, 1
new/old time: 0.2, 0, 0, 0.1, 0.1, 0.2, 0.4, 0.4, 0.6, 0.6, 0.7, 1, 1, 1, 1.1
new/old rts: 1, 1, 0.9, 0.9, 0.8, 0.9, 0.9, 0.9, 0.9, 1, 1, 1, 1, 1, 1


> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>         Attachments: JCR-2498-poc.patch
>
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "Michael Dürig (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Dürig resolved JCR-2498.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 2.1.0

Applied a cleaned up/improved version of the patch in revision 915810  


> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>             Fix For: 2.1.0
>
>         Attachments: JCR-2498-poc.patch
>
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2498) Implement caching mechanism for ItemInfo batches

Posted by "Michael Dürig (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834250#action_12834250 ] 

Michael Dürig commented on JCR-2498:
------------------------------------

As promised some numbers. All measurements are done using ReadPerformanceTest.java [1]. 

Batch size: 24340, 12170, 6085, 3043, 1521, 761, 380, 190, 95, 48, 24, 12, 6, 3, 1
ms per request: 20.2, 24.2, 17.4, 16.3, 7.3, 3.0, 2.5, 2.1, 2.0, 1.3, 1.3, 1.1, 1.0, 1.0, 1.1

The performance impact of large batches is clearly visible here. Without refresh operations [2] the picture remains similar but less pronounced:

Batch size: 24340, 12170, 6085, 3043, 1521, 761, 380, 190, 95, 48, 24, 12, 6, 3, 1
ms per request: 5.1, 17.1, 16.3, 12.0, 6.0, 2.6, 2.7, 2.0, 2.0, 1.4, 1.4, 1.2, 1.0, 1.1, 1.3


[1] http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit-jcr2spi/src/test/java/org/apache/jackrabbit/jcr2spi/benchmark/ReadPerformanceTest.java?revision=910523&view=markup&pathrev=910523

[2] See upcoming patch



> Implement caching mechanism for ItemInfo batches
> ------------------------------------------------
>
>                 Key: JCR-2498
>                 URL: https://issues.apache.org/jira/browse/JCR-2498
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-jcr2spi, jackrabbit-spi
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>
> Currently all ItemInfos returned by RepositoryService#getItemInfos are placed into the hierarchy right away. For big batch sizes this is prohibitively expensive. The overhead is so great (*), that it quickly outweighs the overhead of network round trips. Moreover, SPI implementations usually choose the batch in a way determined by the backing persistence store and not by the requirements of the consuming application on the JCR side. That is, many of the items in the batch might never be actually needed. 
> I suggest to implement a cache for ItemInfo batches. Conceptually such a cache would live inside jcr2spi right above the SPI API. The actual implementation would be provided by SPI implementations. This approach allows for fine tuning cache/batch sizes to a given persistence store and network environment. This would also better separate different concerns: the purpose of the existing item cache is to optimize for the requirement of the consumer of the JCR API ('the application'). The new ItemInfo cache is to optimize for the specific network environment and backing persistence store. 
> (*) Numbers follow 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.