You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Marcel Reutegger (JIRA)" <ji...@apache.org> on 2015/12/17 13:37:46 UTC

[jira] [Commented] (OAK-3653) Incorrect last revision of cached node state

    [ https://issues.apache.org/jira/browse/OAK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061988#comment-15061988 ] 

Marcel Reutegger commented on OAK-3653:
---------------------------------------

I saw a similar issue while running tests for OAK-3646 and I think this is caused by OAK-3634. The attached diagram shows a possible sequence of operations in a two-node cluster with background reads and updates on a node 'foo'.

!diagram.png|width=100%!

First cluster node 1 updates the _lastRev of foo to r4-0-1 with a modCount of 5. Next, cluster node 2 updates its _lastRev to r7-0-2 bumping the modCount to 6 and then runs a background update. Cluster node 1 picks up the changes of the root document with a background read and starts a query to find out which documents to invalidate. The query result will contain an entry for 'foo' with a modCount of 6. At the same time a background update happens on cluster node 1 and the _lastRev is set to r9-0-1. On the MongoDB side the modCount is set to 7, but because of OAK-3634 the new cached document in Oak is calculated based on the currently cache document, which has a modCount of 5. The new document in the cache will therefore have a modCount of 6. The background read now proceeds to the invalidate phase and compares the modCount from the query with the modCount in the cache. Both are 6 and the document is not invalidated. 

> Incorrect last revision of cached node state
> --------------------------------------------
>
>                 Key: OAK-3653
>                 URL: https://issues.apache.org/jira/browse/OAK-3653
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: documentmk
>            Reporter: Vikas Saurabh
>            Assignee: Marcel Reutegger
>             Fix For: 1.4
>
>         Attachments: Capture.JPG, Capture1.JPG, Capture2.JPG, diagram.png, oplog.json
>
>
> While installing a package in one of the systems (Oak-1.0.16), we observed broken workflow model. Upon further investigation, it was found that {{baseVersion}} of the model was pointing to a node in version storage which wasn't visible on the instance. This node was available in mongo and could also be seen on a different instance of the cluster. 
> The node was created on the same instance where it wasn't visible (cluster id - 3)
> Further investigation showed that for some (yet unknown) reason, {{lastRevision}} for cached node state of node's parent at revision where the node was created was stale (an older revision) which must have led to invalid children list and hence unavailable node.
> Attaching a few files which were captured during the investigation:
> * [^oplog.json] - oplog for +-10s for r151191c7601-0-3
> * [^Capture.JPG]- snapshot of groovy console output using script at \[0] to list node children cache entry for parent
> * [^Capture1.JPG]- snapshot of groovy console output using script at \[1] to traverse to invisible node
> * [^Capture2.JPG]- Almost same as Capture.jpg but this time for node cache instead (using script at \[2])
> The node which wasn't visible in mongo:
> {noformat}
> db.nodes.findOne({_id: "7:/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e/1.7"})
> {
>         "_id" : "7:/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e/1.7",
>         "_deleted" : {
>                 "r151191c7601-0-3" : "false"
>         },
>         "_commitRoot" : {
>                 "r151191c7601-0-3" : "0"
>         },
>         "_lastRev" : {
>                 "r0-0-3" : "r151191c7601-0-3"
>         },
>         "jcr:primaryType" : {
>                 "r151191c7601-0-3" : "\"nam:nt:version\""
>         },
>         "jcr:uuid" : {
>                 "r151191c7601-0-3" : "\"40234f1d-2435-4c6b-9962-20af22b1c948\""
>         },
>         "_modified" : NumberLong(1447825270),
>         "jcr:created" : {
>                 "r151191c7601-0-3" : "\"dat:2015-11-17T21:41:14.058-08:00\""
>         },
>         "_children" : true,
>         "jcr:predecessors" : {
>                 "r151191c7601-0-3" : "[\"ref:6cecc77b-3020-4a47-b9cc-084a618aa957\"]"
>         },
>         "jcr:successors" : {
>                 "r151191c7601-0-3" : "\"[0]:Reference\""
>         },
>         "_modCount" : NumberLong(1)
> }
> {noformat}
> Last revision entry of cached node state of {{/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e}} (parent) at revision {{r151191c7601-0-3}} was {{r14f035ebdf8-0-3}} (the last revs were propagated correctly up till root for r151191c7601-0-3).
> As a work-around, doing {{cache.invalidate(k)}} for node cache (inside the loop of script at \[2]) revived the visibility of the node.
> *Disclaimer*: _The issue was reported and investigated on a cluster backed by mongo running on Oak-1.0.16. But, the problem itself doesn't seem like any which have been fixed in later versions_
> \[0]: {noformat}
> def printNodeChildrenCache(def path) {
>   def session = osgi.getService(org.apache.sling.jcr.api.SlingRepository.class).loginAdministrative(null)
>   try {
>     def rootNode = session.getRootNode()
>     def cache = rootNode.sessionDelegate.root.store.nodeChildrenCache
>     def cacheMap = cache.asMap()
>     cacheMap.each{k,v -> 
>       if (k.toString().startsWith(path + "@")) {
>         println "${k}:${v}"
>       }
>     }
>   } finally {
>     session.logout()
>   }
> }
>  
> printNodeChildrenCache("/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e")
> {noformat}
> \[1]: {noformat}
> def traverse(def path) {
>   def session = osgi.getService(org.apache.sling.jcr.api.SlingRepository.class).loginAdministrative(null)
>   try {
>     def rootNode = session.getRootNode()
>     def nb = rootNode.sessionDelegate.root.store.root
>     def p = ''
>     path.tokenize('/').each() {
>       p = p + '/' + it
>       nb = nb.getChildNode(it)
>       println "${p} ${nb}"
>     }
>   } finally {
>     session.logout()
>   }
> }
> traverse("/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e/1.7")
> {noformat}
> \[2]: {noformat}
> def printNodeChildrenCache(def path) {
>   def session = osgi.getService(org.apache.sling.jcr.api.SlingRepository.class).loginAdministrative(null)
>   try {
>     def rootNode = session.getRootNode()
>     def cache = rootNode.sessionDelegate.root.store.nodeCache
>     def cacheMap = cache.asMap()
>     cacheMap.each{k,v -> 
>       if (k.toString().startsWith(path + "@")) {
>         println "${k}:${v}"
>         println "lastRev -> ${v.lastRevision}"
>       }
>     }
>   } finally {
>     session.logout()
>   }
> }
>  
> printNodeChildrenCache("/jcr:system/jcr:versionStorage/d9/9f/22/d99f2279-dcc9-40a0-9aa4-c91fe310287e")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)