You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2014/11/26 10:37:12 UTC

[jira] [Comment Edited] (OAK-2268) Support index time Aggregation of repository nodes

    [ https://issues.apache.org/jira/browse/OAK-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225916#comment-14225916 ] 

Chetan Mehrotra edited comment on OAK-2268 at 11/26/14 9:36 AM:
----------------------------------------------------------------

Added support for index time aggregation with http://svn.apache.org/r1641771

h3. Aggregation Rules
Aggregation rules are defined as part of index definition. A typical
fulltext aggregate rule would look like below
 
{code}
{
  "assetIndex": {
    "jcr:primaryType": "oak:QueryIndexDefinition",
    "type": "lucene",
    "async": "async",
    "aggregates" : {
      "nt:file" :{
        "include1" : { "path" : "jcr:content" }  
      },
      "test:AssetContent" : {
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original/jcr:content" }
      },
      "test:PageContent" : {
        "include1" : { "path" : "*" },
        "include2" : { "path" : "*/*" },
        "include3" : { "path" : "*/*/*" }
      }
    }
  }
}
{code}

Support for aggregate is similar to one we had in JR2 [1]. New additions being

*A - Support for separate relative node aggregate fulltext field*
For evaluating queries like _jcr:contains(renditions/original, 'fox')_ one can configured an aggregation rule like below. Then the indexer would create a seprate fulltext field for _renditions/original_ node within the parent document
{code}
"test:AssetContent" : {
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original", relativeNode: true }
      }
{code}
Aggregation done for such node would honour aggregation rule for that node type. So if _renditions/original_ is nt:file the aggregate would include _renditions/original/jcr:content_

*B - Support for relative properties*
Relative properties are also supported by using the aggregation logic however there configuration is specified as part of normal propery configuration in indexing rules

{noformat}
jcr:contains(jcr:content/metadata/@format, 'image')
{noformat}

{code}
...
"properties": {
        "jcr:primaryType": "nt:unstructured",
        "prop1": {
          "jcr:primaryType": "nt:unstructured",
          "name": "jcr:content/metadata/format",
          "propertyIndex": true
        },
{code}

*C - Support for recursion limit*
By default aggregation logic would reapply the aggregation rule on the aggregated node. For e.g. while aggregating nt:folder if a nt:file node is found then rules applicable to nt:file would be applied. If the nodeType is same then it can lead to recursion (See JCR-2989). One need to explicitly enable such aggregation. However that would not prevent from case where recursion is multi level e.g. test:Asset includes test:AssetContent and assume test:AssetContent again contains test:Asset.

So in this implementation one can specify an upper limit for such re aggregations at rule level which is by default set to 5. 

{code}
"test:AssetContent" : {
        "reaggregateLimit" : 5
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original", relativeNode: true }
      }
{code}


*Whats not supported*
JR2 was able to support queries like without explicit aggregation rules. This was done by performing kind of join query at Lucene level. Runtime aggregation Oak so far also supports it by using an _intersecting cursor_ and firing two separate queries. However this would not work with new index time aggregation implementation unless explicit aggregation rules are configured

{noformat}
/jcr:root/content//element(*, test:Asset)
    [(jcr:contains(., 'mountain')) 
            and (jcr:contains(jcr:content/metadata/@format, 'image'))]
{noformat}

*Implementation Note*
{{LuceneIndexEditor}} makes use of {{Aggregate.Matcher}} to determine what all aggregation rules are applicable to current node as its traverses down while performing the diff. If a matcher matches the current node then {{AggregationRoot}} is marked dirty such that later re aggregation is performed for that root (diff is depth first traversal)

Further the relative property support also make use of {{Aggregate.Matcher}}

*Pending Work*
Most of the aggregation scenarios are handled and all current aggregation related test are passing. 

Would update the Oak Documentation with more details

/cc [~alex.parvulescu] [~tmueller] [~mreutegg] [~amitj_76] [~teofili]

[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration#Index_Aggregates


was (Author: chetanm):
Added support for index time aggregation with http://svn.apache.org/r1641771

*Aggregation Rules*

Aggregation rules are defined as part of index definition. A typical
fulltext aggregate rule would look like below
 
{code}
{
  "assetIndex": {
    "jcr:primaryType": "oak:QueryIndexDefinition",
    "type": "lucene",
    "async": "async",
    "aggregates" : {
      "nt:file" :{
        "include1" : { "path" : "jcr:content" }  
      },
      "test:AssetContent" : {
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original/jcr:content" }
      },
      "test:PageContent" : {
        "include1" : { "path" : "*" },
        "include2" : { "path" : "*/*" },
        "include3" : { "path" : "*/*/*" }
      }
    }
  }
}
{code}

Support for aggregate is similar to one we had in JR2 [1]. New additions being

# Support for separate relative node aggregate fulltext field - For evaluating queries like _jcr:contains(renditions/original, 'fox')_ one can configured an aggregation rule like below. Then the indexer would create a seprate fulltext field for _renditions/original_ node within the parent document

{code}
"test:AssetContent" : {
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original", relativeNode: true }
      }
{code}

Aggregation done for such node would honour aggregation rule for that node type. So if _renditions/original_ is nt:file the aggregate would include _renditions/original/jcr:content_

# Support for recursion limit
By default aggregation logic would reapply the aggregation rule on the aggregated node. For e.g. while aggregating nt:folder if a nt:file node is found then rules applicable to nt:file would be applied. If the nodeType is same then it can lead to recursion (See JCR-2989). One need to explicitly enable such aggregation. However that would not prevent from case where recursion is multi level e.g. test:Asset includes test:AssetContent and assume test:AssetContent again contains test:Asset.

So in this implementation one can specify an upper limit for such re aggregations at rule level which is by default set to 5. 

{code}
"test:AssetContent" : {
        "reaggregateLimit" : 5
        "include1" : { "path" : "metadata" },
        "include2" : { "path" : "renditions/original", relativeNode: true }
      }
{code}


*Whats not supported*
JR2 was able to support queries like without explicit aggregation rules. This was done by performing kind of join query at Lucene level. Runtime aggregation Oak so far also supports it by using an _intersecting cursor_ and firing two separate queries. However this would not work with new index time aggregation implementation unless explicit aggregation rules are configured

{noformat}
/jcr:root/content//element(*, test:Asset)
    [(jcr:contains(., 'mountain')) 
            and (jcr:contains(jcr:content/metadata/@format, 'image'))]
{noformat}

*Implementation Note*
{{LuceneIndexEditor}} makes use of {{Aggregate.Matcher}} to determine what all aggregation rules are applicable to current node as its traverses down while performing the diff. If a matcher matches the current node then {{AggregationRoot}} is marked dirty such that later re aggregation is performed for that root (diff is depth first traversal)

*Pending Work*
Most of the aggregation scenarios are handled and all current aggregation related test are passing. 

Would update the Oak Documentation with more details

/cc [~alex.parvulescu] [~tmueller] [~mreutegg] [~amitj_76] [~teofili]

[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration#Index_Aggregates

> Support index time Aggregation of repository nodes
> --------------------------------------------------
>
>                 Key: OAK-2268
>                 URL: https://issues.apache.org/jira/browse/OAK-2268
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: oak-lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.0.9, 1.2
>
>
> Currently Oak supports runtime of aggregation of content which works ok for some cases. However as noted in [1] it might not perform well or not give expected result in some cases.
> JR2 supported index time aggregation [2] with following features
> # Support for nodeType scoped aggregation rules
> # Recursive aggregation support (JCR-2989)
> Oak should also provide similar aggregation support. To start with we can support #1. Support for #2 can be added later depending on requirements
> [1] http://markmail.org/thread/cyu7evezbi4u22gr
> [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration#Index_Aggregates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)