You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Chetan Mehrotra <ch...@gmail.com> on 2014/11/05 05:57:24 UTC

Features to be supported while enabling boost support in Lucene Full text index

Hi Team,

With OAK-2178 some basic support for boosting has been added. However
Jackrabbit used to support lots more fine grained boosting [1]. So for
boost feature to be used in real world scenarios should we aim to
implement similar support i.e. provide

1. Conditional boosting based on some criteria
2. Node level boosting based on NodeType

Q.1 - Should we support all or some of that. It would introduce some
complexity but probably for feature to be useful they need to be
supported

Q.2 - Config format - If we need to support all (or some of that) we
would need to decide the index definition format

Configuration Format
-----------------------------

As documented in [2] the new configuration format proposed and being
used with Lucene Property Index is like following

"assetIndex":
{
  "jcr:primaryType":"oak:QueryIndexDefinition",
  "declaringNodeTypes":"app:Asset",
  "includePropertyNames":["title", "type"],
  "type":"lucene",
  "async":"async",
  "fulltextEnabled":false,
  "orderedProps":["jcr:content/jcr:lastModified"]
  "properties": {
    "title" : { "boost" : 2.0 }
  }
}

This works fine for property index where we would restrict the
definition to some specific NodeType and specific propertyNames

However for full text index which is more generic we would need to
have way to distinguish properties for specific nodeTypes

If we need to utilize same format to capture index rules at [2] then
one way would be to capture nodeType scoped property definitions
separately

--------------------
"properties": {
            "title" : { "boost" : 2.0 } /* Unscoped property */
        },
        "indexRules" : {
            "nt:unstructured" : {
                "properties" :{
                    "title" : { /* Scoped property */
                        "boost" : 1.5
                    }
                }
            },
            "nt:file" : {
                "boost" : "2.0",
                "condition" : "@priority = 'high'"
            }
        }
-----------------------

With current design most of the conditions can be support except one
involving ancesstor as Oak NodeState model does not allow traversing
up easily

Thoughts?

Chetan Mehrotra
[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
[2] http://jackrabbit.apache.org/oak/docs/query/lucene.html

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Chetan,

first of all thanks for all your great work on this.
I generally agree with you that we need to be on par with JR2 in terms of
capabilities.

Looking in more detail into the index configuration what about the
following format:

--------------------------------
        "indexRules" : {
            "rule0" : {
                "name" : "title", /* Unscoped property */
                "boost" : 2.0
            },
            "rule1" : {
                "type" : "nt:unstructured",
                "name" : "title", /* Scoped property */
                "boost" : 1.5
            },
            "rule2" : {
                "type" : "nt:file",
                "name" : "title", /* Scoped property */
                "boost" : 2.0,
                "condition" : "@priority = 'high'"
            },
        }
--------------------------------

The rationale is to handle each rule using the same structure, what do you
think? Would it be feasible?

Regards,
Tommaso


2014-11-05 5:57 GMT+01:00 Chetan Mehrotra <ch...@gmail.com>:

> Hi Team,
>
> With OAK-2178 some basic support for boosting has been added. However
> Jackrabbit used to support lots more fine grained boosting [1]. So for
> boost feature to be used in real world scenarios should we aim to
> implement similar support i.e. provide
>
> 1. Conditional boosting based on some criteria
> 2. Node level boosting based on NodeType
>
> Q.1 - Should we support all or some of that. It would introduce some
> complexity but probably for feature to be useful they need to be
> supported
>
> Q.2 - Config format - If we need to support all (or some of that) we
> would need to decide the index definition format
>
> Configuration Format
> -----------------------------
>
> As documented in [2] the new configuration format proposed and being
> used with Lucene Property Index is like following
>
> "assetIndex":
> {
>   "jcr:primaryType":"oak:QueryIndexDefinition",
>   "declaringNodeTypes":"app:Asset",
>   "includePropertyNames":["title", "type"],
>   "type":"lucene",
>   "async":"async",
>   "fulltextEnabled":false,
>   "orderedProps":["jcr:content/jcr:lastModified"]
>   "properties": {
>     "title" : { "boost" : 2.0 }
>   }
> }
>
> This works fine for property index where we would restrict the
> definition to some specific NodeType and specific propertyNames
>
> However for full text index which is more generic we would need to
> have way to distinguish properties for specific nodeTypes
>
> If we need to utilize same format to capture index rules at [2] then
> one way would be to capture nodeType scoped property definitions
> separately
>
> --------------------
> "properties": {
>             "title" : { "boost" : 2.0 } /* Unscoped property */
>         },
>         "indexRules" : {
>             "nt:unstructured" : {
>                 "properties" :{
>                     "title" : { /* Scoped property */
>                         "boost" : 1.5
>                     }
>                 }
>             },
>             "nt:file" : {
>                 "boost" : "2.0",
>                 "condition" : "@priority = 'high'"
>             }
>         }
> -----------------------
>
> With current design most of the conditions can be support except one
> involving ancesstor as Oak NodeState model does not allow traversing
> up easily
>
> Thoughts?
>
> Chetan Mehrotra
> [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> [2] http://jackrabbit.apache.org/oak/docs/query/lucene.html
>

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Michael Marth <mm...@adobe.com>.

Chetan,

Given existing config is part of 1.0.8 we would need to support both
but users would not be allowed to mix both approaches.

When you refer to “existing config” do you mean only the part that configures boosting or more?
If the former: given that 1.0.8 is out only for a couple of days I do not think we would create big problems by changing the the config syntax in 1.0.9

Michael

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Chetan,

thanks for the explanation (which makes sense), I think we should go with
your latest proposal then.

Regards,
Tommaso

2014-11-06 10:29 GMT+01:00 Chetan Mehrotra <ch...@gmail.com>:

> Hi Tommaso,
>
> On Thu, Nov 6, 2014 at 2:18 PM, Tommaso Teofili
> <to...@gmail.com> wrote:
> > the drawback is that you would have to define a similar structure for
> each
> > field to be boosted for each node type, the advantage is that it's
> > compliant with what we have in 1.0.8.
>
> I would also prefer that but couple of things need to be supported
>
> 1. The JR2 index format supported NodeType inheritance. So index rule
> defined at start would supercede defintion defined later. Hence the
> need of list based on NodeType
>
> 2. Support for regular expression in property names - In JR2 property
> name regex was scoped by nodeType. Turning it around would make it
> tricky
>
> Chetan Mehrotra
>

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Chetan Mehrotra <ch...@gmail.com>.

Hi Tommaso,

On Thu, Nov 6, 2014 at 2:18 PM, Tommaso Teofili
<to...@gmail.com> wrote:
> the drawback is that you would have to define a similar structure for each
> field to be boosted for each node type, the advantage is that it's
> compliant with what we have in 1.0.8.

I would also prefer that but couple of things need to be supported

1. The JR2 index format supported NodeType inheritance. So index rule
defined at start would supercede defintion defined later. Hence the
need of list based on NodeType

2. Support for regular expression in property names - In JR2 property
name regex was scoped by nodeType. Turning it around would make it
tricky

Chetan Mehrotra

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Tommaso Teofili <to...@gmail.com>.

2014-11-05 13:42 GMT+01:00 Chetan Mehrotra <ch...@gmail.com>:

> On Wed, Nov 5, 2014 at 3:30 PM, Marcel Reutegger <mr...@adobe.com>
> wrote:
> > the your configuration proposal we'd now have three different
> > places where a property can be specified:
> >
> > - includePropertyNames
> > - properties
> > - indexRules
> >
> > would it be possible to unify those definitions into a single one?
>
> yes that needs to be done. This happened because LucenePropertyIndex
> work in whitelist mode while Lucene full text index work in more
> generic mode where you use some common setting for everything and then
> for specific stuff customize it. Hence the need for
> 'includePropertyNames' to provide the whitelist of property names and
> 'properties'  to capture property specific changes.
>
> So thinking more about it I think its better to use JR config model
> and map it to content tree. So the proposed model would be something
> like
>
> -----------------------------
>         "indexRules" : {
>             //need orderable child nodes to honour
>             //nodeType hierrachy
>             "jcr:primaryType": "nt:unstructured",
>             "nt:unstructured" : {
>                 "properties" :{
>                     "title" : { /* Scoped property */
>                         "boost" : 1.5
>                     }
>                 }
>             },
>             "nt:file" : {
>                 "boost" : "2.0",
>                 "condition" : "@priority = 'high'"
>             },
>             "app:Asset" : {
>                 "properties": {
>                     "jcr:primaryType":"oak:Unstructured",
>                     //relative property jcr:content/lastModified
>                     "jcr:content": {
>                         "jcr:primaryType":"oak:Unstructured",
>                         "jcr:lastModified": {
>                             "jcr:primaryType":"oak:Unstructured",
>                             "type":"Date"
>                         }
>                     }
>             },
>             "nt:base" : {
>                 "properties" :{
>                      //need orderable property nodes
>                      //to support regexp
>                     "jcr:primaryType": "nt:unstructured",
>                     "offTime" : {
>                         "nodeScopeIndex" : false
>                     }
>                 }
>             }
>         }
> -----------------------------
>
> Above model would super cede current config options
>
> * declaredNodeTypes - This can be computed from the indexRules child node
> list
> * relativeProperties - The relative property names would also be
> scoped to there nodeType
> * includePorpertyName - A property would be included if it passes any
> of the index rules
>
> @Tommaso - I think using nodeType as the node name instead of
> arbitrary name would remove redundancy
>

ok


>
> Given existing config is part of 1.0.8 we would need to support both
> but users would not be allowed to mix both approaches.
>

right, what about making the field name the common ancestor for both
unscoped and scoped properties? :

"properties": {
            "title" : { "boost" : 2.0, /* Unscoped property */
                           "nt:unstructured" : { "boost" : 1.5 }, /* scoped
property */
                           "nt:file" : { "boost" : 1.2 } /* scoped property
*/
                       }
        },

the drawback is that you would have to define a similar structure for each
field to be boosted for each node type, the advantage is that it's
compliant with what we have in 1.0.8.

Regards,
Tommaso


>
> Thoughts?
>
>
> Chetan Mehrotra
>

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Wed, Nov 5, 2014 at 3:30 PM, Marcel Reutegger <mr...@adobe.com> wrote:
> the your configuration proposal we'd now have three different
> places where a property can be specified:
>
> - includePropertyNames
> - properties
> - indexRules
>
> would it be possible to unify those definitions into a single one?

yes that needs to be done. This happened because LucenePropertyIndex
work in whitelist mode while Lucene full text index work in more
generic mode where you use some common setting for everything and then
for specific stuff customize it. Hence the need for
'includePropertyNames' to provide the whitelist of property names and
'properties'  to capture property specific changes.

So thinking more about it I think its better to use JR config model
and map it to content tree. So the proposed model would be something
like

-----------------------------
        "indexRules" : {
            //need orderable child nodes to honour
            //nodeType hierrachy
            "jcr:primaryType": "nt:unstructured",
            "nt:unstructured" : {
                "properties" :{
                    "title" : { /* Scoped property */
                        "boost" : 1.5
                    }
                }
            },
            "nt:file" : {
                "boost" : "2.0",
                "condition" : "@priority = 'high'"
            },
            "app:Asset" : {
                "properties": {
                    "jcr:primaryType":"oak:Unstructured",
                    //relative property jcr:content/lastModified
                    "jcr:content": {
                        "jcr:primaryType":"oak:Unstructured",
                        "jcr:lastModified": {
                            "jcr:primaryType":"oak:Unstructured",
                            "type":"Date"
                        }
                    }
            },
            "nt:base" : {
                "properties" :{
                     //need orderable property nodes
                     //to support regexp
                    "jcr:primaryType": "nt:unstructured",
                    "offTime" : {
                        "nodeScopeIndex" : false
                    }
                }
            }
        }
-----------------------------

Above model would super cede current config options

* declaredNodeTypes - This can be computed from the indexRules child node list
* relativeProperties - The relative property names would also be
scoped to there nodeType
* includePorpertyName - A property would be included if it passes any
of the index rules

@Tommaso - I think using nodeType as the node name instead of
arbitrary name would remove redundancy

Given existing config is part of 1.0.8 we would need to support both
but users would not be allowed to mix both approaches.

Thoughts?


Chetan Mehrotra

Re: Features to be supported while enabling boost support in Lucene Full text index

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

in general I think it would be nice to have more control over
the boost values. I'm not sure if we need all the features we
have in Jackrabbit. Some of them are rarely used and not too
useful. I would rather implement the most useful features first
and then wait for feedback if more is needed.

the your configuration proposal we'd now have three different
places where a property can be specified:

- includePropertyNames
- properties
- indexRules

would it be possible to unify those definitions into a single one?

Regards
 Marcel

On 05/11/14 05:57, "Chetan Mehrotra" <ch...@gmail.com> wrote:

>Hi Team,
>
>With OAK-2178 some basic support for boosting has been added. However
>Jackrabbit used to support lots more fine grained boosting [1]. So for
>boost feature to be used in real world scenarios should we aim to
>implement similar support i.e. provide
>
>1. Conditional boosting based on some criteria
>2. Node level boosting based on NodeType
>
>Q.1 - Should we support all or some of that. It would introduce some
>complexity but probably for feature to be useful they need to be
>supported
>
>Q.2 - Config format - If we need to support all (or some of that) we
>would need to decide the index definition format
>
>Configuration Format
>-----------------------------
>
>As documented in [2] the new configuration format proposed and being
>used with Lucene Property Index is like following
>
>"assetIndex":
>{
>  "jcr:primaryType":"oak:QueryIndexDefinition",
>  "declaringNodeTypes":"app:Asset",
>  "includePropertyNames":["title", "type"],
>  "type":"lucene",
>  "async":"async",
>  "fulltextEnabled":false,
>  "orderedProps":["jcr:content/jcr:lastModified"]
>  "properties": {
>    "title" : { "boost" : 2.0 }
>  }
>}
>
>This works fine for property index where we would restrict the
>definition to some specific NodeType and specific propertyNames
>
>However for full text index which is more generic we would need to
>have way to distinguish properties for specific nodeTypes
>
>If we need to utilize same format to capture index rules at [2] then
>one way would be to capture nodeType scoped property definitions
>separately
>
>--------------------
>"properties": {
>            "title" : { "boost" : 2.0 } /* Unscoped property */
>        },
>        "indexRules" : {
>            "nt:unstructured" : {
>                "properties" :{
>                    "title" : { /* Scoped property */
>                        "boost" : 1.5
>                    }
>                }
>            },
>            "nt:file" : {
>                "boost" : "2.0",
>                "condition" : "@priority = 'high'"
>            }
>        }
>-----------------------
>
>With current design most of the conditions can be support except one
>involving ancesstor as Oak NodeState model does not allow traversing
>up easily
>
>Thoughts?
>
>Chetan Mehrotra
>[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
>[2] http://jackrabbit.apache.org/oak/docs/query/lucene.html