You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2013/07/09 11:05:50 UTC

[jira] [Updated] (OAK-890) Query: advanced fulltext search conditions

     [ https://issues.apache.org/jira/browse/OAK-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Mueller updated OAK-890:
-------------------------------

    Description: 
Currently, the query engine does not use a fulltext index if there are multiple fulltext conditions combined with "or". Also, the QueryIndex interface does not support boosts, and does not support fulltext conditions on properties (just on nodes) - Filter.getFulltextConditions is a collection of strings, combined with "and", but does not contain the information whether a condition is on a property or on all properties. Also, the popular sorting by score (specially descending) is not currently supported.

[~mreutegg] and me discussed how we could support those features (including boost) in a way that is backward compatible with Jackrabbit 2.x, but without adding a lot of complexity. Example Jackrabbit 2.x query:

{code}
/jcr:root/content//*[(@jcr:primaryType='page' 
  and (jcr:contains(jcr:content/@tags, 'it:blue') 
  or jcr:contains(jcr:content/@tags, '/tags/it/blue')))]

/jcr:root/content//element(*, nt:hierarchyNode)[
  (jcr:contains(jcr:content, 'SomeTextToSearch') 
  or jcr:contains(jcr:content/@jcr:title, 'SomeTextToSearch') 
  or jcr:contains(jcr:content/@jcr:description, 'SomeTextToSearch'))]
  /rep:excerpt(.) order by @jcr:score descending 
{code}

A possible solution is to extend the internal fulltext syntax to support those features. The internal fulltext syntax is the one used by 
Filter.getFulltextCondition (not the one used within the original XPath, SQL, or SQL-2 query). The proposed syntax (work in progress, just a rough draft so far) is:

{code}
FullTextSearch ::= Or
  ['order by score' [' desc']]
Or ::= And {' OR ' And}* 
And ::= Term {' ' Term}*
Term ::= '(' Or ')' | ['-'] SimpleTerm
SimpleTerm ::= [Property ':'] '"' Word {' ' Word}* '"' ['^' Boost]
Property ::= <property name>
Boost ::= <number>
{code}

The idea is that the syntax matches the syntax used by Lucene (except for the 'order by' part), so that the Lucene and Solr index implementations should get simpler (only need minimal parsing, possibly just the 'order by' part). Search terms (phrases, words) are always within double quotes. That means, the above queries would result in the following condition:

{code}
jcr:content/tags:"it:blue" 
OR jcr:content/tags:"/tags/it/blue"

jcr:content/*:"SomeTextToSearch" 
OR jcr:content/jcr:title:"SomeTextToSearch"
OR jcr:content/jcr:description:"SomeTextToSearch"
order by score desc
{code}

It would also allow to switch back from 

{code}
Collection<String> getFulltextConditions()
{code}
to 
{code}
String getFulltextCondition()
{code}


  was:
Currently, the query engine does not use a fulltext index if there are multiple fulltext conditions combined with "or". Also, the QueryIndex interface does not support boosts, and does not support fulltext conditions on properties (just on nodes) - Filter.getFulltextConditions is a collection of strings, combined with "and", but does not contain the information whether a condition is on a property or on all properties. Also, the popular sorting by score (specially descending) is not currently supported.

[~mreutegg] and me discussed how we could support those features (including boost) in a way that is backward compatible with Jackrabbit 2.x, but without adding a lot of complexity. Example Jackrabbit 2.x query:

{code}
/jcr:root/content//*[(@jcr:primaryType='page' 
  and (jcr:contains(jcr:content/@tags, 'it:blue') 
  or jcr:contains(jcr:content/@tags, '/tags/it/blue')))]

/jcr:root/content//element(*, nt:hierarchyNode)[
  (jcr:contains(jcr:content, 'SomeTextToSearch') 
  or jcr:contains(jcr:content/@jcr:title, 'SomeTextToSearch') 
  or jcr:contains(jcr:content/@jcr:description, 'SomeTextToSearch'))]
  /rep:excerpt(.) order by @jcr:score descending 
{code}

A possible solution is to extend the internal fulltext syntax to support those features. The internal fulltext syntax is the one used by 
Filter.getFulltextCondition (not the one used within the original XPath, SQL, or SQL-2 query). The proposed syntax (work in progress, just a rough draft so far) is:

{code}
FullTextSearchLiteral ::= Disjunct {Space 'OR' Space Disjunct}* 
  ['order by score' [Space 'desc']]
Disjunct ::= Term {Space Term}*
Term ::= ['-'] SimpleTerm
SimpleTerm ::= [Property ':'] '"' Word {Space Word}* '"' ['^' Boost]
Property ::= <property name>
Boost ::= <number>
{code}

The idea is that the syntax matches the syntax used by Lucene where possible. Search terms (phrases, words) are always within double quotes. That means, the above queries would result in the following condition:

{code}
jcr:content/tags:"it:blue" 
OR jcr:content/tags:"/tags/it/blue"

jcr:content/*:"SomeTextToSearch" 
OR jcr:content/jcr:title:"SomeTextToSearch"
OR jcr:content/jcr:description:"SomeTextToSearch"
order by score desc
{code}

It would also allow to switch back from 

{code}
Collection<String> getFulltextConditions()
{code}
to 
{code}
String getFulltextCondition()
{code}


    
> Query: advanced fulltext search conditions
> ------------------------------------------
>
>                 Key: OAK-890
>                 URL: https://issues.apache.org/jira/browse/OAK-890
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>
> Currently, the query engine does not use a fulltext index if there are multiple fulltext conditions combined with "or". Also, the QueryIndex interface does not support boosts, and does not support fulltext conditions on properties (just on nodes) - Filter.getFulltextConditions is a collection of strings, combined with "and", but does not contain the information whether a condition is on a property or on all properties. Also, the popular sorting by score (specially descending) is not currently supported.
> [~mreutegg] and me discussed how we could support those features (including boost) in a way that is backward compatible with Jackrabbit 2.x, but without adding a lot of complexity. Example Jackrabbit 2.x query:
> {code}
> /jcr:root/content//*[(@jcr:primaryType='page' 
>   and (jcr:contains(jcr:content/@tags, 'it:blue') 
>   or jcr:contains(jcr:content/@tags, '/tags/it/blue')))]
> /jcr:root/content//element(*, nt:hierarchyNode)[
>   (jcr:contains(jcr:content, 'SomeTextToSearch') 
>   or jcr:contains(jcr:content/@jcr:title, 'SomeTextToSearch') 
>   or jcr:contains(jcr:content/@jcr:description, 'SomeTextToSearch'))]
>   /rep:excerpt(.) order by @jcr:score descending 
> {code}
> A possible solution is to extend the internal fulltext syntax to support those features. The internal fulltext syntax is the one used by 
> Filter.getFulltextCondition (not the one used within the original XPath, SQL, or SQL-2 query). The proposed syntax (work in progress, just a rough draft so far) is:
> {code}
> FullTextSearch ::= Or
>   ['order by score' [' desc']]
> Or ::= And {' OR ' And}* 
> And ::= Term {' ' Term}*
> Term ::= '(' Or ')' | ['-'] SimpleTerm
> SimpleTerm ::= [Property ':'] '"' Word {' ' Word}* '"' ['^' Boost]
> Property ::= <property name>
> Boost ::= <number>
> {code}
> The idea is that the syntax matches the syntax used by Lucene (except for the 'order by' part), so that the Lucene and Solr index implementations should get simpler (only need minimal parsing, possibly just the 'order by' part). Search terms (phrases, words) are always within double quotes. That means, the above queries would result in the following condition:
> {code}
> jcr:content/tags:"it:blue" 
> OR jcr:content/tags:"/tags/it/blue"
> jcr:content/*:"SomeTextToSearch" 
> OR jcr:content/jcr:title:"SomeTextToSearch"
> OR jcr:content/jcr:description:"SomeTextToSearch"
> order by score desc
> {code}
> It would also allow to switch back from 
> {code}
> Collection<String> getFulltextConditions()
> {code}
> to 
> {code}
> String getFulltextCondition()
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira