You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Chetan Mehrotra <ch...@gmail.com> on 2014/11/19 14:23:30 UTC

Disable parsing of fulltext string in QueryEngine

Following up on the earlier mail thread [1] but focusing on fulltext
parsing happening at the Query Engine level

Consider a case where we search for "mountain is big" and assume that
no aggregation complexity is involved

/jcr:root/content//element(*, test:Asset)[(jcr:contains(., 'mountain is big'))]

Now as per (OAK-890) this would get broken into a full text expression
which is *and* of 'mountain' , 'is', 'big'. LuceneIndex would get to
see already analyzed full text phrase and would construct a Lucene
query like below

+:fulltext:big +:fulltext:is +:fulltext:mountain

This query might not perform in expected way if the analyzer is
configured with stop words which would ignore 'is'.

To avoid such cases it would be better if the QueryEngine does not
parse the fulltext string in any form and pass the string as is.

Only thing that would be lost in such a case is the boost support.
That can possibly be handled at LuceneIndex level

Looking at JR2 code I think no such parsing was performed at that time
[2] and text passed as part of query is passed *as is* to Lucene
QueryParser

So should we disable the Fulltext parsing happening in QueryEngine?

Chetan Mehrotra
[1] http://markmail.org/thread/cyu7evezbi4u22gr
[2] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/JackrabbitQueryParser.java

Re: Disable parsing of fulltext string in QueryEngine

Posted by Chetan Mehrotra <ch...@gmail.com>.

After further discussion with Thomas it appears that QueryEngine need
to provide a different AST for fulltext expressions such that
LuceneIndex can access the non tokenized expression. Opened OAK-2301
to track that
Chetan Mehrotra


On Mon, Nov 24, 2014 at 4:03 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>I want to treat following two differently
>>
>>1. FulltextExpression created out of simple string term passed to a
>>single jcr:contains in a query
>>2. FulltextExpression by combining multiple jcr:contains
>>
>>For #1 it would be better to get access to raw string and pass it to
>>Lucene analyzer for tokenization. For #2 it would be preferable to get
>>the FullTextExpression AST such that it can be mapped to required
>>Lucene query
>
> The AST should have the same information as the raw string, so that you
> should be able to easily generate the raw string form the AST.
>
>>That can be done but then how can I distinguish from a
>>FulltextExpression created out of "mountain is big" and
>>_jcr:contains("title","mountain is big")_So if fulltext expression can
>>provide some hint from what it was constructed from that might help
>
> Do you mean jcr:contains(@title, 'mountain is big')? The
> FullTextExpression AST for this is:
>
>     FullTextAnd(
>       FullTextTerm(propertyName="title", text="mountain"),
>       FullTextTerm(propertyName="title", text="is"),
>       FullTextTerm(propertyName="title", text="big")
>     )
>
> and toString is:
>
>     title:"mountain" title:"is" title:"big"
>
> Is this the correct representation?
>
> Regards,
>
> Thomas
>
>

Re: Disable parsing of fulltext string in QueryEngine

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>I want to treat following two differently
>
>1. FulltextExpression created out of simple string term passed to a
>single jcr:contains in a query
>2. FulltextExpression by combining multiple jcr:contains
>
>For #1 it would be better to get access to raw string and pass it to
>Lucene analyzer for tokenization. For #2 it would be preferable to get
>the FullTextExpression AST such that it can be mapped to required
>Lucene query

The AST should have the same information as the raw string, so that you
should be able to easily generate the raw string form the AST.

>That can be done but then how can I distinguish from a
>FulltextExpression created out of "mountain is big" and
>_jcr:contains("title","mountain is big")_So if fulltext expression can
>provide some hint from what it was constructed from that might help

Do you mean jcr:contains(@title, 'mountain is big')? The
FullTextExpression AST for this is:

    FullTextAnd(
      FullTextTerm(propertyName="title", text="mountain"),
      FullTextTerm(propertyName="title", text="is"),
      FullTextTerm(propertyName="title", text="big")
    )

and toString is:

    title:"mountain" title:"is" title:"big"

Is this the correct representation?

Regards,

Thomas

Re: Disable parsing of fulltext string in QueryEngine

Posted by Chetan Mehrotra <ch...@gmail.com>.

Hi Thomas,

On Mon, Nov 24, 2014 at 3:30 PM, Thomas Mueller <mu...@adobe.com> wrote:
> (With "full-text parsing" I understand parsing a full-text expression,
> which consists of one or multiple "contains" conditions, into a
> FullTextExpression AST. If you have a different understanding, then please
> tell me.)

I want to treat following two differently

1. FulltextExpression created out of simple string term passed to a
single jcr:contains in a query
2. FulltextExpression by combining multiple jcr:contains

For #1 it would be better to get access to raw string and pass it to
Lucene analyzer for tokenization. For #2 it would be preferable to get
the FullTextExpression AST such that it can be mapped to required
Lucene query

On Mon, Nov 24, 2014 at 3:30 PM, Thomas Mueller <mu...@adobe.com> wrote:
> As for the "mountain is big" example, the problem seems to be the
> query-time aggregation, not the parsing of the expression in the query
> engine, and not the use of the FullTextExpression. If you want to get the
> term "mountain is big" from the FullTextExpression, use
> FullTextExpression.toString() ("Get the string representation of the
> condition.").

That can be done but then how can I distinguish from a
FulltextExpression created out of "mountain is big" and
_jcr:contains("title","mountain is big")_So if fulltext expression can
provide some hint from what it was constructed from that might help

Chetan Mehrotra

Re: Disable parsing of fulltext string in QueryEngine

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

(With "full-text parsing" I understand parsing a full-text expression,
which consists of one or multiple "contains" conditions, into a
FullTextExpression AST. If you have a different understanding, then please
tell me.)

Full-text parsing was originally needed internally (and was not public) in
the query engine, to support the case where no full-text index is
available. But then we improved and moved those classes to the QueryIndex
API (Filter.getFullTextConstraint) to support query-time aggregation. I
think we still want to support query-time aggregation, so we need to keep
it there. I understand we want to use index-time aggregation by default,
the same as Jackrabbit 2.x, but I don't think we want to drop support for
query-time aggregation, at least not yet yet.

As for the "mountain is big" example, the problem seems to be the
query-time aggregation, not the parsing of the expression in the query
engine, and not the use of the FullTextExpression. If you want to get the
term "mountain is big" from the FullTextExpression, use
FullTextExpression.toString() ("Get the string representation of the
condition.").

Regards,
Thomas

On 21/11/14 08:25, "Chetan Mehrotra" <ch...@gmail.com> wrote:

>Thanks Davide for the feedback.
>
>Would be helpful to get some more feedback on what should be done
>there. So waiting for more feedback!
>Chetan Mehrotra
>
>
>On Wed, Nov 19, 2014 at 9:11 PM, Davide Giannella <da...@apache.org>
>wrote:
>> On 19/11/2014 13:23, Chetan Mehrotra wrote:
>>> ...
>>>
>>> So should we disable the Fulltext parsing happening in QueryEngine?
>>>
>> I think we should reproduce and OOTB behaviour as it was in JR2 as
>> customer updating from that will expect the same behaviour.
>>
>> So I would ensure about what that behaviour is and working accordingly.
>> Then if the behaviour is NOT to split the string, I would like to have
>> this option in a configurable way. In this way customers willing to
>> leverage the boost stuff could trigger some configuration.
>>
>> Could it make sense? (I'm not expert in Lucene) :)
>>
>> Cheers
>> Davide
>>
>>

Re: Disable parsing of fulltext string in QueryEngine

Posted by Chetan Mehrotra <ch...@gmail.com>.

Thanks Davide for the feedback.

Would be helpful to get some more feedback on what should be done
there. So waiting for more feedback!
Chetan Mehrotra


On Wed, Nov 19, 2014 at 9:11 PM, Davide Giannella <da...@apache.org> wrote:
> On 19/11/2014 13:23, Chetan Mehrotra wrote:
>> ...
>>
>> So should we disable the Fulltext parsing happening in QueryEngine?
>>
> I think we should reproduce and OOTB behaviour as it was in JR2 as
> customer updating from that will expect the same behaviour.
>
> So I would ensure about what that behaviour is and working accordingly.
> Then if the behaviour is NOT to split the string, I would like to have
> this option in a configurable way. In this way customers willing to
> leverage the boost stuff could trigger some configuration.
>
> Could it make sense? (I'm not expert in Lucene) :)
>
> Cheers
> Davide
>
>

Re: Disable parsing of fulltext string in QueryEngine

Posted by Davide Giannella <da...@apache.org>.

On 19/11/2014 13:23, Chetan Mehrotra wrote:
> ...
>
> So should we disable the Fulltext parsing happening in QueryEngine?
>
I think we should reproduce and OOTB behaviour as it was in JR2 as
customer updating from that will expect the same behaviour.

So I would ensure about what that behaviour is and working accordingly.
Then if the behaviour is NOT to split the string, I would like to have
this option in a configurable way. In this way customers willing to
leverage the boost stuff could trigger some configuration.

Could it make sense? (I'm not expert in Lucene) :)

Cheers
Davide

Re: Disable parsing of fulltext string in QueryEngine

Posted by Alex Parvulescu <al...@gmail.com>.

Hi,

I fully agree with the idea that the Query Engine should not split the
search phrase into tokens.

If i remember correctly this behavior is there to allow the default
full-text engine to work, so to keep those parts working (if needed) we can
simply move this simple tokenization mechanism to the index impl.

> So should we disable the Fulltext parsing happening in QueryEngine?
+1


alex







On Wed, Nov 19, 2014 at 2:23 PM, Chetan Mehrotra <ch...@gmail.com>
wrote:

> Following up on the earlier mail thread [1] but focusing on fulltext
> parsing happening at the Query Engine level
>
> Consider a case where we search for "mountain is big" and assume that
> no aggregation complexity is involved
>
> /jcr:root/content//element(*, test:Asset)[(jcr:contains(., 'mountain is
> big'))]
>
> Now as per (OAK-890) this would get broken into a full text expression
> which is *and* of 'mountain' , 'is', 'big'. LuceneIndex would get to
> see already analyzed full text phrase and would construct a Lucene
> query like below
>
> +:fulltext:big +:fulltext:is +:fulltext:mountain
>
> This query might not perform in expected way if the analyzer is
> configured with stop words which would ignore 'is'.
>
> To avoid such cases it would be better if the QueryEngine does not
> parse the fulltext string in any form and pass the string as is.
>
> Only thing that would be lost in such a case is the boost support.
> That can possibly be handled at LuceneIndex level
>
> Looking at JR2 code I think no such parsing was performed at that time
> [2] and text passed as part of query is passed *as is* to Lucene
> QueryParser
>
> So should we disable the Fulltext parsing happening in QueryEngine?
>
> Chetan Mehrotra
> [1] http://markmail.org/thread/cyu7evezbi4u22gr
> [2]
> https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/JackrabbitQueryParser.java
>