You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Cédric Damioli <ce...@anyware-tech.com> on 2006/02/23 16:19:42 UTC

Lucene Analyzer not used when querying the index ?

Hi all,

I noticed that no Lucene Analyzer is used when querying the repository : 
when building the actual Lucene query the 
o.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of the 
Analyzer (at least in my case).
So the input pattern is not correctly tokenized, and the query does not 
return the correct answer.

Let describe my exemple : I'm using chinese characters, say A and B. I 
set a property named "title" with the value "AB" (the two chinese 
characters without any witespace).
After indexation (with the default StandardAnalyzer) the text has been 
tokenized and the index contains at least three noticeable terms :
- one associated with the field _PROPERTIES and the value "titleï¿¿AB"
- one associated with the field FULL:title and the value "A"
- one associated with the field FULL:title and the value "B"

After that I try to execute an XPath Query like //*[jcr:contains(@title, 
'*AB*')]
I of course expected this query to return the previously set property, 
but I obtained no results.
After looking at the code, I can say that the Analyzer is not called for 
a WildcardQuery, so my "AB" is not tokenized and furthermore, it seems 
that the _PROPERTIES field is not used when searching, otherwise, I 
think it would match.

I know that StandardAnalyzer is not the best suited for handling chinese 
text, but that's another story.
It seems to me that there may be a Jackrabbit problem here, so I wanted 
to have your feelings about this.

Regards,

-- 
CÃ©dric Damioli
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

Re: Lucene Analyzer not used when querying the index ?

Posted by Cédric Damioli <ce...@anyware-tech.com>.

Thanks a lot Marcel for your answers,


Marcel Reutegger a écrit :
> Cédric Damioli wrote:
>> Hi all,
>>
>> I noticed that no Lucene Analyzer is used when querying the 
>> repository : when building the actual Lucene query the 
>> o.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of the 
>> Analyzer (at least in my case).
>
> in general the analyzer is used for the contains() function to 
> tokenize the fulltext query parameter. however there is one exception 
> to this rule: terms that use wildcards are not tokenized.
>
> the reason for this is a technical one. an analyzer that is based on a 
> grammer will not be able to process such tokens properly.
>
> e.g. if the grammar rule says 'a' 'b' and 'abc' are tokens then the 
> analyzer would be unable to determine if 'ab*' should be tokenized or 
> not.
>
>> Let describe my exemple : I'm using chinese characters, say A and B. 
>> I set a property named "title" with the value "AB" (the two chinese 
>> characters without any witespace).
>> After indexation (with the default StandardAnalyzer) the text has 
>> been tokenized and the index contains at least three noticeable terms :
>> - one associated with the field _PROPERTIES and the value "titleï¿¿AB"
>> - one associated with the field FULL:title and the value "A"
>> - one associated with the field FULL:title and the value "B"
>>
>> After that I try to execute an XPath Query like 
>> //*[jcr:contains(@title, '*AB*')]
>> I of course expected this query to return the previously set 
>> property, but I obtained no results.
>> After looking at the code, I can say that the Analyzer is not called 
>> for a WildcardQuery, so my "AB" is not tokenized and furthermore,
>
> if you execute the following query you will get the expected result:
> //*[jcr:contains(@title, 'AB')]
>
> assuming A and B are chinese characters, they will get tokenized and 
> the fulltext query is acutally a phrase match. similar to searching 
> for 'hello there'.
I actually can't use that query, because my application handle both 
chinese and latin-1 characters, and in case of latin ones, the query 
needs to be wilcarded, otherwise it would only match exact tokens, which 
is not what I want.

But I now understand the processing.
The correct behaviour for me is :
- First, tokenize my query String ("AB") using the same tokenizer than 
JackRabbit (StandardTokenizer by default) :
- Then building the XPath query with a separated statement for each 
token : /*[jcr:contains(@title, '*A*') and jcr:contains(@title, '*B*')]
- This query gives me the correct answer.

With this processing I can query the index with both chinese and 
european strings.

Thanks for your help

Regards,

-- 
Cédric Damioli
Chef de projets systèmes d'informations
Solutions CMS
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

Re: Lucene Analyzer not used when querying the index ?

Posted by Marcel Reutegger <ma...@gmx.net>.

Cédric Damioli wrote:
> Hi all,
> 
> I noticed that no Lucene Analyzer is used when querying the repository : 
> when building the actual Lucene query the 
> o.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of the 
> Analyzer (at least in my case).

in general the analyzer is used for the contains() function to tokenize 
the fulltext query parameter. however there is one exception to this 
rule: terms that use wildcards are not tokenized.

the reason for this is a technical one. an analyzer that is based on a 
grammer will not be able to process such tokens properly.

e.g. if the grammar rule says 'a' 'b' and 'abc' are tokens then the 
analyzer would be unable to determine if 'ab*' should be tokenized or not.

> Let describe my exemple : I'm using chinese characters, say A and B. I 
> set a property named "title" with the value "AB" (the two chinese 
> characters without any witespace).
> After indexation (with the default StandardAnalyzer) the text has been 
> tokenized and the index contains at least three noticeable terms :
> - one associated with the field _PROPERTIES and the value "titleï¿¿AB"
> - one associated with the field FULL:title and the value "A"
> - one associated with the field FULL:title and the value "B"
> 
> After that I try to execute an XPath Query like //*[jcr:contains(@title, 
> '*AB*')]
> I of course expected this query to return the previously set property, 
> but I obtained no results.
> After looking at the code, I can say that the Analyzer is not called for 
> a WildcardQuery, so my "AB" is not tokenized and furthermore,

if you execute the following query you will get the expected result:
//*[jcr:contains(@title, 'AB')]

assuming A and B are chinese characters, they will get tokenized and the 
fulltext query is acutally a phrase match. similar to searching for 
'hello there'.

> it seems 
> that the _PROPERTIES field is not used when searching, otherwise, I 
> think it would match.

the PROPERTIES field is only used for jcr:like and other operators.
e.g. you can search the workspace with the following query:
//*[jcr:like(@title, '%AB%')]

this will internally use the PROPERTIES field.

> I know that StandardAnalyzer is not the best suited for handling chinese 
> text, but that's another story.

there might be implementations that are better suited for chinese text, 
but I think it does a pretty good job.

> It seems to me that there may be a Jackrabbit problem here, so I wanted 
> to have your feelings about this.

What you described is imo expected behaviour in jackrabbit.

Regarding analyzers, you can configure it on a per workspace basis and 
use one of the many available analyzers. e.g. from the lucene website.

regards
  marcel