You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by "Dunstall, Christopher" <cd...@csu.edu.au> on 2010/09/02 06:19:00 UTC

RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

I've got the customised Analyzer and Tokenizer working, but it seems I'm back at square one, maybe even further back because now it looks like it's being case sensitive.

My Analyzer:

public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
  private static final Logger LOGGER = LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
  
  public TokenStream tokenStream(String field, final Reader reader) {
    LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ? reader.toString() : "") + "]");
    
    TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
    return keywordTokenStream;
    //return (new LowerCaseFilter(keywordTokenStream));
  }
}

My HyphenKeywordTokenizer class is practically a direct copy of KeywordTokenizer, where it emits the entire input as a single token.  As you can see above, I'm not using the lower case filter, just to see what happens.

Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named 'Bob' 'Arlington-Smythe'.

A search for 'Sophie-Anne' produces the user's record, however, a search for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now, even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query now? >From what H. Wilson has found, it doesn't look like it will solve the problem.

The query being used is:
//*[@sling:resourceType="sakai/user-profile" and (jcr:contains(., 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score descending]


Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: H. Wilson [mailto:wilsonh@randdss.com] 
Sent: Wednesday, 1 September 2010 6:47 AM
To: users@jackrabbit.apache.org
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains


On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>
>> Given the following parameters in the repository:
>>
>>    .North.South.East.WestLand
>>    .North.South.East.West_Land
>>    .North.South.East.West Land    //yes that's a space
>>
>> The following exact name, case sensitive queries worked as expected for each
>> of the three parameters:
>>
>>    filter.orJCRExpression ("jcr:like(@" + srchField
>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
> jcr:like does not depend on any analyser but on the stored field, so
> this is not strange that it still works.
I expected this too, I just try to be as thorough as possible when 
posting anywhere. I am disappointed enough I haven't figured this out on 
my own.
>> The following exact name query, case insensitive, worked for only the
>> parameter with a fullName with a whitespace character:
>>
>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>
>> The following exact name queries, case insensitive, stopped working for the
>> fullnames WITHOUT a whitespace character:
>>
>>    filter.addContains ( srchField,
>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>
>> Again, the only change I made was to the analyzer, I didn't remove my
>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>> to figure out how the tokens were working. Oh I should note, the output from
>> the Analyzer only showed one Token per field, which I believe is what we
>> were looking for. Which leaves me as perplexed as before.
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>>    ...
>>
>>    public TokenStream tokenStream ( String field, final Reader reader  ) {
>>             System.out.println ("TOKEN STREAM for field: " + field);
>>             TokenStream keywordTokenStream = super.tokenStream (field,
>> reader);
>>
>>         //changed for testing
>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>> keywordTokenStream ) ;
>>             final Token reusableToken = new Token();
>>             try {
>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>                 while ( mytoken != null  ) {
>>                     System.out.println ("[" + mytoken.term() + "]");
>>                     mytoken = lowerCaseStream.next (mytoken);
>>                 }
>>                 //lowerCaseStream.reset();  //uncommenting this did not
>> change results.
>>             }
>>             catch  (IOException ioe) {
>>                 System.err.println ("ERROR: " + ioe.toString());
>>             }
>>
> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
> on the keywordTokenStream before using it again.
>
> Regards Ard
>
>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>         }
>>
>>    ...
I was real excited when I saw your email this morning. However, 
resetting keywordTokenStream as the last line in the "try" resulted in 
no change. I also tried uncommenting the lowerCaseStream.reset line in 
an act of desperation with no difference. I must be missing something 
completely obvious at this point... look at a problem too long and the 
obvious fails to jump out at you...

H. Wilson
>> Thanks.
>>
>> H. Wilson
>>
>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>    wrote:
>>>>   Ard,
>>>>
>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>> think
>>>> I was too worn out from my week and too excited to have code that
>>>> "worked"
>>>> to notice the obvious... this must be a workaround. However, I will need
>>>> a
>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>> really
>>>> understood how to use it properly. Could you give me a clear list of
>>>> steps,
>>>> or point me to a resource I missed, on how I would go about inspecting
>>>> tokens during insert/search? Thanks.
>>> I'd just print them to your console with Token#term() or use a
>>> debugger . If you do that during indexing and searching, I think you
>>> must see some difference in the token that explains *why* Lucene
>>> doesn't find a hit for your usecase with spaces.
>>>
>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>> as the field value prefixing: It is unfortunate and not completely
>>> necessary any more but has some historical reasons from Lucene back in
>>> the days when it could not handle very many unique fieldnames
>>>
>>> Regards Ard
>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>   wrote:
>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>> anyone
>>>>>> who
>>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>>   During testing, I determined that if you performed the following query
>>>>>> for
>>>>>> the exact fullName property:
>>>>>>
>>>>>>     filter.addContains ( @fullName,
>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>
>>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>>> it
>>>>>> would return results:
>>>>>>
>>>>>>    filter.addContains ( @fullName,
>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Lan*"));
>>>>> This does not make sense...see below
>>>>>
>>>>>> But since I did not want to throw in wild cards where they might not be
>>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>>> and
>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>> fn:lower-case.
>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>> for
>>>>>> case sensitive and case insensitive searching) .
>>>>>>
>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>> performCaseSensitiveSearch,
>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>> fullName
>>>>>>
>>>>>>    .....
>>>>>>
>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>
>>>>>>        //jcr:like for case sensitive
>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>
>>>>>>    }
>>>>>>    else {
>>>>>>
>>>>>>        //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>>
>>>>>>        if ( searchTerm.contains (" ")&&        !searchTerm.contains
>>>>>> ("*")&&
>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>
>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>        else {
>>>>>>
>>>>>>            //jcr:contains for case insensitive
>>>>>>            filter.addContains ( srchField,
>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>    }
>>>>> This seems to me a workaround around the real problem, because, it
>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>> indexing (just store something) and during searching: just search in
>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>> it should leave spaces untouched
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>
>>>>>>    ....
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Hope that helps anyone who needs it.
>>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>> as
>>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>>> of
>>>>>>>> one
>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>
>>>>>>>>    .North.South.East.WestLand
>>>>>>>>    .North.South.East.West_Land
>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>
>>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>>> the
>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>
>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land") +"'));
>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>> though
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>> creating
>>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>>> spaces),
>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>
>>>>>>>> H. Wilson

RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by "Dunstall, Christopher" <CD...@csu.edu.au>.
Hi Ard,

I've returned to this problem after some time away from it...

If you recall; I have 2 users, Sophie-Anne and Sophie.

//*[@sling:resourceType='sakai/user-home' and (jcr:contains(public/*/*/*/*/*,'*Sophie*') or jcr:contains(public/*/*/*/*,'*Sophie*') or jcr:contains(public/*/*/*,'*Sophie*') or jcr:contains(public/*/*,'*Sophie*') or jcr:contains(public/*,'*Sophie*') or jcr:contains(pages/*/*/*/*/*,'*Sophie*') or jcr:contains(pages/*/*/*/*,'*Sophie*') or jcr:contains(pages/*/*/*,'*Sophie*') or jcr:contains(pages/*/*,'*Sophie*') or jcr:contains(pages/*,'*Sophie*'))] order by @jcr:score descending

Returns both users.

//*[@sling:resourceType='sakai/user-home' and (jcr:contains(public/*/*/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*,'*Sophie-Anne*') or jcr:contains(public/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*,'*Sophie-Anne*') or jcr:contains(pages/*,'*Sophie-Anne*'))] order by @jcr:score descending

Returns neither user.

Are you able to tell me how I can see the actual query being passed to Lucene? I need to see how the query is being interpreted and executed on lucene.

The Analyzer method was of no use to me, btw.

Regards,

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
Sent: Friday, 3 September 2010 5:45 PM
To: users@jackrabbit.apache.org; wilsonh@randdss.com
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Hello Wilson,

On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson <wi...@randdss.com> wrote:

> Some successful queries I ran in my unit tests (out of the 1200+ test
> queries I have ...) (all of these were tried once as shown and once as
> "string".toLowerCase() )
>
>   .North.South.East.West*
>   .North.South.East.West-*
>   .North.South.East.West-Land
>   *West-Land
>   .North*
>
>
> Unsuccessful include:
>
>   .North.South.East.West-Lan?
>   .North.South.East.West Land

I didn't look at code, but I think the analyzer part is just fine. I
suspect the jackrabbit queryparser to mangle dashes and spaces. I am
how ever not sure how you could avoid this. I'd have to look into it.
Though, you might want to check the JackrabbitQueryParser what it
makes of your ' .North.South.East.West-Lan?' or
'.North.South.East.West Land'

Regards Ard

>
>
> Good Luck!
>
> *H. Wilson*
>
>
> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>
>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>> filter, you actually got the record.
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
>> Sent: Thursday, 2 September 2010 2:19 PM
>> To: users@jackrabbit.apache.org
>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>> back at square one, maybe even further back because now it looks like it's
>> being case sensitive.
>>
>> My Analyzer:
>>
>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>   private static final Logger LOGGER =
>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>
>>   public TokenStream tokenStream(String field, final Reader reader) {
>>     LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>> reader.toString() : "") + "]");
>>
>>     TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>     return keywordTokenStream;
>>     //return (new LowerCaseFilter(keywordTokenStream));
>>   }
>> }
>>
>> My HyphenKeywordTokenizer class is practically a direct copy of
>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>> can see above, I'm not using the lower case filter, just to see what
>> happens.
>>
>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>> 'Bob' 'Arlington-Smythe'.
>>
>> A search for 'Sophie-Anne' produces the user's record, however, a search
>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>> now?> From what H. Wilson has found, it doesn't look like it will solve the
>> problem.
>>
>> The query being used is:
>> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>> descending]
>>
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: H. Wilson [mailto:wilsonh@randdss.com]
>> Sent: Wednesday, 1 September 2010 6:47 AM
>> To: users@jackrabbit.apache.org
>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>>
>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>
>>>> Given the following parameters in the repository:
>>>>
>>>>    .North.South.East.WestLand
>>>>    .North.South.East.West_Land
>>>>    .North.South.East.West Land    //yes that's a space
>>>>
>>>> The following exact name, case sensitive queries worked as expected for
>>>> each
>>>> of the three parameters:
>>>>
>>>>    filter.orJCRExpression ("jcr:like(@" + srchField
>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>> sens.
>>>
>>> jcr:like does not depend on any analyser but on the stored field, so
>>> this is not strange that it still works.
>>
>> I expected this too, I just try to be as thorough as possible when
>> posting anywhere. I am disappointed enough I haven't figured this out on
>> my own.
>>>>
>>>> The following exact name query, case insensitive, worked for only the
>>>> parameter with a fullName with a whitespace character:
>>>>
>>>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>
>>>> The following exact name queries, case insensitive, stopped working for
>>>> the
>>>> fullnames WITHOUT a whitespace character:
>>>>
>>>>    filter.addContains ( srchField,
>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>
>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>> analyzer
>>>> to figure out how the tokens were working. Oh I should note, the output
>>>> from
>>>> the Analyzer only showed one Token per field, which I believe is what we
>>>> were looking for. Which leaves me as perplexed as before.
>>>>
>>>> LowerCaseKeywordAnalyzer.java:
>>>>
>>>>    ...
>>>>
>>>>    public TokenStream tokenStream ( String field, final Reader reader  )
>>>> {
>>>>             System.out.println ("TOKEN STREAM for field: " + field);
>>>>             TokenStream keywordTokenStream = super.tokenStream (field,
>>>> reader);
>>>>
>>>>         //changed for testing
>>>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>> keywordTokenStream ) ;
>>>>             final Token reusableToken = new Token();
>>>>             try {
>>>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>>>                 while ( mytoken != null  ) {
>>>>                     System.out.println ("[" + mytoken.term() + "]");
>>>>                     mytoken = lowerCaseStream.next (mytoken);
>>>>                 }
>>>>                 //lowerCaseStream.reset();  //uncommenting this did not
>>>> change results.
>>>>             }
>>>>             catch  (IOException ioe) {
>>>>                 System.err.println ("ERROR: " + ioe.toString());
>>>>             }
>>>>
>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>> on the keywordTokenStream before using it again.
>>>
>>> Regards Ard
>>>
>>>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>         }
>>>>
>>>>    ...
>>
>> I was real excited when I saw your email this morning. However,
>> resetting keywordTokenStream as the last line in the "try" resulted in
>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>> an act of desperation with no difference. I must be missing something
>> completely obvious at this point... look at a problem too long and the
>> obvious fails to jump out at you...
>>
>> H. Wilson
>>>>
>>>> Thanks.
>>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>
>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>
>>>>> wrote:
>>>>>>
>>>>>>   Ard,
>>>>>>
>>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>>> think
>>>>>> I was too worn out from my week and too excited to have code that
>>>>>> "worked"
>>>>>> to notice the obvious... this must be a workaround. However, I will
>>>>>> need
>>>>>> a
>>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>>> really
>>>>>> understood how to use it properly. Could you give me a clear list of
>>>>>> steps,
>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>> tokens during insert/search? Thanks.
>>>>>
>>>>> I'd just print them to your console with Token#term() or use a
>>>>> debugger . If you do that during indexing and searching, I think you
>>>>> must see some difference in the token that explains *why* Lucene
>>>>> doesn't find a hit for your usecase with spaces.
>>>>>
>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>> necessary any more but has some historical reasons from Lucene back in
>>>>> the days when it could not handle very many unique fieldnames
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>>>   wrote:
>>>>>>>>
>>>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>>>> anyone
>>>>>>>> who
>>>>>>>> needs it. Putting quotes around the spaces unfortunately did not
>>>>>>>> work.
>>>>>>>>   During testing, I determined that if you performed the following
>>>>>>>> query
>>>>>>>> for
>>>>>>>> the exact fullName property:
>>>>>>>>
>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land"));
>>>>>>>>
>>>>>>>> It would return nothing. But tweak it a little and add a wildcard,
>>>>>>>> and
>>>>>>>> it
>>>>>>>> would return results:
>>>>>>>>
>>>>>>>>    filter.addContains ( @fullName,
>>>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Lan*"));
>>>>>>>
>>>>>>> This does not make sense...see below
>>>>>>>
>>>>>>>> But since I did not want to throw in wild cards where they might not
>>>>>>>> be
>>>>>>>> wanted, if a search string contained spaces, did not contain wild
>>>>>>>> cards
>>>>>>>> and
>>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>>> fn:lower-case.
>>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>>> for
>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>
>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>> performCaseSensitiveSearch,
>>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>>> fullName
>>>>>>>>
>>>>>>>>    .....
>>>>>>>>
>>>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>>>
>>>>>>>>        //jcr:like for case sensitive
>>>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>
>>>>>>>>    }
>>>>>>>>    else {
>>>>>>>>
>>>>>>>>        //only use fn:lower-case if there is spaces, with NO wild
>>>>>>>> cards
>>>>>>>>
>>>>>>>>        if ( searchTerm.contains (" ")&&         !searchTerm.contains
>>>>>>>> ("*")&&
>>>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>>>
>>>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>>>
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>        else {
>>>>>>>>
>>>>>>>>            //jcr:contains for case insensitive
>>>>>>>>            filter.addContains ( srchField,
>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>    }
>>>>>>>
>>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>>> indexing (just store something) and during searching: just search in
>>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems
>>>>>>> that
>>>>>>> it should leave spaces untouched
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>
>>>>>>>>    ....
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>
>>>>>>>> H. Wilson
>>>>>>>>
>>>>>>>>>> OK so it looks like I have one other issue. Using the
>>>>>>>>>> configuration
>>>>>>>>>> as
>>>>>>>>>> posted below and sticking to my previous examples, with the
>>>>>>>>>> addition
>>>>>>>>>> of
>>>>>>>>>> one
>>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>>
>>>>>>>>>>    .North.South.East.WestLand
>>>>>>>>>>    .North.South.East.West_Land
>>>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>>>
>>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
>>>>>>>>>> cards:
>>>>>>>>>> the
>>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>>
>>>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>> Land") +"'));
>>>>>>>>>
>>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am
>>>>>>>>> not
>>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>>> though
>>>>>>>>>
>>>>>>>>> Regards Ard
>>>>>>>>>
>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>>> creating
>>>>>>>>>> one token, plus combined with escaping the Illegal Characters
>>>>>>>>>> (i.e.
>>>>>>>>>> spaces),
>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>
>>>>>>>>>> H. Wilson
>

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by "H. Wilson" <wi...@randdss.com>.
  Interesting twist... New test cases (no code modifications) show the 
following queries with trailing question marks DO work:

    ?North.South.East.West-Lan?
    *North.South.East.West-Lan?
    .North.*.West-Lan?

While the following one still does not:

    .North.South.East.West-Lan?


*H. Wilson*
R&D Software Systems, Inc


On 09/03/2010 03:45 AM, Ard Schrijvers wrote:
> Hello Wilson,
>
> On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson<wi...@randdss.com>  wrote:
>
>> Some successful queries I ran in my unit tests (out of the 1200+ test
>> queries I have ...) (all of these were tried once as shown and once as
>> "string".toLowerCase() )
>>
>>    .North.South.East.West*
>>    .North.South.East.West-*
>>    .North.South.East.West-Land
>>    *West-Land
>>    .North*
>>
>>
>> Unsuccessful include:
>>
>>    .North.South.East.West-Lan?
>>    .North.South.East.West Land
> I didn't look at code, but I think the analyzer part is just fine. I
> suspect the jackrabbit queryparser to mangle dashes and spaces. I am
> how ever not sure how you could avoid this. I'd have to look into it.
> Though, you might want to check the JackrabbitQueryParser what it
> makes of your ' .North.South.East.West-Lan?' or
> '.North.South.East.West Land'
>
> Regards Ard
>
>>
>> Good Luck!
>>
>> *H. Wilson*
>>
>>
>> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>>> filter, you actually got the record.
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
>>> Sent: Thursday, 2 September 2010 2:19 PM
>>> To: users@jackrabbit.apache.org
>>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>>> back at square one, maybe even further back because now it looks like it's
>>> being case sensitive.
>>>
>>> My Analyzer:
>>>
>>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>>    private static final Logger LOGGER =
>>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>>
>>>    public TokenStream tokenStream(String field, final Reader reader) {
>>>      LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>>> reader.toString() : "") + "]");
>>>
>>>      TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>>      return keywordTokenStream;
>>>      //return (new LowerCaseFilter(keywordTokenStream));
>>>    }
>>> }
>>>
>>> My HyphenKeywordTokenizer class is practically a direct copy of
>>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>>> can see above, I'm not using the lower case filter, just to see what
>>> happens.
>>>
>>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>>> 'Bob' 'Arlington-Smythe'.
>>>
>>> A search for 'Sophie-Anne' produces the user's record, however, a search
>>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>>> now?>   From what H. Wilson has found, it doesn't look like it will solve the
>>> problem.
>>>
>>> The query being used is:
>>> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>>> descending]
>>>
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: H. Wilson [mailto:wilsonh@randdss.com]
>>> Sent: Wednesday, 1 September 2010 6:47 AM
>>> To: users@jackrabbit.apache.org
>>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>>
>>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>> Given the following parameters in the repository:
>>>>>
>>>>>     .North.South.East.WestLand
>>>>>     .North.South.East.West_Land
>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>
>>>>> The following exact name, case sensitive queries worked as expected for
>>>>> each
>>>>> of the three parameters:
>>>>>
>>>>>     filter.orJCRExpression ("jcr:like(@" + srchField
>>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>>> sens.
>>>> jcr:like does not depend on any analyser but on the stored field, so
>>>> this is not strange that it still works.
>>> I expected this too, I just try to be as thorough as possible when
>>> posting anywhere. I am disappointed enough I haven't figured this out on
>>> my own.
>>>>> The following exact name query, case insensitive, worked for only the
>>>>> parameter with a fullName with a whitespace character:
>>>>>
>>>>>     filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>
>>>>> The following exact name queries, case insensitive, stopped working for
>>>>> the
>>>>> fullnames WITHOUT a whitespace character:
>>>>>
>>>>>     filter.addContains ( srchField,
>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>
>>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>>> analyzer
>>>>> to figure out how the tokens were working. Oh I should note, the output
>>>>> from
>>>>> the Analyzer only showed one Token per field, which I believe is what we
>>>>> were looking for. Which leaves me as perplexed as before.
>>>>>
>>>>> LowerCaseKeywordAnalyzer.java:
>>>>>
>>>>>     ...
>>>>>
>>>>>     public TokenStream tokenStream ( String field, final Reader reader  )
>>>>> {
>>>>>              System.out.println ("TOKEN STREAM for field: " + field);
>>>>>              TokenStream keywordTokenStream = super.tokenStream (field,
>>>>> reader);
>>>>>
>>>>>          //changed for testing
>>>>>              TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>>> keywordTokenStream ) ;
>>>>>              final Token reusableToken = new Token();
>>>>>              try {
>>>>>                  Token mytoken = lowerCaseStream.next (reusableToken);
>>>>>                  while ( mytoken != null  ) {
>>>>>                      System.out.println ("[" + mytoken.term() + "]");
>>>>>                      mytoken = lowerCaseStream.next (mytoken);
>>>>>                  }
>>>>>                  //lowerCaseStream.reset();  //uncommenting this did not
>>>>> change results.
>>>>>              }
>>>>>              catch  (IOException ioe) {
>>>>>                  System.err.println ("ERROR: " + ioe.toString());
>>>>>              }
>>>>>
>>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>>> on the keywordTokenStream before using it again.
>>>>
>>>> Regards Ard
>>>>
>>>>>              return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>>          }
>>>>>
>>>>>     ...
>>> I was real excited when I saw your email this morning. However,
>>> resetting keywordTokenStream as the last line in the "try" resulted in
>>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>>> an act of desperation with no difference. I must be missing something
>>> completely obvious at this point... look at a problem too long and the
>>> obvious fails to jump out at you...
>>>
>>> H. Wilson
>>>>> Thanks.
>>>>>
>>>>> H. Wilson
>>>>>
>>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>
>>>>>> wrote:
>>>>>>>    Ard,
>>>>>>>
>>>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>>>> think
>>>>>>> I was too worn out from my week and too excited to have code that
>>>>>>> "worked"
>>>>>>> to notice the obvious... this must be a workaround. However, I will
>>>>>>> need
>>>>>>> a
>>>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>>>> really
>>>>>>> understood how to use it properly. Could you give me a clear list of
>>>>>>> steps,
>>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>>> tokens during insert/search? Thanks.
>>>>>> I'd just print them to your console with Token#term() or use a
>>>>>> debugger . If you do that during indexing and searching, I think you
>>>>>> must see some difference in the token that explains *why* Lucene
>>>>>> doesn't find a hit for your usecase with spaces.
>>>>>>
>>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>>> necessary any more but has some historical reasons from Lucene back in
>>>>>> the days when it could not handle very many unique fieldnames
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>> H. Wilson
>>>>>>>
>>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>>>>    wrote:
>>>>>>>>>    OK, well I got the spaces part figured out, and will post it for
>>>>>>>>> anyone
>>>>>>>>> who
>>>>>>>>> needs it. Putting quotes around the spaces unfortunately did not
>>>>>>>>> work.
>>>>>>>>>    During testing, I determined that if you performed the following
>>>>>>>>> query
>>>>>>>>> for
>>>>>>>>> the exact fullName property:
>>>>>>>>>
>>>>>>>>>      filter.addContains ( @fullName,
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Land"));
>>>>>>>>>
>>>>>>>>> It would return nothing. But tweak it a little and add a wildcard,
>>>>>>>>> and
>>>>>>>>> it
>>>>>>>>> would return results:
>>>>>>>>>
>>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>>>     '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Lan*"));
>>>>>>>> This does not make sense...see below
>>>>>>>>
>>>>>>>>> But since I did not want to throw in wild cards where they might not
>>>>>>>>> be
>>>>>>>>> wanted, if a search string contained spaces, did not contain wild
>>>>>>>>> cards
>>>>>>>>> and
>>>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>>>> fn:lower-case.
>>>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>>>> for
>>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>>
>>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>>> performCaseSensitiveSearch,
>>>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>>>> fullName
>>>>>>>>>
>>>>>>>>>     .....
>>>>>>>>>
>>>>>>>>>     if ( performCaseSensitiveSearch) {
>>>>>>>>>
>>>>>>>>>         //jcr:like for case sensitive
>>>>>>>>>         filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>>>     else {
>>>>>>>>>
>>>>>>>>>         //only use fn:lower-case if there is spaces, with NO wild
>>>>>>>>> cards
>>>>>>>>>
>>>>>>>>>         if ( searchTerm.contains (" ")&&           !searchTerm.contains
>>>>>>>>> ("*")&&
>>>>>>>>>    !searchTerm.contains ("?") ) {
>>>>>>>>>
>>>>>>>>>             filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>>>>
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>         else {
>>>>>>>>>
>>>>>>>>>             //jcr:contains for case insensitive
>>>>>>>>>             filter.addContains ( srchField,
>>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>>>> indexing (just store something) and during searching: just search in
>>>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems
>>>>>>>> that
>>>>>>>> it should leave spaces untouched
>>>>>>>>
>>>>>>>> Regards Ard
>>>>>>>>
>>>>>>>>
>>>>>>>>>     ....
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>>
>>>>>>>>> H. Wilson
>>>>>>>>>
>>>>>>>>>>> OK so it looks like I have one other issue. Using the
>>>>>>>>>>> configuration
>>>>>>>>>>> as
>>>>>>>>>>> posted below and sticking to my previous examples, with the
>>>>>>>>>>> addition
>>>>>>>>>>> of
>>>>>>>>>>> one
>>>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>>>
>>>>>>>>>>>     .North.South.East.WestLand
>>>>>>>>>>>     .North.South.East.West_Land
>>>>>>>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>>>>>>>
>>>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
>>>>>>>>>>> cards:
>>>>>>>>>>> the
>>>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>>>
>>>>>>>>>>>     filter.addContains(@fullName,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>>> Land") +"'));
>>>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am
>>>>>>>>>> not
>>>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>>>> though
>>>>>>>>>>
>>>>>>>>>> Regards Ard
>>>>>>>>>>
>>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>>>> creating
>>>>>>>>>>> one token, plus combined with escaping the Illegal Characters
>>>>>>>>>>> (i.e.
>>>>>>>>>>> spaces),
>>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>>
>>>>>>>>>>> H. Wilson

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by "H. Wilson" <wi...@randdss.com>.
  Now this is interesting. Since today is a busy day for me, I "cheated" 
and quickly copied the JackrabbitQueryParser method "parse" into one of 
my unit tests so I could compare the differences in the Strings at the 
beginning of the method and at the end. Given the following query Strings:

    ".North.South.East.West*"
    ".North.South.East.West-*"
    ".North.South.East.West-Lan?"
    ".North.South.East.West-Land"
    ".North.South.East.West Land"  //space
    ".North.South.East.West_Land"

None showed any change at the end of the method. When I tested the 
method Text.escapeIllegalXpathSearchChars with the same query strings, 
all also returned the same except the one with the trailing Question 
Mark - It was escaped :

    ".North.South.East.West-Lan\?"

So I looked up the Text class and found this comment in the javadoc:

    "Escapes illegal XPath search characters at the end of a string.
    Example:
    A search string like 'test?' will run into a ParseException
    documented in http://issues.apache.org/jira/browse/JCR-1248"

Following the link through did not really help. It makes it sound like 
this is considered resolved as of 1.5.0. ( I am on 2.0.0).

*H. Wilson*
R&D Software Systems, Inc.


On 09/03/2010 03:45 AM, Ard Schrijvers wrote:
> Hello Wilson,
>
> On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson<wi...@randdss.com>  wrote:
>
>> Some successful queries I ran in my unit tests (out of the 1200+ test
>> queries I have ...) (all of these were tried once as shown and once as
>> "string".toLowerCase() )
>>
>>    .North.South.East.West*
>>    .North.South.East.West-*
>>    .North.South.East.West-Land
>>    *West-Land
>>    .North*
>>
>>
>> Unsuccessful include:
>>
>>    .North.South.East.West-Lan?
>>    .North.South.East.West Land
> I didn't look at code, but I think the analyzer part is just fine. I
> suspect the jackrabbit queryparser to mangle dashes and spaces. I am
> how ever not sure how you could avoid this. I'd have to look into it.
> Though, you might want to check the JackrabbitQueryParser what it
> makes of your ' .North.South.East.West-Lan?' or
> '.North.South.East.West Land'
>
> Regards Ard
>
>>
>> Good Luck!
>>
>> *H. Wilson*
>>
>>
>> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>>> filter, you actually got the record.
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
>>> Sent: Thursday, 2 September 2010 2:19 PM
>>> To: users@jackrabbit.apache.org
>>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>>> back at square one, maybe even further back because now it looks like it's
>>> being case sensitive.
>>>
>>> My Analyzer:
>>>
>>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>>    private static final Logger LOGGER =
>>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>>
>>>    public TokenStream tokenStream(String field, final Reader reader) {
>>>      LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>>> reader.toString() : "") + "]");
>>>
>>>      TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>>      return keywordTokenStream;
>>>      //return (new LowerCaseFilter(keywordTokenStream));
>>>    }
>>> }
>>>
>>> My HyphenKeywordTokenizer class is practically a direct copy of
>>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>>> can see above, I'm not using the lower case filter, just to see what
>>> happens.
>>>
>>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>>> 'Bob' 'Arlington-Smythe'.
>>>
>>> A search for 'Sophie-Anne' produces the user's record, however, a search
>>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>>> now?>   From what H. Wilson has found, it doesn't look like it will solve the
>>> problem.
>>>
>>> The query being used is:
>>> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>>> descending]
>>>
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: H. Wilson [mailto:wilsonh@randdss.com]
>>> Sent: Wednesday, 1 September 2010 6:47 AM
>>> To: users@jackrabbit.apache.org
>>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>>
>>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>> Given the following parameters in the repository:
>>>>>
>>>>>     .North.South.East.WestLand
>>>>>     .North.South.East.West_Land
>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>
>>>>> The following exact name, case sensitive queries worked as expected for
>>>>> each
>>>>> of the three parameters:
>>>>>
>>>>>     filter.orJCRExpression ("jcr:like(@" + srchField
>>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>>> sens.
>>>> jcr:like does not depend on any analyser but on the stored field, so
>>>> this is not strange that it still works.
>>> I expected this too, I just try to be as thorough as possible when
>>> posting anywhere. I am disappointed enough I haven't figured this out on
>>> my own.
>>>>> The following exact name query, case insensitive, worked for only the
>>>>> parameter with a fullName with a whitespace character:
>>>>>
>>>>>     filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>
>>>>> The following exact name queries, case insensitive, stopped working for
>>>>> the
>>>>> fullnames WITHOUT a whitespace character:
>>>>>
>>>>>     filter.addContains ( srchField,
>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>
>>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>>> analyzer
>>>>> to figure out how the tokens were working. Oh I should note, the output
>>>>> from
>>>>> the Analyzer only showed one Token per field, which I believe is what we
>>>>> were looking for. Which leaves me as perplexed as before.
>>>>>
>>>>> LowerCaseKeywordAnalyzer.java:
>>>>>
>>>>>     ...
>>>>>
>>>>>     public TokenStream tokenStream ( String field, final Reader reader  )
>>>>> {
>>>>>              System.out.println ("TOKEN STREAM for field: " + field);
>>>>>              TokenStream keywordTokenStream = super.tokenStream (field,
>>>>> reader);
>>>>>
>>>>>          //changed for testing
>>>>>              TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>>> keywordTokenStream ) ;
>>>>>              final Token reusableToken = new Token();
>>>>>              try {
>>>>>                  Token mytoken = lowerCaseStream.next (reusableToken);
>>>>>                  while ( mytoken != null  ) {
>>>>>                      System.out.println ("[" + mytoken.term() + "]");
>>>>>                      mytoken = lowerCaseStream.next (mytoken);
>>>>>                  }
>>>>>                  //lowerCaseStream.reset();  //uncommenting this did not
>>>>> change results.
>>>>>              }
>>>>>              catch  (IOException ioe) {
>>>>>                  System.err.println ("ERROR: " + ioe.toString());
>>>>>              }
>>>>>
>>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>>> on the keywordTokenStream before using it again.
>>>>
>>>> Regards Ard
>>>>
>>>>>              return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>>          }
>>>>>
>>>>>     ...
>>> I was real excited when I saw your email this morning. However,
>>> resetting keywordTokenStream as the last line in the "try" resulted in
>>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>>> an act of desperation with no difference. I must be missing something
>>> completely obvious at this point... look at a problem too long and the
>>> obvious fails to jump out at you...
>>>
>>> H. Wilson
>>>>> Thanks.
>>>>>
>>>>> H. Wilson
>>>>>
>>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>
>>>>>> wrote:
>>>>>>>    Ard,
>>>>>>>
>>>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>>>> think
>>>>>>> I was too worn out from my week and too excited to have code that
>>>>>>> "worked"
>>>>>>> to notice the obvious... this must be a workaround. However, I will
>>>>>>> need
>>>>>>> a
>>>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>>>> really
>>>>>>> understood how to use it properly. Could you give me a clear list of
>>>>>>> steps,
>>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>>> tokens during insert/search? Thanks.
>>>>>> I'd just print them to your console with Token#term() or use a
>>>>>> debugger . If you do that during indexing and searching, I think you
>>>>>> must see some difference in the token that explains *why* Lucene
>>>>>> doesn't find a hit for your usecase with spaces.
>>>>>>
>>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>>> necessary any more but has some historical reasons from Lucene back in
>>>>>> the days when it could not handle very many unique fieldnames
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>> H. Wilson
>>>>>>>
>>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>>>>    wrote:
>>>>>>>>>    OK, well I got the spaces part figured out, and will post it for
>>>>>>>>> anyone
>>>>>>>>> who
>>>>>>>>> needs it. Putting quotes around the spaces unfortunately did not
>>>>>>>>> work.
>>>>>>>>>    During testing, I determined that if you performed the following
>>>>>>>>> query
>>>>>>>>> for
>>>>>>>>> the exact fullName property:
>>>>>>>>>
>>>>>>>>>      filter.addContains ( @fullName,
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Land"));
>>>>>>>>>
>>>>>>>>> It would return nothing. But tweak it a little and add a wildcard,
>>>>>>>>> and
>>>>>>>>> it
>>>>>>>>> would return results:
>>>>>>>>>
>>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>>>     '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Lan*"));
>>>>>>>> This does not make sense...see below
>>>>>>>>
>>>>>>>>> But since I did not want to throw in wild cards where they might not
>>>>>>>>> be
>>>>>>>>> wanted, if a search string contained spaces, did not contain wild
>>>>>>>>> cards
>>>>>>>>> and
>>>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>>>> fn:lower-case.
>>>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>>>> for
>>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>>
>>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>>> performCaseSensitiveSearch,
>>>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>>>> fullName
>>>>>>>>>
>>>>>>>>>     .....
>>>>>>>>>
>>>>>>>>>     if ( performCaseSensitiveSearch) {
>>>>>>>>>
>>>>>>>>>         //jcr:like for case sensitive
>>>>>>>>>         filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>>>     else {
>>>>>>>>>
>>>>>>>>>         //only use fn:lower-case if there is spaces, with NO wild
>>>>>>>>> cards
>>>>>>>>>
>>>>>>>>>         if ( searchTerm.contains (" ")&&           !searchTerm.contains
>>>>>>>>> ("*")&&
>>>>>>>>>    !searchTerm.contains ("?") ) {
>>>>>>>>>
>>>>>>>>>             filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>>>>
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>         else {
>>>>>>>>>
>>>>>>>>>             //jcr:contains for case insensitive
>>>>>>>>>             filter.addContains ( srchField,
>>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>>>> indexing (just store something) and during searching: just search in
>>>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems
>>>>>>>> that
>>>>>>>> it should leave spaces untouched
>>>>>>>>
>>>>>>>> Regards Ard
>>>>>>>>
>>>>>>>>
>>>>>>>>>     ....
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>>
>>>>>>>>> H. Wilson
>>>>>>>>>
>>>>>>>>>>> OK so it looks like I have one other issue. Using the
>>>>>>>>>>> configuration
>>>>>>>>>>> as
>>>>>>>>>>> posted below and sticking to my previous examples, with the
>>>>>>>>>>> addition
>>>>>>>>>>> of
>>>>>>>>>>> one
>>>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>>>
>>>>>>>>>>>     .North.South.East.WestLand
>>>>>>>>>>>     .North.South.East.West_Land
>>>>>>>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>>>>>>>
>>>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
>>>>>>>>>>> cards:
>>>>>>>>>>> the
>>>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>>>
>>>>>>>>>>>     filter.addContains(@fullName,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>>> Land") +"'));
>>>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am
>>>>>>>>>> not
>>>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>>>> though
>>>>>>>>>>
>>>>>>>>>> Regards Ard
>>>>>>>>>>
>>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>>>> creating
>>>>>>>>>>> one token, plus combined with escaping the Illegal Characters
>>>>>>>>>>> (i.e.
>>>>>>>>>>> spaces),
>>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>>
>>>>>>>>>>> H. Wilson

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello Wilson,

On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson <wi...@randdss.com> wrote:

> Some successful queries I ran in my unit tests (out of the 1200+ test
> queries I have ...) (all of these were tried once as shown and once as
> "string".toLowerCase() )
>
>   .North.South.East.West*
>   .North.South.East.West-*
>   .North.South.East.West-Land
>   *West-Land
>   .North*
>
>
> Unsuccessful include:
>
>   .North.South.East.West-Lan?
>   .North.South.East.West Land

I didn't look at code, but I think the analyzer part is just fine. I
suspect the jackrabbit queryparser to mangle dashes and spaces. I am
how ever not sure how you could avoid this. I'd have to look into it.
Though, you might want to check the JackrabbitQueryParser what it
makes of your ' .North.South.East.West-Lan?' or
'.North.South.East.West Land'

Regards Ard

>
>
> Good Luck!
>
> *H. Wilson*
>
>
> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>
>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>> filter, you actually got the record.
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
>> Sent: Thursday, 2 September 2010 2:19 PM
>> To: users@jackrabbit.apache.org
>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>> back at square one, maybe even further back because now it looks like it's
>> being case sensitive.
>>
>> My Analyzer:
>>
>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>   private static final Logger LOGGER =
>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>
>>   public TokenStream tokenStream(String field, final Reader reader) {
>>     LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>> reader.toString() : "") + "]");
>>
>>     TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>     return keywordTokenStream;
>>     //return (new LowerCaseFilter(keywordTokenStream));
>>   }
>> }
>>
>> My HyphenKeywordTokenizer class is practically a direct copy of
>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>> can see above, I'm not using the lower case filter, just to see what
>> happens.
>>
>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>> 'Bob' 'Arlington-Smythe'.
>>
>> A search for 'Sophie-Anne' produces the user's record, however, a search
>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>> now?> From what H. Wilson has found, it doesn't look like it will solve the
>> problem.
>>
>> The query being used is:
>> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>> descending]
>>
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: H. Wilson [mailto:wilsonh@randdss.com]
>> Sent: Wednesday, 1 September 2010 6:47 AM
>> To: users@jackrabbit.apache.org
>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>>
>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>
>>>> Given the following parameters in the repository:
>>>>
>>>>    .North.South.East.WestLand
>>>>    .North.South.East.West_Land
>>>>    .North.South.East.West Land    //yes that's a space
>>>>
>>>> The following exact name, case sensitive queries worked as expected for
>>>> each
>>>> of the three parameters:
>>>>
>>>>    filter.orJCRExpression ("jcr:like(@" + srchField
>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>> sens.
>>>
>>> jcr:like does not depend on any analyser but on the stored field, so
>>> this is not strange that it still works.
>>
>> I expected this too, I just try to be as thorough as possible when
>> posting anywhere. I am disappointed enough I haven't figured this out on
>> my own.
>>>>
>>>> The following exact name query, case insensitive, worked for only the
>>>> parameter with a fullName with a whitespace character:
>>>>
>>>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>
>>>> The following exact name queries, case insensitive, stopped working for
>>>> the
>>>> fullnames WITHOUT a whitespace character:
>>>>
>>>>    filter.addContains ( srchField,
>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>
>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>> analyzer
>>>> to figure out how the tokens were working. Oh I should note, the output
>>>> from
>>>> the Analyzer only showed one Token per field, which I believe is what we
>>>> were looking for. Which leaves me as perplexed as before.
>>>>
>>>> LowerCaseKeywordAnalyzer.java:
>>>>
>>>>    ...
>>>>
>>>>    public TokenStream tokenStream ( String field, final Reader reader  )
>>>> {
>>>>             System.out.println ("TOKEN STREAM for field: " + field);
>>>>             TokenStream keywordTokenStream = super.tokenStream (field,
>>>> reader);
>>>>
>>>>         //changed for testing
>>>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>> keywordTokenStream ) ;
>>>>             final Token reusableToken = new Token();
>>>>             try {
>>>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>>>                 while ( mytoken != null  ) {
>>>>                     System.out.println ("[" + mytoken.term() + "]");
>>>>                     mytoken = lowerCaseStream.next (mytoken);
>>>>                 }
>>>>                 //lowerCaseStream.reset();  //uncommenting this did not
>>>> change results.
>>>>             }
>>>>             catch  (IOException ioe) {
>>>>                 System.err.println ("ERROR: " + ioe.toString());
>>>>             }
>>>>
>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>> on the keywordTokenStream before using it again.
>>>
>>> Regards Ard
>>>
>>>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>         }
>>>>
>>>>    ...
>>
>> I was real excited when I saw your email this morning. However,
>> resetting keywordTokenStream as the last line in the "try" resulted in
>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>> an act of desperation with no difference. I must be missing something
>> completely obvious at this point... look at a problem too long and the
>> obvious fails to jump out at you...
>>
>> H. Wilson
>>>>
>>>> Thanks.
>>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>
>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>
>>>>> wrote:
>>>>>>
>>>>>>   Ard,
>>>>>>
>>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>>> think
>>>>>> I was too worn out from my week and too excited to have code that
>>>>>> "worked"
>>>>>> to notice the obvious... this must be a workaround. However, I will
>>>>>> need
>>>>>> a
>>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>>> really
>>>>>> understood how to use it properly. Could you give me a clear list of
>>>>>> steps,
>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>> tokens during insert/search? Thanks.
>>>>>
>>>>> I'd just print them to your console with Token#term() or use a
>>>>> debugger . If you do that during indexing and searching, I think you
>>>>> must see some difference in the token that explains *why* Lucene
>>>>> doesn't find a hit for your usecase with spaces.
>>>>>
>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>> necessary any more but has some historical reasons from Lucene back in
>>>>> the days when it could not handle very many unique fieldnames
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>>>   wrote:
>>>>>>>>
>>>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>>>> anyone
>>>>>>>> who
>>>>>>>> needs it. Putting quotes around the spaces unfortunately did not
>>>>>>>> work.
>>>>>>>>   During testing, I determined that if you performed the following
>>>>>>>> query
>>>>>>>> for
>>>>>>>> the exact fullName property:
>>>>>>>>
>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land"));
>>>>>>>>
>>>>>>>> It would return nothing. But tweak it a little and add a wildcard,
>>>>>>>> and
>>>>>>>> it
>>>>>>>> would return results:
>>>>>>>>
>>>>>>>>    filter.addContains ( @fullName,
>>>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Lan*"));
>>>>>>>
>>>>>>> This does not make sense...see below
>>>>>>>
>>>>>>>> But since I did not want to throw in wild cards where they might not
>>>>>>>> be
>>>>>>>> wanted, if a search string contained spaces, did not contain wild
>>>>>>>> cards
>>>>>>>> and
>>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>>> fn:lower-case.
>>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>>> for
>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>
>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>> performCaseSensitiveSearch,
>>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>>> fullName
>>>>>>>>
>>>>>>>>    .....
>>>>>>>>
>>>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>>>
>>>>>>>>        //jcr:like for case sensitive
>>>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>
>>>>>>>>    }
>>>>>>>>    else {
>>>>>>>>
>>>>>>>>        //only use fn:lower-case if there is spaces, with NO wild
>>>>>>>> cards
>>>>>>>>
>>>>>>>>        if ( searchTerm.contains (" ")&&         !searchTerm.contains
>>>>>>>> ("*")&&
>>>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>>>
>>>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>>>
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>        else {
>>>>>>>>
>>>>>>>>            //jcr:contains for case insensitive
>>>>>>>>            filter.addContains ( srchField,
>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>    }
>>>>>>>
>>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>>> indexing (just store something) and during searching: just search in
>>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems
>>>>>>> that
>>>>>>> it should leave spaces untouched
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>
>>>>>>>>    ....
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>
>>>>>>>> H. Wilson
>>>>>>>>
>>>>>>>>>> OK so it looks like I have one other issue. Using the
>>>>>>>>>> configuration
>>>>>>>>>> as
>>>>>>>>>> posted below and sticking to my previous examples, with the
>>>>>>>>>> addition
>>>>>>>>>> of
>>>>>>>>>> one
>>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>>
>>>>>>>>>>    .North.South.East.WestLand
>>>>>>>>>>    .North.South.East.West_Land
>>>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>>>
>>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
>>>>>>>>>> cards:
>>>>>>>>>> the
>>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>>
>>>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>> Land") +"'));
>>>>>>>>>
>>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am
>>>>>>>>> not
>>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>>> though
>>>>>>>>>
>>>>>>>>> Regards Ard
>>>>>>>>>
>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>>> creating
>>>>>>>>>> one token, plus combined with escaping the Illegal Characters
>>>>>>>>>> (i.e.
>>>>>>>>>> spaces),
>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>
>>>>>>>>>> H. Wilson
>

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by "H. Wilson" <wi...@randdss.com>.
  Chris,

I am still working on this too. I can at best make more suggestions... I 
am using TestNG  to populate my tests such that they have all sorts of 
wildcards and what not. I added a Hyphen example to my tests and was 
unable to replicate your problem. I did notice a difference in your 
tokenStream method to mine though, and I am not sure if it is the 
reason. Instead of your:

    TokenStream keywordTokenStream = new HyphenKeywordTokenizer (reader);

I have:

    TokenStream keywordTokenStream = super.tokenStream (field, reader);

I see you said it was nearly a direct copy of KeywordTokenizer, I just 
thought I would suggest trying it that way first to see if that was 
working, before customizing? My return statement was identical to your 
commented out one, and it seems to work fine.

You also mentioned it was outputting only one Token as it should, could 
I pry as to whether you had any issues outputting Tokens and getting the 
same results as before trying to output them? As you may see in previous 
messages on this thread, when I began TokenStream.next() 'ing, just to 
output the Token, even if I reset() the TokenStream I would get 
different results, with a higher failure rate. If I commented out the 
section outputting the Tokens, it worked as before. I had to put a dirty 
hack/workaround in my code in the meantime to make searches containing 
whitespaces function properly. Those and trailing "?" searches are 
currently not working for me.

Examples I tested with this last time, which I tweaked to include a 
hyphen test, (values in repo):

    .North.South.East.WestLand
    .North.South.East.West Land
    .North.South.East.West_Land
    .North.South.East.West-Land

Some successful queries I ran in my unit tests (out of the 1200+ test 
queries I have ...) (all of these were tried once as shown and once as 
"string".toLowerCase() )

    .North.South.East.West*
    .North.South.East.West-*
    .North.South.East.West-Land
    *West-Land
    .North*


Unsuccessful include:

    .North.South.East.West-Lan?
    .North.South.East.West Land


Good Luck!

*H. Wilson*


On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
> Just to be clear, the Lowercase Filter makes it even worse, as searching for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the filter, you actually got the record.
>
> Chris Dunstall | Service Support - Applications
> Technology Integration/OLE Virtual Team
> Division of Information Technology | Charles Sturt University | Bathurst, NSW
>
> Ph: 02 63384818 | Fax: 02 63384181
>
>
> -----Original Message-----
> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
> Sent: Thursday, 2 September 2010 2:19 PM
> To: users@jackrabbit.apache.org
> Subject: RE: Problems with hyphen in JSR-170 XPath query using jcr:contains
>
> I've got the customised Analyzer and Tokenizer working, but it seems I'm back at square one, maybe even further back because now it looks like it's being case sensitive.
>
> My Analyzer:
>
> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>    private static final Logger LOGGER = LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>
>    public TokenStream tokenStream(String field, final Reader reader) {
>      LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ? reader.toString() : "") + "]");
>
>      TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>      return keywordTokenStream;
>      //return (new LowerCaseFilter(keywordTokenStream));
>    }
> }
>
> My HyphenKeywordTokenizer class is practically a direct copy of KeywordTokenizer, where it emits the entire input as a single token.  As you can see above, I'm not using the lower case filter, just to see what happens.
>
> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named 'Bob' 'Arlington-Smythe'.
>
> A search for 'Sophie-Anne' produces the user's record, however, a search for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now, even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query now?> From what H. Wilson has found, it doesn't look like it will solve the problem.
>
> The query being used is:
> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(., 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score descending]
>
>
> Chris Dunstall | Service Support - Applications
> Technology Integration/OLE Virtual Team
> Division of Information Technology | Charles Sturt University | Bathurst, NSW
>
> Ph: 02 63384818 | Fax: 02 63384181
>
>
> -----Original Message-----
> From: H. Wilson [mailto:wilsonh@randdss.com]
> Sent: Wednesday, 1 September 2010 6:47 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
>
>
> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>> Given the following parameters in the repository:
>>>
>>>     .North.South.East.WestLand
>>>     .North.South.East.West_Land
>>>     .North.South.East.West Land    //yes that's a space
>>>
>>> The following exact name, case sensitive queries worked as expected for each
>>> of the three parameters:
>>>
>>>     filter.orJCRExpression ("jcr:like(@" + srchField
>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
>> jcr:like does not depend on any analyser but on the stored field, so
>> this is not strange that it still works.
> I expected this too, I just try to be as thorough as possible when
> posting anywhere. I am disappointed enough I haven't figured this out on
> my own.
>>> The following exact name query, case insensitive, worked for only the
>>> parameter with a fullName with a whitespace character:
>>>
>>>     filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>
>>> The following exact name queries, case insensitive, stopped working for the
>>> fullnames WITHOUT a whitespace character:
>>>
>>>     filter.addContains ( srchField,
>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>
>>> Again, the only change I made was to the analyzer, I didn't remove my
>>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>>> to figure out how the tokens were working. Oh I should note, the output from
>>> the Analyzer only showed one Token per field, which I believe is what we
>>> were looking for. Which leaves me as perplexed as before.
>>>
>>> LowerCaseKeywordAnalyzer.java:
>>>
>>>     ...
>>>
>>>     public TokenStream tokenStream ( String field, final Reader reader  ) {
>>>              System.out.println ("TOKEN STREAM for field: " + field);
>>>              TokenStream keywordTokenStream = super.tokenStream (field,
>>> reader);
>>>
>>>          //changed for testing
>>>              TokenStream lowerCaseStream =  new LowerCaseFilter (
>>> keywordTokenStream ) ;
>>>              final Token reusableToken = new Token();
>>>              try {
>>>                  Token mytoken = lowerCaseStream.next (reusableToken);
>>>                  while ( mytoken != null  ) {
>>>                      System.out.println ("[" + mytoken.term() + "]");
>>>                      mytoken = lowerCaseStream.next (mytoken);
>>>                  }
>>>                  //lowerCaseStream.reset();  //uncommenting this did not
>>> change results.
>>>              }
>>>              catch  (IOException ioe) {
>>>                  System.err.println ("ERROR: " + ioe.toString());
>>>              }
>>>
>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>> on the keywordTokenStream before using it again.
>>
>> Regards Ard
>>
>>>              return (new LowerCaseFilter ( keywordTokenStream ) );
>>>          }
>>>
>>>     ...
> I was real excited when I saw your email this morning. However,
> resetting keywordTokenStream as the last line in the "try" resulted in
> no change. I also tried uncommenting the lowerCaseStream.reset line in
> an act of desperation with no difference. I must be missing something
> completely obvious at this point... look at a problem too long and the
> obvious fails to jump out at you...
>
> H. Wilson
>>> Thanks.
>>>
>>> H. Wilson
>>>
>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>     wrote:
>>>>>    Ard,
>>>>>
>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>> think
>>>>> I was too worn out from my week and too excited to have code that
>>>>> "worked"
>>>>> to notice the obvious... this must be a workaround. However, I will need
>>>>> a
>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>> really
>>>>> understood how to use it properly. Could you give me a clear list of
>>>>> steps,
>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>> tokens during insert/search? Thanks.
>>>> I'd just print them to your console with Token#term() or use a
>>>> debugger . If you do that during indexing and searching, I think you
>>>> must see some difference in the token that explains *why* Lucene
>>>> doesn't find a hit for your usecase with spaces.
>>>>
>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>> as the field value prefixing: It is unfortunate and not completely
>>>> necessary any more but has some historical reasons from Lucene back in
>>>> the days when it could not handle very many unique fieldnames
>>>>
>>>> Regards Ard
>>>>
>>>>> H. Wilson
>>>>>
>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>>    wrote:
>>>>>>>    OK, well I got the spaces part figured out, and will post it for
>>>>>>> anyone
>>>>>>> who
>>>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>>>    During testing, I determined that if you performed the following query
>>>>>>> for
>>>>>>> the exact fullName property:
>>>>>>>
>>>>>>>      filter.addContains ( @fullName,
>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>>
>>>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>>>> it
>>>>>>> would return results:
>>>>>>>
>>>>>>>     filter.addContains ( @fullName,
>>>>>>>     '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>> Lan*"));
>>>>>> This does not make sense...see below
>>>>>>
>>>>>>> But since I did not want to throw in wild cards where they might not be
>>>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>>>> and
>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>> fn:lower-case.
>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>> for
>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>
>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>> performCaseSensitiveSearch,
>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>> fullName
>>>>>>>
>>>>>>>     .....
>>>>>>>
>>>>>>>     if ( performCaseSensitiveSearch) {
>>>>>>>
>>>>>>>         //jcr:like for case sensitive
>>>>>>>         filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>
>>>>>>>     }
>>>>>>>     else {
>>>>>>>
>>>>>>>         //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>>>
>>>>>>>         if ( searchTerm.contains (" ")&&         !searchTerm.contains
>>>>>>> ("*")&&
>>>>>>>    !searchTerm.contains ("?") ) {
>>>>>>>
>>>>>>>             filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>         else {
>>>>>>>
>>>>>>>             //jcr:contains for case insensitive
>>>>>>>             filter.addContains ( srchField,
>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>     }
>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>> indexing (just store something) and during searching: just search in
>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>>> it should leave spaces untouched
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>
>>>>>>>     ....
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> Hope that helps anyone who needs it.
>>>>>>>
>>>>>>> H. Wilson
>>>>>>>
>>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>>> as
>>>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>>>> of
>>>>>>>>> one
>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>
>>>>>>>>>     .North.South.East.WestLand
>>>>>>>>>     .North.South.East.West_Land
>>>>>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>>>>>
>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>>>> the
>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>
>>>>>>>>>     filter.addContains(@fullName,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Land") +"'));
>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>> though
>>>>>>>>
>>>>>>>> Regards Ard
>>>>>>>>
>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>> creating
>>>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>>>> spaces),
>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>
>>>>>>>>> H. Wilson

RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

Posted by "Dunstall, Christopher" <cd...@csu.edu.au>.
Just to be clear, the Lowercase Filter makes it even worse, as searching for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the filter, you actually got the record.

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au] 
Sent: Thursday, 2 September 2010 2:19 PM
To: users@jackrabbit.apache.org
Subject: RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

I've got the customised Analyzer and Tokenizer working, but it seems I'm back at square one, maybe even further back because now it looks like it's being case sensitive.

My Analyzer:

public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
  private static final Logger LOGGER = LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
  
  public TokenStream tokenStream(String field, final Reader reader) {
    LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ? reader.toString() : "") + "]");
    
    TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
    return keywordTokenStream;
    //return (new LowerCaseFilter(keywordTokenStream));
  }
}

My HyphenKeywordTokenizer class is practically a direct copy of KeywordTokenizer, where it emits the entire input as a single token.  As you can see above, I'm not using the lower case filter, just to see what happens.

Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named 'Bob' 'Arlington-Smythe'.

A search for 'Sophie-Anne' produces the user's record, however, a search for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now, even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query now? >From what H. Wilson has found, it doesn't look like it will solve the problem.

The query being used is:
//*[@sling:resourceType="sakai/user-profile" and (jcr:contains(., 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score descending]


Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: H. Wilson [mailto:wilsonh@randdss.com] 
Sent: Wednesday, 1 September 2010 6:47 AM
To: users@jackrabbit.apache.org
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains


On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>
>> Given the following parameters in the repository:
>>
>>    .North.South.East.WestLand
>>    .North.South.East.West_Land
>>    .North.South.East.West Land    //yes that's a space
>>
>> The following exact name, case sensitive queries worked as expected for each
>> of the three parameters:
>>
>>    filter.orJCRExpression ("jcr:like(@" + srchField
>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
> jcr:like does not depend on any analyser but on the stored field, so
> this is not strange that it still works.
I expected this too, I just try to be as thorough as possible when 
posting anywhere. I am disappointed enough I haven't figured this out on 
my own.
>> The following exact name query, case insensitive, worked for only the
>> parameter with a fullName with a whitespace character:
>>
>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>
>> The following exact name queries, case insensitive, stopped working for the
>> fullnames WITHOUT a whitespace character:
>>
>>    filter.addContains ( srchField,
>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>
>> Again, the only change I made was to the analyzer, I didn't remove my
>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>> to figure out how the tokens were working. Oh I should note, the output from
>> the Analyzer only showed one Token per field, which I believe is what we
>> were looking for. Which leaves me as perplexed as before.
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>>    ...
>>
>>    public TokenStream tokenStream ( String field, final Reader reader  ) {
>>             System.out.println ("TOKEN STREAM for field: " + field);
>>             TokenStream keywordTokenStream = super.tokenStream (field,
>> reader);
>>
>>         //changed for testing
>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>> keywordTokenStream ) ;
>>             final Token reusableToken = new Token();
>>             try {
>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>                 while ( mytoken != null  ) {
>>                     System.out.println ("[" + mytoken.term() + "]");
>>                     mytoken = lowerCaseStream.next (mytoken);
>>                 }
>>                 //lowerCaseStream.reset();  //uncommenting this did not
>> change results.
>>             }
>>             catch  (IOException ioe) {
>>                 System.err.println ("ERROR: " + ioe.toString());
>>             }
>>
> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
> on the keywordTokenStream before using it again.
>
> Regards Ard
>
>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>         }
>>
>>    ...
I was real excited when I saw your email this morning. However, 
resetting keywordTokenStream as the last line in the "try" resulted in 
no change. I also tried uncommenting the lowerCaseStream.reset line in 
an act of desperation with no difference. I must be missing something 
completely obvious at this point... look at a problem too long and the 
obvious fails to jump out at you...

H. Wilson
>> Thanks.
>>
>> H. Wilson
>>
>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wi...@randdss.com>    wrote:
>>>>   Ard,
>>>>
>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>> think
>>>> I was too worn out from my week and too excited to have code that
>>>> "worked"
>>>> to notice the obvious... this must be a workaround. However, I will need
>>>> a
>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>> really
>>>> understood how to use it properly. Could you give me a clear list of
>>>> steps,
>>>> or point me to a resource I missed, on how I would go about inspecting
>>>> tokens during insert/search? Thanks.
>>> I'd just print them to your console with Token#term() or use a
>>> debugger . If you do that during indexing and searching, I think you
>>> must see some difference in the token that explains *why* Lucene
>>> doesn't find a hit for your usecase with spaces.
>>>
>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>> as the field value prefixing: It is unfortunate and not completely
>>> necessary any more but has some historical reasons from Lucene back in
>>> the days when it could not handle very many unique fieldnames
>>>
>>> Regards Ard
>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wi...@randdss.com>
>>>>>   wrote:
>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>> anyone
>>>>>> who
>>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>>   During testing, I determined that if you performed the following query
>>>>>> for
>>>>>> the exact fullName property:
>>>>>>
>>>>>>     filter.addContains ( @fullName,
>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>
>>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>>> it
>>>>>> would return results:
>>>>>>
>>>>>>    filter.addContains ( @fullName,
>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Lan*"));
>>>>> This does not make sense...see below
>>>>>
>>>>>> But since I did not want to throw in wild cards where they might not be
>>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>>> and
>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>> fn:lower-case.
>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>> for
>>>>>> case sensitive and case insensitive searching) .
>>>>>>
>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>> performCaseSensitiveSearch,
>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>> fullName
>>>>>>
>>>>>>    .....
>>>>>>
>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>
>>>>>>        //jcr:like for case sensitive
>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>
>>>>>>    }
>>>>>>    else {
>>>>>>
>>>>>>        //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>>
>>>>>>        if ( searchTerm.contains (" ")&&        !searchTerm.contains
>>>>>> ("*")&&
>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>
>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>        else {
>>>>>>
>>>>>>            //jcr:contains for case insensitive
>>>>>>            filter.addContains ( srchField,
>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>    }
>>>>> This seems to me a workaround around the real problem, because, it
>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>> indexing (just store something) and during searching: just search in
>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>> it should leave spaces untouched
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>
>>>>>>    ....
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Hope that helps anyone who needs it.
>>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>> as
>>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>>> of
>>>>>>>> one
>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>
>>>>>>>>    .North.South.East.WestLand
>>>>>>>>    .North.South.East.West_Land
>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>
>>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>>> the
>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>
>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land") +"'));
>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>> though
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>> creating
>>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>>> spaces),
>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>
>>>>>>>> H. Wilson