You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by cbowditch <bo...@hotmail.com> on 2009/07/29 11:56:13 UTC

$ or £ symbols are excluded from Search Query

Hi All,

I am using Lucene 2.2.0 and have created an Index with some text including
values in $, £ and euros too! But I can't search for text that includes $, £
or euro. I checked the index with Luke and can see the $ and £ symbols in
the index. When I ask Luke to explain the structure of the Query it always
excludes the $ and £ symbols from the query. I read the help on special
symbols and it said / could be used to escape the characters. Although I
didn't see $ or £ listed as a special operator and when I tried escaping
them it made no difference. Can anyone tell me how I can search my index for
$ or £.

Thanks,

Chris
-- 
View this message in context: http://www.nabble.com/%24-or-%C2%A3-symbols-are-excluded-from-Search-Query-tp24716042p24716042.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: $ or £ symbols are excluded from Search Query

Posted by Erick Erickson <er...@gmail.com>.

When you say "using Luke", you're also using a particular analyzer. I forget
which one Luke defaults to, but it may well be stripping out your
special characters.

It's all about your analyzer, and I'm 90% certain you're using one that
strips out these characters when *querying*. You can make Luke
use different analyzers, there's a drop-down that lets you select...

HTH
Erick

On Wed, Jul 29, 2009 at 5:56 AM, cbowditch <bo...@hotmail.com>wrote:

>
> Hi All,
>
> I am using Lucene 2.2.0 and have created an Index with some text including
> values in $, £ and euros too! But I can't search for text that includes $,
> £
> or euro. I checked the index with Luke and can see the $ and £ symbols in
> the index. When I ask Luke to explain the structure of the Query it always
> excludes the $ and £ symbols from the query. I read the help on special
> symbols and it said / could be used to escape the characters. Although I
> didn't see $ or £ listed as a special operator and when I tried escaping
> them it made no difference. Can anyone tell me how I can search my index
> for
> $ or £.
>
> Thanks,
>
> Chris
> --
> View this message in context:
> http://www.nabble.com/%24-or-%C2%A3-symbols-are-excluded-from-Search-Query-tp24716042p24716042.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: $ or £ symbols are excluded from Search Query

Posted by Erick Erickson <er...@gmail.com>.

WhitespaceAnalyzer won't fold case. It won't strip any "odd" characters out.
It won't, in fact, do anything except break on white space. You might want
to write your own analyzer that incorporates, some of the filters,
especially LowercaseFilter.

On Wed, Jul 29, 2009 at 9:04 AM, cbowditch <bo...@hotmail.com>wrote:

>
>
>
> Ahmet Arslan wrote:
> >
> >
> >> Can anyone tell me how I can search my index for $ or £.
> >
> > $ or £ or euro character are not reserved characters that are specified
> in
> > QueryParser. I just verified it using the code below: (in Lucene 2.4.1)
> >
> > org.apache.lucene.queryParser.QueryParser qp = new
> > org.apache.lucene.queryParser.QueryParser("title", new
> > WhitespaceAnalyzer());
> > Query q = qp.parse("$ahmet$ AND £arslan£ te$s£t");
> > System.out.println(q.toString());
> >
> > Where the output is : +title:$ahmet$ +title:£arslan£ title:te$s£t
> >
> > Probably your analyzer is eating up those characters. Are you using
> > StandardAnalyzer or SimpleAnalyzer? LetterTokenizer and StandardTokenizer
> > breaks/splits words at those characters. If thats the cause of the
> > problem, use something like WhitespaceAnalyzer or construct your queries
> > programmatically using Lucene Query API. e.g. TermQuery etc.
> >
>
> Thanks for the suggestions. I had tried SimpleAnalyzer and StandardAnalyzer
> within Luke. When I switched to WhitespaceAnalyzer the $ and £ symbols were
> maintained.
>
> Within my own Application we seem to be using a custom Analyzer that sub
> classes Analyzer. What is the implication of switch the base class to
> WhitespaceAnalyzer?
>
> Thanks,
>
> Chris
> --
> View this message in context:
> http://www.nabble.com/%24-or-%C2%A3-symbols-are-excluded-from-Search-Query-tp24716042p24718799.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: $ or £ symbols are excluded from Search Query

Posted by AHMET ARSLAN <io...@yahoo.com>.


> Within my own Application we seem to be using a custom
> Analyzer that sub classes Analyzer. What is the implication of switch the
> base class to WhitespaceAnalyzer?

You said that you can see those characters in the lucene index, right? If yes in query parsing you can use the same custom analyzer that is used for indexing without problems. Because it means that analyzer didn't eat those characters during indexing. I am hoping your fields are analyzed/tokenized fields.

If you want to build an analyzer from scratch, Lucene in Action book say a lot about it.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: $ or £ symbols are excluded from Search Query

Posted by cbowditch <bo...@hotmail.com>.



Ahmet Arslan wrote:
> 
> 
>> Can anyone tell me how I can search my index for $ or £.
>  
> $ or £ or euro character are not reserved characters that are specified in
> QueryParser. I just verified it using the code below: (in Lucene 2.4.1)
> 
> org.apache.lucene.queryParser.QueryParser qp = new
> org.apache.lucene.queryParser.QueryParser("title", new
> WhitespaceAnalyzer());
> Query q = qp.parse("$ahmet$ AND £arslan£ te$s£t");
> System.out.println(q.toString());
> 
> Where the output is : +title:$ahmet$ +title:£arslan£ title:te$s£t
> 
> Probably your analyzer is eating up those characters. Are you using
> StandardAnalyzer or SimpleAnalyzer? LetterTokenizer and StandardTokenizer
> breaks/splits words at those characters. If thats the cause of the
> problem, use something like WhitespaceAnalyzer or construct your queries
> programmatically using Lucene Query API. e.g. TermQuery etc.
> 

Thanks for the suggestions. I had tried SimpleAnalyzer and StandardAnalyzer
within Luke. When I switched to WhitespaceAnalyzer the $ and £ symbols were
maintained.

Within my own Application we seem to be using a custom Analyzer that sub
classes Analyzer. What is the implication of switch the base class to
WhitespaceAnalyzer?

Thanks,

Chris
-- 
View this message in context: http://www.nabble.com/%24-or-%C2%A3-symbols-are-excluded-from-Search-Query-tp24716042p24718799.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: $ or £ symbols are excluded from Search Query

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Can anyone tell me how I can search my index for $ or £.
 
$ or £ or euro character are not reserved characters that are specified in QueryParser. I just verified it using the code below: (in Lucene 2.4.1)

org.apache.lucene.queryParser.QueryParser qp = new org.apache.lucene.queryParser.QueryParser("title", new WhitespaceAnalyzer());
Query q = qp.parse("$ahmet$ AND £arslan£ te$s£t");
System.out.println(q.toString());

Where the output is : +title:$ahmet$ +title:£arslan£ title:te$s£t

Probably your analyzer is eating up those characters. Are you using StandardAnalyzer or SimpleAnalyzer? LetterTokenizer and StandardTokenizer breaks/splits words at those characters. If thats the cause of the problem, use something like WhitespaceAnalyzer or construct your queries programmatically using Lucene Query API. e.g. TermQuery etc.

And why not to switch latest version. To improve searching and indexing speed Lucene [1][2] advises to use the latest version.

[1] http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
[2] http://wiki.apache.org/lucene-java/ImproveSearchingSpeed




      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org