You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Pillinger, Adrian" <ap...@dolby.co.uk> on 2006/08/10 17:49:49 UTC

Special characters

I am indexing some text in a java object that is "%772B" with the
standard analyser and Lucene 2.

Should I be able to search for this with the same text as the query, or
do I need to do any escaping of characters?

Thanks

Adrian

-----------------------------------------
This message (including any attachments) may contain confidential
information intended for a specific individual and purpose.  If you
are not the intended recipient, delete this message.  If you are
not the intended recipient, disclosing, copying, distributing, or
taking any action based on this message is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Special characters

Posted by Adrian Pillinger <ap...@dolby.co.uk>.
Thanks for the replies on my question.

In the end I've taken the StandardAnalyser grammar, modified it and  
generated a new analyser with JavaCC. Seems to be working a treat!

Adrian

On 11 Aug 2006, at 14:32, Erik Hatcher wrote:

>
> On Aug 11, 2006, at 1:23 AM, Martin Braun wrote:
>> Hello Adrian,
>>
>>>> I am indexing some text in a java object that is "%772B" with the
>>>> standard analyser and Lucene 2.
>>>>
>>>> Should I be able to search for this with the same text as the  
>>>> query, or
>>>> do I need to do any escaping of characters?
>>
>> Besides Luke there are the AnalyzerUtils from the LIA book, (you can
>> download the source code examples here:
>> http://www.lucenebook.com/LuceneInAction.zip
>
> You can also try out analysis just using "ant AnalyzerDemo", like  
> this:
>
> $ ant AnalyzerDemo
> Buildfile: build.xml
>
> check-environment:
>
> compile:
>
> build-test-index:
>
> build-perf-index:
>
> prepare:
>
> AnalyzerDemo:
>      [echo]
>      [echo]       Demonstrates analysis of sample text.
>      [echo]
>      [echo]       Refer to the "Analysis" chapter for much more on  
> this
>      [echo]       extremely crucial topic.
>      [echo]
>     [input] Press return to continue...
>
>     [input] String to analyze: [This string will be analyzed.]
> %772B
>      [echo] Running lia.analysis.AnalyzerDemo...
>      [java] Analyzing "%772B"
>      [java]   WhitespaceAnalyzer:
>      [java]     [%772B]
>
>      [java]   SimpleAnalyzer:
>      [java]     [b]
>
>      [java]   StopAnalyzer:
>      [java]     [b]
>
>      [java]   StandardAnalyzer:
>      [java]     [772b]
>
>
> BUILD SUCCESSFUL
> Total time: 7 seconds
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-----------------------------------------
This message (including any attachments) may contain confidential
information intended for a specific individual and purpose.  If you
are not the intended recipient, delete this message.  If you are
not the intended recipient, disclosing, copying, distributing, or
taking any action based on this message is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Special characters

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 11, 2006, at 1:23 AM, Martin Braun wrote:
> Hello Adrian,
>
>>> I am indexing some text in a java object that is "%772B" with the
>>> standard analyser and Lucene 2.
>>>
>>> Should I be able to search for this with the same text as the  
>>> query, or
>>> do I need to do any escaping of characters?
>
> Besides Luke there are the AnalyzerUtils from the LIA book, (you can
> download the source code examples here:
> http://www.lucenebook.com/LuceneInAction.zip

You can also try out analysis just using "ant AnalyzerDemo", like this:

$ ant AnalyzerDemo
Buildfile: build.xml

check-environment:

compile:

build-test-index:

build-perf-index:

prepare:

AnalyzerDemo:
      [echo]
      [echo]       Demonstrates analysis of sample text.
      [echo]
      [echo]       Refer to the "Analysis" chapter for much more on this
      [echo]       extremely crucial topic.
      [echo]
     [input] Press return to continue...

     [input] String to analyze: [This string will be analyzed.]
%772B
      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "%772B"
      [java]   WhitespaceAnalyzer:
      [java]     [%772B]

      [java]   SimpleAnalyzer:
      [java]     [b]

      [java]   StopAnalyzer:
      [java]     [b]

      [java]   StandardAnalyzer:
      [java]     [772b]


BUILD SUCCESSFUL
Total time: 7 seconds


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Special characters

Posted by Martin Braun <mb...@uni-hd.de>.
Hello Adrian,

>> I am indexing some text in a java object that is "%772B" with the
>> standard analyser and Lucene 2.
>>
>> Should I be able to search for this with the same text as the query, or
>> do I need to do any escaping of characters?

Besides Luke there are the AnalyzerUtils from the LIA book, (you can
download the source code examples here:
http://www.lucenebook.com/LuceneInAction.zip

You'll just have to customize the test-class and you'll get an output
like this:


Analzying "%772B"
	org.apache.lucene.analysis.standard.StandardAnalyzer:
		[772b]


1: [772b:1->5:<ALPHANUM>]

1: [772b]


 Analzying "%772B"
	org.apache.lucene.analysis.KeywordAnalyzer:
		[%772B]


1: [%772B:0->5:word]

1: [%772B]

hth,
martin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Special characters

Posted by Erick Erickson <er...@gmail.com>.
See below...

On 8/10/06, Pillinger, Adrian <ap...@dolby.co.uk> wrote:
>
> I am indexing some text in a java object that is "%772B" with the
> standard analyser and Lucene 2.
>
> Should I be able to search for this with the same text as the query, or
> do I need to do any escaping of characters?


probably not because I doubt that you'll have the '%' in the index (but I
admit I don't know for sure). Get Luke and check to be sure (
http://www.getopt.org/luke/). That will tell you exactly what is in the
index. I suspect you'll find "772B" but the '%' will simply be absent.

Also, watch capitalization. The StandardAnalyzer lowercases your stream as I
remember....

You probably want a different analyzer fot *both* indexing and searching if
you really need to search such strings, try WhitespaceAnalyzer and perhaps
store your values UN_TOKENIZED (but watch that latter, this assumes you're
controlling your tokens yourself and not relying on the analyzer to break up
your input stream).

And if you want to look treat different fields differently, think about
PerFieldAnalyzerWrapper.

Best
Erick


Thanks
>
> Adrian
>
> -----------------------------------------
> This message (including any attachments) may contain confidential
> information intended for a specific individual and purpose.  If you
> are not the intended recipient, delete this message.  If you are
> not the intended recipient, disclosing, copying, distributing, or
> taking any action based on this message is strictly prohibited.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>