You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Pillinger, Adrian" <ap...@dolby.co.uk> on 2006/08/10 17:49:49 UTC
Special characters
I am indexing some text in a java object that is "%772B" with the
standard analyser and Lucene 2.
Should I be able to search for this with the same text as the query, or
do I need to do any escaping of characters?
Thanks
Adrian
-----------------------------------------
This message (including any attachments) may contain confidential
information intended for a specific individual and purpose. If you
are not the intended recipient, delete this message. If you are
not the intended recipient, disclosing, copying, distributing, or
taking any action based on this message is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Special characters
Posted by Adrian Pillinger <ap...@dolby.co.uk>.
Thanks for the replies on my question.
In the end I've taken the StandardAnalyser grammar, modified it and
generated a new analyser with JavaCC. Seems to be working a treat!
Adrian
On 11 Aug 2006, at 14:32, Erik Hatcher wrote:
>
> On Aug 11, 2006, at 1:23 AM, Martin Braun wrote:
>> Hello Adrian,
>>
>>>> I am indexing some text in a java object that is "%772B" with the
>>>> standard analyser and Lucene 2.
>>>>
>>>> Should I be able to search for this with the same text as the
>>>> query, or
>>>> do I need to do any escaping of characters?
>>
>> Besides Luke there are the AnalyzerUtils from the LIA book, (you can
>> download the source code examples here:
>> http://www.lucenebook.com/LuceneInAction.zip
>
> You can also try out analysis just using "ant AnalyzerDemo", like
> this:
>
> $ ant AnalyzerDemo
> Buildfile: build.xml
>
> check-environment:
>
> compile:
>
> build-test-index:
>
> build-perf-index:
>
> prepare:
>
> AnalyzerDemo:
> [echo]
> [echo] Demonstrates analysis of sample text.
> [echo]
> [echo] Refer to the "Analysis" chapter for much more on
> this
> [echo] extremely crucial topic.
> [echo]
> [input] Press return to continue...
>
> [input] String to analyze: [This string will be analyzed.]
> %772B
> [echo] Running lia.analysis.AnalyzerDemo...
> [java] Analyzing "%772B"
> [java] WhitespaceAnalyzer:
> [java] [%772B]
>
> [java] SimpleAnalyzer:
> [java] [b]
>
> [java] StopAnalyzer:
> [java] [b]
>
> [java] StandardAnalyzer:
> [java] [772b]
>
>
> BUILD SUCCESSFUL
> Total time: 7 seconds
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
-----------------------------------------
This message (including any attachments) may contain confidential
information intended for a specific individual and purpose. If you
are not the intended recipient, delete this message. If you are
not the intended recipient, disclosing, copying, distributing, or
taking any action based on this message is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Special characters
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 11, 2006, at 1:23 AM, Martin Braun wrote:
> Hello Adrian,
>
>>> I am indexing some text in a java object that is "%772B" with the
>>> standard analyser and Lucene 2.
>>>
>>> Should I be able to search for this with the same text as the
>>> query, or
>>> do I need to do any escaping of characters?
>
> Besides Luke there are the AnalyzerUtils from the LIA book, (you can
> download the source code examples here:
> http://www.lucenebook.com/LuceneInAction.zip
You can also try out analysis just using "ant AnalyzerDemo", like this:
$ ant AnalyzerDemo
Buildfile: build.xml
check-environment:
compile:
build-test-index:
build-perf-index:
prepare:
AnalyzerDemo:
[echo]
[echo] Demonstrates analysis of sample text.
[echo]
[echo] Refer to the "Analysis" chapter for much more on this
[echo] extremely crucial topic.
[echo]
[input] Press return to continue...
[input] String to analyze: [This string will be analyzed.]
%772B
[echo] Running lia.analysis.AnalyzerDemo...
[java] Analyzing "%772B"
[java] WhitespaceAnalyzer:
[java] [%772B]
[java] SimpleAnalyzer:
[java] [b]
[java] StopAnalyzer:
[java] [b]
[java] StandardAnalyzer:
[java] [772b]
BUILD SUCCESSFUL
Total time: 7 seconds
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Special characters
Posted by Martin Braun <mb...@uni-hd.de>.
Hello Adrian,
>> I am indexing some text in a java object that is "%772B" with the
>> standard analyser and Lucene 2.
>>
>> Should I be able to search for this with the same text as the query, or
>> do I need to do any escaping of characters?
Besides Luke there are the AnalyzerUtils from the LIA book, (you can
download the source code examples here:
http://www.lucenebook.com/LuceneInAction.zip
You'll just have to customize the test-class and you'll get an output
like this:
Analzying "%772B"
org.apache.lucene.analysis.standard.StandardAnalyzer:
[772b]
1: [772b:1->5:<ALPHANUM>]
1: [772b]
Analzying "%772B"
org.apache.lucene.analysis.KeywordAnalyzer:
[%772B]
1: [%772B:0->5:word]
1: [%772B]
hth,
martin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Special characters
Posted by Erick Erickson <er...@gmail.com>.
See below...
On 8/10/06, Pillinger, Adrian <ap...@dolby.co.uk> wrote:
>
> I am indexing some text in a java object that is "%772B" with the
> standard analyser and Lucene 2.
>
> Should I be able to search for this with the same text as the query, or
> do I need to do any escaping of characters?
probably not because I doubt that you'll have the '%' in the index (but I
admit I don't know for sure). Get Luke and check to be sure (
http://www.getopt.org/luke/). That will tell you exactly what is in the
index. I suspect you'll find "772B" but the '%' will simply be absent.
Also, watch capitalization. The StandardAnalyzer lowercases your stream as I
remember....
You probably want a different analyzer fot *both* indexing and searching if
you really need to search such strings, try WhitespaceAnalyzer and perhaps
store your values UN_TOKENIZED (but watch that latter, this assumes you're
controlling your tokens yourself and not relying on the analyzer to break up
your input stream).
And if you want to look treat different fields differently, think about
PerFieldAnalyzerWrapper.
Best
Erick
Thanks
>
> Adrian
>
> -----------------------------------------
> This message (including any attachments) may contain confidential
> information intended for a specific individual and purpose. If you
> are not the intended recipient, delete this message. If you are
> not the intended recipient, disclosing, copying, distributing, or
> taking any action based on this message is strictly prohibited.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>