You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by syedfa <fa...@gmail.com> on 2009/12/27 23:48:13 UTC

Basic question about indexing certain words

Dear fellow Java developers:

I have a very basic question about indexing text using Lucene.  I am
indexing a large amount of text, that includes names that contain certain
punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.)  Will the punctuation
throw off the indexer in any way, such that it breaks up the tokens when
they shouldn't be, or will the indexer simply treat the punctuation inside
the names as any other character, and the presence of the punctuation will
not in any way hinder a user's ability to search for that name?  Are there
any precautions that I should take to avoid any problems?  

I hope this question is clear and makes sense. 

Thanks in advance to all who reply.

-- 
View this message in context: http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Basic question about indexing certain words

Posted by syedfa <fa...@gmail.com>.
Thanks very much for such a detailed reply, I didn't realize that there was
so much to this subject.  I understand the issue a bit better now!

Take care.


Erick Erickson wrote:
> 
> It depends completely on what analyzer you use. Conceptually, an Analyzer
> is composed of a Tokenizer followed by any number of Filters. So the
> input stream is broken up by the Tokenizer, then each token has one or
> more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..
> 
> The reason I'm not answering your question directly is that I can't. If
> you
> choose, say, a WhitespaceAnalyzer, which is built from a
> WhitespaceTokenizer,
> then your hyphens and apostrophes will pass through as-is, and your tokens
> (the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
> capitals
> and all.
> 
> If you choose StandardAnalyzer, built on StandardTokenizer and several
> filters
> your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
> well).
> 
> You can build your own Analyzers to process text however you please.
> Lucene
> In Action has quite a thorough explanation of this process, you'll save
> yourself
> a bunch of time by reading those sections. You can get the second edition
> of that book in electronic form from Manning through their early access
> program.
> 
> Until you understand this process well, I'd recommend that you be very,
> very
> sure that you use the *same* analyzer for both indexing and searching or
> your
> results will be...surprising.
> 
> Think about getting a copy of Luke to examine your indexes, that tool
> makes
> it
> easy to see the effects of various Analyzers. Google Lucene Luke....
> 
> Finally, you can easily use *different* analyzers for different fields
> within a
> document, see PerFieldAnalyzerWrapper.
> 
> HTH
> Erick
> 
> On Sun, Dec 27, 2009 at 5:48 PM, syedfa <fa...@gmail.com> wrote:
> 
>>
>> Dear fellow Java developers:
>>
>> I have a very basic question about indexing text using Lucene.  I am
>> indexing a large amount of text, that includes names that contain certain
>> punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.)  Will the punctuation
>> throw off the indexer in any way, such that it breaks up the tokens when
>> they shouldn't be, or will the indexer simply treat the punctuation
>> inside
>> the names as any other character, and the presence of the punctuation
>> will
>> not in any way hinder a user's ability to search for that name?  Are
>> there
>> any precautions that I should take to avoid any problems?
>>
>> I hope this question is clear and makes sense.
>>
>> Thanks in advance to all who reply.
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26938117.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Basic question about indexing certain words

Posted by Erick Erickson <er...@gmail.com>.
It depends completely on what analyzer you use. Conceptually, an Analyzer
is composed of a Tokenizer followed by any number of Filters. So the
input stream is broken up by the Tokenizer, then each token has one or
more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..

The reason I'm not answering your question directly is that I can't. If you
choose, say, a WhitespaceAnalyzer, which is built from a
WhitespaceTokenizer,
then your hyphens and apostrophes will pass through as-is, and your tokens
(the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
capitals
and all.

If you choose StandardAnalyzer, built on StandardTokenizer and several
filters
your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
well).

You can build your own Analyzers to process text however you please. Lucene
In Action has quite a thorough explanation of this process, you'll save
yourself
a bunch of time by reading those sections. You can get the second edition
of that book in electronic form from Manning through their early access
program.

Until you understand this process well, I'd recommend that you be very, very
sure that you use the *same* analyzer for both indexing and searching or
your
results will be...surprising.

Think about getting a copy of Luke to examine your indexes, that tool makes
it
easy to see the effects of various Analyzers. Google Lucene Luke....

Finally, you can easily use *different* analyzers for different fields
within a
document, see PerFieldAnalyzerWrapper.

HTH
Erick

On Sun, Dec 27, 2009 at 5:48 PM, syedfa <fa...@gmail.com> wrote:

>
> Dear fellow Java developers:
>
> I have a very basic question about indexing text using Lucene.  I am
> indexing a large amount of text, that includes names that contain certain
> punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.)  Will the punctuation
> throw off the indexer in any way, such that it breaks up the tokens when
> they shouldn't be, or will the indexer simply treat the punctuation inside
> the names as any other character, and the presence of the punctuation will
> not in any way hinder a user's ability to search for that name?  Are there
> any precautions that I should take to avoid any problems?
>
> I hope this question is clear and makes sense.
>
> Thanks in advance to all who reply.
>
> --
> View this message in context:
> http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>