You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/09/30 05:54:43 UTC
Whitespace/Standard Analyzer and punctuation
I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation. For example, when I was using
StandardAnalyzer, "John. Smith" was matching, which is wrong for me. Right
now I am using WhitespaceAnalyzer but instead searching for "John Smith"
"John Smith." "John Smith," etc., which seems like a dumb thing to be
doing. Can I separate the punctuation but keep the analyzer aware of where
the punctuation occurred in my matching term?
Thanks.
Re: Whitespace/Standard Analyzer and punctuation
Posted by Karl Wettin <ka...@gmail.com>.
You could look in to modifying the standard tokenizer lexer code to
handle punctuation (there is a patch in the isssue tracker for the old
javacc grammer to handle punctuation) and there is also the Gate NLP
project which has a fairly nice sentence splitter you might find
useful. Add a whole bunch of position increment between your sentences
and limit your searches to how much distance you allow for a hit.
I hope this helps.
karl
30 sep 2009 kl. 05.54 skrev Max Lynch:
> I would like my searches to match "John Smith" when John Smith is in a
> document, but not separated with punctuation. For example, when I
> was using
> StandardAnalyzer, "John. Smith" was matching, which is wrong for
> me. Right
> now I am using WhitespaceAnalyzer but instead searching for "John
> Smith"
> "John Smith." "John Smith," etc., which seems like a dumb thing to be
> doing. Can I separate the punctuation but keep the analyzer aware
> of where
> the punctuation occurred in my matching term?
>
> Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org