You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/09/30 05:54:43 UTC

Whitespace/Standard Analyzer and punctuation

I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation.  For example, when I was using
StandardAnalyzer, "John. Smith" was matching, which is wrong for me.  Right
now I am using WhitespaceAnalyzer but instead searching for "John Smith"
"John Smith." "John Smith," etc., which seems like a dumb thing to be
doing.  Can I separate the punctuation but keep the analyzer aware of where
the punctuation occurred in my matching term?

Thanks.

Re: Whitespace/Standard Analyzer and punctuation

Posted by Karl Wettin <ka...@gmail.com>.
You could look in to modifying the standard tokenizer lexer code to  
handle punctuation (there is a patch in the isssue tracker for the old  
javacc grammer to handle punctuation) and there is also the Gate NLP  
project which has a fairly nice sentence splitter you might find  
useful. Add a whole bunch of position increment between your sentences  
and limit your searches to how much distance you allow for a hit.

I hope this helps.


        karl


30 sep 2009 kl. 05.54 skrev Max Lynch:

> I would like my searches to match "John Smith" when John Smith is in a
> document, but not separated with punctuation.  For example, when I  
> was using
> StandardAnalyzer, "John. Smith" was matching, which is wrong for  
> me.  Right
> now I am using WhitespaceAnalyzer but instead searching for "John  
> Smith"
> "John Smith." "John Smith," etc., which seems like a dumb thing to be
> doing.  Can I separate the punctuation but keep the analyzer aware  
> of where
> the punctuation occurred in my matching term?
>
> Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org