You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Carsten Schnober <sc...@ids-mannheim.de> on 2012/11/23 15:36:54 UTC

Specialized Analyzer for names

Hi,
I'm indexing names in a dedicated Lucene field and I wonder which
analyzer to use for that purpose. Typically, the names are in the format
"John Smith", so the WhitespaceAnalyzer is likely the best in most
cases. The field type to choose seems to be the TextField.
Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
cautious about that because I'm afraid of wildcard or regex queries such
as "*Smith" or ".*Smith" respectively.

However, there might also be special cases and spelling exceptions of
all kinds, e.g. "Smith, John", "John 'Hammmer' Smith", "Abd al-Aziz",
"Stan van Hoop" and what else one could imagine. Is there a special
Analyzer that is optimized on dealing with such cases or do I have to do
normalization beforehand?
I see that such special characters and spellings can easily be covered
by the right queries, but that requires the user to know the exact
spelling, which is what I'm trying to spare her.

Best regards,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Specialized Analyzer for names

Posted by Ian Lea <ia...@gmail.com>.

I'd use StandardAnalyzer, or ClassicAnalyzer.  Also depends on how you
want to search.  You probably want a query for "John Smith" to match

"John Smith" and "Smith, John" but maybe not "John Brown and Sam
Smith".  The latter is a problem.  You can partially work round it by
using a BooleanQuery made up of a phrase query, and/or SpanNearQuery
with small slop and InOrder true and a general catch all clause, with
boosts on the first two.

If this is real world data there will always be exceptions and problems.


--
Ian.


On Fri, Nov 23, 2012 at 2:36 PM, Carsten Schnober
<sc...@ids-mannheim.de> wrote:
> Hi,
> I'm indexing names in a dedicated Lucene field and I wonder which
> analyzer to use for that purpose. Typically, the names are in the format
> "John Smith", so the WhitespaceAnalyzer is likely the best in most
> cases. The field type to choose seems to be the TextField.
> Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
> cautious about that because I'm afraid of wildcard or regex queries such
> as "*Smith" or ".*Smith" respectively.
>
> However, there might also be special cases and spelling exceptions of
> all kinds, e.g. "Smith, John", "John 'Hammmer' Smith", "Abd al-Aziz",
> "Stan van Hoop" and what else one could imagine. Is there a special
> Analyzer that is optimized on dealing with such cases or do I have to do
> normalization beforehand?
> I see that such special characters and spellings can easily be covered
> by the right queries, but that requires the user to know the exact
> spelling, which is what I'm trying to spare her.
>
> Best regards,
> Carsten
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org