You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Lee Goddard <le...@gmail.com> on 2014/06/10 17:08:37 UTC

indexing Guides? Indexing names

Could you recommend a good guide on constructing an index — analyzers, 
filters....

I've inherited a set-up that indexes company names. It does a great job 
on 1,000 names or so, but when I put in a million or more, it makes no 
sense.

My test search is searching 'A & B Household' to target 'A & B 
Households' — when I have a million records (of several tens of million 
to come), I see the name has an equal score to other names with 
different initials.

Is it possible to weight the individual initials as words?

Would you recommend employing a stemmer?

Thanks in anticipation
Lee

Re: indexing Guides? Indexing names

Posted by Lee <le...@gmail.com>.
On 10/06/2014 18:40, Ted Dunning wrote:
>
 > On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <leegee@gmail.com
 > <ma...@gmail.com>> wrote:
 >
 > Is it possible to weight the individual initials as words?
 >
 > Would you recommend employing a stemmer?
 >
 >
 > Yes it is definitely possible.  But don't just use any stemmer.  You
 > need to adapt something so that you preserve initial letters and
 > likely uses heuristics such as possibly preserving case.

Am I going to have to write a parser in Java for that, or is it a matter 
of combing what is in the box? I've previously created indexes of photos 
(my own parser) and indexes of documents, but indexing a single company 
name is quite a new idea to me.

> You will also probably want to  include alternative forms in other
 > fields.  These would include nicknames, stock symbols and
 > abbreviations.

Not in this — it's simply an interface to find information held by the 
state on the affairs of a company, so the alternative forms are of the 
final element of the company registered name: it might be 'Limited' but 
people may search 'ltd', it may be 'SE' but people may search 'european'.

TIA
Lee

Re: indexing Guides? Indexing names

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <le...@gmail.com> wrote:

> Is it possible to weight the individual initials as words?
>
> Would you recommend employing a stemmer?
>
>
Yes it is definitely possible.  But don't just use any stemmer.  You need
to adapt something so that you preserve initial letters and likely uses
heuristics such as possibly preserving case.

You will also probably want to include alternative forms in other fields.
 These would include nicknames, stock symbols and abbreviations.