You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Lee Goddard <le...@gmail.com> on 2014/06/10 17:08:37 UTC
indexing Guides? Indexing names
Could you recommend a good guide on constructing an index — analyzers,
filters....
I've inherited a set-up that indexes company names. It does a great job
on 1,000 names or so, but when I put in a million or more, it makes no
sense.
My test search is searching 'A & B Household' to target 'A & B
Households' — when I have a million records (of several tens of million
to come), I see the name has an equal score to other names with
different initials.
Is it possible to weight the individual initials as words?
Would you recommend employing a stemmer?
Thanks in anticipation
Lee
Re: indexing Guides? Indexing names
Posted by Lee <le...@gmail.com>.
On 10/06/2014 18:40, Ted Dunning wrote:
>
> On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <leegee@gmail.com
> <ma...@gmail.com>> wrote:
>
> Is it possible to weight the individual initials as words?
>
> Would you recommend employing a stemmer?
>
>
> Yes it is definitely possible. But don't just use any stemmer. You
> need to adapt something so that you preserve initial letters and
> likely uses heuristics such as possibly preserving case.
Am I going to have to write a parser in Java for that, or is it a matter
of combing what is in the box? I've previously created indexes of photos
(my own parser) and indexes of documents, but indexing a single company
name is quite a new idea to me.
> You will also probably want to include alternative forms in other
> fields. These would include nicknames, stock symbols and
> abbreviations.
Not in this — it's simply an interface to find information held by the
state on the affairs of a company, so the alternative forms are of the
final element of the company registered name: it might be 'Limited' but
people may search 'ltd', it may be 'SE' but people may search 'european'.
TIA
Lee
Re: indexing Guides? Indexing names
Posted by Ted Dunning <te...@gmail.com>.
On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <le...@gmail.com> wrote:
> Is it possible to weight the individual initials as words?
>
> Would you recommend employing a stemmer?
>
>
Yes it is definitely possible. But don't just use any stemmer. You need
to adapt something so that you preserve initial letters and likely uses
heuristics such as possibly preserving case.
You will also probably want to include alternative forms in other fields.
These would include nicknames, stock symbols and abbreviations.