You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Quaroni <dq...@OPENRATINGS.com> on 2003/08/14 17:58:14 UTC

Indexing documents with multiple values for 1 field

I saw a post that sort of touched on my question, I think, but it didn't
seem quite the same...

What's the best way to index a document with multiple values for the same
field?  I'm trying to optimize search time and accuracy.

We have a database of companies that we want to be able to search on, and
the fields will include company name, address, and telephone number.  Some
companies have more than one name, though.  For example, BMG is also known
as Bertelsmann Music Group.  Our users need to be able to search on either
of these names and find a match.  In our raw data, these different names are
in separate fields for alternate names...  But which is a better way to
implement this in Lucene:

A) Duplicate documents by using all the same data except for the name (i.e.
1 document for BMG at 123 fake street and 1 document for Bertelsmann Music
Group at 123 fake street)

B) Create 5 fields for alternate names (Which 80% of companies don't have at
all so they'd be empty) and then when doing a search query, search for the
same thing across all 6 fields?  (i.e. name:BMG OR altname1:BGM OR
altname2:BMG... etc)

C) Put all of the altername names together into the name field (i.e. BMG
Bertelsmann Music Group).  Is there anything to delimit the different names
with so that they would be treated as separate entities?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing documents with multiple values for 1 field

Posted by Victor Hadianto <vi...@nuix.com.au>.
> C) Put all of the altername names together into the name field (i.e. BMG
> Bertelsmann Music Group).  Is there anything to delimit the different names
> with so that they would be treated as separate entities?

This is what I would do. If you are not going to have large value in this 
field, and you don't need it for later display (ie for UnStored field) then 
put all the variant name in the "Name" field. This will be the easiest to do 
and you don't have to "or"s a lot of field (or use multi field query) later.

Also you can do: 
doc.add(Field.UnStored("Name", "BMG"));
doc.add(Field.UnStored("Name", "Berstelsmann Music Group"));

and that will be the same with:
doc.add(Field.UnStored("Name", "BMG Berstelsmann Music Group"));


HTH,

Victor Hadianto


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing documents with multiple values for 1 field

Posted by Victor Hadianto <vi...@nuix.com.au>.
> C) Put all of the altername names together into the name field (i.e. BMG
> Bertelsmann Music Group).  Is there anything to delimit the different names
> with so that they would be treated as separate entities?

This is what I would do. If you are not going to have large value in this 
field, and you don't need it for later display (ie for UnStored field) then 
put all the variant name in the "Name" field. This will be the easiest to do 
and you don't have to "or"s a lot of field (or use multi field query) later.

Also you can do: 
doc.add(Field.UnStored("Name", "BMG"));
doc.add(Field.UnStored("Name", "Berstelsmann Music Group"));

and that will be the same with:
doc.add(Field.UnStored("Name", "BMG Berstelsmann Music Group"));


HTH,

Victor Hadianto