You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Seid Mohammed <se...@gmail.com> on 2009/03/03 07:13:07 UTC

How to index Named Entities

I want to index document conents in two ways, one just a simple
content, and the other as named entity.
the senario is like this.
if i have this document "the source of Nile is Ethiopia"
then I want to index "source" as a normal content, "Nile" as river
name, and "Ethiopia" as Country name. so that later if ask a question
"where is the source of Nile", it should retrieve Ethiopia as an
Answer.

Note: I will have List of River names, Country names,... so that
during indexing I will compare every word of a document with my lists.

thanks a lot

Seid M
-- 
"RABI ZIDNI ILMA"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to index Named Entities

Posted by Grant Ingersoll <gs...@apache.org>.

Have a look at the TeeTokenFilter and the SinkTokenizer.  You could  
extend/implement those to have a lookup in your list, and then when  
you have a match, add the token to the Sink, which then allows you to  
index a separate field containing your named entities.  The TeeTF and  
SinkTok are located in the contrib/analysis package of the latest  
Lucene release.   Alternatively, you could implement a TokenFilter  
that adds a payload onto a term whenever it comes across a Named Entity.

Alternatively, you might look into preprocessing with OpenNLP or  
LingPipe or some tool like that which can go beyond just list lookup  
for Named Entities.  List based approaches are useful, but they also  
tend to be brittle.

<shameless_somewhat_self_serving_but_hopefully_useful_plug>
Using OpenNLP is described in my book: http://manning.com/ingersoll/  
in chapter 5 and I believe Tom (my coauthor) even has code in there  
for plugging OpenNLP into the Lucene analysis process)
</shameless_somewhat_self_serving_but_hopefully_useful_plug>

On Mar 3, 2009, at 1:13 AM, Seid Mohammed wrote:

> I want to index document conents in two ways, one just a simple
> content, and the other as named entity.
> the senario is like this.
> if i have this document "the source of Nile is Ethiopia"
> then I want to index "source" as a normal content, "Nile" as river
> name, and "Ethiopia" as Country name. so that later if ask a question
> "where is the source of Nile", it should retrieve Ethiopia as an
> Answer.
>
> Note: I will have List of River names, Country names,... so that
> during indexing I will compare every word of a document with my lists.
>
> thanks a lot
>
> Seid M
> -- 
> "RABI ZIDNI ILMA"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org