You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Seid Mohammed <se...@gmail.com> on 2009/03/03 07:13:07 UTC
How to index Named Entities
I want to index document conents in two ways, one just a simple
content, and the other as named entity.
the senario is like this.
if i have this document "the source of Nile is Ethiopia"
then I want to index "source" as a normal content, "Nile" as river
name, and "Ethiopia" as Country name. so that later if ask a question
"where is the source of Nile", it should retrieve Ethiopia as an
Answer.
Note: I will have List of River names, Country names,... so that
during indexing I will compare every word of a document with my lists.
thanks a lot
Seid M
--
"RABI ZIDNI ILMA"
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to index Named Entities
Posted by Grant Ingersoll <gs...@apache.org>.
Have a look at the TeeTokenFilter and the SinkTokenizer. You could
extend/implement those to have a lookup in your list, and then when
you have a match, add the token to the Sink, which then allows you to
index a separate field containing your named entities. The TeeTF and
SinkTok are located in the contrib/analysis package of the latest
Lucene release. Alternatively, you could implement a TokenFilter
that adds a payload onto a term whenever it comes across a Named Entity.
Alternatively, you might look into preprocessing with OpenNLP or
LingPipe or some tool like that which can go beyond just list lookup
for Named Entities. List based approaches are useful, but they also
tend to be brittle.
<shameless_somewhat_self_serving_but_hopefully_useful_plug>
Using OpenNLP is described in my book: http://manning.com/ingersoll/
in chapter 5 and I believe Tom (my coauthor) even has code in there
for plugging OpenNLP into the Lucene analysis process)
</shameless_somewhat_self_serving_but_hopefully_useful_plug>
On Mar 3, 2009, at 1:13 AM, Seid Mohammed wrote:
> I want to index document conents in two ways, one just a simple
> content, and the other as named entity.
> the senario is like this.
> if i have this document "the source of Nile is Ethiopia"
> then I want to index "source" as a normal content, "Nile" as river
> name, and "Ethiopia" as Country name. so that later if ask a question
> "where is the source of Nile", it should retrieve Ethiopia as an
> Answer.
>
> Note: I will have List of River names, Country names,... so that
> during indexing I will compare every word of a document with my lists.
>
> thanks a lot
>
> Seid M
> --
> "RABI ZIDNI ILMA"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org