You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by 1world1love <jd...@yahoo.com> on 2007/12/20 15:48:12 UTC

advice on integrating NLP engine during indexing

Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to index and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for this.

The trick is that I need to exclude certain terms from the index such as
those terms that are negated or information that could potentially identify
people. I have a collection of natural language processing tools that are
able to tag or remove/replace such terms.

I need to design the indexing such that I can feed each document through
these tools and then incorporate the results into the indexing strategy.

As an example, if I have a report that has the phrase: "Mr. Smith has no
history of violence against women prior to this event"

The NLP engine would recognize the name Smith and the negation of the term
"violence" and would tag them as such. I would then like to exclude those
terms from the indexing as seems prudent.

Another strategy I would like to look at is to include the tags in the index
to incorprate it into the search engine. That is to say, whether a subject
"likely" has a history of violence, "may" have a history of violence, or
"does not" have a history of violence.

I assume that I will need to design a custom analyzer to do this, but I was
hoping to solicit any comments, advice, or general suggestions before I get
started.

Thanks in advance,

--
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.

RE: advice on integrating NLP engine during indexing

Posted by 1world1love <jd...@yahoo.com>.

Hi James. Ira's link is a good starting point. There is another algorithm
called NegEx used in parsing medical texts that was published out of the
University of Pittsburgh. You can find a high level description here:
http://healthinformatics.wikispaces.com/NegEx+Algorithm

Although much of the research in the field is being done in medical
informatics, the general principals are really universal as long as you have
a good understanding of the domain vocabulary. You could probably search
pubmed for current literature on the subject.

As to the question of accuracy, I have found that most of the published
results are based on a "best case scenario" and that any method will need to
be tweaked for a particular problem to get the best results. You will
probably never find a method that is perfectly accurate, even human based.
My philosophy when evaluating these algorithms is "Don't let the perfect be
the enemy of the good".

j

James-10 wrote:
> 
> Hi,
> 
> I can't answer your question -- sorry!  But, I was curious about the NLP
> you
> describe.  Are there algorithms available for determining negation
> automatically, and are they accurate?
> 
> Sincerely,
> James
> 
> 

-- 
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14443277.html
Sent from the Lucene - General mailing list archive at Nabble.com.

RE: advice on integrating NLP engine during indexing

Posted by James Ryley <ja...@ryley.com>.

Hi,

I can't answer your question -- sorry!  But, I was curious about the NLP you
describe.  Are there algorithms available for determining negation
automatically, and are they accurate?

Sincerely,
James

> -----Original Message-----
> From: 1world1love [mailto:jd_cowan@yahoo.com]
> Sent: Thursday, December 20, 2007 9:48 AM
> To: general@lucene.apache.org
> Subject: advice on integrating NLP engine during indexing
> 
> 
> Greetings all. I am new to Lucene and am looking for a little
> advice/direction/feedback on what I am trying to do. I want to index and
> query millions of documents that are unstructured and resemble
> crime/police/phsychiatric reports; no problem, lucene is perfect for this.
> 
> The trick is that I need to exclude certain terms from the index such as
> those terms that are negated or information that could potentially
identify
> people. I have a collection of natural language processing tools that are
> able to tag or remove/replace such terms.
> 
> I need to design the indexing such that I can feed each document through
> these tools and then incorporate the results into the indexing strategy.
> 
> As an example, if I have a report that has the phrase: "Mr. Smith has no
> history of violence against women prior to this event"
> 
> The NLP engine would recognize the name Smith and the negation of the term
> "violence" and would tag them as such. I would then like to exclude those
> terms from the indexing as seems prudent.
> 
> Another strategy I would like to look at is to include the tags in the
index
> to incorprate it into the search engine. That is to say, whether a subject
> "likely" has a history of violence, "may" have a history of violence, or
> "does not" have a history of violence.
> 
> I assume that I will need to design a custom analyzer to do this, but I
was
> hoping to solicit any comments, advice, or general suggestions before I
get
> started.
> 
> Thanks in advance,
> 
> j
> 
> 
> --
> View this message in context:
http://www.nabble.com/advice-on-integrating-NLP-
> engine-during-indexing-tp14437913p14437913.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Re: advice on integrating NLP engine during indexing

Posted by Grant Ingersoll <gs...@apache.org>.

FYI: you will get a broader audience on java-user, this list is mostly  
for discussion of higher level Lucene things that effect two or more  
of the Lucene projects.

That being said, a custom analyzer is the way to go to redact the  
appropriate information.  If you have your files in some sort of  
markup, you can easily create fields to contain the various metadata  
that you have generated (i.e. history of violence.)  One new thing  
that I have been intrigued with for use in NLP applications is the new  
TeeTokenFilter and SinkTokenizer that can be used to siphon off  
interesting tokens for other fields based on the tokens of an existing  
field.  This can save on the need to reanalyze content over and over  
for different analysis needs.  This is, however, advanced usage for  
now (although I hope it will become more common)

Cheers
Grant

On Dec 20, 2007, at 9:48 AM, 1world1love wrote:

>
> Greetings all. I am new to Lucene and am looking for a little
> advice/direction/feedback on what I am trying to do. I want to index  
> and
> query millions of documents that are unstructured and resemble
> crime/police/phsychiatric reports; no problem, lucene is perfect for  
> this.
>
> The trick is that I need to exclude certain terms from the index  
> such as
> those terms that are negated or information that could potentially  
> identify
> people. I have a collection of natural language processing tools  
> that are
> able to tag or remove/replace such terms.
>
> I need to design the indexing such that I can feed each document  
> through
> these tools and then incorporate the results into the indexing  
> strategy.
>
> As an example, if I have a report that has the phrase: "Mr. Smith  
> has no
> history of violence against women prior to this event"
>
> The NLP engine would recognize the name Smith and the negation of  
> the term
> "violence" and would tag them as such. I would then like to exclude  
> those
> terms from the indexing as seems prudent.
>
> Another strategy I would like to look at is to include the tags in  
> the index
> to incorprate it into the search engine. That is to say, whether a  
> subject
> "likely" has a history of violence, "may" have a history of  
> violence, or
> "does not" have a history of violence.
>
> I assume that I will need to design a custom analyzer to do this,  
> but I was
> hoping to solicit any comments, advice, or general suggestions  
> before I get
> started.
>
> Thanks in advance,
>
> j
>
>
> -- 
> View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ