You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by lee carroll <le...@googlemail.com> on 2011/01/10 18:42:34 UTC

first steps in nlp

Hi

I'm indexing a set of documents which have a conversational writing style.
In particular the authors are very fond
of listing facts in a variety of ways (this is to keep a human reader
interested) but its causing my index trouble.

For example instead of listing facts like: the house is white, the castle is
pretty.

We get the house is the complete opposite of black and the castle is not
ugly.

What are the best approaches to resolve these sorts of issues. Even if its
just handling "not" correctly would be a good start


cheers lee c

Re: first steps in nlp

Posted by lee carroll <le...@googlemail.com>.

Just to be more explicit in terms of using synonyms. Our thinking was
something like:

1 analyse texts for patterns such as not x and list these out
2 in a synonyms txt file list in effect antonyms eg
      not pretty -> Ugly
      not ugly -> pretty
      not lively -> quiet
      not very nice -> Ugly
      etc
3 use a synonym filter referencing the antoymns at index time only.

however the language in the text is probably more complex than the above
simple phrases and nlp seems to promise a lot :-) should we venture down
that route instead?

cheers lee c


On 10 January 2011 22:04, lee carroll <le...@googlemail.com> wrote:

> Hi Grant,
>
> Its a search relevancy problem. For example:
>
> a document about london reads like
>
> London is not very good for a peaceful break.
>
> we analyse this at the (i can't remember the technical term) is it lexical
> level? (bloody hell i think you may have wrote the book !) anyway which
> produces tokens in our index of say
>
> "London good peaceful holiday"
>
> users search for cities which would be nice for them to take a holiday in
> say the search is
> "good for a peaceful break"
>
> and bang london is top. talk about a relevancy problem :-)
>
> now i was thinking of using phrase matches in the synonyms file but is that
> the best approach or could nlp help here?
>
> cheers lee
>
>
>
>
>
> On 10 January 2011 18:21, Grant Ingersoll <gs...@apache.org> wrote:
>
>>
>> On Jan 10, 2011, at 12:42 PM, lee carroll wrote:
>>
>> > Hi
>> >
>> > I'm indexing a set of documents which have a conversational writing
>> style.
>> > In particular the authors are very fond
>> > of listing facts in a variety of ways (this is to keep a human reader
>> > interested) but its causing my index trouble.
>> >
>> > For example instead of listing facts like: the house is white, the
>> castle is
>> > pretty.
>> >
>> > We get the house is the complete opposite of black and the castle is not
>> > ugly.
>> >
>> > What are the best approaches to resolve these sorts of issues. Even if
>> its
>> > just handling "not" correctly would be a good start
>> >
>>
>> Hmm, good problem.  I guess I'd start by stepping back and ask what is the
>> problem you are trying to solve?  You've stated, I think, one half of the
>> problem, namely that your authors have a conversational style, but you
>> haven't stated what your users are expecting to do with this information?
>>  Is this a pure search app?  Is it something else that is just backed by
>> Solr but the user would never do a search?
>>
>> Do you have a relevance problem?  Also, what is your notion of handling
>> "not" correctly?  In other words, more details are welcome!
>>
>> -Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>

Re: first steps in nlp

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 10, 2011, at 5:04 PM, lee carroll wrote:

> Hi Grant,
> 
> Its a search relevancy problem. For example:
> 
> a document about london reads like
> 
> London is not very good for a peaceful break.
> 
> we analyse this at the (i can't remember the technical term) is it lexical
> level? (bloody hell i think you may have wrote the book !) anyway which
> produces tokens in our index of say
> 
> "London good peaceful holiday"
> 
> users search for cities which would be nice for them to take a holiday in
> say the search is
> "good for a peaceful break"
> 
> and bang london is top. talk about a relevancy problem :-)

First question, why are you getting rid of "not"?  Despite it's reputation as a "stopword", it does carry a significant amount of meaning for you.  Then, you could probably do some phrase based searching that would help in some cases.

> 
> now i was thinking of using phrase matches in the synonyms file but is that
> the best approach or could nlp help here?

I suppose it could.  During indexing,  you could detect that it is a negative connotation and change it to be "bad for a peaceful break" or something like that.  I'm not aware of any system that does that.  You could also use some sentiment analysis to analyze the sentence and determine it is a negative sentence and then tag it as negative such that your query takes that into account.  Payloads and/or marker tokens would likely help here.

-Grant


> 
> cheers lee
> 
> 
> 
> 
> On 10 January 2011 18:21, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> 
>> On Jan 10, 2011, at 12:42 PM, lee carroll wrote:
>> 
>>> Hi
>>> 
>>> I'm indexing a set of documents which have a conversational writing
>> style.
>>> In particular the authors are very fond
>>> of listing facts in a variety of ways (this is to keep a human reader
>>> interested) but its causing my index trouble.
>>> 
>>> For example instead of listing facts like: the house is white, the castle
>> is
>>> pretty.
>>> 
>>> We get the house is the complete opposite of black and the castle is not
>>> ugly.
>>> 
>>> What are the best approaches to resolve these sorts of issues. Even if
>> its
>>> just handling "not" correctly would be a good start
>>> 
>> 
>> Hmm, good problem.  I guess I'd start by stepping back and ask what is the
>> problem you are trying to solve?  You've stated, I think, one half of the
>> problem, namely that your authors have a conversational style, but you
>> haven't stated what your users are expecting to do with this information?
>> Is this a pure search app?  Is it something else that is just backed by
>> Solr but the user would never do a search?
>> 
>> Do you have a relevance problem?  Also, what is your notion of handling
>> "not" correctly?  In other words, more details are welcome!
>> 
>> -Grant
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

RE: first steps in nlp

Posted by Hong-Thai Nguyen <Ho...@polyspot.com>.

Hi,

Absolutely this problem is the main scope of NLP. To handle not (negative), passive, tense (pass, future, ...) need more advanced linguistic analyse (morpho-syntax) in phraseology than a sample tokenize with stem or lemm enhances. The output of this analyse kind is normally a tree-like structure.
Beware that this work is quite expensive.

Best,
-------------------
Hong-Thai

-----Message d'origine-----
De : lee carroll [mailto:lee.a.carroll@googlemail.com] 
Envoyé : lundi 10 janvier 2011 23:04
À : solr-user@lucene.apache.org
Objet : Re: first steps in nlp

Hi Grant,

Its a search relevancy problem. For example:

a document about london reads like

London is not very good for a peaceful break.

we analyse this at the (i can't remember the technical term) is it lexical
level? (bloody hell i think you may have wrote the book !) anyway which
produces tokens in our index of say

"London good peaceful holiday"

users search for cities which would be nice for them to take a holiday in
say the search is
"good for a peaceful break"

and bang london is top. talk about a relevancy problem :-)

now i was thinking of using phrase matches in the synonyms file but is that
the best approach or could nlp help here?

cheers lee

On 10 January 2011 18:21, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 10, 2011, at 12:42 PM, lee carroll wrote:
>
> > Hi
> >
> > I'm indexing a set of documents which have a conversational writing
> style.
> > In particular the authors are very fond
> > of listing facts in a variety of ways (this is to keep a human reader
> > interested) but its causing my index trouble.
> >
> > For example instead of listing facts like: the house is white, the castle
> is
> > pretty.
> >
> > We get the house is the complete opposite of black and the castle is not
> > ugly.
> >
> > What are the best approaches to resolve these sorts of issues. Even if
> its
> > just handling "not" correctly would be a good start
> >
>
> Hmm, good problem.  I guess I'd start by stepping back and ask what is the
> problem you are trying to solve?  You've stated, I think, one half of the
> problem, namely that your authors have a conversational style, but you
> haven't stated what your users are expecting to do with this information?
>  Is this a pure search app?  Is it something else that is just backed by
> Solr but the user would never do a search?
>
> Do you have a relevance problem?  Also, what is your notion of handling
> "not" correctly?  In other words, more details are welcome!
>
> -Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Re: first steps in nlp

Posted by lee carroll <le...@googlemail.com>.

Hi Grant,

Its a search relevancy problem. For example:

a document about london reads like

London is not very good for a peaceful break.

we analyse this at the (i can't remember the technical term) is it lexical
level? (bloody hell i think you may have wrote the book !) anyway which
produces tokens in our index of say

"London good peaceful holiday"

users search for cities which would be nice for them to take a holiday in
say the search is
"good for a peaceful break"

and bang london is top. talk about a relevancy problem :-)

now i was thinking of using phrase matches in the synonyms file but is that
the best approach or could nlp help here?

cheers lee




On 10 January 2011 18:21, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 10, 2011, at 12:42 PM, lee carroll wrote:
>
> > Hi
> >
> > I'm indexing a set of documents which have a conversational writing
> style.
> > In particular the authors are very fond
> > of listing facts in a variety of ways (this is to keep a human reader
> > interested) but its causing my index trouble.
> >
> > For example instead of listing facts like: the house is white, the castle
> is
> > pretty.
> >
> > We get the house is the complete opposite of black and the castle is not
> > ugly.
> >
> > What are the best approaches to resolve these sorts of issues. Even if
> its
> > just handling "not" correctly would be a good start
> >
>
> Hmm, good problem.  I guess I'd start by stepping back and ask what is the
> problem you are trying to solve?  You've stated, I think, one half of the
> problem, namely that your authors have a conversational style, but you
> haven't stated what your users are expecting to do with this information?
>  Is this a pure search app?  Is it something else that is just backed by
> Solr but the user would never do a search?
>
> Do you have a relevance problem?  Also, what is your notion of handling
> "not" correctly?  In other words, more details are welcome!
>
> -Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Re: first steps in nlp

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 10, 2011, at 12:42 PM, lee carroll wrote:

> Hi
> 
> I'm indexing a set of documents which have a conversational writing style.
> In particular the authors are very fond
> of listing facts in a variety of ways (this is to keep a human reader
> interested) but its causing my index trouble.
> 
> For example instead of listing facts like: the house is white, the castle is
> pretty.
> 
> We get the house is the complete opposite of black and the castle is not
> ugly.
> 
> What are the best approaches to resolve these sorts of issues. Even if its
> just handling "not" correctly would be a good start
> 

Hmm, good problem.  I guess I'd start by stepping back and ask what is the problem you are trying to solve?  You've stated, I think, one half of the problem, namely that your authors have a conversational style, but you haven't stated what your users are expecting to do with this information?  Is this a pure search app?  Is it something else that is just backed by Solr but the user would never do a search?  

Do you have a relevance problem?  Also, what is your notion of handling "not" correctly?  In other words, more details are welcome!

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com