You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Seid Mohammed <se...@gmail.com> on 2009/03/05 15:49:10 UTC

similarity function

For my work, I have read an article stating that " Answer type can be
automatically constructed by Indexing Different Questions and Answer
types. Later, when an unseen question apears, answer type for this
question will be found with the help of 'similarity function'
computation"

so I am clear with the arguement above. my problem is,
1. how can I index individual questions and Answer types as is ( not tokenized
2. how can I calculate the similarity between indexed questions and
and unseen questions (question of any type that can be asked latter)

to make things clear: the senario is
1. Who is the president of UN
              Answer type <Person>
2. When will the presidency of Meles Zenawi hold?
          Answer Type <Date>
these two will be indexed and
and later an unseen question like
who is the president of Kenya
          should match the first question and so that will have answer
type of <Person>

I appricate any help

Seid M

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: similarity function

Posted by patrick o'leary <pj...@pjaol.com>.
Sounds like your most difficult part will be the question parser using POS.

This is kind of old school but use something like the AliceBot AIML library
http://en.wikipedia.org/wiki/AIML

Where the subjective terms can be extracted from the questions, and indexed
separately.

Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe
(LingPipe license is a little bit of a pain)
for entity extraction.

An interesting way to look at the data would be to construct 3 fields,
Original_Question, Question_base, Subject

Doc:
Original_Question: Who is the president of the UN
Question_base: Who is the president of
Question_base: Who is
Subject: the president of the UN
Subject: the president
Subject: the UN
/Doc

And similarity can be somewhat easier to calculate with similar question
bases, subjects, etc

P


On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Hi Seid,
>
> Do you have a reference for the article?  I've done some QA in my day, but
> don't recall reading that one.
>
> At any rate, I do think it is possible to do what you are after.  See
> below.
>
> On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:
>
>  For my work, I have read an article stating that " Answer type can be
>> automatically constructed by Indexing Different Questions and Answer
>> types. Later, when an unseen question apears, answer type for this
>> question will be found with the help of 'similarity function'
>> computation"
>>
>> so I am clear with the arguement above. my problem is,
>> 1. how can I index individual questions and Answer types as is ( not
>> tokenized
>>
>
> I'm not sure you want this, but when constructing your Field, just use the
> NOT_ANALYZED option.
>
>
>> 2. how can I calculate the similarity between indexed questions and
>> and unseen questions (question of any type that can be asked latter)
>>
>
> In line with #1, I think you might be better off to actually tokenize the
> question as one one field, and the answer type as a second field.  Then, you
> can let Lucene calculate similarity via it's normal query mechanisms.  In
> this case, I would like try experimenting with things like: exact match,
> phrase queries with slop, etc.  That way, not only can you match "Who is the
> president of UN" but you might also match on things that are a bit fuzzier.
>  To do this, you might need to have several fields per document with
> variations.  I could also see using Lucene's payload mechanism as well.
>
> But, as Vasu said, you will likely need other parts too, like OpenNLP.
>
> HTH,
> Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: similarity function

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Seid,

Do you have a reference for the article?  I've done some QA in my day,  
but don't recall reading that one.

At any rate, I do think it is possible to do what you are after.  See  
below.

On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:

> For my work, I have read an article stating that " Answer type can be
> automatically constructed by Indexing Different Questions and Answer
> types. Later, when an unseen question apears, answer type for this
> question will be found with the help of 'similarity function'
> computation"
>
> so I am clear with the arguement above. my problem is,
> 1. how can I index individual questions and Answer types as is ( not  
> tokenized

I'm not sure you want this, but when constructing your Field, just use  
the NOT_ANALYZED option.

>
> 2. how can I calculate the similarity between indexed questions and
> and unseen questions (question of any type that can be asked latter)

In line with #1, I think you might be better off to actually tokenize  
the question as one one field, and the answer type as a second field.   
Then, you can let Lucene calculate similarity via it's normal query  
mechanisms.  In this case, I would like try experimenting with things  
like: exact match, phrase queries with slop, etc.  That way, not only  
can you match "Who is the president of UN" but you might also match on  
things that are a bit fuzzier.  To do this, you might need to have  
several fields per document with variations.  I could also see using  
Lucene's payload mechanism as well.

But, as Vasu said, you will likely need other parts too, like OpenNLP.

HTH,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: similarity function

Posted by Vasudevan Comandur <vc...@gmail.com>.
Hi,

   The very fact that you are trying to answer factoid questions to start
with, it is better to use OpenNLP components to identify
   NER (Named Entity recognition) in the document and use those tags as part
of your indexing process.

REgards
 Vasu


On Thu, Mar 5, 2009 at 8:19 PM, Seid Mohammed <se...@gmail.com> wrote:

> For my work, I have read an article stating that " Answer type can be
> automatically constructed by Indexing Different Questions and Answer
> types. Later, when an unseen question apears, answer type for this
> question will be found with the help of 'similarity function'
> computation"
>
> so I am clear with the arguement above. my problem is,
> 1. how can I index individual questions and Answer types as is ( not
> tokenized
> 2. how can I calculate the similarity between indexed questions and
> and unseen questions (question of any type that can be asked latter)
>
> to make things clear: the senario is
> 1. Who is the president of UN
>              Answer type <Person>
> 2. When will the presidency of Meles Zenawi hold?
>          Answer Type <Date>
> these two will be indexed and
> and later an unseen question like
> who is the president of Kenya
>          should match the first question and so that will have answer
> type of <Person>
>
> I appricate any help
>
> Seid M
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>