You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Puri, Aseem" <As...@Honeywell.com> on 2009/08/18 13:36:23 UTC
Hbase for question answer modeling
Hello
I am working on a model in which I have to manage question and their
answers.
I create two columns, one in which question is to be store and other its
answer.
Now people will ask question, so when a new question come I want to
execute map reduce job which find is same kind of question is already
exist or not.
If same question is asked then with map reduce I will find similar
question that exist and provide answer to him that is already there with
it. Also I want to append it with the similar question that is already
their in my table.
If question is different then I will store it in different row and its
answer will be given by some expert and be stored.
I know Hadoop HBase have property write once read many times. So I can't
append it.
I have two other options.
1. Manage new similar question with help of timestamp.
2. As a new similar question come I make new column qualifier and
store it in same row.
Please suggest that which approach should I follow and also that help in
my map reduce operation where I have to analyze similarity of new
question with every question that already exist. Also if some other
approach can help me please suggest me.
Regards
Aseem Puri
Re: Hbase for question answer modeling
Posted by Yabo-Arber Xu <ar...@gmail.com>.
Hi JG,
Thanks for your information. I will dig more.
Best,
Arber
On Thu, Aug 20, 2009 at 12:25 AM, Jonathan Gray <jl...@streamy.com> wrote:
> Arber,
>
> I don't have any links to papers handy, unfortunately. Quite honestly
> there is a TON of research on this subject. My recommendation is to dig
> around ACM, you can find many papers related to duplicate detection. If you
> don't have an ACM membership to the archives, digging around Google should
> still yield some results.
>
> Generally an online dupe detection system would take advantage of some kind
> of "signature" or dimensional reduction that permits a level of fuzzy
> matching. The implementations vary greatly depending on the domain, for
> example near-duplicate image detection is a heavily researched field as well
> as text-based.
>
> As I said, this topic is well beyond the scope of this mailing list. A bit
> of legwork should yield more papers than you can possibly read :)
>
> JG
>
>
> Yabo-Arber Xu wrote:
>
>> Hi JG,
>>
>> Sorry for interrupting the ongoing topic, but I am quite interested in the
>> online dup detection method you mentioned. Could you please elaborate it a
>> bit, or point out some links and I will follow?
>>
>> Best,
>> Arber
>>
>>
>> On Wed, Aug 19, 2009 at 1:51 AM, Jonathan Gray <jl...@streamy.com> wrote:
>>
>> You didn't talk much about how you plan on doing dupe-detection of
>>> questions, but there are some interesting ways to generate signatures
>>> which
>>> could turn into your row keys, then you could actually do some kind of
>>> online duplicate detecting of already answered questions. That's beyond
>>> the
>>> scope of this mailing list, however.
>>>
>>>
>>
Re: Hbase for question answer modeling
Posted by Jonathan Gray <jl...@streamy.com>.
Arber,
I don't have any links to papers handy, unfortunately. Quite honestly
there is a TON of research on this subject. My recommendation is to dig
around ACM, you can find many papers related to duplicate detection. If
you don't have an ACM membership to the archives, digging around Google
should still yield some results.
Generally an online dupe detection system would take advantage of some
kind of "signature" or dimensional reduction that permits a level of
fuzzy matching. The implementations vary greatly depending on the
domain, for example near-duplicate image detection is a heavily
researched field as well as text-based.
As I said, this topic is well beyond the scope of this mailing list. A
bit of legwork should yield more papers than you can possibly read :)
JG
Yabo-Arber Xu wrote:
> Hi JG,
>
> Sorry for interrupting the ongoing topic, but I am quite interested in the
> online dup detection method you mentioned. Could you please elaborate it a
> bit, or point out some links and I will follow?
>
> Best,
> Arber
>
>
> On Wed, Aug 19, 2009 at 1:51 AM, Jonathan Gray <jl...@streamy.com> wrote:
>
>> You didn't talk much about how you plan on doing dupe-detection of
>> questions, but there are some interesting ways to generate signatures which
>> could turn into your row keys, then you could actually do some kind of
>> online duplicate detecting of already answered questions. That's beyond the
>> scope of this mailing list, however.
>>
>
Re: Hbase for question answer modeling
Posted by Yabo-Arber Xu <ar...@gmail.com>.
Hi JG,
Sorry for interrupting the ongoing topic, but I am quite interested in the
online dup detection method you mentioned. Could you please elaborate it a
bit, or point out some links and I will follow?
Best,
Arber
On Wed, Aug 19, 2009 at 1:51 AM, Jonathan Gray <jl...@streamy.com> wrote:
> You didn't talk much about how you plan on doing dupe-detection of
> questions, but there are some interesting ways to generate signatures which
> could turn into your row keys, then you could actually do some kind of
> online duplicate detecting of already answered questions. That's beyond the
> scope of this mailing list, however.
>
Re: Hbase for question answer modeling
Posted by Jonathan Gray <jl...@streamy.com>.
I'm having a little difficulty totally understanding your requirements,
but let me take a stab.
You basically want a mapping from 1 to N QUESTIONS to a single ANSWER?
When a new question comes in, you run an MR job that scans all existing
questions and does some kind of similarity metric against them to try to
find existing matches, and if one is found, add the new question to the
list of questions for that answer, and return the answer.
The first big question I have is, are you expecting this
question-matching query to be done in real-time? Or this is an offline,
batch process? Remember, MapReduce is not for real-time queries. At
the low end, for simple jobs, you will always run for several seconds if
not tens of seconds (for VERY simple jobs).
But it seems like you would need to scan the entire table, and run
something like a cosine similarity against every single question in it.
That's going to be a much longer running job, depending on how many
questions already exist, and certainly not real-time.
As for actually storing the questions, you should create two column
families "questions" and "answer". For each question, you insert a
column into the "questions" family. The "answer" family would always
have a single column (only a single answer right?). Then you can very
easily query for all questions, and they will be grouped by row (I'm not
sure what your row key will be).
You didn't talk much about how you plan on doing dupe-detection of
questions, but there are some interesting ways to generate signatures
which could turn into your row keys, then you could actually do some
kind of online duplicate detecting of already answered questions.
That's beyond the scope of this mailing list, however.
Hope that helps. If you need more help, please provide more detail.
JG
Puri, Aseem wrote:
> Hello
>
>
>
> I am working on a model in which I have to manage question and their
> answers.
>
>
>
> I create two columns, one in which question is to be store and other its
> answer.
>
>
>
> Now people will ask question, so when a new question come I want to
> execute map reduce job which find is same kind of question is already
> exist or not.
>
>
>
> If same question is asked then with map reduce I will find similar
> question that exist and provide answer to him that is already there with
> it. Also I want to append it with the similar question that is already
> their in my table.
>
>
>
> If question is different then I will store it in different row and its
> answer will be given by some expert and be stored.
>
>
>
> I know Hadoop HBase have property write once read many times. So I can't
> append it.
>
>
>
> I have two other options.
>
>
>
> 1. Manage new similar question with help of timestamp.
>
>
>
> 2. As a new similar question come I make new column qualifier and
> store it in same row.
>
>
>
> Please suggest that which approach should I follow and also that help in
> my map reduce operation where I have to analyze similarity of new
> question with every question that already exist. Also if some other
> approach can help me please suggest me.
>
>
>
> Regards
>
> Aseem Puri
>
>
>
>
>
>