You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stefan Henß <st...@googlemail.com> on 2011/06/09 22:18:07 UTC

Re: Automatically extracted Mahout FAQs

Hello everyone,

a few weeks ago I had introduced some research we are currently doing. 
It’s about considering a large corpus of mailing lists, clustering the 
threads using LDA and using the models to select the most relevant Q/A’s 
from each cluster to form topic-focused FAQs.

We’ve now created a tool for reviewing the generated FAQs. Within 
approx. 10 minutes, you can select and reformulate good question/answer 
pairs found by our system. Eventually, you will be able to download the 
FAQ in HTML or XML. We will also use your selected questions to further 
evaluate and improve our system.

You can review the FAQ generated from the cluster mainly relating to 
Mahout at http://faqcluster.com/review/1. The selections remain past the 
session, so the mailing list can cooperate on the review.

We are eagerly looking forward receiving your feedback on the review 
process and system.

Yours sincerely,

Stefan and Martin
University of Darmstadt, Germany


Am 23.02.2011 06:15, schrieb Stefan Henß:
> Hi everybody,
>
> I'm currently doing research for my bachelor thesis on how to 
> automatically extract FAQs from unstructured data.
>
> For this I've built a system automatically performing the following:
> - Load thousands of conversations from forums and mailing lists (don't 
> mind the categories there).
> - Build categorization solely based on the conversation's texts (by 
> clustering).
> - Pick the best modelled categories as basis for one FAQ each.
> - For each question (first entry in a conversation) find the best 
> reply from its answers.
> - Select the most relevant and well formatted question/answer-pairs 
> for each FAQ.
>
> Most of the steps almost completely rely on the data from the 
> categorization step which is obtained using the latent Dirichlet 
> allocation model.
>
> For the evaluation part I'd like to ask you for having a look at one 
> or two FAQs and maybe give some comments on how far the questions 
> matched the FAQ's title, how relevant they were etc.
>
>
> Here's the direct link to the Mahout FAQs: 
> http://faqcluster.com/mahout-data
>
> (There are some other interesting FAQs as well at http://faqcluster.com/)
>
>
> Thanks for your help
>
> Stefan

Re: Automatically extracted Mahout FAQs

Posted by Lance Norskog <go...@gmail.com>.

This is just amazingly wonderful.

On Thu, Jun 9, 2011 at 1:18 PM, Stefan Henß <st...@googlemail.com> wrote:
> Hello everyone,
>
> a few weeks ago I had introduced some research we are currently doing. It’s
> about considering a large corpus of mailing lists, clustering the threads
> using LDA and using the models to select the most relevant Q/A’s from each
> cluster to form topic-focused FAQs.
>
> We’ve now created a tool for reviewing the generated FAQs. Within approx. 10
> minutes, you can select and reformulate good question/answer pairs found by
> our system. Eventually, you will be able to download the FAQ in HTML or XML.
> We will also use your selected questions to further evaluate and improve our
> system.
>
> You can review the FAQ generated from the cluster mainly relating to Mahout
> at http://faqcluster.com/review/1. The selections remain past the session,
> so the mailing list can cooperate on the review.
>
> We are eagerly looking forward receiving your feedback on the review process
> and system.
>
> Yours sincerely,
>
> Stefan and Martin
> University of Darmstadt, Germany
>
>
> Am 23.02.2011 06:15, schrieb Stefan Henß:
>>
>> Hi everybody,
>>
>> I'm currently doing research for my bachelor thesis on how to
>> automatically extract FAQs from unstructured data.
>>
>> For this I've built a system automatically performing the following:
>> - Load thousands of conversations from forums and mailing lists (don't
>> mind the categories there).
>> - Build categorization solely based on the conversation's texts (by
>> clustering).
>> - Pick the best modelled categories as basis for one FAQ each.
>> - For each question (first entry in a conversation) find the best reply
>> from its answers.
>> - Select the most relevant and well formatted question/answer-pairs for
>> each FAQ.
>>
>> Most of the steps almost completely rely on the data from the
>> categorization step which is obtained using the latent Dirichlet allocation
>> model.
>>
>> For the evaluation part I'd like to ask you for having a look at one or
>> two FAQs and maybe give some comments on how far the questions matched the
>> FAQ's title, how relevant they were etc.
>>
>>
>> Here's the direct link to the Mahout FAQs:
>> http://faqcluster.com/mahout-data
>>
>> (There are some other interesting FAQs as well at http://faqcluster.com/)
>>
>>
>> Thanks for your help
>>
>> Stefan
>
>



-- 
Lance Norskog
goksron@gmail.com