You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Stefan Henß <st...@googlemail.com> on 2011/02/23 05:24:48 UTC

Automatically extracted CouchDB FAQs

Hi everybody,

I'm currently doing research for my bachelor thesis on how to 
automatically extract FAQs from unstructured data.

For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't 
mind the categories there).
- Build categorization solely based on the conversation's texts (by 
clustering).
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply 
from its answers.
- Select the most relevant and well formatted question/answer-pairs for 
each FAQ.

For the evaluation part I'd like to ask you for having a look at one or 
two FAQs and maybe give some comments on how far the questions matched 
the FAQ's title, how relevant they were etc.


Here's the direct link to the CouchDB FAQs: 
http://faqcluster.com/couchdb-view-document-doc-couch

And here a quite good example in my opinion: 
http://faqcluster.com/question1516894006

(There are some other interesting FAQs as well at http://faqcluster.com/)


Thanks for your help

Stefan

Re: Automatically extracted CouchDB FAQs

Posted by Stefan Henß <st...@googlemail.com>.
Hi Eli,

the subtitle is definately missleading. It should only give an idea of 
the topics contained in the FAQ, not what it is limited to :-)

I do remove generic english terms before the clustering but not mailing 
list-specific terms. In fact those are the ones I'm trying to find :-) 
In order to validate that the clustering is working properly I consider 
threads from a bunch of different mailing lists (currently 8) as data 
basis and assign no label to them. So those common words are my best 
hint in "rebuilding" the original mailing lists.

But I still agree with your point. After the first clusters are found 
(hopefully including a 100% precise couchdb FAQ) I again run the mining 
algorithm on the set of threads for each cluster to generate the 
second-level categorization. At this point I should definately remove 
too generic words for this cluster as they can only distort the further 
analysis. Thanks for pointing this out.

Best regards,
Stefan

Am 23.02.2011 21:10, schrieb Eli Stevens (Gmail):
> Interesting project.  :)
>
> I didn't get a very strong sense of correlation between the topic
> categories and the questions in them.  For example,
>
> http://faqcluster.com/couchdb-replication-couch-databases-database
> "Questions&  Answers about Couchdb, Couch, Replication, Databases and Database."
>
> Had the following question:
>
> http://faqcluster.com/question1996757514
>
> "I'm looking for a recommendation for ruby gem that will enable me to
> use couchdb from rails. I'd like to have couch documents be modeled by
> ActiveRecord."
>
> This didn't have any mention of replication (or databases), so I can
> only guess that it was clustering on "couch" or "couchdb".
>
> Do you do any screening of common terms from the clustering?  I'd
> imagine that if you looked at the user@couchdb mailing list, you could
> find a list of very common terms (like couch, couchdb, database, etc.)
> and discard or ignore those when trying to cluster the messages (in
> the same way that words like "the" and "and" shouldn't be used).
> Basically, a per-mailing-list set of generic terms.
>
> The questions and answers themselves seemed to be a nice, readable "I
> have X problem" "here is an answer" pair, so that was cool.  :)
>
> HTH,
> Eli
>
>
> On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß
> <st...@googlemail.com>  wrote:
>> Hi everybody,
>>
>> I'm currently doing research for my bachelor thesis on how to automatically
>> extract FAQs from unstructured data.
>>
>> For this I've built a system automatically performing the following:
>> - Load thousands of conversations from forums and mailing lists (don't mind
>> the categories there).
>> - Build categorization solely based on the conversation's texts (by
>> clustering).
>> - Pick the best modelled categories as basis for one FAQ each.
>> - For each question (first entry in a conversation) find the best reply from
>> its answers.
>> - Select the most relevant and well formatted question/answer-pairs for each
>> FAQ.
>>
>> For the evaluation part I'd like to ask you for having a look at one or two
>> FAQs and maybe give some comments on how far the questions matched the FAQ's
>> title, how relevant they were etc.
>>
>>
>> Here's the direct link to the CouchDB FAQs:
>> http://faqcluster.com/couchdb-view-document-doc-couch
>>
>> And here a quite good example in my opinion:
>> http://faqcluster.com/question1516894006
>>
>> (There are some other interesting FAQs as well at http://faqcluster.com/)
>>
>>
>> Thanks for your help
>>
>> Stefan
>>


Re: Automatically extracted CouchDB FAQs

Posted by "Eli Stevens (Gmail)" <wi...@gmail.com>.
Interesting project.  :)

I didn't get a very strong sense of correlation between the topic
categories and the questions in them.  For example,

http://faqcluster.com/couchdb-replication-couch-databases-database
"Questions & Answers about Couchdb, Couch, Replication, Databases and Database."

Had the following question:

http://faqcluster.com/question1996757514

"I'm looking for a recommendation for ruby gem that will enable me to
use couchdb from rails. I'd like to have couch documents be modeled by
ActiveRecord."

This didn't have any mention of replication (or databases), so I can
only guess that it was clustering on "couch" or "couchdb".

Do you do any screening of common terms from the clustering?  I'd
imagine that if you looked at the user@couchdb mailing list, you could
find a list of very common terms (like couch, couchdb, database, etc.)
and discard or ignore those when trying to cluster the messages (in
the same way that words like "the" and "and" shouldn't be used).
Basically, a per-mailing-list set of generic terms.

The questions and answers themselves seemed to be a nice, readable "I
have X problem" "here is an answer" pair, so that was cool.  :)

HTH,
Eli


On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß
<st...@googlemail.com> wrote:
> Hi everybody,
>
> I'm currently doing research for my bachelor thesis on how to automatically
> extract FAQs from unstructured data.
>
> For this I've built a system automatically performing the following:
> - Load thousands of conversations from forums and mailing lists (don't mind
> the categories there).
> - Build categorization solely based on the conversation's texts (by
> clustering).
> - Pick the best modelled categories as basis for one FAQ each.
> - For each question (first entry in a conversation) find the best reply from
> its answers.
> - Select the most relevant and well formatted question/answer-pairs for each
> FAQ.
>
> For the evaluation part I'd like to ask you for having a look at one or two
> FAQs and maybe give some comments on how far the questions matched the FAQ's
> title, how relevant they were etc.
>
>
> Here's the direct link to the CouchDB FAQs:
> http://faqcluster.com/couchdb-view-document-doc-couch
>
> And here a quite good example in my opinion:
> http://faqcluster.com/question1516894006
>
> (There are some other interesting FAQs as well at http://faqcluster.com/)
>
>
> Thanks for your help
>
> Stefan
>