You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Stefan Henß <st...@googlemail.com> on 2011/02/23 05:24:48 UTC
Automatically extracted CouchDB FAQs
Hi everybody,
I'm currently doing research for my bachelor thesis on how to
automatically extract FAQs from unstructured data.
For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't
mind the categories there).
- Build categorization solely based on the conversation's texts (by
clustering).
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply
from its answers.
- Select the most relevant and well formatted question/answer-pairs for
each FAQ.
For the evaluation part I'd like to ask you for having a look at one or
two FAQs and maybe give some comments on how far the questions matched
the FAQ's title, how relevant they were etc.
Here's the direct link to the CouchDB FAQs:
http://faqcluster.com/couchdb-view-document-doc-couch
And here a quite good example in my opinion:
http://faqcluster.com/question1516894006
(There are some other interesting FAQs as well at http://faqcluster.com/)
Thanks for your help
Stefan
Re: Automatically extracted CouchDB FAQs
Posted by Stefan Henß <st...@googlemail.com>.
Hi Eli,
the subtitle is definately missleading. It should only give an idea of
the topics contained in the FAQ, not what it is limited to :-)
I do remove generic english terms before the clustering but not mailing
list-specific terms. In fact those are the ones I'm trying to find :-)
In order to validate that the clustering is working properly I consider
threads from a bunch of different mailing lists (currently 8) as data
basis and assign no label to them. So those common words are my best
hint in "rebuilding" the original mailing lists.
But I still agree with your point. After the first clusters are found
(hopefully including a 100% precise couchdb FAQ) I again run the mining
algorithm on the set of threads for each cluster to generate the
second-level categorization. At this point I should definately remove
too generic words for this cluster as they can only distort the further
analysis. Thanks for pointing this out.
Best regards,
Stefan
Am 23.02.2011 21:10, schrieb Eli Stevens (Gmail):
> Interesting project. :)
>
> I didn't get a very strong sense of correlation between the topic
> categories and the questions in them. For example,
>
> http://faqcluster.com/couchdb-replication-couch-databases-database
> "Questions& Answers about Couchdb, Couch, Replication, Databases and Database."
>
> Had the following question:
>
> http://faqcluster.com/question1996757514
>
> "I'm looking for a recommendation for ruby gem that will enable me to
> use couchdb from rails. I'd like to have couch documents be modeled by
> ActiveRecord."
>
> This didn't have any mention of replication (or databases), so I can
> only guess that it was clustering on "couch" or "couchdb".
>
> Do you do any screening of common terms from the clustering? I'd
> imagine that if you looked at the user@couchdb mailing list, you could
> find a list of very common terms (like couch, couchdb, database, etc.)
> and discard or ignore those when trying to cluster the messages (in
> the same way that words like "the" and "and" shouldn't be used).
> Basically, a per-mailing-list set of generic terms.
>
> The questions and answers themselves seemed to be a nice, readable "I
> have X problem" "here is an answer" pair, so that was cool. :)
>
> HTH,
> Eli
>
>
> On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß
> <st...@googlemail.com> wrote:
>> Hi everybody,
>>
>> I'm currently doing research for my bachelor thesis on how to automatically
>> extract FAQs from unstructured data.
>>
>> For this I've built a system automatically performing the following:
>> - Load thousands of conversations from forums and mailing lists (don't mind
>> the categories there).
>> - Build categorization solely based on the conversation's texts (by
>> clustering).
>> - Pick the best modelled categories as basis for one FAQ each.
>> - For each question (first entry in a conversation) find the best reply from
>> its answers.
>> - Select the most relevant and well formatted question/answer-pairs for each
>> FAQ.
>>
>> For the evaluation part I'd like to ask you for having a look at one or two
>> FAQs and maybe give some comments on how far the questions matched the FAQ's
>> title, how relevant they were etc.
>>
>>
>> Here's the direct link to the CouchDB FAQs:
>> http://faqcluster.com/couchdb-view-document-doc-couch
>>
>> And here a quite good example in my opinion:
>> http://faqcluster.com/question1516894006
>>
>> (There are some other interesting FAQs as well at http://faqcluster.com/)
>>
>>
>> Thanks for your help
>>
>> Stefan
>>
Re: Automatically extracted CouchDB FAQs
Posted by "Eli Stevens (Gmail)" <wi...@gmail.com>.
Interesting project. :)
I didn't get a very strong sense of correlation between the topic
categories and the questions in them. For example,
http://faqcluster.com/couchdb-replication-couch-databases-database
"Questions & Answers about Couchdb, Couch, Replication, Databases and Database."
Had the following question:
http://faqcluster.com/question1996757514
"I'm looking for a recommendation for ruby gem that will enable me to
use couchdb from rails. I'd like to have couch documents be modeled by
ActiveRecord."
This didn't have any mention of replication (or databases), so I can
only guess that it was clustering on "couch" or "couchdb".
Do you do any screening of common terms from the clustering? I'd
imagine that if you looked at the user@couchdb mailing list, you could
find a list of very common terms (like couch, couchdb, database, etc.)
and discard or ignore those when trying to cluster the messages (in
the same way that words like "the" and "and" shouldn't be used).
Basically, a per-mailing-list set of generic terms.
The questions and answers themselves seemed to be a nice, readable "I
have X problem" "here is an answer" pair, so that was cool. :)
HTH,
Eli
On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß
<st...@googlemail.com> wrote:
> Hi everybody,
>
> I'm currently doing research for my bachelor thesis on how to automatically
> extract FAQs from unstructured data.
>
> For this I've built a system automatically performing the following:
> - Load thousands of conversations from forums and mailing lists (don't mind
> the categories there).
> - Build categorization solely based on the conversation's texts (by
> clustering).
> - Pick the best modelled categories as basis for one FAQ each.
> - For each question (first entry in a conversation) find the best reply from
> its answers.
> - Select the most relevant and well formatted question/answer-pairs for each
> FAQ.
>
> For the evaluation part I'd like to ask you for having a look at one or two
> FAQs and maybe give some comments on how far the questions matched the FAQ's
> title, how relevant they were etc.
>
>
> Here's the direct link to the CouchDB FAQs:
> http://faqcluster.com/couchdb-view-document-doc-couch
>
> And here a quite good example in my opinion:
> http://faqcluster.com/question1516894006
>
> (There are some other interesting FAQs as well at http://faqcluster.com/)
>
>
> Thanks for your help
>
> Stefan
>