You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/03/01 19:03:58 UTC

Re: Stack Overflow?

Hi,


----- Original Message ----
> From: Stack <st...@duboce.net>

> > Regarding this:
> >> Going  forward I for one was going to try and mine our archives more at
> >>  least for dealing with the repeats.
> >
> > Do you mean manually, or did  you have something more sophisticated in mind?
> 
> No sophistication  other than my use of the snazzy hadoop-search.com tool.
> 
> Do you have  something in mind?  Could we be making better use of the
> sematext  summaries?

Hm... we already index HBase and other Digests on search-hadoop.com.
I was thinking more along the lines of mining the ML archives and doing 
automatic Q&A extraction.
I don't know how difficult it would be.  Maybe the input would be too noisy 
(people don't ask proper questions, answers are not full sentences, quote 
characters prefixing lines from old messages add a layer of complexity...), but 
that's what I thought you might have meant.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Generating FAQ's from Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

On Mon, Mar 14, 2011 at 8:05 PM, Andrew Look <al...@shopzilla.com> wrote:

> This way coherent responses could be chained together in order to aggregate
> more useful information, while people replying on tangents or spamming would
> tend to get left out.
>

Interesting point.

> Thoughts on how mahout might help create such an adjacency matrix?

Yes.  There are cooccurrence counters in Mahout that can help a lot with
this.  I will be visiting your facility on Wednesday fi you would like to
talk about this more.

> Obviously cosine similarity would still form the distances between each
> reply in a given thread, but it seems like having some way of weighting each
> term’s specificity would help too – i.e. SGD or SVM are more specific than
> classifier, and classifier is more specific than Mahout since we’re looking
> at the mahout mailing list...

yes.  Good idea.

Re: Generating FAQ's from Stack Overflow?

Posted by Andrew Look <al...@shopzilla.com>.

Interesting, faqcluster.com does appear to be a successful application of Mahout.. Categorizing by question+all replies seems like a smart approach.

Do you think that choosing the best answer based on the cosine similarity of inlcuded terms is the based way to go?  Is choosing a single answer even the best approach? It seems that in many cases, a coherent answer to a question emerges after a number of people have replied to the question at hand.

For instance, at http://faqcluster.com/question-521113443 :
    * Q: Is there a SVM classifier implemented in Mahout?
    * A: See also o.a.m.classifier.sgd.TrainNewsGroups

While in the source conversation, a number of useful pieces of information (even additional questions) are divulged in between the question and faqcluster’s chosen answer: http://lucene.472066.n3.nabble.com/SVM-classifier-td2028948.html
    * Q1: Is there a SVM classifier implemented in Mahout?
    * A1: No. But the SGD classifier should have similar characteristics. There is also a rough draft of an SVM implementation available as a patch.
   * Q2: where can I know more about SGD classifier? mahout wiki did not help :(
   * A2i: https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression <https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression>  Sorry about that.  What queries did you use?
   * A2ii: See also o.a.m.classifier.sgd.TrainNewsGroups

Looking at this, I think that an interesting approach (to extracting the most useful information from a thread) would be to take the original question and all replies, and form an adjacency matrix:

Match(Q1, A1) = (SVM, classifier)
Match(Q1, Q2) = (classifier, mahout)
Match(A1, Q2) = (SGD, classifier)
...

This way coherent responses could be chained together in order to aggregate more useful information, while people replying on tangents or spamming would tend to get left out.

Thoughts on how mahout might help create such an adjacency matrix? Obviously cosine similarity would still form the distances between each reply in a given thread, but it seems like having some way of weighting each term’s specificity would help too – i.e. SGD or SVM are more specific than classifier, and classifier is more specific than Mahout since we’re looking at the mahout mailing list...

- Andrew

On 3/14/11 6:39 PM, "Ted Dunning" <td...@maprtech.com> wrote:

I found it.  The student in question was named Stefan Henß.  See here for details: http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3C4D660038.2000807@gmail.com%3E

The results were quite surprisingly good for how simple the techniques used are.

On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <td...@maprtech.com> wrote:
I have looked but can't find the postings by a student who recently posted about their FAQ extraction program.  The results were pretty good in terms of precision and the extracted answers were very nice.  The methods used were quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:
Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)

On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
>
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
>
> I think automatic question extraction is a quite ambitious goal.
>
> Friso
>
>
>
> On 1 mrt 2011, at 19:12, Stack wrote:
>
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>>
>>> Hm... we already index HBase and other Digests on search-hadoop.com <http://search-hadoop.com> .
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>>
>>
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
>
>

Re: Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

I found it.  The student in question was named Stefan Henß.  See here for
details:
http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3C4D660038.2000807@gmail.com%3E

The results were quite surprisingly good for how simple the techniques used
are.


On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <td...@maprtech.com> wrote:

> I have looked but can't find the postings by a student who recently posted
> about their FAQ extraction program.  The results were pretty good in terms
> of precision and the extracted answers were very nice.  The methods used
> were quite simple.
>
> Does anybody else remember this interchange?  Did it not occur here?  Did I
> imagine it?
>
> On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:
>
>> Is there any easy way to export this data from sematext / stack overflow?
>> Or is web crawling/scraping the way to go here?
>>
>> This is a good use case for Mahout, I've been looking for a problem to
>> play
>> around on mahout with :)
>>
>>
>> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
>> wrote:
>>
>> > You could try using Apache Mahout to at least cluster the messages into
>> groups
>> > of similar ones based on text features. That should be doable. Given the
>> > groups, you could manually extract questions (the clusters with most
>> threads
>> > could be the most frequently asked). Also, if you manage to get this to
>> work
>> > nicely, it could be a nice tool for other projects as well. Would be a
>> fun
>> > exercise anyways...
>> >
>> > I am starting to toy with Mahout for another pet project. Once I get
>> more
>> > comfortable with it, I might be able to take this on (not a promise).
>> >
>> > I think automatic question extraction is a quite ambitious goal.
>> >
>> > Friso
>> >
>> >
>> >
>> > On 1 mrt 2011, at 19:12, Stack wrote:
>> >
>> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> >> <ot...@yahoo.com> wrote:
>> >>>> Do you have  something in mind?  Could we be making better use of the
>> >>>> sematext  summaries?
>> >>>
>> >>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> >>> I was thinking more along the lines of mining the ML archives and
>> doing
>> >>> automatic Q&A extraction.
>> >>> I don't know how difficult it would be.  Maybe the input would be too
>> noisy
>> >>> (people don't ask proper questions, answers are not full sentences,
>> quote
>> >>> characters prefixing lines from old messages add a layer of
>> complexity...),
>> >>> but
>> >>> that's what I thought you might have meant.
>> >>>
>> >>
>> >> That'd be a nice addition to the docs.  Our FAQ is in need of
>> >> updating.  This would be a nice undertaking if someone was up for
>> >> taking it on.
>> >> St.Ack
>> >
>> >
>>
>>
>

Re: Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

I have looked but can't find the postings by a student who recently posted
about their FAQ extraction program.  The results were pretty good in terms
of precision and the extracted answers were very nice.  The methods used
were quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I
imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:

> Is there any easy way to export this data from sematext / stack overflow?
> Or is web crawling/scraping the way to go here?
>
> This is a good use case for Mahout, I've been looking for a problem to play
> around on mahout with :)
>
>
> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
> wrote:
>
> > You could try using Apache Mahout to at least cluster the messages into
> groups
> > of similar ones based on text features. That should be doable. Given the
> > groups, you could manually extract questions (the clusters with most
> threads
> > could be the most frequently asked). Also, if you manage to get this to
> work
> > nicely, it could be a nice tool for other projects as well. Would be a
> fun
> > exercise anyways...
> >
> > I am starting to toy with Mahout for another pet project. Once I get more
> > comfortable with it, I might be able to take this on (not a promise).
> >
> > I think automatic question extraction is a quite ambitious goal.
> >
> > Friso
> >
> >
> >
> > On 1 mrt 2011, at 19:12, Stack wrote:
> >
> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
> >> <ot...@yahoo.com> wrote:
> >>>> Do you have  something in mind?  Could we be making better use of the
> >>>> sematext  summaries?
> >>>
> >>> Hm... we already index HBase and other Digests on search-hadoop.com.
> >>> I was thinking more along the lines of mining the ML archives and doing
> >>> automatic Q&A extraction.
> >>> I don't know how difficult it would be.  Maybe the input would be too
> noisy
> >>> (people don't ask proper questions, answers are not full sentences,
> quote
> >>> characters prefixing lines from old messages add a layer of
> complexity...),
> >>> but
> >>> that's what I thought you might have meant.
> >>>
> >>
> >> That'd be a nice addition to the docs.  Our FAQ is in need of
> >> updating.  This would be a nice undertaking if someone was up for
> >> taking it on.
> >> St.Ack
> >
> >
>
>

Re: Stack Overflow?

Posted by Andrew Look <al...@shopzilla.com>.

Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)


On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
> 
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
> 
> I think automatic question extraction is a quite ambitious goal.
> 
> Friso
> 
> 
> 
> On 1 mrt 2011, at 19:12, Stack wrote:
> 
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>> 
>>> Hm... we already index HBase and other Digests on search-hadoop.com.
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>> 
>> 
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
> 
>

Re: Stack Overflow?

Posted by Friso van Vollenhoven <fv...@xebia.com>.

You could try using Apache Mahout to at least cluster the messages into groups of similar ones based on text features. That should be doable. Given the groups, you could manually extract questions (the clusters with most threads could be the most frequently asked). Also, if you manage to get this to work nicely, it could be a nice tool for other projects as well. Would be a fun exercise anyways...

I am starting to toy with Mahout for another pet project. Once I get more comfortable with it, I might be able to take this on (not a promise).

I think automatic question extraction is a quite ambitious goal.

Friso

On 1 mrt 2011, at 19:12, Stack wrote:

> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>>> Do you have  something in mind?  Could we be making better use of the
>>> sematext  summaries?
>> 
>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> I was thinking more along the lines of mining the ML archives and doing
>> automatic Q&A extraction.
>> I don't know how difficult it would be.  Maybe the input would be too noisy
>> (people don't ask proper questions, answers are not full sentences, quote
>> characters prefixing lines from old messages add a layer of complexity...), but
>> that's what I thought you might have meant.
>> 
> 
> That'd be a nice addition to the docs.  Our FAQ is in need of
> updating.  This would be a nice undertaking if someone was up for
> taking it on.
> St.Ack

Re: Stack Overflow?

Posted by Stack <st...@duboce.net>.

On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>> Do you have  something in mind?  Could we be making better use of the
>> sematext  summaries?
>
> Hm... we already index HBase and other Digests on search-hadoop.com.
> I was thinking more along the lines of mining the ML archives and doing
> automatic Q&A extraction.
> I don't know how difficult it would be.  Maybe the input would be too noisy
> (people don't ask proper questions, answers are not full sentences, quote
> characters prefixing lines from old messages add a layer of complexity...), but
> that's what I thought you might have meant.
>

That'd be a nice addition to the docs.  Our FAQ is in need of
updating.  This would be a nice undertaking if someone was up for
taking it on.
St.Ack