You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Buttler, David" <bu...@llnl.gov> on 2011/02/24 02:31:32 UTC

Stack Overflow?

Hi all,
It seems that we are getting a lot of repeated questions now.  Perhaps it would be useful to start migrating the simple questions off to stackoverflow (or whichever stack exchange website is most appropriate), and just pointing people there?  Obviously there are still a lot of questions that can't be migrated over since they require detailed examination of logs, but some solid answers over there may help keep the signal high on this list.

Dave

Re: Stack Overflow?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi David,

When I see people asking questions that others have asked before (and received 
answers) I tend to point them to those questions/answers via a tool, so they 
become aware of the tool, hopefully start using it, and thus check before asking 
next time around.  For Lucene, Solr, etc. I point people to appropriate search 
results on http://search-lucene.com and for HBase and friends I'd point people 
to search results or specific ML threads over on http://search-hadoop.com/ .  
That way a parallel knowledge base doesn't have to exist.

Otis

----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "Buttler, David" <bu...@llnl.gov>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Wed, February 23, 2011 8:31:32 PM
> Subject: Stack Overflow?
> 
> Hi all,
> It seems that we are getting a lot of repeated questions now.   Perhaps it 
>would be useful to start migrating the simple questions off to  stackoverflow 
>(or whichever stack exchange website is most appropriate), and  just pointing 
>people there?  Obviously there are still a lot of questions  that can't be 
>migrated over since they require detailed examination of logs, but  some solid 
>answers over there may help keep the signal high on this  list.
> 
> Dave
> 
>

Re: Generating FAQ's from Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

On Mon, Mar 14, 2011 at 8:05 PM, Andrew Look <al...@shopzilla.com> wrote:

> This way coherent responses could be chained together in order to aggregate
> more useful information, while people replying on tangents or spamming would
> tend to get left out.
>

Interesting point.

> Thoughts on how mahout might help create such an adjacency matrix?

Yes.  There are cooccurrence counters in Mahout that can help a lot with
this.  I will be visiting your facility on Wednesday fi you would like to
talk about this more.

> Obviously cosine similarity would still form the distances between each
> reply in a given thread, but it seems like having some way of weighting each
> term’s specificity would help too – i.e. SGD or SVM are more specific than
> classifier, and classifier is more specific than Mahout since we’re looking
> at the mahout mailing list...

yes.  Good idea.

Re: Generating FAQ's from Stack Overflow?

Posted by Andrew Look <al...@shopzilla.com>.

Interesting, faqcluster.com does appear to be a successful application of Mahout.. Categorizing by question+all replies seems like a smart approach.

Do you think that choosing the best answer based on the cosine similarity of inlcuded terms is the based way to go?  Is choosing a single answer even the best approach? It seems that in many cases, a coherent answer to a question emerges after a number of people have replied to the question at hand.

For instance, at http://faqcluster.com/question-521113443 :
    * Q: Is there a SVM classifier implemented in Mahout?
    * A: See also o.a.m.classifier.sgd.TrainNewsGroups

While in the source conversation, a number of useful pieces of information (even additional questions) are divulged in between the question and faqcluster’s chosen answer: http://lucene.472066.n3.nabble.com/SVM-classifier-td2028948.html
    * Q1: Is there a SVM classifier implemented in Mahout?
    * A1: No. But the SGD classifier should have similar characteristics. There is also a rough draft of an SVM implementation available as a patch.
   * Q2: where can I know more about SGD classifier? mahout wiki did not help :(
   * A2i: https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression <https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression>  Sorry about that.  What queries did you use?
   * A2ii: See also o.a.m.classifier.sgd.TrainNewsGroups

Looking at this, I think that an interesting approach (to extracting the most useful information from a thread) would be to take the original question and all replies, and form an adjacency matrix:

Match(Q1, A1) = (SVM, classifier)
Match(Q1, Q2) = (classifier, mahout)
Match(A1, Q2) = (SGD, classifier)
...

This way coherent responses could be chained together in order to aggregate more useful information, while people replying on tangents or spamming would tend to get left out.

Thoughts on how mahout might help create such an adjacency matrix? Obviously cosine similarity would still form the distances between each reply in a given thread, but it seems like having some way of weighting each term’s specificity would help too – i.e. SGD or SVM are more specific than classifier, and classifier is more specific than Mahout since we’re looking at the mahout mailing list...

- Andrew

On 3/14/11 6:39 PM, "Ted Dunning" <td...@maprtech.com> wrote:

I found it.  The student in question was named Stefan Henß.  See here for details: http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3C4D660038.2000807@gmail.com%3E

The results were quite surprisingly good for how simple the techniques used are.

On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <td...@maprtech.com> wrote:
I have looked but can't find the postings by a student who recently posted about their FAQ extraction program.  The results were pretty good in terms of precision and the extracted answers were very nice.  The methods used were quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:
Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)

On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
>
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
>
> I think automatic question extraction is a quite ambitious goal.
>
> Friso
>
>
>
> On 1 mrt 2011, at 19:12, Stack wrote:
>
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>>
>>> Hm... we already index HBase and other Digests on search-hadoop.com <http://search-hadoop.com> .
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>>
>>
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
>
>

Re: Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

I found it.  The student in question was named Stefan Henß.  See here for
details:
http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3C4D660038.2000807@gmail.com%3E

The results were quite surprisingly good for how simple the techniques used
are.


On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <td...@maprtech.com> wrote:

> I have looked but can't find the postings by a student who recently posted
> about their FAQ extraction program.  The results were pretty good in terms
> of precision and the extracted answers were very nice.  The methods used
> were quite simple.
>
> Does anybody else remember this interchange?  Did it not occur here?  Did I
> imagine it?
>
> On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:
>
>> Is there any easy way to export this data from sematext / stack overflow?
>> Or is web crawling/scraping the way to go here?
>>
>> This is a good use case for Mahout, I've been looking for a problem to
>> play
>> around on mahout with :)
>>
>>
>> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
>> wrote:
>>
>> > You could try using Apache Mahout to at least cluster the messages into
>> groups
>> > of similar ones based on text features. That should be doable. Given the
>> > groups, you could manually extract questions (the clusters with most
>> threads
>> > could be the most frequently asked). Also, if you manage to get this to
>> work
>> > nicely, it could be a nice tool for other projects as well. Would be a
>> fun
>> > exercise anyways...
>> >
>> > I am starting to toy with Mahout for another pet project. Once I get
>> more
>> > comfortable with it, I might be able to take this on (not a promise).
>> >
>> > I think automatic question extraction is a quite ambitious goal.
>> >
>> > Friso
>> >
>> >
>> >
>> > On 1 mrt 2011, at 19:12, Stack wrote:
>> >
>> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> >> <ot...@yahoo.com> wrote:
>> >>>> Do you have  something in mind?  Could we be making better use of the
>> >>>> sematext  summaries?
>> >>>
>> >>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> >>> I was thinking more along the lines of mining the ML archives and
>> doing
>> >>> automatic Q&A extraction.
>> >>> I don't know how difficult it would be.  Maybe the input would be too
>> noisy
>> >>> (people don't ask proper questions, answers are not full sentences,
>> quote
>> >>> characters prefixing lines from old messages add a layer of
>> complexity...),
>> >>> but
>> >>> that's what I thought you might have meant.
>> >>>
>> >>
>> >> That'd be a nice addition to the docs.  Our FAQ is in need of
>> >> updating.  This would be a nice undertaking if someone was up for
>> >> taking it on.
>> >> St.Ack
>> >
>> >
>>
>>
>

Re: Stack Overflow?

Posted by Ted Dunning <td...@maprtech.com>.

I have looked but can't find the postings by a student who recently posted
about their FAQ extraction program.  The results were pretty good in terms
of precision and the extracted answers were very nice.  The methods used
were quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I
imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote:

> Is there any easy way to export this data from sematext / stack overflow?
> Or is web crawling/scraping the way to go here?
>
> This is a good use case for Mahout, I've been looking for a problem to play
> around on mahout with :)
>
>
> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
> wrote:
>
> > You could try using Apache Mahout to at least cluster the messages into
> groups
> > of similar ones based on text features. That should be doable. Given the
> > groups, you could manually extract questions (the clusters with most
> threads
> > could be the most frequently asked). Also, if you manage to get this to
> work
> > nicely, it could be a nice tool for other projects as well. Would be a
> fun
> > exercise anyways...
> >
> > I am starting to toy with Mahout for another pet project. Once I get more
> > comfortable with it, I might be able to take this on (not a promise).
> >
> > I think automatic question extraction is a quite ambitious goal.
> >
> > Friso
> >
> >
> >
> > On 1 mrt 2011, at 19:12, Stack wrote:
> >
> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
> >> <ot...@yahoo.com> wrote:
> >>>> Do you have  something in mind?  Could we be making better use of the
> >>>> sematext  summaries?
> >>>
> >>> Hm... we already index HBase and other Digests on search-hadoop.com.
> >>> I was thinking more along the lines of mining the ML archives and doing
> >>> automatic Q&A extraction.
> >>> I don't know how difficult it would be.  Maybe the input would be too
> noisy
> >>> (people don't ask proper questions, answers are not full sentences,
> quote
> >>> characters prefixing lines from old messages add a layer of
> complexity...),
> >>> but
> >>> that's what I thought you might have meant.
> >>>
> >>
> >> That'd be a nice addition to the docs.  Our FAQ is in need of
> >> updating.  This would be a nice undertaking if someone was up for
> >> taking it on.
> >> St.Ack
> >
> >
>
>

Re: Stack Overflow?

Posted by Andrew Look <al...@shopzilla.com>.

Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)


On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fv...@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
> 
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
> 
> I think automatic question extraction is a quite ambitious goal.
> 
> Friso
> 
> 
> 
> On 1 mrt 2011, at 19:12, Stack wrote:
> 
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>> 
>>> Hm... we already index HBase and other Digests on search-hadoop.com.
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>> 
>> 
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
> 
>

Re: Stack Overflow?

Posted by Friso van Vollenhoven <fv...@xebia.com>.

You could try using Apache Mahout to at least cluster the messages into groups of similar ones based on text features. That should be doable. Given the groups, you could manually extract questions (the clusters with most threads could be the most frequently asked). Also, if you manage to get this to work nicely, it could be a nice tool for other projects as well. Would be a fun exercise anyways...

I am starting to toy with Mahout for another pet project. Once I get more comfortable with it, I might be able to take this on (not a promise).

I think automatic question extraction is a quite ambitious goal.

Friso

On 1 mrt 2011, at 19:12, Stack wrote:

> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>>> Do you have  something in mind?  Could we be making better use of the
>>> sematext  summaries?
>> 
>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> I was thinking more along the lines of mining the ML archives and doing
>> automatic Q&A extraction.
>> I don't know how difficult it would be.  Maybe the input would be too noisy
>> (people don't ask proper questions, answers are not full sentences, quote
>> characters prefixing lines from old messages add a layer of complexity...), but
>> that's what I thought you might have meant.
>> 
> 
> That'd be a nice addition to the docs.  Our FAQ is in need of
> updating.  This would be a nice undertaking if someone was up for
> taking it on.
> St.Ack

Re: Stack Overflow?

Posted by Stack <st...@duboce.net>.

On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>> Do you have  something in mind?  Could we be making better use of the
>> sematext  summaries?
>
> Hm... we already index HBase and other Digests on search-hadoop.com.
> I was thinking more along the lines of mining the ML archives and doing
> automatic Q&A extraction.
> I don't know how difficult it would be.  Maybe the input would be too noisy
> (people don't ask proper questions, answers are not full sentences, quote
> characters prefixing lines from old messages add a layer of complexity...), but
> that's what I thought you might have meant.
>

That'd be a nice addition to the docs.  Our FAQ is in need of
updating.  This would be a nice undertaking if someone was up for
taking it on.
St.Ack

Re: Stack Overflow?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,


----- Original Message ----
> From: Stack <st...@duboce.net>

> > Regarding this:
> >> Going  forward I for one was going to try and mine our archives more at
> >>  least for dealing with the repeats.
> >
> > Do you mean manually, or did  you have something more sophisticated in mind?
> 
> No sophistication  other than my use of the snazzy hadoop-search.com tool.
> 
> Do you have  something in mind?  Could we be making better use of the
> sematext  summaries?

Hm... we already index HBase and other Digests on search-hadoop.com.
I was thinking more along the lines of mining the ML archives and doing 
automatic Q&A extraction.
I don't know how difficult it would be.  Maybe the input would be too noisy 
(people don't ask proper questions, answers are not full sentences, quote 
characters prefixing lines from old messages add a layer of complexity...), but 
that's what I thought you might have meant.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Stack Overflow?

Posted by Stack <st...@duboce.net>.

On Thu, Feb 24, 2011 at 7:01 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Ha!  That's exactly what I was saying (or trying to say) in my reply.
>

Smile

> Regarding this:
>> Going forward I for one was going to try and mine our archives more at
>> least for dealing with the repeats.
>
> Do you mean manually, or did you have something more sophisticated in mind?
>

No sophistication other than my use of the snazzy hadoop-search.com tool.

Do you have something in mind?  Could we be making better use of the
sematext summaries?

St.Ack

Re: Stack Overflow?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Ha!  That's exactly what I was saying (or trying to say) in my reply.

Regarding this:
> Going forward I for one was going to try and mine our archives more at
> least for dealing with the repeats.

Do you mean manually, or did you have something more sophisticated in mind?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - HBase
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Stack <st...@duboce.net>
> To: user@hbase.apache.org
> Cc: "Buttler, David" <bu...@llnl.gov>
> Sent: Thu, February 24, 2011 12:16:50 AM
> Subject: Re: Stack Overflow?
> 
> Hey David:
> 
> Yeah, a few of us have started to refer to the 'two week  cycle' where
> it seems the same questions come around again.
> 
> Karl  Fogels' Producing Open Source Software,
> http://producingoss.com/en/producingoss.pdf, has a good section on
> this  topic.  In it he advocates 'Conspicuous Use of Archives':
> 
> "Use those  archives as much as possible, and as conspicuously as
> possible. Even when you  know the
> answer to some question off the top of your head, if you think  there's
> a reference in the archives that
> contains the answer, spend the  time to dig it up and present it. Every
> time you do that in a  publicly
> visible way, some people learn for the first time that the  archives
> are there, and that searching in them
> can produce answers. Also,  by referring to the archives instead of
> rewriting the advice, you  reinforce
> the social norm against duplicating information. Why have the  same
> answer in two different places?"
> [Pg 105]
> 
> The new  search-hadoop.com search box that is at the top right hand
> corner of  hbase.apache.org since the new maven-generated 0.90.0 hbase
> website went up,  makes the digging in archives quite a bit easier.
> 
> Going forward I for one  was going to try and mine our archives more at
> least for dealing with the  repeats.
> 
> Thanks for bringing up this topic  David,
> St.Ack
> 
> 
> 
> On Wed, Feb 23, 2011 at 5:31 PM, Buttler, David  <bu...@llnl.gov> wrote:
> > Hi  all,
> > It seems that we are getting a lot of repeated questions now.   Perhaps it 
>would be useful to start migrating the simple questions off to  stackoverflow 
>(or whichever stack exchange website is most appropriate), and  just pointing 
>people there?  Obviously there are still a lot of questions that  can't be 
>migrated over since they require detailed examination of logs, but some  solid 
>answers over there may help keep the signal high on this  list.
> >
> > Dave
> >
> >
>

Re: Stack Overflow?

Posted by Stack <st...@duboce.net>.

Hey David:

Yeah, a few of us have started to refer to the 'two week cycle' where
it seems the same questions come around again.

Karl Fogels' Producing Open Source Software,
http://producingoss.com/en/producingoss.pdf, has a good section on
this topic.  In it he advocates 'Conspicuous Use of Archives':

"Use those archives as much as possible, and as conspicuously as
possible. Even when you know the
answer to some question off the top of your head, if you think there's
a reference in the archives that
contains the answer, spend the time to dig it up and present it. Every
time you do that in a publicly
visible way, some people learn for the first time that the archives
are there, and that searching in them
can produce answers. Also, by referring to the archives instead of
rewriting the advice, you reinforce
the social norm against duplicating information. Why have the same
answer in two different places?"
[Pg 105]

The new search-hadoop.com search box that is at the top right hand
corner of hbase.apache.org since the new maven-generated 0.90.0 hbase
website went up, makes the digging in archives quite a bit easier.

Going forward I for one was going to try and mine our archives more at
least for dealing with the repeats.

Thanks for bringing up this topic David,
St.Ack

On Wed, Feb 23, 2011 at 5:31 PM, Buttler, David <bu...@llnl.gov> wrote:
> Hi all,
> It seems that we are getting a lot of repeated questions now.  Perhaps it would be useful to start migrating the simple questions off to stackoverflow (or whichever stack exchange website is most appropriate), and just pointing people there?  Obviously there are still a lot of questions that can't be migrated over since they require detailed examination of logs, but some solid answers over there may help keep the signal high on this list.
>
> Dave
>
>