You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2010/11/08 14:01:17 UTC

Moving a twitter conversation to the mailing list

I'm moving a twitter conversation to the mailing list so that it doesn't 
vanish in the short-lived microblogging sphere.

To summarize, @alansaid is looking for an implementation of the 
EM-algorithm as described here: 
https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization. 
I could only point him to an unsuccessful implementation of PLSI tried 
at https://issues.apache.org/jira/browse/MAHOUT-106. While this one 
worked for tiny examples, it clearly didn't scale and it had some parts 
of the algorithm wrong IMHO. @sbourke tweeted about using it besides 
scalability issues but I would clearly discourage anyone from doing this.

Nevertheless if Alan manages to make this work and scale I think it 
would make a very nice contribution to Mahout. I guess we'd be willing 
to help, so Alan, if you need support, just ask on dev@. There's also a 
mahout hackathon planned in Berlin, maybe that would be a good 
opportunity work collaboratively on that implementation.

--sebastian

RE: Moving a twitter conversation to the mailing list

Posted by Alan Said <Al...@dai-labor.de>.
As Sebastian mentions I'm going to try to make a scalable implementation. Being a Hadoop/Mahout newbie however I'm not really sure how difficult this might end up being.

I intend to do a very general implementation which could be used for (Hy)PLSA as described here: http://www.dai-labor.de/en/publication/403

/Alan
-- 
***************************************
M.Sc.(Eng.) Alan Said
Compentence Center Information Retrieval & Machine Learning 
Technische Universität Berlin / DAI-Lab 
Sekr. TEL 14 Ernst-Reuter-Platz 7
10587 Berlin / Germany
Phone:  0049 - 30 - 314 74072
Fax:    0049 - 30 - 314 74003
E-mail: alan.said@dai-lab.de
http://www.dai-labor.de
***************************************

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Monday, November 08, 2010 1:01 PM
To: user@mahout.apache.org
Subject: Moving a twitter conversation to the mailing list

I'm moving a twitter conversation to the mailing list so that it doesn't 
vanish in the short-lived microblogging sphere.

To summarize, @alansaid is looking for an implementation of the 
EM-algorithm as described here: 
https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization. 
I could only point him to an unsuccessful implementation of PLSI tried 
at https://issues.apache.org/jira/browse/MAHOUT-106. While this one 
worked for tiny examples, it clearly didn't scale and it had some parts 
of the algorithm wrong IMHO. @sbourke tweeted about using it besides 
scalability issues but I would clearly discourage anyone from doing this.

Nevertheless if Alan manages to make this work and scale I think it 
would make a very nice contribution to Mahout. I guess we'd be willing 
to help, so Alan, if you need support, just ask on dev@. There's also a 
mahout hackathon planned in Berlin, maybe that would be a good 
opportunity work collaboratively on that implementation.

--sebastian

Re: Moving a twitter conversation to the mailing list

Posted by Drew Farris <dr...@apache.org>.
FWIW, Jimmy Lin's book has a chapter on MapReduce-based EM algorithms
(http://www.umiacs.umd.edu/~jimmylin/book.html)

On Mon, Nov 8, 2010 at 8:01 AM, Sebastian Schelter <ss...@apache.org> wrote:
> I'm moving a twitter conversation to the mailing list so that it doesn't
> vanish in the short-lived microblogging sphere.
>
> To summarize, @alansaid is looking for an implementation of the EM-algorithm
> as described here:
> https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization.
> I could only point him to an unsuccessful implementation of PLSI tried at
> https://issues.apache.org/jira/browse/MAHOUT-106. While this one worked for
> tiny examples, it clearly didn't scale and it had some parts of the
> algorithm wrong IMHO. @sbourke tweeted about using it besides scalability
> issues but I would clearly discourage anyone from doing this.
>
> Nevertheless if Alan manages to make this work and scale I think it would
> make a very nice contribution to Mahout. I guess we'd be willing to help, so
> Alan, if you need support, just ask on dev@. There's also a mahout hackathon
> planned in Berlin, maybe that would be a good opportunity work
> collaboratively on that implementation.
>
> --sebastian
>

Re: Moving a twitter conversation to the mailing list

Posted by Grant Ingersoll <gs...@apache.org>.
Just realized this morning that I have my acronyms backwards:  They have a Max. Entropy algorithm, not an EM algorithm.

On Nov 8, 2010, at 9:07 PM, Grant Ingersoll wrote:

> The EM topic is interesting, as OpenNLP is in the process of moving towards Incubation (http://wiki.apache.org/incubator/OpenNLPProposal) at the ASF and they have an EM implementation.  I've talked to them about bringing it into Mahout, but they are not interested in the extra complexity at the moment since it would add a lot of dependencies.  We, however, could do the heavy lifting by taking it and making it scale, if it is possible.
> 
> -Grant
> 
> 
> On Nov 8, 2010, at 8:01 AM, Sebastian Schelter wrote:
> 
>> I'm moving a twitter conversation to the mailing list so that it doesn't vanish in the short-lived microblogging sphere.
>> 
>> To summarize, @alansaid is looking for an implementation of the EM-algorithm as described here: https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization. I could only point him to an unsuccessful implementation of PLSI tried at https://issues.apache.org/jira/browse/MAHOUT-106. While this one worked for tiny examples, it clearly didn't scale and it had some parts of the algorithm wrong IMHO. @sbourke tweeted about using it besides scalability issues but I would clearly discourage anyone from doing this.
>> 
>> Nevertheless if Alan manages to make this work and scale I think it would make a very nice contribution to Mahout. I guess we'd be willing to help, so Alan, if you need support, just ask on dev@. There's also a mahout hackathon planned in Berlin, maybe that would be a good opportunity work collaboratively on that implementation.
>> 
>> --sebastian
> 


Re: Moving a twitter conversation to the mailing list

Posted by Grant Ingersoll <gs...@apache.org>.
The EM topic is interesting, as OpenNLP is in the process of moving towards Incubation (http://wiki.apache.org/incubator/OpenNLPProposal) at the ASF and they have an EM implementation.  I've talked to them about bringing it into Mahout, but they are not interested in the extra complexity at the moment since it would add a lot of dependencies.  We, however, could do the heavy lifting by taking it and making it scale, if it is possible.

-Grant


On Nov 8, 2010, at 8:01 AM, Sebastian Schelter wrote:

> I'm moving a twitter conversation to the mailing list so that it doesn't vanish in the short-lived microblogging sphere.
> 
> To summarize, @alansaid is looking for an implementation of the EM-algorithm as described here: https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization. I could only point him to an unsuccessful implementation of PLSI tried at https://issues.apache.org/jira/browse/MAHOUT-106. While this one worked for tiny examples, it clearly didn't scale and it had some parts of the algorithm wrong IMHO. @sbourke tweeted about using it besides scalability issues but I would clearly discourage anyone from doing this.
> 
> Nevertheless if Alan manages to make this work and scale I think it would make a very nice contribution to Mahout. I guess we'd be willing to help, so Alan, if you need support, just ask on dev@. There's also a mahout hackathon planned in Berlin, maybe that would be a good opportunity work collaboratively on that implementation.
> 
> --sebastian

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Re: Moving a twitter conversation to the mailing list

Posted by Robin Anil <ro...@gmail.com>.
Or even output clusters as graph of features or tokens. Lot of interesting
things can be done here. Graphviz folks(AT&T) were also once interested in
helping out in their code to accommodate large data.

Again sorry about the Thread Hijack.

EM is also a big and long forgotten TODO on Mahout's wiki


On Mon, Nov 8, 2010 at 10:45 PM, Ted Dunning <te...@gmail.com> wrote:

> Graphviz is an excellent suggestion, partly because lots of tools can read
> graphviz formatted
> data (omniGraffle, for instance).  That would be a natural kind of output
> from the HMM code
> for instance.
>
> On Mon, Nov 8, 2010 at 9:08 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > On Mon, Nov 8, 2010 at 9:11 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > I use R for this when using Mahout to build models or clusters or
> > whatever.
> > >  Works great.
> > >
> > > Visualization is not something that Mahout will ever have much of,
> > largely
> > > because other projects are so good
> > > at this.  The goal of Mahout should be to facilitate the use of these
> > other
> > > tools.
> > >
> > There is another one - Graphviz, which scales reasonably well for large
> > data. Yes, Mahout is not going to be adding any GUI tools. Exporters to
> > external tools is also what I had in mind.
> >
>

Re: Moving a twitter conversation to the mailing list

Posted by Ted Dunning <te...@gmail.com>.
Graphviz is an excellent suggestion, partly because lots of tools can read
graphviz formatted
data (omniGraffle, for instance).  That would be a natural kind of output
from the HMM code
for instance.

On Mon, Nov 8, 2010 at 9:08 AM, Robin Anil <ro...@gmail.com> wrote:

> On Mon, Nov 8, 2010 at 9:11 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > I use R for this when using Mahout to build models or clusters or
> whatever.
> >  Works great.
> >
> > Visualization is not something that Mahout will ever have much of,
> largely
> > because other projects are so good
> > at this.  The goal of Mahout should be to facilitate the use of these
> other
> > tools.
> >
> There is another one - Graphviz, which scales reasonably well for large
> data. Yes, Mahout is not going to be adding any GUI tools. Exporters to
> external tools is also what I had in mind.
>

Re: Moving a twitter conversation to the mailing list

Posted by Robin Anil <ro...@gmail.com>.
On Mon, Nov 8, 2010 at 9:11 PM, Ted Dunning <te...@gmail.com> wrote:

> I use R for this when using Mahout to build models or clusters or whatever.
>  Works great.
>
> Visualization is not something that Mahout will ever have much of, largely
> because other projects are so good
> at this.  The goal of Mahout should be to facilitate the use of these other
> tools.
>
There is another one - Graphviz, which scales reasonably well for large
data. Yes, Mahout is not going to be adding any GUI tools. Exporters to
external tools is also what I had in mind.

Re: Moving a twitter conversation to the mailing list

Posted by Ted Dunning <te...@gmail.com>.
That is actually an interesting idea, not least because R scripts that show
how to suck in Mahout
output would be educational even for non-R users.

On Mon, Nov 8, 2010 at 7:54 AM, zaki rahaman <za...@gmail.com> wrote:

> +1 for R.
>
> I think everyone has pet tools for visualization. I remember reading
> somewhere else on the list about having a "Labs" section in Mahout, it
> would
> be great if there was a 'contrib' section of sorts for some R
> scripts/packages (and/or other tools for other viz methods). I totally
> understand avoiding feature creep and keeping the quality bars high, but
> maybe this could also help keep project priorities and future directions
> somewhat organized (I know that other mechanisms like JIRA help with this
> too).
>
> On Mon, Nov 8, 2010 at 10:41 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I use R for this when using Mahout to build models or clusters or
> whatever.
> >  Works great.
> >
> > Visualization is not something that Mahout will ever have much of,
> largely
> > because other projects are so good
> > at this.  The goal of Mahout should be to facilitate the use of these
> other
> > tools.
> >
> > On Mon, Nov 8, 2010 at 5:07 AM, Steven Bourke <sb...@gmail.com> wrote:
> >
> > > Some of my colleagues were moaning about mahout because
> > > it didn't have something to show pretty clusters etc.
> > >
> >
>
>
>
> --
> Zaki Rahaman
>

Re: Moving a twitter conversation to the mailing list

Posted by zaki rahaman <za...@gmail.com>.
+1 for R.

I think everyone has pet tools for visualization. I remember reading
somewhere else on the list about having a "Labs" section in Mahout, it would
be great if there was a 'contrib' section of sorts for some R
scripts/packages (and/or other tools for other viz methods). I totally
understand avoiding feature creep and keeping the quality bars high, but
maybe this could also help keep project priorities and future directions
somewhat organized (I know that other mechanisms like JIRA help with this
too).

On Mon, Nov 8, 2010 at 10:41 AM, Ted Dunning <te...@gmail.com> wrote:

> I use R for this when using Mahout to build models or clusters or whatever.
>  Works great.
>
> Visualization is not something that Mahout will ever have much of, largely
> because other projects are so good
> at this.  The goal of Mahout should be to facilitate the use of these other
> tools.
>
> On Mon, Nov 8, 2010 at 5:07 AM, Steven Bourke <sb...@gmail.com> wrote:
>
> > Some of my colleagues were moaning about mahout because
> > it didn't have something to show pretty clusters etc.
> >
>



-- 
Zaki Rahaman

Re: Moving a twitter conversation to the mailing list

Posted by Ted Dunning <te...@gmail.com>.
I use R for this when using Mahout to build models or clusters or whatever.
 Works great.

Visualization is not something that Mahout will ever have much of, largely
because other projects are so good
at this.  The goal of Mahout should be to facilitate the use of these other
tools.

On Mon, Nov 8, 2010 at 5:07 AM, Steven Bourke <sb...@gmail.com> wrote:

> Some of my colleagues were moaning about mahout because
> it didn't have something to show pretty clusters etc.
>

Re: Moving a twitter conversation to the mailing list

Posted by Robin Anil <ro...@gmail.com>.
Hi Steven, I can help you in the visualization's part. I was the one who
posted it as a GSOC project and never got someone to take it up.

Robin


On Mon, Nov 8, 2010 at 6:37 PM, Steven Bourke <sb...@gmail.com> wrote:

> Hi (sbourke = me)
>
> If anyone wants to give a hand helping to incorporate visualizations I'd be
> interested in getting involved. I know it had a JIRSA somewhere but never
> really progressed. Some of my colleagues were moaning about mahout because
> it didn't have something to show pretty clusters etc.
>
>
>
> On Mon, Nov 8, 2010 at 1:01 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
> > I'm moving a twitter conversation to the mailing list so that it doesn't
> > vanish in the short-lived microblogging sphere.
> >
> > To summarize, @alansaid is looking for an implementation of the
> > EM-algorithm as described here:
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization
> .
> > I could only point him to an unsuccessful implementation of PLSI tried at
> > https://issues.apache.org/jira/browse/MAHOUT-106. While this one worked
> > for tiny examples, it clearly didn't scale and it had some parts of the
> > algorithm wrong IMHO. @sbourke tweeted about using it besides scalability
> > issues but I would clearly discourage anyone from doing this.
> >
> > Nevertheless if Alan manages to make this work and scale I think it would
> > make a very nice contribution to Mahout. I guess we'd be willing to help,
> so
> > Alan, if you need support, just ask on dev@. There's also a mahout
> > hackathon planned in Berlin, maybe that would be a good opportunity work
> > collaboratively on that implementation.
> >
> > --sebastian
> >
>

Re: Moving a twitter conversation to the mailing list

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Steve,

please send a new mail for this. It's very interesting but unrelated to 
the topic of this mail thread.

--sebastian

On 08.11.2010 14:07, Steven Bourke wrote:
> Hi (sbourke = me)
>
> If anyone wants to give a hand helping to incorporate visualizations 
> I'd be interested in getting involved. I know it had a JIRSA somewhere 
> but never really progressed. Some of my colleagues were moaning about 
> mahout because it didn't have something to show pretty clusters etc.
>
>
>
> On Mon, Nov 8, 2010 at 1:01 PM, Sebastian Schelter <ssc@apache.org 
> <ma...@apache.org>> wrote:
>
>     I'm moving a twitter conversation to the mailing list so that it
>     doesn't vanish in the short-lived microblogging sphere.
>
>     To summarize, @alansaid is looking for an implementation of the
>     EM-algorithm as described here:
>     https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization.
>     I could only point him to an unsuccessful implementation of PLSI
>     tried at https://issues.apache.org/jira/browse/MAHOUT-106. While
>     this one worked for tiny examples, it clearly didn't scale and it
>     had some parts of the algorithm wrong IMHO. @sbourke tweeted about
>     using it besides scalability issues but I would clearly discourage
>     anyone from doing this.
>
>     Nevertheless if Alan manages to make this work and scale I think
>     it would make a very nice contribution to Mahout. I guess we'd be
>     willing to help, so Alan, if you need support, just ask on dev@.
>     There's also a mahout hackathon planned in Berlin, maybe that
>     would be a good opportunity work collaboratively on that
>     implementation.
>
>     --sebastian
>
>


Re: Moving a twitter conversation to the mailing list

Posted by Steven Bourke <sb...@gmail.com>.
Hi (sbourke = me)

If anyone wants to give a hand helping to incorporate visualizations I'd be
interested in getting involved. I know it had a JIRSA somewhere but never
really progressed. Some of my colleagues were moaning about mahout because
it didn't have something to show pretty clusters etc.



On Mon, Nov 8, 2010 at 1:01 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I'm moving a twitter conversation to the mailing list so that it doesn't
> vanish in the short-lived microblogging sphere.
>
> To summarize, @alansaid is looking for an implementation of the
> EM-algorithm as described here:
> https://cwiki.apache.org/confluence/display/MAHOUT/Expectation+Maximization.
> I could only point him to an unsuccessful implementation of PLSI tried at
> https://issues.apache.org/jira/browse/MAHOUT-106. While this one worked
> for tiny examples, it clearly didn't scale and it had some parts of the
> algorithm wrong IMHO. @sbourke tweeted about using it besides scalability
> issues but I would clearly discourage anyone from doing this.
>
> Nevertheless if Alan manages to make this work and scale I think it would
> make a very nice contribution to Mahout. I guess we'd be willing to help, so
> Alan, if you need support, just ask on dev@. There's also a mahout
> hackathon planned in Berlin, maybe that would be a good opportunity work
> collaboratively on that implementation.
>
> --sebastian
>