You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2011/08/17 11:48:31 UTC

Issues piling up

Hi all, I'm again seeing the issue count tend to pile up. I try to run
through regularly to resolve anything addressed to me, and even things that
aren't but that I am confident enough to fix. It would be great if everyone
could do the same in a spare 1-2 hours this week, if only to say "yes, go
ahead on that patch" or "no I don't think this is a good idea". Especially
the committers who have not been active in a while.

To me, this is the most essential work we can do, because without responses
from those with power to commit, new community members get the message that
their contributions are ignored, or that nobody's home. That's no good.
Understanding that individuals may not have time to actively write their own
new changes and improvements, it seems that the least we can all do is
involve and respond to external input, to bring in those who want to make
changes.

I'd also like to sweep through the issues that have not been touched in 6+
months and close some that just do not seem to be getting any traction or
attention. The theory is that closing stuff that by all accounts won't get
looked at better communicates what's coming in the project, and focuses
attention on issues that might get looked at.

Before I start that though, would welcome anyone to peek at everything
that's open and assign, comment, ping, etc. anything that needs to be kept
alive.

Re: Issues piling up

Posted by Shannon Quinn <sq...@gatech.edu>.

My issues of interest:

MAHOUT-516: Eigencuts produces unexpected results
This is finding a decent heuristic for automatically determining the 
degree fed to the Lanczos solver.

MAHOUT-517: Eigencuts needs an output format
Something better than System.out.println()

MAHOUT-518: Implement Affinity Preprocessing for Eigencuts and Spectral 
KMeans
Have a map job sit in front of eigencuts/spectral k-means that converts 
some standard input format (perhaps CSV?) into the affinity matrix used 
by the algorithms.

MAHOUT-524: DisplaySpectralKMeans example fails
Something strange in the clusters shown in the example.

MAHOUT-537: Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
Somewhat on hold until a later version of Hadoop.

I'm sorry for my lack of activity; my summer internship with Google kept 
me busier than I'd expected. However, with the return of PhD research, 
my advisor wants to very quickly--as in, the next couple months--push 
out a prototype for our framework that uses Mahout, so these issues 
should be attended to very, very soon.

Shannon

On 8/17/11 6:44 AM, Grant Ingersoll wrote:
> More later, but…
>
> On Aug 17, 2011, at 5:13 AM, Sebastian Schelter wrote:
>
>> MAHOUT-767 Improve RowSimilarityJob performance for count-based distance measures
>>
>> Currently working on that.
> I'm looking to test what you have on the ASF mail archives in the coming few weeks.
>

Re: Issues piling up

Posted by Grant Ingersoll <gs...@apache.org>.

More later, but…

On Aug 17, 2011, at 5:13 AM, Sebastian Schelter wrote:

> 
> MAHOUT-767 Improve RowSimilarityJob performance for count-based distance measures
> 
> Currently working on that.

I'm looking to test what you have on the ASF mail archives in the coming few weeks.

Re: Issues piling up

Posted by Frank Scholten <fr...@frankscholten.nl>.

MAHOUT-612: Simplify configuring and running Mahout MapReduce jobs
from Java using Java bean configuration

Currently working on Fuzzy K-Means configuration

On Wed, Aug 17, 2011 at 4:51 PM, Ted Dunning <te...@gmail.com> wrote:
> Just holler.
>
> On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> MAHOUT-773 Implement Random Walk with Restarts
>>
>> I have a working patch, but Ted suggested the use of random projections,
>> I'll have a look into that if I get hold of a linear algebra guru :)
>>
>

Re: Issues piling up

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 17, 2011, at 11:09 AM, Sebastian Schelter wrote:

> Thanks for offering your help, but I guess it need something more than someone helping per mail...
> 
> I need someone to sit down with me in my office and answer lots of possibly very embarassing questions :)

I doubt you are alone in those questions.


> 
> --sebastian
> 
> On 17.08.2011 16:51, Ted Dunning wrote:
>> Just holler.
>> 
>> On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <ssc@apache.org
>> <ma...@apache.org>> wrote:
>> 
>>    MAHOUT-773 Implement Random Walk with Restarts
>> 
>>    I have a working patch, but Ted suggested the use of random
>>    projections, I'll have a look into that if I get hold of a linear
>>    algebra guru :)
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Issues piling up

Posted by Sebastian Schelter <ss...@apache.org>.

Thanks for offering your help, but I guess it need something more than 
someone helping per mail...

I need someone to sit down with me in my office and answer lots of 
possibly very embarassing questions :)

--sebastian

On 17.08.2011 16:51, Ted Dunning wrote:
> Just holler.
>
> On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <ssc@apache.org
> <ma...@apache.org>> wrote:
>
>     MAHOUT-773 Implement Random Walk with Restarts
>
>     I have a working patch, but Ted suggested the use of random
>     projections, I'll have a look into that if I get hold of a linear
>     algebra guru :)
>
>

Re: Issues piling up

Posted by Ted Dunning <te...@gmail.com>.

Just holler.

On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <ss...@apache.org> wrote:

> MAHOUT-773 Implement Random Walk with Restarts
>
> I have a working patch, but Ted suggested the use of random projections,
> I'll have a look into that if I get hold of a linear algebra guru :)
>

Re: Issues piling up

Posted by Sebastian Schelter <ss...@apache.org>.

I ran through the issues and compiled a list of tasks I'll care about, 
please keep them open.


MAHOUT-710 Implementing K-Trusses

I'll pick up work on that in a few weeks.


MAHOUT-777 Improve TransposeJob to use a Combiner

Patch needs review from Jake (or anyone else willing to do that)


MAHOUT-767 Improve RowSimilarityJob performance for count-based distance 
measures

Currently working on that.


MAHOUT-773 Implement Random Walk with Restarts

I have a working patch, but Ted suggested the use of random projections, 
I'll have a look into that if I get hold of a linear algebra guru :)


MAHOUT-737 Implicit Alternating Least Squares SVD

This one needs input from Tamas Jambor, the code looked good, yet he has 
to rework it to avoid a matrix inversion.


MAHOUT-609 Add an option to make RecommenderJob write out it's computed 
item similarities

This should be a small change that I'll add in some spare time or before 
the next release.

--sebastian

On 17.08.2011 11:48, Sean Owen wrote:
> Hi all, I'm again seeing the issue count tend to pile up. I try to run
> through regularly to resolve anything addressed to me, and even things that
> aren't but that I am confident enough to fix. It would be great if everyone
> could do the same in a spare 1-2 hours this week, if only to say "yes, go
> ahead on that patch" or "no I don't think this is a good idea". Especially
> the committers who have not been active in a while.
>
> To me, this is the most essential work we can do, because without responses
> from those with power to commit, new community members get the message that
> their contributions are ignored, or that nobody's home. That's no good.
> Understanding that individuals may not have time to actively write their own
> new changes and improvements, it seems that the least we can all do is
> involve and respond to external input, to bring in those who want to make
> changes.
>
> I'd also like to sweep through the issues that have not been touched in 6+
> months and close some that just do not seem to be getting any traction or
> attention. The theory is that closing stuff that by all accounts won't get
> looked at better communicates what's coming in the project, and focuses
> attention on issues that might get looked at.
>
> Before I start that though, would welcome anyone to peek at everything
> that's open and assign, comment, ping, etc. anything that needs to be kept
> alive.
>

Re: Issues piling up

Posted by Sean Owen <sr...@gmail.com>.

I think all the renewed activity is great. Next week, I will update JIRA to
reflect these comments, and perhaps close out some items that are not
mentioned here.

We seem to make a release after about 6 months or 150 JIRA issues. There's
no hard rules about that, but seems like a fine pace to date. That would put
us to a new release around January next year by default. Hey, if there's a
surge of activity, let's make it sooner.

On Thu, Aug 18, 2011 at 2:44 PM, Grant Ingersoll <gs...@apache.org>wrote:

> I intend to get through M-688 and M-627 soon.  I'd appreciate some other
> eyeballs on M-627.    I think M-399 warrants more interest, but I also seem
> to recall Jake saying he has a pretty significant overhaul of LDA coming
> anyway, so it may not be worth the time.
>
> Seems like with this push, we could get to 0.6 sometime in late Sept or
> Oct.?
>
>
> On Aug 17, 2011, at 4:48 AM, Sean Owen wrote:
>
> > Hi all, I'm again seeing the issue count tend to pile up. I try to run
> > through regularly to resolve anything addressed to me, and even things
> that
> > aren't but that I am confident enough to fix. It would be great if
> everyone
> > could do the same in a spare 1-2 hours this week, if only to say "yes, go
> > ahead on that patch" or "no I don't think this is a good idea".
> Especially
> > the committers who have not been active in a while.
> >
> > To me, this is the most essential work we can do, because without
> responses
> > from those with power to commit, new community members get the message
> that
> > their contributions are ignored, or that nobody's home. That's no good.
> > Understanding that individuals may not have time to actively write their
> own
> > new changes and improvements, it seems that the least we can all do is
> > involve and respond to external input, to bring in those who want to make
> > changes.
> >
> > I'd also like to sweep through the issues that have not been touched in
> 6+
> > months and close some that just do not seem to be getting any traction or
> > attention. The theory is that closing stuff that by all accounts won't
> get
> > looked at better communicates what's coming in the project, and focuses
> > attention on issues that might get looked at.
> >
> > Before I start that though, would welcome anyone to peek at everything
> > that's open and assign, comment, ping, etc. anything that needs to be
> kept
> > alive.
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Re: Issues piling up

Posted by Grant Ingersoll <gs...@apache.org>.

I intend to get through M-688 and M-627 soon.  I'd appreciate some other eyeballs on M-627.    I think M-399 warrants more interest, but I also seem to recall Jake saying he has a pretty significant overhaul of LDA coming anyway, so it may not be worth the time.  

Seems like with this push, we could get to 0.6 sometime in late Sept or Oct.?


On Aug 17, 2011, at 4:48 AM, Sean Owen wrote:

> Hi all, I'm again seeing the issue count tend to pile up. I try to run
> through regularly to resolve anything addressed to me, and even things that
> aren't but that I am confident enough to fix. It would be great if everyone
> could do the same in a spare 1-2 hours this week, if only to say "yes, go
> ahead on that patch" or "no I don't think this is a good idea". Especially
> the committers who have not been active in a while.
> 
> To me, this is the most essential work we can do, because without responses
> from those with power to commit, new community members get the message that
> their contributions are ignored, or that nobody's home. That's no good.
> Understanding that individuals may not have time to actively write their own
> new changes and improvements, it seems that the least we can all do is
> involve and respond to external input, to bring in those who want to make
> changes.
> 
> I'd also like to sweep through the issues that have not been touched in 6+
> months and close some that just do not seem to be getting any traction or
> attention. The theory is that closing stuff that by all accounts won't get
> looked at better communicates what's coming in the project, and focuses
> attention on issues that might get looked at.
> 
> Before I start that though, would welcome anyone to peek at everything
> that's open and assign, comment, ping, etc. anything that needs to be kept
> alive.

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Issues piling up

Posted by Ted Dunning <te...@gmail.com>.

Here are my thoughts so far:

http://dl.dropbox.com/u/36863361/sd-2.pdf

and tex source:

http://dl.dropbox.com/u/36863361/sd-2.tex

I think that this gets rid of the QR steps.  I am still debugging the case
of a singular matrix, but that shouldn't apply to any real cases.

On Wed, Aug 17, 2011 at 12:01 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I will take a look although there seem to be a lot of new stuff i don't
> have
> time to read the science for.
>
> On top of it, i was planning some improvements on SSVD scaling and getting
> rid of current limitations for some time now, such as
>
> -- SSVD-wide enhancements: to allow better wide scaling, in summary to
> billions of non-zero elements per row:
>    -- remove at least k+p rows per map task limiation without causing
> "supersplits" by allowing blocked QR  pushdown to reducers (or perhaps even
> automatic pushdown, i am not sure if it is possible).
>    -- I have already used SSVD code that equips vector with a preprocessor
> via Configured hadoop interface allowing on-the fly random projection which
> allows to randomly project very long rows without ever loadnig them in
> memory
>
> -- "SSVD-tall" improvements: to allow more vertical scaling (currently
> thought to be at about billion rows with a lot of memory) by introducing
> more bottom-up divide-and-conquer QR steps in the middle.
>
> Unfortunately, i see most of those improvements (except for preprocessor
> improvement probably, and perhaps QR pushdown) as purely theoretical
> challenge as i am yet to find a use case for them either myself or in
> public, hence it is merely a theoretical scale interest right now. Dense
> matrix even of million by million is already 5 to 8 Tb input file, which is
> a challenge to find for me, much less benchmark on a thousand-node cluster,
> and this case is thought to be already well covered even by current code.
> Potential challenge to it is high deviation of nonzero elements in the
> input
> (so that it may be million on average with spikes to a billion or so which
> would mean a 8G sized vector).
>
> Given i seem to be burried  in ever-increasing work and household tasks, i
> don't see myself doing much of that except for what improvements already
> exist on the side, in the next 6 months or so.
>
> -d
>
> On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > Hi all, I'm again seeing the issue count tend to pile up. I try to run
> > through regularly to resolve anything addressed to me, and even things
> that
> > aren't but that I am confident enough to fix. It would be great if
> everyone
> > could do the same in a spare 1-2 hours this week, if only to say "yes, go
> > ahead on that patch" or "no I don't think this is a good idea".
> Especially
> > the committers who have not been active in a while.
> >
> > To me, this is the most essential work we can do, because without
> responses
> > from those with power to commit, new community members get the message
> that
> > their contributions are ignored, or that nobody's home. That's no good.
> > Understanding that individuals may not have time to actively write their
> > own
> > new changes and improvements, it seems that the least we can all do is
> > involve and respond to external input, to bring in those who want to make
> > changes.
> >
> > I'd also like to sweep through the issues that have not been touched in
> 6+
> > months and close some that just do not seem to be getting any traction or
> > attention. The theory is that closing stuff that by all accounts won't
> get
> > looked at better communicates what's coming in the project, and focuses
> > attention on issues that might get looked at.
> >
> > Before I start that though, would welcome anyone to peek at everything
> > that's open and assign, comment, ping, etc. anything that needs to be
> kept
> > alive.
> >
>

Re: Issues piling up

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I will take a look although there seem to be a lot of new stuff i don't have
time to read the science for.

On top of it, i was planning some improvements on SSVD scaling and getting
rid of current limitations for some time now, such as

-- SSVD-wide enhancements: to allow better wide scaling, in summary to
billions of non-zero elements per row:
    -- remove at least k+p rows per map task limiation without causing
"supersplits" by allowing blocked QR  pushdown to reducers (or perhaps even
automatic pushdown, i am not sure if it is possible).
    -- I have already used SSVD code that equips vector with a preprocessor
via Configured hadoop interface allowing on-the fly random projection which
allows to randomly project very long rows without ever loadnig them in
memory

-- "SSVD-tall" improvements: to allow more vertical scaling (currently
thought to be at about billion rows with a lot of memory) by introducing
more bottom-up divide-and-conquer QR steps in the middle.

Unfortunately, i see most of those improvements (except for preprocessor
improvement probably, and perhaps QR pushdown) as purely theoretical
challenge as i am yet to find a use case for them either myself or in
public, hence it is merely a theoretical scale interest right now. Dense
matrix even of million by million is already 5 to 8 Tb input file, which is
a challenge to find for me, much less benchmark on a thousand-node cluster,
and this case is thought to be already well covered even by current code.
Potential challenge to it is high deviation of nonzero elements in the input
(so that it may be million on average with spikes to a billion or so which
would mean a 8G sized vector).

Given i seem to be burried  in ever-increasing work and household tasks, i
don't see myself doing much of that except for what improvements already
exist on the side, in the next 6 months or so.

-d

On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <sr...@gmail.com> wrote:

> Hi all, I'm again seeing the issue count tend to pile up. I try to run
> through regularly to resolve anything addressed to me, and even things that
> aren't but that I am confident enough to fix. It would be great if everyone
> could do the same in a spare 1-2 hours this week, if only to say "yes, go
> ahead on that patch" or "no I don't think this is a good idea". Especially
> the committers who have not been active in a while.
>
> To me, this is the most essential work we can do, because without responses
> from those with power to commit, new community members get the message that
> their contributions are ignored, or that nobody's home. That's no good.
> Understanding that individuals may not have time to actively write their
> own
> new changes and improvements, it seems that the least we can all do is
> involve and respond to external input, to bring in those who want to make
> changes.
>
> I'd also like to sweep through the issues that have not been touched in 6+
> months and close some that just do not seem to be getting any traction or
> attention. The theory is that closing stuff that by all accounts won't get
> looked at better communicates what's coming in the project, and focuses
> attention on issues that might get looked at.
>
> Before I start that though, would welcome anyone to peek at everything
> that's open and assign, comment, ping, etc. anything that needs to be kept
> alive.
>