You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/09/01 20:43:34 UTC

Phases in AbstractJob

Other than opening the code and looking, is there a way we register our phases such that one could, via the command line, know what they are?  For instance, I think, for now, I can skip, in my application, the first two phases of the RecommenderJob, but it seems a bit awkward to say --startPhase 2 given that at some point in a new release a new phase could be added in and I would then have to go check the code.  Not the end of the world, but it seems error prone and not readily maintainable.    I suppose as a bonus, it would be nice if one could also know where each phase expects things to be and in what format.  Would it make sense to have the equivalent of prepareJob that does registerJob up front and can then be dumped out so that one could see the phases and their inputs, etc?

-Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: Phases in AbstractJob

Posted by Ted Dunning <te...@gmail.com>.

And FlumeJava would be even more useful.

As it stands, map-reduce based functions cannot composed across multiple
function invocations.  For a library to ever be useful, we need something
like a java based flow rewriter.

On Thu, Sep 1, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:

> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
>
> > That's completely right. The use case is more for restarting a failed job
> > rather than configuring the pipeline. You "really" want to do something
> > different like piece together your own job.
>
> yeah, this is the downside to our big monolithic drivers.  Oozie or others
> might be useful here.

Re: Phases in AbstractJob

Posted by Lance Norskog <go...@gmail.com>.

Perfect example about the common file formats problem:
TopKStringPatterns.java. The FPGrowth jobs leave a SequenceFile of
TopKStringPatterns, a multi-level data format. Nothing reads it.

On Fri, Sep 2, 2011 at 8:09 PM, Lance Norskog <go...@gmail.com> wrote:

> Spitting out an Hamake file or Oozie file should be straightforward.
>
> As a first step I would standardize all of the  arguments. And, pick a list
> of N Writables as "1st class" sequence files: if a job gets one of these, it
> should know what to do.
>
>
> On Thu, Sep 1, 2011 at 4:37 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> A first step into the right direction might be better tooling for creating
>> the appropriate input data for our algorithms.
>>
>> We should have a job that creates the user-item-matrix for the
>> recommendation stuff from CSV data with support for sampling, normalization,
>> etc. I already wrote something like this for myself. I also started work on
>> something like this for creating adjacency matrices in the graph package.
>>
>> Ideally most of our algorithms should be distributed linear algebra
>> operations on distributed matrices (where possible).
>>
>> For example RowSimilarityJob is only a fancy way of computing A'A,
>> ItemSimilarityJob is just a wrapper around that and RecommenderJob adds
>> another multiplication with A' on the right. In the graph mining package
>> PageRank and RandomWalkWithRestart are just eigenvector computations of the
>> stochastified adjacency matrix.
>>
>> So I'd say we don't only need better job configuration but also a clearer
>> separation between code that executes an algorithm and code that just
>> converts data (where ever possible).
>>
>> --sebastian
>>
>>
>> On 02.09.2011 00:34, Grant Ingersoll wrote:
>>
>>> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
>>>
>>>  That's completely right. The use case is more for restarting a failed
>>>> job
>>>> rather than configuring the pipeline. You "really" want to do something
>>>> different like piece together your own job.
>>>>
>>> yeah, this is the downside to our big monolithic drivers.  Oozie or
>>> others might be useful here.
>>>
>>>  This could be as complex as we want -- it could be its own project,
>>>> defining
>>>> a slightly-higher-level definition language for MR. In fact there are
>>>> already one or two like that.
>>>>
>>> I was just thinking a registerJob to complement prepareJob might be
>>> useful and simple and hook into the AbstractJob/ CLI params
>>>
>>>  I like the idea... somehow I think you'll find it hard to implement
>>>> across
>>>> all the jobs since they're not even all in the same "format" at this
>>>> point!
>>>>
>>> +1.  Standardizing this stuff is important.
>>>
>>>
>>>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<gs...@apache.org>
>>>>  wrote:
>>>>
>>>>  Other than opening the code and looking, is there a way we register our
>>>>> phases such that one could, via the command line, know what they are?
>>>>>  For
>>>>> instance, I think, for now, I can skip, in my application, the first
>>>>> two
>>>>> phases of the RecommenderJob, but it seems a bit awkward to say
>>>>> --startPhase
>>>>> 2 given that at some point in a new release a new phase could be added
>>>>> in
>>>>> and I would then have to go check the code.  Not the end of the world,
>>>>> but
>>>>> it seems error prone and not readily maintainable.    I suppose as a
>>>>> bonus,
>>>>> it would be nice if one could also know where each phase expects things
>>>>> to
>>>>> be and in what format.  Would it make sense to have the equivalent of
>>>>> prepareJob that does registerJob up front and can then be dumped out so
>>>>> that
>>>>> one could see the phases and their inputs, etc?
>>>>>
>>>>> -Grant
>>>>>
>>>>> ------------------------------**--------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>
>>>>>
>>>>>  ------------------------------**--------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>
>>>
>>>
>>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: Phases in AbstractJob

Posted by Lance Norskog <go...@gmail.com>.

Spitting out an Hamake file or Oozie file should be straightforward.

As a first step I would standardize all of the  arguments. And, pick a list
of N Writables as "1st class" sequence files: if a job gets one of these, it
should know what to do.

On Thu, Sep 1, 2011 at 4:37 PM, Sebastian Schelter <ss...@apache.org> wrote:

> A first step into the right direction might be better tooling for creating
> the appropriate input data for our algorithms.
>
> We should have a job that creates the user-item-matrix for the
> recommendation stuff from CSV data with support for sampling, normalization,
> etc. I already wrote something like this for myself. I also started work on
> something like this for creating adjacency matrices in the graph package.
>
> Ideally most of our algorithms should be distributed linear algebra
> operations on distributed matrices (where possible).
>
> For example RowSimilarityJob is only a fancy way of computing A'A,
> ItemSimilarityJob is just a wrapper around that and RecommenderJob adds
> another multiplication with A' on the right. In the graph mining package
> PageRank and RandomWalkWithRestart are just eigenvector computations of the
> stochastified adjacency matrix.
>
> So I'd say we don't only need better job configuration but also a clearer
> separation between code that executes an algorithm and code that just
> converts data (where ever possible).
>
> --sebastian
>
>
> On 02.09.2011 00:34, Grant Ingersoll wrote:
>
>> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
>>
>>  That's completely right. The use case is more for restarting a failed job
>>> rather than configuring the pipeline. You "really" want to do something
>>> different like piece together your own job.
>>>
>> yeah, this is the downside to our big monolithic drivers.  Oozie or others
>> might be useful here.
>>
>>  This could be as complex as we want -- it could be its own project,
>>> defining
>>> a slightly-higher-level definition language for MR. In fact there are
>>> already one or two like that.
>>>
>> I was just thinking a registerJob to complement prepareJob might be useful
>> and simple and hook into the AbstractJob/ CLI params
>>
>>  I like the idea... somehow I think you'll find it hard to implement
>>> across
>>> all the jobs since they're not even all in the same "format" at this
>>> point!
>>>
>> +1.  Standardizing this stuff is important.
>>
>>
>>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<gs...@apache.org>
>>>  wrote:
>>>
>>>  Other than opening the code and looking, is there a way we register our
>>>> phases such that one could, via the command line, know what they are?
>>>>  For
>>>> instance, I think, for now, I can skip, in my application, the first two
>>>> phases of the RecommenderJob, but it seems a bit awkward to say
>>>> --startPhase
>>>> 2 given that at some point in a new release a new phase could be added
>>>> in
>>>> and I would then have to go check the code.  Not the end of the world,
>>>> but
>>>> it seems error prone and not readily maintainable.    I suppose as a
>>>> bonus,
>>>> it would be nice if one could also know where each phase expects things
>>>> to
>>>> be and in what format.  Would it make sense to have the equivalent of
>>>> prepareJob that does registerJob up front and can then be dumped out so
>>>> that
>>>> one could see the phases and their inputs, etc?
>>>>
>>>> -Grant
>>>>
>>>> ------------------------------**--------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>
>>>>
>>>>  ------------------------------**--------------
>> Grant Ingersoll
>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>
>>
>>
>


-- 
Lance Norskog
goksron@gmail.com

Re: Phases in AbstractJob

Posted by Sebastian Schelter <ss...@apache.org>.

A first step into the right direction might be better tooling for 
creating the appropriate input data for our algorithms.

We should have a job that creates the user-item-matrix for the 
recommendation stuff from CSV data with support for sampling, 
normalization, etc. I already wrote something like this for myself. I 
also started work on something like this for creating adjacency matrices 
in the graph package.

Ideally most of our algorithms should be distributed linear algebra 
operations on distributed matrices (where possible).

For example RowSimilarityJob is only a fancy way of computing A'A, 
ItemSimilarityJob is just a wrapper around that and RecommenderJob adds 
another multiplication with A' on the right. In the graph mining package 
PageRank and RandomWalkWithRestart are just eigenvector computations of 
the stochastified adjacency matrix.

So I'd say we don't only need better job configuration but also a 
clearer separation between code that executes an algorithm and code that 
just converts data (where ever possible).

--sebastian

On 02.09.2011 00:34, Grant Ingersoll wrote:
> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
>
>> That's completely right. The use case is more for restarting a failed job
>> rather than configuring the pipeline. You "really" want to do something
>> different like piece together your own job.
> yeah, this is the downside to our big monolithic drivers.  Oozie or others might be useful here.
>
>> This could be as complex as we want -- it could be its own project, defining
>> a slightly-higher-level definition language for MR. In fact there are
>> already one or two like that.
> I was just thinking a registerJob to complement prepareJob might be useful and simple and hook into the AbstractJob/ CLI params
>
>> I like the idea... somehow I think you'll find it hard to implement across
>> all the jobs since they're not even all in the same "format" at this point!
> +1.  Standardizing this stuff is important.
>
>>
>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<gs...@apache.org>  wrote:
>>
>>> Other than opening the code and looking, is there a way we register our
>>> phases such that one could, via the command line, know what they are?  For
>>> instance, I think, for now, I can skip, in my application, the first two
>>> phases of the RecommenderJob, but it seems a bit awkward to say --startPhase
>>> 2 given that at some point in a new release a new phase could be added in
>>> and I would then have to go check the code.  Not the end of the world, but
>>> it seems error prone and not readily maintainable.    I suppose as a bonus,
>>> it would be nice if one could also know where each phase expects things to
>>> be and in what format.  Would it make sense to have the equivalent of
>>> prepareJob that does registerJob up front and can then be dumped out so that
>>> one could see the phases and their inputs, etc?
>>>
>>> -Grant
>>>
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>
>>>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>

Re: Phases in AbstractJob

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:

> That's completely right. The use case is more for restarting a failed job
> rather than configuring the pipeline. You "really" want to do something
> different like piece together your own job.

yeah, this is the downside to our big monolithic drivers.  Oozie or others might be useful here.

> 
> This could be as complex as we want -- it could be its own project, defining
> a slightly-higher-level definition language for MR. In fact there are
> already one or two like that.

I was just thinking a registerJob to complement prepareJob might be useful and simple and hook into the AbstractJob/ CLI params

> 
> I like the idea... somehow I think you'll find it hard to implement across
> all the jobs since they're not even all in the same "format" at this point!

+1.  Standardizing this stuff is important.

> 
> 
> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> Other than opening the code and looking, is there a way we register our
>> phases such that one could, via the command line, know what they are?  For
>> instance, I think, for now, I can skip, in my application, the first two
>> phases of the RecommenderJob, but it seems a bit awkward to say --startPhase
>> 2 given that at some point in a new release a new phase could be added in
>> and I would then have to go check the code.  Not the end of the world, but
>> it seems error prone and not readily maintainable.    I suppose as a bonus,
>> it would be nice if one could also know where each phase expects things to
>> be and in what format.  Would it make sense to have the equivalent of
>> prepareJob that does registerJob up front and can then be dumped out so that
>> one could see the phases and their inputs, etc?
>> 
>> -Grant
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: Phases in AbstractJob

Posted by Sean Owen <sr...@gmail.com>.

That's completely right. The use case is more for restarting a failed job
rather than configuring the pipeline. You "really" want to do something
different like piece together your own job.

This could be as complex as we want -- it could be its own project, defining
a slightly-higher-level definition language for MR. In fact there are
already one or two like that.

I like the idea... somehow I think you'll find it hard to implement across
all the jobs since they're not even all in the same "format" at this point!

On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Other than opening the code and looking, is there a way we register our
> phases such that one could, via the command line, know what they are?  For
> instance, I think, for now, I can skip, in my application, the first two
> phases of the RecommenderJob, but it seems a bit awkward to say --startPhase
> 2 given that at some point in a new release a new phase could be added in
> and I would then have to go check the code.  Not the end of the world, but
> it seems error prone and not readily maintainable.    I suppose as a bonus,
> it would be nice if one could also know where each phase expects things to
> be and in what format.  Would it make sense to have the equivalent of
> prepareJob that does registerJob up front and can then be dumped out so that
> one could see the phases and their inputs, etc?
>
> -Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>