You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/05/12 18:48:06 UTC

Clustering Jobs and Drivers

With the recent removal of all the clustering Job classes we have 
introduced a fatal bug in all of the synthetic control examples. In the 
original implementation, the Jobs were responsible for deleting their 
output directory prior to running the various Drivers which did not 
delete output. In removing the clustering Jobs this deletion 
responsibility was moved to the Drivers. Problem is, the synthetic 
control examples transform the input file to the output/data directory 
before calling the clustering Driver, which now also zaps output (and 
thus it's input in this case) causing file not found errors. I see 4 
possible solutions:

   1. Reinstate the Job files, giving them the responsibility to delete
      their output directory and removing that responsibility from all
      Drivers.This will involve some code duplication in the Job and
      Driver main methods which can be addressed by refactoring.
   2. Leave the Drivers as-is and just remove their output deletion.
      This puts a bit more burden on the user but makes constructing job
      chains with clustering computations possible.
   3. Modify synthetic control examples to use a different, non-output,
      directory for the converted data and leave the Drivers alone.
   4. Finally, since chains of clustering jobs usually call the driver's
      static methods and not main, just move output deletion to main.
      The rub here is that sequences of command-line invocations would
      be problematic.

I'd like to move towards a consistent pattern across Mahout (the reason 
for removing Jobs in the first place). I'm leaning towards #1 but would 
like some feedback esp. from Sean and Robin who (iirc) started this ball 
rolling.

Re: Clustering Jobs and Drivers

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I see you added the -w option to kmeans and fuzzyk so I have added it to 
the other drivers. Turns out canopy was the only driver actually 
deleting output by default so I've fixed that too. This is more like #4 
and is also compatible with Ted's comment on #2.

On 5/12/10 10:07 AM, Robin Anil wrote:
> from a user perspective its all just a -w flag they need to provide the the
> Driver. So dont reinstate any class just to delete the file. I would say
> leave drivers.runJob as is and explicitly give the option in the main
> class(for commandline) for those who use the API burden is on them to delete
> the output.
>
>
> Or
>
> Provide a boolean flag to delete output. I have seen this behaviour lucene.
> where to create an index by overwriting you have to do something like
> IndexWriter(path, true);
>
> Robin
>
> On Wed, May 12, 2010 at 10:18 PM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>    
>> With the recent removal of all the clustering Job classes we have
>> introduced a fatal bug in all of the synthetic control examples. In the
>> original implementation, the Jobs were responsible for deleting their output
>> directory prior to running the various Drivers which did not delete output.
>> In removing the clustering Jobs this deletion responsibility was moved to
>> the Drivers. Problem is, the synthetic control examples transform the input
>> file to the output/data directory before calling the clustering Driver,
>> which now also zaps output (and thus it's input in this case) causing file
>> not found errors. I see 4 possible solutions:
>>
>>   1. Reinstate the Job files, giving them the responsibility to delete
>>      their output directory and removing that responsibility from all
>>      Drivers.This will involve some code duplication in the Job and
>>      Driver main methods which can be addressed by refactoring.
>>   2. Leave the Drivers as-is and just remove their output deletion.
>>      This puts a bit more burden on the user but makes constructing job
>>      chains with clustering computations possible.
>>   3. Modify synthetic control examples to use a different, non-output,
>>      directory for the converted data and leave the Drivers alone.
>>   4. Finally, since chains of clustering jobs usually call the driver's
>>      static methods and not main, just move output deletion to main.
>>      The rub here is that sequences of command-line invocations would
>>      be problematic.
>>
>> I'd like to move towards a consistent pattern across Mahout (the reason for
>> removing Jobs in the first place). I'm leaning towards #1 but would like
>> some feedback esp. from Sean and Robin who (iirc) started this ball rolling.
>>
>>      
>    


Re: Clustering Jobs and Drivers

Posted by Robin Anil <ro...@gmail.com>.
from a user perspective its all just a -w flag they need to provide the the
Driver. So dont reinstate any class just to delete the file. I would say
leave drivers.runJob as is and explicitly give the option in the main
class(for commandline) for those who use the API burden is on them to delete
the output.


Or

Provide a boolean flag to delete output. I have seen this behaviour lucene.
where to create an index by overwriting you have to do something like
IndexWriter(path, true);

Robin

On Wed, May 12, 2010 at 10:18 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> With the recent removal of all the clustering Job classes we have
> introduced a fatal bug in all of the synthetic control examples. In the
> original implementation, the Jobs were responsible for deleting their output
> directory prior to running the various Drivers which did not delete output.
> In removing the clustering Jobs this deletion responsibility was moved to
> the Drivers. Problem is, the synthetic control examples transform the input
> file to the output/data directory before calling the clustering Driver,
> which now also zaps output (and thus it's input in this case) causing file
> not found errors. I see 4 possible solutions:
>
>  1. Reinstate the Job files, giving them the responsibility to delete
>     their output directory and removing that responsibility from all
>     Drivers.This will involve some code duplication in the Job and
>     Driver main methods which can be addressed by refactoring.
>  2. Leave the Drivers as-is and just remove their output deletion.
>     This puts a bit more burden on the user but makes constructing job
>     chains with clustering computations possible.
>  3. Modify synthetic control examples to use a different, non-output,
>     directory for the converted data and leave the Drivers alone.
>  4. Finally, since chains of clustering jobs usually call the driver's
>     static methods and not main, just move output deletion to main.
>     The rub here is that sequences of command-line invocations would
>     be problematic.
>
> I'd like to move towards a consistent pattern across Mahout (the reason for
> removing Jobs in the first place). I'm leaning towards #1 but would like
> some feedback esp. from Sean and Robin who (iirc) started this ball rolling.
>

Re: Clustering Jobs and Drivers

Posted by Ted Dunning <te...@gmail.com>.
I think that (2) is more compatible with the hadoop mindset.  Jobs shouldn't
overwrite or delete output except for temp files.

On Wed, May 12, 2010 at 9:48 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  2. Leave the Drivers as-is and just remove their output deletion.
>     This puts a bit more burden on the user but makes constructing job
>     chains with clustering computations possible.
>