You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2012/11/08 19:48:30 UTC

Run multiple kmeans jobs at once from the same bash script as part of top down clustering

When doing top down clustering, I'm running a first pass of kmeans, and
then splitting the different clusters off into their own directories with
clusterpp. So I have a bunch of input directories that I want to run kmeans
jobs on at the same time.

Can I do that from a bash script? Right now I'm running over each input
directory with a for loop, and each kmeans job is waiting for completion
before the next one starts.

If I can't do it with a script, could I do it in Java without having to
modify the mahout source?

Thanks for the help!

Re: Run multiple kmeans jobs at once from the same bash script as part of top down clustering

Posted by Matt Molek <mp...@gmail.com>.
I know it's more complicated since there are multiple jobs within one run
of kmeans clustering, but with other hadoop jobs, I've done something along
the lines of:

for(Job job : parallelJobs){
    job.submit()
}

And then I just watch that list of jobs and wait for them all to complete.
That's the sort of thing I want to be able to do with KMeans on multiple
separate datasets.


On Tue, Nov 20, 2012 at 11:58 AM, Matt Molek <mp...@gmail.com> wrote:

> I've given up on the CLI and I'm trying to do this in java now, but it
> looks like I can't launch multiple KMeans drivers at once since
> KMeansDriver and many of its underlying classes are static. Am I right that
> that will cause problems? (Sorry for the beginner question. I'm not too
> familiar with concurrency in java).
>
> I'd really like to be able to launch multiple clustering runs at the same
> time since launching them one at a time and waiting for each to finish is
> killing my overall performance.
>
>
>
> On Thu, Nov 8, 2012 at 1:48 PM, Matt Molek <mp...@gmail.com> wrote:
>
>> When doing top down clustering, I'm running a first pass of kmeans, and
>> then splitting the different clusters off into their own directories with
>> clusterpp. So I have a bunch of input directories that I want to run kmeans
>> jobs on at the same time.
>>
>> Can I do that from a bash script? Right now I'm running over each input
>> directory with a for loop, and each kmeans job is waiting for completion
>> before the next one starts.
>>
>> If I can't do it with a script, could I do it in Java without having to
>> modify the mahout source?
>>
>> Thanks for the help!
>>
>
>

Re: Run multiple kmeans jobs at once from the same bash script as part of top down clustering

Posted by Matt Molek <mp...@gmail.com>.
I've given up on the CLI and I'm trying to do this in java now, but it
looks like I can't launch multiple KMeans drivers at once since
KMeansDriver and many of its underlying classes are static. Am I right that
that will cause problems? (Sorry for the beginner question. I'm not too
familiar with concurrency in java).

I'd really like to be able to launch multiple clustering runs at the same
time since launching them one at a time and waiting for each to finish is
killing my overall performance.


On Thu, Nov 8, 2012 at 1:48 PM, Matt Molek <mp...@gmail.com> wrote:

> When doing top down clustering, I'm running a first pass of kmeans, and
> then splitting the different clusters off into their own directories with
> clusterpp. So I have a bunch of input directories that I want to run kmeans
> jobs on at the same time.
>
> Can I do that from a bash script? Right now I'm running over each input
> directory with a for loop, and each kmeans job is waiting for completion
> before the next one starts.
>
> If I can't do it with a script, could I do it in Java without having to
> modify the mahout source?
>
> Thanks for the help!
>