You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2008/03/06 13:59:03 UTC

[jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

     [ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-11:
---------------------------------

    Assignee: Dawid Weiss

> Static fields used throughout clustering code (Canopy, K-Means).
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-11
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect code the information is exchanged via static fields (for example, distance measure and thresholds for Canopies are static field). Is it always true in Hadoop that one job runs inside one JVM with exclusive access? I haven't seen it anywhere in Hadoop documentation and my impression was that everything uses JobConf to pass configuration to jobs, but jobs are configured on a per-object basis (a job is an object, a mapper is an object and everything else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

I do see a few advantages of using static variables, actually -- I just wasn't 
sure if it's contractual for Hadoop jobs to run in isolation from other jobs. 
This is a refactoring rather than functionality improvement, so I'll leave the 
issue open for some time; once I get a spare minute I'll look at Hadoop's code 
and see what's cooking there.

D.


Jeff Eastman wrote:
> Dawid,
> 
> I'm not sure either, as it seems to work on deployed jobs where each
> process only uses a single configuration of distance measure. I'm sure
> one can easily create use cases where different t1 and t2 values are
> required and this will break the static approach. I was going to move
> the static variables back into the object and require each instance to
> be configured individually, but I got sidetracked into vectors and
> matrices and have not gotten to it. 
> 
> Go for it,
> Jeff
> 
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:jira@apache.org] 
> Sent: Thursday, March 06, 2008 4:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Assigned: (MAHOUT-11) Static fields used throughout
> clustering code (Canopy, K-Means).
> 
> 
>      [
> https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
> 
> Dawid Weiss reassigned MAHOUT-11:
> ---------------------------------
> 
>     Assignee: Dawid Weiss
> 
>> Static fields used throughout clustering code (Canopy, K-Means).
>> ----------------------------------------------------------------
>>
>>                 Key: MAHOUT-11
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>>             Project: Mahout
>>          Issue Type: Bug
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Dawid Weiss
>>            Assignee: Dawid Weiss
>>
>> I file this as a bug, even though I'm not 100% sure it is one. In the
> currect code the information is exchanged via static fields (for
> example, distance measure and thresholds for Canopies are static field).
> Is it always true in Hadoop that one job runs inside one JVM with
> exclusive access? I haven't seen it anywhere in Hadoop documentation and
> my impression was that everything uses JobConf to pass configuration to
> jobs, but jobs are configured on a per-object basis (a job is an object,
> a mapper is an object and everything else is basically an object).
>> If it's possible for two jobs to run in parallel inside one JVM then
> this is a limitation and bug in our code that needs to be addressed.
>

RE: [jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

Posted by Jeff Eastman <je...@collab.net>.

Dawid,

I'm not sure either, as it seems to work on deployed jobs where each
process only uses a single configuration of distance measure. I'm sure
one can easily create use cases where different t1 and t2 values are
required and this will break the static approach. I was going to move
the static variables back into the object and require each instance to
be configured individually, but I got sidetracked into vectors and
matrices and have not gotten to it. 

Go for it,
Jeff

-----Original Message-----
From: Dawid Weiss (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, March 06, 2008 4:59 AM
To: mahout-dev@lucene.apache.org
Subject: [jira] Assigned: (MAHOUT-11) Static fields used throughout
clustering code (Canopy, K-Means).


     [
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-11:
---------------------------------

    Assignee: Dawid Weiss

> Static fields used throughout clustering code (Canopy, K-Means).
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-11
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>
> I file this as a bug, even though I'm not 100% sure it is one. In the
currect code the information is exchanged via static fields (for
example, distance measure and thresholds for Canopies are static field).
Is it always true in Hadoop that one job runs inside one JVM with
exclusive access? I haven't seen it anywhere in Hadoop documentation and
my impression was that everything uses JobConf to pass configuration to
jobs, but jobs are configured on a per-object basis (a job is an object,
a mapper is an object and everything else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then
this is a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.