You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Kai Mosebach (JIRA)" <ji...@apache.org> on 2008/08/22 11:44:44 UTC

[jira] Updated: (HADOOP-3999) Need to add host capabilites / abilities

     [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kai Mosebach updated HADOOP-3999:
---------------------------------

    Description: 
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

  was:
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive)
b2.) generate jovs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast cpu nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)


> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.