You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Kai Mosebach (JIRA)" <ji...@apache.org> on 2008/08/22 11:42:46 UTC

[jira] Created: (HADOOP-3999) Need to add host capabilites / abilities

Need to add host capabilites / abilities
----------------------------------------

                 Key: HADOOP-3999
                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
             Project: Hadoop Core
          Issue Type: Improvement
          Components: metrics
         Environment: Any
            Reporter: Kai Mosebach


The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive)
b2.) generate jovs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast cpu nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639117#action_12639117 ] 

Kai Mosebach commented on HADOOP-3999:
--------------------------------------

Thanks a lot for your comment!

Regarding 1.) im just implementing some sort of plugin system which allows us to load arbitrary plugin classes that have to implement the CapabilityPlugin class. Its working with Maps so the plugins are quite free in what the put into it as results.
This is necessary since many benchmarks are available under a non-apache license only (i.e. scimark2) and in this way they can still be used. Furthermore i think is makes sense to define which "key(s)" from the CapabilitiyPlugin are supposed to be your relevant keys for your scheduler (i.e. the capability.performance.dhrystone value combined w/ capability.performance.diskwrite and the capability.hardware.memory might be interesting, other combinations for others - a good default setting is important here but should be tweakable - at least for testing). The plugin system should also be able to handle shell scripts/tools since some benchmarks (i/o etc) are nearly impossible in java.
Furthermore this system can also hold software info as well as other at the same time. it will have aging (since we dont want to do some (.i.e. performance) tests on every start) and serialization of the data.

I assume this system fits into other domains (beside sw/hw) as well.

Regarding 2.) I see this danger as well ... anyway i think it still makes a lot of sence if you can assume you have a special tool onsite you can use (as we have - using a lot of biological add ons - which you dont want to reinvent ;). 
Further down the road, if we see superclouds that need to handle multiple customers with different needs / specs / service levels we also should be able to differ between nodes (i call it individualized nodes at this point)
Looking at smaller setups / test setups with a lot of heterogeneousity (as we have here) we could be better of, if we can make the scheduler stop using machines for workload which are needed otherwise.
Regarding the "work-near-the-data", not only the scheduler has to know about specs of the nodes, also the dfs could make use of it (actually should prefer fast IO machines eventually)

For friends I often use the metaphor : different people are living in the cloud, i.e. workers, scientists, housewifes. so why give mathematical problems to the housewife and ironing jobs to the scientists?

Regarding 3.) (and 2) maybe the performance system is - in the beginning - more usable for core-developers and performance tweakers than for my biologist neighbors who just were forced to develop in java.


> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kai Mosebach updated HADOOP-3999:
---------------------------------

    Description: 
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

  was:
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive)
b2.) generate jovs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast cpu nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)


> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651962#action_12651962 ] 

Kai Mosebach commented on HADOOP-3999:
--------------------------------------

The basic capability plugin system is done so far but i have some structural problems/questions which you hopefully might be able to help me out with:

Status : 
- I currently do the execution of the collector-plugins into both, the DataNode startup as well as the TaskTracker startup.
- The results should be persisted locally with a timestamp so that expensive plugins (like searching for a binary, harddisk performance checks etc) are not run too often.

Questions:
- Where should i put the configuration to be available throughout the cluster (especially to the namenode and to the jobtracker). Would DatanodeInfo be a good place?
- Would it make sense to merge the capabilities with the generic conf structure?
- Plugins (.class, shell and perl scripts) currently reside in $HADOOP_HOME/plugins. I am not quite happy with that and not yet sure where to place them in the build stack. any recommendations? Maybe $HADOOP_HOME/bin/plugins ?


> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638999#action_12638999 ] 

Kai Mosebach commented on HADOOP-3999:
--------------------------------------

First implementation on the way, extending the global config structure with information acquired by performance/software-check plugins

> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3999) Dynamic host configuration system (via node side plugins)

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kai Mosebach updated HADOOP-3999:
---------------------------------

    Attachment: cloud_divide.jpg

- Nodes collect local information (functions / performance indicators / other) via plugins
- We assume the job scheduler knows the information of these (now individualized) nodes 
- Cloud might be logically split up into several sections, functional ones, providing some special software or having some special capability
- The scheduler can now provide different quality (Service Levels, Software) as well as quantity levels (Performance, BW) to the customer.
- Customer now can submit different "profiled" jobs. Regarding the profile they submitted they can be charged at different cost.



> Dynamic host configuration system (via node side plugins)
> ---------------------------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: benchmarks, conf, metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>         Attachments: cloud_divide.jpg
>
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3999) Dynamic host configuration system (via node side plugins)

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kai Mosebach updated HADOOP-3999:
---------------------------------

    Component/s: conf
                 benchmarks
        Summary: Dynamic host configuration system (via node side plugins)  (was: Need to add host capabilites / abilities)

> Dynamic host configuration system (via node side plugins)
> ---------------------------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: benchmarks, conf, metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Kai Mosebach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kai Mosebach updated HADOOP-3999:
---------------------------------

    Description: 
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

  was:
The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.

On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.

Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the task tracker could select these nodes regarding a so called job profile (i.e. heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.

To achieve this, node capabilities could be introduced and stored in the DFS, giving you

a1.) basic information about each node (OS, ARCH)
a2.) more sophisticated infos (additional software, path to software, version). 
a3.) PKI collected about the node (disc-io, cpu power, memory)
a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster

This would allow you to

b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)

>From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)


> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3999) Need to add host capabilites / abilities

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639040#action_12639040 ] 

Steve Loughran commented on HADOOP-3999:
----------------------------------------

1. This would be good if it could be easily extended; rather than than a hard coded set of values, clients could add other (key,value) info for schedulers to use. Things like expected-availability for cycle-scavenging task-trackers, and other extensions that custom schedulers could use. It could also integrate with diagnostics. 

2. There's a danger here in trying to do a full grid scheduler. Why Danger? Hard to get right, there are other tools and products that can do a lot of this. Hadoop likes to push work near the data and works best if the work is all Java.

3. Developers are surprisingly bad about estimating workload, especially if you have a few layers between you and the MR jobs. The best metric for how long/CPU-intensive/IO intensive a job will be is "what was like last time".

> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster of very fast disk-io nodes, the job tracker could select these nodes regarding a so called job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.