You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "David J. O'Dell" <do...@videoegg.com> on 2009/11/18 17:20:47 UTC

names or ips in rack awareness script?

I'm trying to figure out if I should use ip addresses or dns names in my 
rack awareness script.

Its easier for me to use dns names because we have the row and rack 
number in the name which means I can dynamically determine the rack 
without having to manually update the list when adding nodes.

However this won't work if the script is passed ips as arguments.
Does anyone know what is being passed on to the script(ip's or dns names)

Relevant docs:
http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Hadoop+Rack+Awareness
and
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/net/DNSToSwitchMapping.html#resolve(java.util.List)

Re: execute multiple MR jobs

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
JobClient (.18) / Job(.20) class apis should help you achieve this.

Amogh

On 11/19/09 1:40 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

HI all,
I am going to execute multiple mapreduce jobs in sequence, but whether or not to execute a job in that sequence could not be determined beforehand, but depend on the result of the previous job. Is there anyone with some ideas how to do this 'dynamically"?

p.s. I guess cascading could help. I still not got of point of cascading yet. It is appreciated if someone could give me some hints on this.

Gang Luo
---------
Department of Computer Science
Duke University
(919)316-0993
gang.luo@duke.edu

----- 原始邮件 ----
发件人： Edward Capriolo <ed...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2009/11/18 (周三) 1:02:35 下午
主   题： Re: names or ips in rack awareness script?

On Wed, Nov 18, 2009 at 11:28 AM, Michael Thomas <th...@hep.caltech.edu> wrote:
> IPs are passed to the rack awareness script. 燱e use 'dig' to do the reverse
> lookup to find the hostname, as we also embed the rack id in the worker node
> hostnames.
>
> --Mike
>
> On 11/18/2009 08:20 AM, David J. O'Dell wrote:
>>
>> I'm trying to figure out if I should use ip addresses or dns names in my
>> rack awareness script.
>>
>> Its easier for me to use dns names because we have the row and rack
>> number in the name which means I can dynamically determine the rack
>> without having to manually update the list when adding nodes.
>>
>> However this won't work if the script is passed ips as arguments.
>> Does anyone know what is being passed on to the script(ip's or dns names)
>>
>> Relevant docs:
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Hadoop+Rack+Awareness
>>
>> and
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/net/DNSToSwitchMapping.html#resolve(java.util.List)
>>
>>
>>
>
>
>

It was never clear to me what would be needed ip vs hostname. I
specified ip, short hostnames, and long hostnames just to be safe. And
you know things sometimes change with hadoop ::wink-wink::

I have been meaning to plug my topology script for a while (as I think
it is pretty cool). I separated my topology script and my topology
data like so..

topology.sh
HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done
  shift
  if [ -z "$result" ] ; then
    echo -n "/default-rack "
  else
    echo -n "$result "
  fi
done

topology.data
hadoopdata1.ec.com     /dc1/rack1
hadoopdata1                   /dc1/rack1
10.1.1.1                       /dc1/rack1

It is great if your hostname reflects the rackname in some parsable
format! Then you do not need to maintain a topology data file like I
have. As of now I generate it from our asset db.

Good luck!

      ___________________________________________________________
  好玩贺卡等你发，邮箱贺卡全新上线！
http://card.mail.cn.yahoo.com/

execute multiple MR jobs

Posted by Gang Luo <lg...@yahoo.com.cn>.

HI all,
I am going to execute multiple mapreduce jobs in sequence, but whether or not to execute a job in that sequence could not be determined beforehand, but depend on the result of the previous job. Is there anyone with some ideas how to do this 'dynamically"?

p.s. I guess cascading could help. I still not got of point of cascading yet. It is appreciated if someone could give me some hints on this.

Gang Luo
---------
Department of Computer Science
Duke University
(919)316-0993
gang.luo@duke.edu

----- 原始邮件 ----
发件人： Edward Capriolo <ed...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2009/11/18 (周三) 1:02:35 下午
主   题： Re: names or ips in rack awareness script?

On Wed, Nov 18, 2009 at 11:28 AM, Michael Thomas <th...@hep.caltech.edu> wrote:
> IPs are passed to the rack awareness script. 燱e use 'dig' to do the reverse
> lookup to find the hostname, as we also embed the rack id in the worker node
> hostnames.
>
> --Mike
>
> On 11/18/2009 08:20 AM, David J. O'Dell wrote:
>>
>> I'm trying to figure out if I should use ip addresses or dns names in my
>> rack awareness script.
>>
>> Its easier for me to use dns names because we have the row and rack
>> number in the name which means I can dynamically determine the rack
>> without having to manually update the list when adding nodes.
>>
>> However this won't work if the script is passed ips as arguments.
>> Does anyone know what is being passed on to the script(ip's or dns names)
>>
>> Relevant docs:
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Hadoop+Rack+Awareness
>>
>> and
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/net/DNSToSwitchMapping.html#resolve(java.util.List)
>>
>>
>>
>
>
>

It was never clear to me what would be needed ip vs hostname. I
specified ip, short hostnames, and long hostnames just to be safe. And
you know things sometimes change with hadoop ::wink-wink::

I have been meaning to plug my topology script for a while (as I think
it is pretty cool). I separated my topology script and my topology
data like so..

topology.sh
HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done
  shift
  if [ -z "$result" ] ; then
    echo -n "/default-rack "
  else
    echo -n "$result "
  fi
done

topology.data
hadoopdata1.ec.com     /dc1/rack1
hadoopdata1                   /dc1/rack1
10.1.1.1                       /dc1/rack1

It is great if your hostname reflects the rackname in some parsable
format! Then you do not need to maintain a topology data file like I
have. As of now I generate it from our asset db.

Good luck!

      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: names or ips in rack awareness script?

Posted by Allen Wittenauer <aw...@linkedin.com>.

On 11/18/09 10:02 AM, "Edward Capriolo" <ed...@gmail.com> wrote:
> It was never clear to me what would be needed ip vs hostname. I
> specified ip, short hostnames, and long hostnames just to be safe. And
> you know things sometimes change with hadoop ::wink-wink::

IIRC, everything is pretty much passed around as IP address internally.  The
topology script is expected to take IP as input.  But even so, it is
probably smart to play for both IPs and names.

FWIW, I actually have a slight example of what we used at Yahoo! and its
output buried in my Hadoop 24/7 presentation from Apachecon this year.

Re: names or ips in rack awareness script?

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Nov 18, 2009 at 11:28 AM, Michael Thomas <th...@hep.caltech.edu> wrote:
> IPs are passed to the rack awareness script.  We use 'dig' to do the reverse
> lookup to find the hostname, as we also embed the rack id in the worker node
> hostnames.
>
> --Mike
>
> On 11/18/2009 08:20 AM, David J. O'Dell wrote:
>>
>> I'm trying to figure out if I should use ip addresses or dns names in my
>> rack awareness script.
>>
>> Its easier for me to use dns names because we have the row and rack
>> number in the name which means I can dynamically determine the rack
>> without having to manually update the list when adding nodes.
>>
>> However this won't work if the script is passed ips as arguments.
>> Does anyone know what is being passed on to the script(ip's or dns names)
>>
>> Relevant docs:
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Hadoop+Rack+Awareness
>>
>> and
>>
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/net/DNSToSwitchMapping.html#resolve(java.util.List)
>>
>>
>>
>
>
>

It was never clear to me what would be needed ip vs hostname. I
specified ip, short hostnames, and long hostnames just to be safe. And
you know things sometimes change with hadoop ::wink-wink::

I have been meaning to plug my topology script for a while (as I think
it is pretty cool). I separated my topology script and my topology
data like so..

topology.sh
HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done
  shift
  if [ -z "$result" ] ; then
    echo -n "/default-rack "
  else
    echo -n "$result "
  fi
done

topology.data
hadoopdata1.ec.com     /dc1/rack1
hadoopdata1                   /dc1/rack1
10.1.1.1                       /dc1/rack1

It is great if your hostname reflects the rackname in some parsable
format! Then you do not need to maintain a topology data file like I
have. As of now I generate it from our asset db.

Good luck!

Re: Can I change the block size and then restart?

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Nov 19, 2009 at 11:24 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> Can I just change the block size in the config and restart or do I have to reformat?  It's okay if what is currently in the file system stays at the old block size if that's possible ?
>
>
>
>

Raymond,

The block size is actually a per file setting. Each client can set its
own block size assuming unless you marked the variables as <final> in
your configuration.

Changing the value does not change current blocks, it only serves as
default for new blocks.

Edward

Re: names or ips in rack awareness script?

Posted by Edward Capriolo <ed...@gmail.com>.

Feel free to add this here:

http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

On Thu, Nov 19, 2009 at 11:18 AM, Michael Thomas <th...@hep.caltech.edu> wrote:
> Steve Loughran wrote:
>> Michael Thomas wrote:
>>> IPs are passed to the rack awareness script.  We use 'dig' to do the
>>> reverse lookup to find the hostname, as we also embed the rack id in
>>> the worker node hostnames.
>>>
>>
>> It might be nice to have some example scripts up on the wiki, to give
>> people a good starting place
>
> If somebody with write access to the wiki would like to add it, here is
> the one we use on our Rocks cluster.
>
> --Mike
>
>
> #!/bin/sh
>
> # The default rule assumes that the nodes are connected to the PDU
> # and switch located in the same rack.  Only the exceptions need
> # to be listed here.
>
> # In our Rocks cluster, nodes are named "compute-X-Y", where X is the
> # Rack identifier and Y is the vertical position of the node within
> # the rack.
>
> for ip in $@ ; do
>    hostname=`nslookup $ip | grep "name =" | awk '{print $4}' | sed -e
> 's/\.local\.$//' `
>    case $hostname in
>        compute-5-8)
>            # Exception: This node had to be rewired into
>            # an adjacent rack
>            rack="/Rack4"
>            ;;
>        *)
>            rack=`echo $hostname | sed -e
> 's/^[a-z]*-\([0-9]*\)-[0-9]*.*/\/Rack\1/'`
>            ;;
>        esac
>        echo $rack
> done
>

Can I change the block size and then restart?

Posted by Raymond Jennings III <ra...@yahoo.com>.

Can I just change the block size in the config and restart or do I have to reformat?  It's okay if what is currently in the file system stays at the old block size if that's possible ?

Re: names or ips in rack awareness script?

Posted by Michael Thomas <th...@hep.caltech.edu>.

Steve Loughran wrote:
> Michael Thomas wrote:
>> IPs are passed to the rack awareness script.  We use 'dig' to do the
>> reverse lookup to find the hostname, as we also embed the rack id in
>> the worker node hostnames.
>>
> 
> It might be nice to have some example scripts up on the wiki, to give
> people a good starting place

If somebody with write access to the wiki would like to add it, here is
the one we use on our Rocks cluster.

--Mike

#!/bin/sh

# The default rule assumes that the nodes are connected to the PDU
# and switch located in the same rack.  Only the exceptions need
# to be listed here.

# In our Rocks cluster, nodes are named "compute-X-Y", where X is the
# Rack identifier and Y is the vertical position of the node within
# the rack.

for ip in $@ ; do
    hostname=`nslookup $ip | grep "name =" | awk '{print $4}' | sed -e
's/\.local\.$//' `
    case $hostname in
        compute-5-8)
            # Exception: This node had to be rewired into
            # an adjacent rack
            rack="/Rack4"
            ;;
        *)
            rack=`echo $hostname | sed -e
's/^[a-z]*-\([0-9]*\)-[0-9]*.*/\/Rack\1/'`
            ;;
        esac
        echo $rack
done

Re: names or ips in rack awareness script?

Posted by Steve Loughran <st...@apache.org>.

Michael Thomas wrote:
> IPs are passed to the rack awareness script.  We use 'dig' to do the 
> reverse lookup to find the hostname, as we also embed the rack id in the 
> worker node hostnames.
> 

It might be nice to have some example scripts up on the wiki, to give 
people a good starting place

Re: names or ips in rack awareness script?

Posted by Michael Thomas <th...@hep.caltech.edu>.

IPs are passed to the rack awareness script.  We use 'dig' to do the 
reverse lookup to find the hostname, as we also embed the rack id in the 
worker node hostnames.

--Mike

On 11/18/2009 08:20 AM, David J. O'Dell wrote:
> I'm trying to figure out if I should use ip addresses or dns names in my
> rack awareness script.
>
> Its easier for me to use dns names because we have the row and rack
> number in the name which means I can dynamically determine the rack
> without having to manually update the list when adding nodes.
>
> However this won't work if the script is passed ips as arguments.
> Does anyone know what is being passed on to the script(ip's or dns names)
>
> Relevant docs:
> http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Hadoop+Rack+Awareness
>
> and
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/net/DNSToSwitchMapping.html#resolve(java.util.List)
>
>
>