You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@whirr.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2011/01/04 06:55:46 UTC

[jira] Created: (WHIRR-189) Hadoop on EC2 should use all available local storage

Hadoop on EC2 should use all available local storage
----------------------------------------------------

                 Key: WHIRR-189
                 URL: https://issues.apache.org/jira/browse/WHIRR-189
             Project: Whirr
          Issue Type: Improvement
          Components: service/hadoop
            Reporter: Tom White
             Fix For: 0.3.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977413#action_12977413 ] 

Adrian Cole commented on WHIRR-189:
-----------------------------------

the jclouds NodeMetadata object has a hardware.volumes collection you could use to determine this stuff.

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>             Fix For: 0.3.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079247#comment-13079247 ] 

Andrei Savu commented on WHIRR-189:
-----------------------------------

Should we try to push this in 0.6.0? Looks good to me. 

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019986#comment-13019986 ] 

Tom White commented on WHIRR-189:
---------------------------------

Thanks Adrian. I opened http://code.google.com/p/jclouds/issues/detail?id=529 for this.

BTW does jclouds know which ephemeral devices are not mapped by default for each instance size? It would be nice to be able to say "map all devices".

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated WHIRR-189:
----------------------------

    Attachment: WHIRR-189.patch

Here's an initial patch for this. Hadoop on EC2 works. The idea is that there's a function called prepare_all_disks() which is passed volume information, and which creates directories /data0, /data1, etc for the service to use. This moves more cloud-specific code out of the scripts.

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrei Savu updated WHIRR-189:
------------------------------

    Fix Version/s:     (was: 0.4.0)

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated WHIRR-189:
----------------------------

    Attachment: WHIRR-189.patch

Refreshed patch. Needs some testing.

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch, WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050702#comment-13050702 ] 

Andrei Savu commented on WHIRR-189:
-----------------------------------

Looks good. It seems like this patch incorporates WHIRR-328. Do we want that or we should search for a better way of avoiding EBS (e.g. by rewriting parts of the templateBuilder in Whirr)? 


> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977263#action_12977263 ] 

Olivier Grisel edited comment on WHIRR-189 at 1/4/11 1:17 PM:
--------------------------------------------------------------

FYI the local drive that has the most space on the m1.small instances is not mounted on  / but on  /media/ephemeral0

[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  1.8G  6.1G  23% /
tmpfs                 840M     0  840M   0% /dev/shm
/dev/xvda2            147G  6.4G  133G   5% /media/ephemeral0


Both drives have the same (read) speed though:

[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1
/dev/xvda1:
 Timing cached reads:   3966 MB in  2.05 seconds = 1939.33 MB/sec
 Timing buffered disk reads:  372 MB in  3.02 seconds = 123.23 MB/sec

[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2
/dev/xvda2:
 Timing cached reads:   4442 MB in  2.00 seconds = 2222.59 MB/sec
 Timing buffered disk reads:  376 MB in  3.00 seconds = 125.19 MB/sec




      was (Author: ogrisel):
    FYI the local drive that has the most space on the m1.small instances is not mounted on  / but on  /media/ephemeral0

[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  1.8G  6.1G  23% /
tmpfs                 840M     0  840M   0% /dev/shm
/dev/xvda2            147G  6.4G  133G   5% /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2

Both drives have the same (read) speed though:

/dev/xvda1:
 Timing cached reads:   3966 MB in  2.05 seconds = 1939.33 MB/sec
 Timing buffered disk reads:  372 MB in  3.02 seconds = 123.23 MB/sec

/dev/xvda2:
 Timing cached reads:   4442 MB in  2.00 seconds = 2222.59 MB/sec
 Timing buffered disk reads:  376 MB in  3.00 seconds = 125.19 MB/sec
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1


  
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>             Fix For: 0.3.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977263#action_12977263 ] 

Olivier Grisel commented on WHIRR-189:
--------------------------------------

FYI the local drive that has the most space on the m1.small instances is not mounted on  / but on  /media/ephemeral0

[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  1.8G  6.1G  23% /
tmpfs                 840M     0  840M   0% /dev/shm
/dev/xvda2            147G  6.4G  133G   5% /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2

Both drives have the same (read) speed though:

/dev/xvda1:
 Timing cached reads:   3966 MB in  2.05 seconds = 1939.33 MB/sec
 Timing buffered disk reads:  372 MB in  3.02 seconds = 123.23 MB/sec

/dev/xvda2:
 Timing cached reads:   4442 MB in  2.00 seconds = 2222.59 MB/sec
 Timing buffered disk reads:  376 MB in  3.00 seconds = 125.19 MB/sec
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1



> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>             Fix For: 0.3.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018974#comment-13018974 ] 

Tom White commented on WHIRR-189:
---------------------------------

On EC2, I noticed that for a m1.large instance jclouds reports that there are two local volumes /dev/sdb, and /dev/sdc (for EBS-backed images), even though /dev/sdc is not present on the instance. This is explained by the second note on http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?instance-storage-concepts.html. We can use EC2TemplateOptions to map all the ephemeral devices (but it's odd that jclouds reports a device that isn't actually mapped). This will require EC2-specific code that knows about the emphemeral devices on each instance size - I wonder if jclouds abstracts this?

Once the above is sorted out, I imagine the implementation would include all the non-boot-device volumes for its storage. HadoopConfigurationBuilder would set dfs.data.dir, dfs.name.dir, and mapred.local.dir to use all the volumes. And the volumes would need mounting/symlinking (and possibly formatting in the case of EC2) using scripts like (this code is based on code from the Python scripts):

{code}
# TODO: make sure that mkfs.xfs is installed
function prep_disk() {
  mount=$1
  device=$2
  automount=${3:-false}

  if [ $(mountpoint -q -x $device) ]; then
    echo "$device is mounted"
    if [ ! -d $mount ]; then
      echo "No mount"
      ln -s $(grep $device /proc/mounts | awk '{print $2}') $mount
    fi
  else
    echo "warning: ERASING CONTENTS OF $device"
    mkfs.xfs -f $device
    if [ ! -e $mount ]; then
      mkdir $mount
    fi
    mount -o defaults,noatime $device $mount
    if $automount ; then
      echo "$device $mount xfs defaults,noatime 0 0" >> /etc/fstab
    fi
  fi
}

prep_disk /data1 /dev/sdb true
prep_disk /data2 /dev/sdc true
{code}

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Tibor Kiss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050596#comment-13050596 ] 

Tibor Kiss commented on WHIRR-189:
----------------------------------

+1
I made a try to this patch and it works on EC2.

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019334#comment-13019334 ] 

Adrian Cole commented on WHIRR-189:
-----------------------------------

@tom I think this is a bug we'll have to test in jclouds.  right now, we don't verify the volumes listed are actually available.  you mind raising an issue on this? thnx.

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all available local storage

Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079271#comment-13079271 ] 

Adrian Cole commented on WHIRR-189:
-----------------------------------

any way we can get an integration test for this?  ex. verify through some hdfs call the size is correct?

> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
>                 Key: WHIRR-189
>                 URL: https://issues.apache.org/jira/browse/WHIRR-189
>             Project: Whirr
>          Issue Type: Improvement
>          Components: service/hadoop
>            Reporter: Tom White
>         Attachments: WHIRR-189.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira