You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@whirr.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2011/01/04 06:55:46 UTC
[jira] Created: (WHIRR-189) Hadoop on EC2 should use all available
local storage
Hadoop on EC2 should use all available local storage
----------------------------------------------------
Key: WHIRR-189
URL: https://issues.apache.org/jira/browse/WHIRR-189
Project: Whirr
Issue Type: Improvement
Components: service/hadoop
Reporter: Tom White
Fix For: 0.3.0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977413#action_12977413 ]
Adrian Cole commented on WHIRR-189:
-----------------------------------
the jclouds NodeMetadata object has a hardware.volumes collection you could use to determine this stuff.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Fix For: 0.3.0
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079247#comment-13079247 ]
Andrei Savu commented on WHIRR-189:
-----------------------------------
Should we try to push this in 0.6.0? Looks good to me.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Tom White (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019986#comment-13019986 ]
Tom White commented on WHIRR-189:
---------------------------------
Thanks Adrian. I opened http://code.google.com/p/jclouds/issues/detail?id=529 for this.
BTW does jclouds know which ephemeral devices are not mapped by default for each instance size? It would be nice to be able to say "map all devices".
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (WHIRR-189) Hadoop on EC2 should use all available
local storage
Posted by "Tom White (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom White updated WHIRR-189:
----------------------------
Attachment: WHIRR-189.patch
Here's an initial patch for this. Hadoop on EC2 works. The idea is that there's a function called prepare_all_disks() which is passed volume information, and which creates directories /data0, /data1, etc for the service to use. This moves more cloud-specific code out of the scripts.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (WHIRR-189) Hadoop on EC2 should use all available
local storage
Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrei Savu updated WHIRR-189:
------------------------------
Fix Version/s: (was: 0.4.0)
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
>
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (WHIRR-189) Hadoop on EC2 should use all available
local storage
Posted by "Tom White (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom White updated WHIRR-189:
----------------------------
Attachment: WHIRR-189.patch
Refreshed patch. Needs some testing.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch, WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050702#comment-13050702 ]
Andrei Savu commented on WHIRR-189:
-----------------------------------
Looks good. It seems like this patch incorporates WHIRR-328. Do we want that or we should search for a better way of avoiding EBS (e.g. by rewriting parts of the templateBuilder in Whirr)?
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Issue Comment Edited: (WHIRR-189) Hadoop on EC2 should use
all available local storage
Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977263#action_12977263 ]
Olivier Grisel edited comment on WHIRR-189 at 1/4/11 1:17 PM:
--------------------------------------------------------------
FYI the local drive that has the most space on the m1.small instances is not mounted on / but on /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.9G 1.8G 6.1G 23% /
tmpfs 840M 0 840M 0% /dev/shm
/dev/xvda2 147G 6.4G 133G 5% /media/ephemeral0
Both drives have the same (read) speed though:
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1
/dev/xvda1:
Timing cached reads: 3966 MB in 2.05 seconds = 1939.33 MB/sec
Timing buffered disk reads: 372 MB in 3.02 seconds = 123.23 MB/sec
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2
/dev/xvda2:
Timing cached reads: 4442 MB in 2.00 seconds = 2222.59 MB/sec
Timing buffered disk reads: 376 MB in 3.00 seconds = 125.19 MB/sec
was (Author: ogrisel):
FYI the local drive that has the most space on the m1.small instances is not mounted on / but on /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.9G 1.8G 6.1G 23% /
tmpfs 840M 0 840M 0% /dev/shm
/dev/xvda2 147G 6.4G 133G 5% /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2
Both drives have the same (read) speed though:
/dev/xvda1:
Timing cached reads: 3966 MB in 2.05 seconds = 1939.33 MB/sec
Timing buffered disk reads: 372 MB in 3.02 seconds = 123.23 MB/sec
/dev/xvda2:
Timing cached reads: 4442 MB in 2.00 seconds = 2222.59 MB/sec
Timing buffered disk reads: 376 MB in 3.00 seconds = 125.19 MB/sec
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Fix For: 0.3.0
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977263#action_12977263 ]
Olivier Grisel commented on WHIRR-189:
--------------------------------------
FYI the local drive that has the most space on the m1.small instances is not mounted on / but on /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.9G 1.8G 6.1G 23% /
tmpfs 840M 0 840M 0% /dev/shm
/dev/xvda2 147G 6.4G 133G 5% /media/ephemeral0
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda2
Both drives have the same (read) speed though:
/dev/xvda1:
Timing cached reads: 3966 MB in 2.05 seconds = 1939.33 MB/sec
Timing buffered disk reads: 372 MB in 3.02 seconds = 123.23 MB/sec
/dev/xvda2:
Timing cached reads: 4442 MB in 2.00 seconds = 2222.59 MB/sec
Timing buffered disk reads: 376 MB in 3.00 seconds = 125.19 MB/sec
[ec2-user@ip-10-203-86-244 ec2-user]$ sudo /sbin/hdparm -tT /dev/xvda1
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Fix For: 0.3.0
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Tom White (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018974#comment-13018974 ]
Tom White commented on WHIRR-189:
---------------------------------
On EC2, I noticed that for a m1.large instance jclouds reports that there are two local volumes /dev/sdb, and /dev/sdc (for EBS-backed images), even though /dev/sdc is not present on the instance. This is explained by the second note on http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?instance-storage-concepts.html. We can use EC2TemplateOptions to map all the ephemeral devices (but it's odd that jclouds reports a device that isn't actually mapped). This will require EC2-specific code that knows about the emphemeral devices on each instance size - I wonder if jclouds abstracts this?
Once the above is sorted out, I imagine the implementation would include all the non-boot-device volumes for its storage. HadoopConfigurationBuilder would set dfs.data.dir, dfs.name.dir, and mapred.local.dir to use all the volumes. And the volumes would need mounting/symlinking (and possibly formatting in the case of EC2) using scripts like (this code is based on code from the Python scripts):
{code}
# TODO: make sure that mkfs.xfs is installed
function prep_disk() {
mount=$1
device=$2
automount=${3:-false}
if [ $(mountpoint -q -x $device) ]; then
echo "$device is mounted"
if [ ! -d $mount ]; then
echo "No mount"
ln -s $(grep $device /proc/mounts | awk '{print $2}') $mount
fi
else
echo "warning: ERASING CONTENTS OF $device"
mkfs.xfs -f $device
if [ ! -e $mount ]; then
mkdir $mount
fi
mount -o defaults,noatime $device $mount
if $automount ; then
echo "$device $mount xfs defaults,noatime 0 0" >> /etc/fstab
fi
fi
}
prep_disk /data1 /dev/sdb true
prep_disk /data2 /dev/sdc true
{code}
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Tibor Kiss (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050596#comment-13050596 ]
Tibor Kiss commented on WHIRR-189:
----------------------------------
+1
I made a try to this patch and it works on EC2.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019334#comment-13019334 ]
Adrian Cole commented on WHIRR-189:
-----------------------------------
@tom I think this is a bug we'll have to test in jclouds. right now, we don't verify the volumes listed are actually available. you mind raising an issue on this? thnx.
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (WHIRR-189) Hadoop on EC2 should use all
available local storage
Posted by "Adrian Cole (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/WHIRR-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079271#comment-13079271 ]
Adrian Cole commented on WHIRR-189:
-----------------------------------
any way we can get an integration test for this? ex. verify through some hdfs call the size is correct?
> Hadoop on EC2 should use all available local storage
> ----------------------------------------------------
>
> Key: WHIRR-189
> URL: https://issues.apache.org/jira/browse/WHIRR-189
> Project: Whirr
> Issue Type: Improvement
> Components: service/hadoop
> Reporter: Tom White
> Attachments: WHIRR-189.patch
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira