You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/03/29 17:26:24 UTC

[Hadoop Wiki] Update of "DiskSetup" by SteveLoughran

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "DiskSetup" page has been changed by SteveLoughran.
The comment on this change is: stuff on temp directories and logging. .
http://wiki.apache.org/hadoop/DiskSetup?action=diff&rev1=5&rev2=6

--------------------------------------------------

== Configuring Hadoop ==

- Pass a list of disks to the dfs.data.dir parameter, Hadoop will use all of the disk that are available.
+ Pass a list of disks to the `dfs.data.dir` parameter, Hadoop will use all of the disks that are available. When one goes offline it is taken out of consideration. Hadoop does not check for the disk coming back -it assumes it is "gone".

+ === Logging ===
+
+ * Don't log to the root directory, as having a machine that does not boot because the logs are overflowing can be inconvenient.
+ * Have a plan to clean up log output, otherwise jobs that log too much to the console will fill up log directories.
+ * Get your developers to use the commons-logging APIs in their MapReduce code, so that you can turn logging up or down without recompiling the code. They can run in debug mode on their test machines, you can run at WARN level in production.
+ * Some JVMs (JRockit) seem to log more. Tune your Log4j settings for your JVM, and only capture the stuff you really want.
+
+ === Do not keep stuff under /tmp ===
+
+ Hadoop defaults to keeping things under `/tmp` so that you can play with Hadoop without filling up your disk. This is dangerous in a production cluster, as any automated cleanup cron job -you will need one- will eventually delete stuff in `/tmp`, at which point your Hadoop cluster is in trouble.
+
+ * Plan the disk layout, configure Hadoop to store stuff in stable locations, preferably off that root disk.
+
== Underlying File System Options ==

- If mount the disks as noatime, then the file access times aren't written back; this speeds up reads. There is also relatime, which stores some access time information, but is not as slow as the classic atime attribute. Remember that any access time information kept by Hadoop is independent of the atime attribute of individual blocks, so Hadoop does not care what your settings are here. If you are mounting disks purely for hadoop, use noatime.
+ If mount the disks as `noatime`, then the file access times aren't written back; this speeds up reads. There is also `relatime`, which stores some access time information, but is not as slow as the classic atime attribute. Remember that any access time information kept by Hadoop is independent of the atime attribute of individual blocks, so Hadoop does not care what your settings are here. If you are mounting disks purely for Hadoop, use `noatime`.

- Formatting and tuning options are important. Using tunefs to set the reserve to zero percent can save you over 25 GigaBytes on a 1 TeraByte disk. Also the underlying file system is going to have many large files, you can get more space by lowering the number of inodes at format time.
+ Formatting and tuning options are important. Using `tunefs` to set the reserve to zero percent can save you over 25 GigaBytes on a 1 TeraByte disk. Also the underlying file system is going to have many large files, you can get more space by lowering the number of inodes at format time.
+
=== Ext3 ===

- Yahoo! has publicly stated they use ext3. Regardless of the merits of the filesystem, that means that HDFS-on-ext3 has been publicly tested at a bigger scale than any other underlying filesystem.
+ Yahoo! has publicly stated they use ext3. Regardless of the merits of the filesystem, that means that HDFS-on-ext3 has been publicly tested at a bigger scale than any other underlying filesystem that we know of.

=== XFS ===