You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Marcel Mitsuto F. S." <ma...@dizorg.net> on 2013/04/02 22:35:00 UTC

hadoop datanode kernel build and HDFS multiplier factor

Hi hadoopers,

I just got my hands on ten servers (hp 2950 iii) that were upgraded by
another set of servers, and these are the production grid servers.

This is a grid to compute exographic metrics from webserver accesslogs like
geolocation, ISP, and all kind of metrics related to our portal's audience,
to support our operations and content delivery teams with complimentary
metrics than Google Analytics and Omniture already provides, and the daily
log rotation should be around 400GB uncompressed Apache's CustomLog. We
won't hold raw data in HDFS as it would increase hardware requirements to a
level we're not yet able to compromise. We're going to Map Reduce these raw
logs to meningful metrics.

They all have 6 slots for SAS 15K HDD, and I already asked hardware guys to
install CentOS distribution on RAID1 using 2 disks of 73GB. The remaining 4
slots will be filled with 300GB 15K SAS HDDs and I want them to be handled
by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2 servers to
NN, SNN and JobTracker, and 8 DN/TT servers.

Now comes the questions:

#1: I'm following the list and there are some questions regarding building
the kernel for this hardware using different I/O scheduler approaches. I
have yet customize one kernel to upgrade our default CentOS6 stock kernel
with new I/O schedulers if it seems to enhance performance, maximizing
throughput. Should I do it?

#2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a
daily or maybe hourly batch jobs, what should be the optimal multiplier to
HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what
value I'd configure to be the multiplier on #2 to have optimal HDFS usage
and to meet the processing time requirements for our batch jobs?

Thank you for your attention and time!

Best regards,
Marcel

Re: hadoop datanode kernel build and HDFS multiplier factor

Posted by Yanbo Liang <ya...@gmail.com>.

I have done similar experiment for tuning hadoop performance.
Many factors will influence the performance such as hadoop configuration,
JVM, OS.

For Linux kernel related factors, we have found two main focus of attention:
1, Every read operation of file system will trigger one disk write
operation for maintaining last access time. So disable this logging(set
file system noatime attribute enabled) will improve the performance.
2, We also experiment different Linux kernel I/O schedulers to measure
their influence to the hadoop performance. CFQ scheduler will be better
than deadline scheduler.

However, this experiment is closely related with hardware and operating
system environment. Our result is just for your reference.


2013/4/3 Marcel Mitsuto F. S. <ma...@dizorg.net>

> Hi hadoopers,
>
> I just got my hands on ten servers (hp 2950 iii) that were upgraded by
> another set of servers, and these are the production grid servers.
>
> This is a grid to compute exographic metrics from webserver accesslogs
> like geolocation, ISP, and all kind of metrics related to our portal's
> audience, to support our operations and content delivery teams with
> complimentary metrics than Google Analytics and Omniture already provides,
> and the daily log rotation should be around 400GB uncompressed Apache's
> CustomLog. We won't hold raw data in HDFS as it would increase hardware
> requirements to a level we're not yet able to compromise. We're going to
> Map Reduce these raw logs to meningful metrics.
>
> They all have 6 slots for SAS 15K HDD, and I already asked hardware guys
> to install CentOS distribution on RAID1 using 2 disks of 73GB. The
> remaining 4 slots will be filled with 300GB 15K SAS HDDs and I want them to
> be handled by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2
> servers to NN, SNN and JobTracker, and 8 DN/TT servers.
>
> Now comes the questions:
>
> #1: I'm following the list and there are some questions regarding building
> the kernel for this hardware using different I/O scheduler approaches. I
> have yet customize one kernel to upgrade our default CentOS6 stock kernel
> with new I/O schedulers if it seems to enhance performance, maximizing
> throughput. Should I do it?
>
> #2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a
> daily or maybe hourly batch jobs, what should be the optimal multiplier to
> HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what
> value I'd configure to be the multiplier on #2 to have optimal HDFS usage
> and to meet the processing time requirements for our batch jobs?
>
> Thank you for your attention and time!
>
> Best regards,
> Marcel
>

Re: hadoop datanode kernel build and HDFS multiplier factor

Posted by Yanbo Liang <ya...@gmail.com>.

I have done similar experiment for tuning hadoop performance.
Many factors will influence the performance such as hadoop configuration,
JVM, OS.

For Linux kernel related factors, we have found two main focus of attention:
1, Every read operation of file system will trigger one disk write
operation for maintaining last access time. So disable this logging(set
file system noatime attribute enabled) will improve the performance.
2, We also experiment different Linux kernel I/O schedulers to measure
their influence to the hadoop performance. CFQ scheduler will be better
than deadline scheduler.

However, this experiment is closely related with hardware and operating
system environment. Our result is just for your reference.


2013/4/3 Marcel Mitsuto F. S. <ma...@dizorg.net>

> Hi hadoopers,
>
> I just got my hands on ten servers (hp 2950 iii) that were upgraded by
> another set of servers, and these are the production grid servers.
>
> This is a grid to compute exographic metrics from webserver accesslogs
> like geolocation, ISP, and all kind of metrics related to our portal's
> audience, to support our operations and content delivery teams with
> complimentary metrics than Google Analytics and Omniture already provides,
> and the daily log rotation should be around 400GB uncompressed Apache's
> CustomLog. We won't hold raw data in HDFS as it would increase hardware
> requirements to a level we're not yet able to compromise. We're going to
> Map Reduce these raw logs to meningful metrics.
>
> They all have 6 slots for SAS 15K HDD, and I already asked hardware guys
> to install CentOS distribution on RAID1 using 2 disks of 73GB. The
> remaining 4 slots will be filled with 300GB 15K SAS HDDs and I want them to
> be handled by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2
> servers to NN, SNN and JobTracker, and 8 DN/TT servers.
>
> Now comes the questions:
>
> #1: I'm following the list and there are some questions regarding building
> the kernel for this hardware using different I/O scheduler approaches. I
> have yet customize one kernel to upgrade our default CentOS6 stock kernel
> with new I/O schedulers if it seems to enhance performance, maximizing
> throughput. Should I do it?
>
> #2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a
> daily or maybe hourly batch jobs, what should be the optimal multiplier to
> HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what
> value I'd configure to be the multiplier on #2 to have optimal HDFS usage
> and to meet the processing time requirements for our batch jobs?
>
> Thank you for your attention and time!
>
> Best regards,
> Marcel
>

Re: hadoop datanode kernel build and HDFS multiplier factor

Posted by Yanbo Liang <ya...@gmail.com>.

I have done similar experiment for tuning hadoop performance.
Many factors will influence the performance such as hadoop configuration,
JVM, OS.

For Linux kernel related factors, we have found two main focus of attention:
1, Every read operation of file system will trigger one disk write
operation for maintaining last access time. So disable this logging(set
file system noatime attribute enabled) will improve the performance.
2, We also experiment different Linux kernel I/O schedulers to measure
their influence to the hadoop performance. CFQ scheduler will be better
than deadline scheduler.

However, this experiment is closely related with hardware and operating
system environment. Our result is just for your reference.


2013/4/3 Marcel Mitsuto F. S. <ma...@dizorg.net>

> Hi hadoopers,
>
> I just got my hands on ten servers (hp 2950 iii) that were upgraded by
> another set of servers, and these are the production grid servers.
>
> This is a grid to compute exographic metrics from webserver accesslogs
> like geolocation, ISP, and all kind of metrics related to our portal's
> audience, to support our operations and content delivery teams with
> complimentary metrics than Google Analytics and Omniture already provides,
> and the daily log rotation should be around 400GB uncompressed Apache's
> CustomLog. We won't hold raw data in HDFS as it would increase hardware
> requirements to a level we're not yet able to compromise. We're going to
> Map Reduce these raw logs to meningful metrics.
>
> They all have 6 slots for SAS 15K HDD, and I already asked hardware guys
> to install CentOS distribution on RAID1 using 2 disks of 73GB. The
> remaining 4 slots will be filled with 300GB 15K SAS HDDs and I want them to
> be handled by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2
> servers to NN, SNN and JobTracker, and 8 DN/TT servers.
>
> Now comes the questions:
>
> #1: I'm following the list and there are some questions regarding building
> the kernel for this hardware using different I/O scheduler approaches. I
> have yet customize one kernel to upgrade our default CentOS6 stock kernel
> with new I/O schedulers if it seems to enhance performance, maximizing
> throughput. Should I do it?
>
> #2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a
> daily or maybe hourly batch jobs, what should be the optimal multiplier to
> HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what
> value I'd configure to be the multiplier on #2 to have optimal HDFS usage
> and to meet the processing time requirements for our batch jobs?
>
> Thank you for your attention and time!
>
> Best regards,
> Marcel
>

Re: hadoop datanode kernel build and HDFS multiplier factor

Posted by Yanbo Liang <ya...@gmail.com>.

I have done similar experiment for tuning hadoop performance.
Many factors will influence the performance such as hadoop configuration,
JVM, OS.

For Linux kernel related factors, we have found two main focus of attention:
1, Every read operation of file system will trigger one disk write
operation for maintaining last access time. So disable this logging(set
file system noatime attribute enabled) will improve the performance.
2, We also experiment different Linux kernel I/O schedulers to measure
their influence to the hadoop performance. CFQ scheduler will be better
than deadline scheduler.

However, this experiment is closely related with hardware and operating
system environment. Our result is just for your reference.


2013/4/3 Marcel Mitsuto F. S. <ma...@dizorg.net>

> Hi hadoopers,
>
> I just got my hands on ten servers (hp 2950 iii) that were upgraded by
> another set of servers, and these are the production grid servers.
>
> This is a grid to compute exographic metrics from webserver accesslogs
> like geolocation, ISP, and all kind of metrics related to our portal's
> audience, to support our operations and content delivery teams with
> complimentary metrics than Google Analytics and Omniture already provides,
> and the daily log rotation should be around 400GB uncompressed Apache's
> CustomLog. We won't hold raw data in HDFS as it would increase hardware
> requirements to a level we're not yet able to compromise. We're going to
> Map Reduce these raw logs to meningful metrics.
>
> They all have 6 slots for SAS 15K HDD, and I already asked hardware guys
> to install CentOS distribution on RAID1 using 2 disks of 73GB. The
> remaining 4 slots will be filled with 300GB 15K SAS HDDs and I want them to
> be handled by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2
> servers to NN, SNN and JobTracker, and 8 DN/TT servers.
>
> Now comes the questions:
>
> #1: I'm following the list and there are some questions regarding building
> the kernel for this hardware using different I/O scheduler approaches. I
> have yet customize one kernel to upgrade our default CentOS6 stock kernel
> with new I/O schedulers if it seems to enhance performance, maximizing
> throughput. Should I do it?
>
> #2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a
> daily or maybe hourly batch jobs, what should be the optimal multiplier to
> HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what
> value I'd configure to be the multiplier on #2 to have optimal HDFS usage
> and to meet the processing time requirements for our batch jobs?
>
> Thank you for your attention and time!
>
> Best regards,
> Marcel
>