You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Brian C. Huffman" <bh...@etinternational.com> on 2014/11/12 17:36:01 UTC

Datanode disk configuration

All,

I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the 
following drives:
1 - 500GB drive (OS disk)
1 - 500GB drive
1 - 2 TB drive
1 - 3 TB drive.

In past experience I've had lots of issues with non-uniform drive sizes 
for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB 
drives for this cluster.

My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB 
drive as intermediate data.  Most our of jobs don't make large use of 
intermediate data, but at least this way, I get a good amount of space 
(2TB) per node before I run into issues.  Then I may end up using the 
AvailableSpaceVolumeChoosingPolicy to help with balancing the blocks.

If necessary I could put intermediate data on one of the OS partitions 
(/home).  But this doesn't seem ideal.

Anybody have any recommendations regarding the optimal use of storage in 
this scenario?

Thanks,
Brian

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
I would consider a jbod with 16-64mb stride. This would be a choice where
one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks
will be hit with the low read/write times of having large amounts of data
behind a single spindle
On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <bh...@etinternational.com>
wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
I would consider a jbod with 16-64mb stride. This would be a choice where
one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks
will be hit with the low read/write times of having large amounts of data
behind a single spindle
On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <bh...@etinternational.com>
wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
Yes. That is why you should consider striping across raid 0 (JBOD)









*.......“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M.
ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 12, 2014 at 9:09 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  That will make the volume balancing easy, but couldn't it hurt
> performance?  My understanding is that there would be three write threads
> pointing to the 3TB disk and 2 threads pointing to the 2TB disk.
>
> Would it be better from a performance perspective to include the 500GB
> drive in the configuration and just use the
> AvailableSpaceVolumeChoosingPolicy from the beginning?
>
> Thanks,
> Brian
>
> On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
>
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
> points same size.
>
>
>   *Thank you!*
>
>
>  *Sincerely,*
>
> *Leonid Fedotov*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
> bhuffman@etinternational.com> wrote:
>
>>  All,
>>
>> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>> following drives:
>> 1 - 500GB drive (OS disk)
>> 1 - 500GB drive
>> 1 - 2 TB drive
>> 1 - 3 TB drive.
>>
>> In past experience I've had lots of issues with non-uniform drive sizes
>> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
>> drives for this cluster.
>>
>> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB
>> drive as intermediate data.  Most our of jobs don't make large use of
>> intermediate data, but at least this way, I get a good amount of space
>> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
>> to help with balancing the blocks.
>>
>> If necessary I could put intermediate data on one of the OS partitions
>> (/home).  But this doesn't seem ideal.
>>
>> Anybody have any recommendations regarding the optimal use of storage in
>> this scenario?
>>
>> Thanks,
>> Brian
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
Yes. That is why you should consider striping across raid 0 (JBOD)









*.......“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M.
ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 12, 2014 at 9:09 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  That will make the volume balancing easy, but couldn't it hurt
> performance?  My understanding is that there would be three write threads
> pointing to the 3TB disk and 2 threads pointing to the 2TB disk.
>
> Would it be better from a performance perspective to include the 500GB
> drive in the configuration and just use the
> AvailableSpaceVolumeChoosingPolicy from the beginning?
>
> Thanks,
> Brian
>
> On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
>
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
> points same size.
>
>
>   *Thank you!*
>
>
>  *Sincerely,*
>
> *Leonid Fedotov*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
> bhuffman@etinternational.com> wrote:
>
>>  All,
>>
>> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>> following drives:
>> 1 - 500GB drive (OS disk)
>> 1 - 500GB drive
>> 1 - 2 TB drive
>> 1 - 3 TB drive.
>>
>> In past experience I've had lots of issues with non-uniform drive sizes
>> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
>> drives for this cluster.
>>
>> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB
>> drive as intermediate data.  Most our of jobs don't make large use of
>> intermediate data, but at least this way, I get a good amount of space
>> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
>> to help with balancing the blocks.
>>
>> If necessary I could put intermediate data on one of the OS partitions
>> (/home).  But this doesn't seem ideal.
>>
>> Anybody have any recommendations regarding the optimal use of storage in
>> this scenario?
>>
>> Thanks,
>> Brian
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
Yes. That is why you should consider striping across raid 0 (JBOD)









*.......“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M.
ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 12, 2014 at 9:09 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  That will make the volume balancing easy, but couldn't it hurt
> performance?  My understanding is that there would be three write threads
> pointing to the 3TB disk and 2 threads pointing to the 2TB disk.
>
> Would it be better from a performance perspective to include the 500GB
> drive in the configuration and just use the
> AvailableSpaceVolumeChoosingPolicy from the beginning?
>
> Thanks,
> Brian
>
> On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
>
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
> points same size.
>
>
>   *Thank you!*
>
>
>  *Sincerely,*
>
> *Leonid Fedotov*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
> bhuffman@etinternational.com> wrote:
>
>>  All,
>>
>> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>> following drives:
>> 1 - 500GB drive (OS disk)
>> 1 - 500GB drive
>> 1 - 2 TB drive
>> 1 - 3 TB drive.
>>
>> In past experience I've had lots of issues with non-uniform drive sizes
>> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
>> drives for this cluster.
>>
>> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB
>> drive as intermediate data.  Most our of jobs don't make large use of
>> intermediate data, but at least this way, I get a good amount of space
>> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
>> to help with balancing the blocks.
>>
>> If necessary I could put intermediate data on one of the OS partitions
>> (/home).  But this doesn't seem ideal.
>>
>> Anybody have any recommendations regarding the optimal use of storage in
>> this scenario?
>>
>> Thanks,
>> Brian
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
Yes. That is why you should consider striping across raid 0 (JBOD)









*.......“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M.
ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 12, 2014 at 9:09 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  That will make the volume balancing easy, but couldn't it hurt
> performance?  My understanding is that there would be three write threads
> pointing to the 3TB disk and 2 threads pointing to the 2TB disk.
>
> Would it be better from a performance perspective to include the 500GB
> drive in the configuration and just use the
> AvailableSpaceVolumeChoosingPolicy from the beginning?
>
> Thanks,
> Brian
>
> On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
>
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
> points same size.
>
>
>   *Thank you!*
>
>
>  *Sincerely,*
>
> *Leonid Fedotov*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
> bhuffman@etinternational.com> wrote:
>
>>  All,
>>
>> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>> following drives:
>> 1 - 500GB drive (OS disk)
>> 1 - 500GB drive
>> 1 - 2 TB drive
>> 1 - 3 TB drive.
>>
>> In past experience I've had lots of issues with non-uniform drive sizes
>> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
>> drives for this cluster.
>>
>> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB
>> drive as intermediate data.  Most our of jobs don't make large use of
>> intermediate data, but at least this way, I get a good amount of space
>> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
>> to help with balancing the blocks.
>>
>> If necessary I could put intermediate data on one of the OS partitions
>> (/home).  But this doesn't seem ideal.
>>
>> Anybody have any recommendations regarding the optimal use of storage in
>> this scenario?
>>
>> Thanks,
>> Brian
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

Re: Datanode disk configuration

Posted by "Brian C. Huffman" <bh...@etinternational.com>.
That will make the volume balancing easy, but couldn't it hurt 
performance?  My understanding is that there would be three write 
threads pointing to the 3TB disk and 2 threads pointing to the 2TB disk.

Would it be better from a performance perspective to include the 500GB 
drive in the configuration and just use the 
AvailableSpaceVolumeChoosingPolicy from the beginning?

Thanks,
Brian

On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount 
> points same size.
>
>
> /Thank you!/
>
>
> /Sincerely,/
>
> */Leonid Fedotov/*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com <ma...@hortonworks.com>
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman 
> <bhuffman@etinternational.com <ma...@etinternational.com>> 
> wrote:
>
>     All,
>
>     I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>     following drives:
>     1 - 500GB drive (OS disk)
>     1 - 500GB drive
>     1 - 2 TB drive
>     1 - 3 TB drive.
>
>     In past experience I've had lots of issues with non-uniform drive
>     sizes for HDFS, but unfortunately it wasn't an option to get all
>     3TB or 2TB drives for this cluster.
>
>     My thought is to set up the 2TB and 3TB drives as HDFS and the
>     500GB drive as intermediate data.  Most our of jobs don't make
>     large use of intermediate data, but at least this way, I get a
>     good amount of space (2TB) per node before I run into issues. 
>     Then I may end up using the AvailableSpaceVolumeChoosingPolicy to
>     help with balancing the blocks.
>
>     If necessary I could put intermediate data on one of the OS
>     partitions (/home).  But this doesn't seem ideal.
>
>     Anybody have any recommendations regarding the optimal use of
>     storage in this scenario?
>
>     Thanks,
>     Brian
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Re: Datanode disk configuration

Posted by "Brian C. Huffman" <bh...@etinternational.com>.
That will make the volume balancing easy, but couldn't it hurt 
performance?  My understanding is that there would be three write 
threads pointing to the 3TB disk and 2 threads pointing to the 2TB disk.

Would it be better from a performance perspective to include the 500GB 
drive in the configuration and just use the 
AvailableSpaceVolumeChoosingPolicy from the beginning?

Thanks,
Brian

On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount 
> points same size.
>
>
> /Thank you!/
>
>
> /Sincerely,/
>
> */Leonid Fedotov/*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com <ma...@hortonworks.com>
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman 
> <bhuffman@etinternational.com <ma...@etinternational.com>> 
> wrote:
>
>     All,
>
>     I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>     following drives:
>     1 - 500GB drive (OS disk)
>     1 - 500GB drive
>     1 - 2 TB drive
>     1 - 3 TB drive.
>
>     In past experience I've had lots of issues with non-uniform drive
>     sizes for HDFS, but unfortunately it wasn't an option to get all
>     3TB or 2TB drives for this cluster.
>
>     My thought is to set up the 2TB and 3TB drives as HDFS and the
>     500GB drive as intermediate data.  Most our of jobs don't make
>     large use of intermediate data, but at least this way, I get a
>     good amount of space (2TB) per node before I run into issues. 
>     Then I may end up using the AvailableSpaceVolumeChoosingPolicy to
>     help with balancing the blocks.
>
>     If necessary I could put intermediate data on one of the OS
>     partitions (/home).  But this doesn't seem ideal.
>
>     Anybody have any recommendations regarding the optimal use of
>     storage in this scenario?
>
>     Thanks,
>     Brian
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Re: Datanode disk configuration

Posted by "Brian C. Huffman" <bh...@etinternational.com>.
That will make the volume balancing easy, but couldn't it hurt 
performance?  My understanding is that there would be three write 
threads pointing to the 3TB disk and 2 threads pointing to the 2TB disk.

Would it be better from a performance perspective to include the 500GB 
drive in the configuration and just use the 
AvailableSpaceVolumeChoosingPolicy from the beginning?

Thanks,
Brian

On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount 
> points same size.
>
>
> /Thank you!/
>
>
> /Sincerely,/
>
> */Leonid Fedotov/*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com <ma...@hortonworks.com>
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman 
> <bhuffman@etinternational.com <ma...@etinternational.com>> 
> wrote:
>
>     All,
>
>     I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>     following drives:
>     1 - 500GB drive (OS disk)
>     1 - 500GB drive
>     1 - 2 TB drive
>     1 - 3 TB drive.
>
>     In past experience I've had lots of issues with non-uniform drive
>     sizes for HDFS, but unfortunately it wasn't an option to get all
>     3TB or 2TB drives for this cluster.
>
>     My thought is to set up the 2TB and 3TB drives as HDFS and the
>     500GB drive as intermediate data.  Most our of jobs don't make
>     large use of intermediate data, but at least this way, I get a
>     good amount of space (2TB) per node before I run into issues. 
>     Then I may end up using the AvailableSpaceVolumeChoosingPolicy to
>     help with balancing the blocks.
>
>     If necessary I could put intermediate data on one of the OS
>     partitions (/home).  But this doesn't seem ideal.
>
>     Anybody have any recommendations regarding the optimal use of
>     storage in this scenario?
>
>     Thanks,
>     Brian
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Re: Datanode disk configuration

Posted by "Brian C. Huffman" <bh...@etinternational.com>.
That will make the volume balancing easy, but couldn't it hurt 
performance?  My understanding is that there would be three write 
threads pointing to the 3TB disk and 2 threads pointing to the 2TB disk.

Would it be better from a performance perspective to include the 500GB 
drive in the configuration and just use the 
AvailableSpaceVolumeChoosingPolicy from the beginning?

Thanks,
Brian

On 11/12/2014 11:47 AM, Leonid Fedotov wrote:
> Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount 
> points same size.
>
>
> /Thank you!/
>
>
> /Sincerely,/
>
> */Leonid Fedotov/*
>
> Systems Architect - Professional Services
>
> lfedotov@hortonworks.com <ma...@hortonworks.com>
>
> office: +1 855 846 7866 ext 292
>
> mobile: +1 650 430 1673
>
>
> On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman 
> <bhuffman@etinternational.com <ma...@etinternational.com>> 
> wrote:
>
>     All,
>
>     I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the
>     following drives:
>     1 - 500GB drive (OS disk)
>     1 - 500GB drive
>     1 - 2 TB drive
>     1 - 3 TB drive.
>
>     In past experience I've had lots of issues with non-uniform drive
>     sizes for HDFS, but unfortunately it wasn't an option to get all
>     3TB or 2TB drives for this cluster.
>
>     My thought is to set up the 2TB and 3TB drives as HDFS and the
>     500GB drive as intermediate data.  Most our of jobs don't make
>     large use of intermediate data, but at least this way, I get a
>     good amount of space (2TB) per node before I run into issues. 
>     Then I may end up using the AvailableSpaceVolumeChoosingPolicy to
>     help with balancing the blocks.
>
>     If necessary I could put intermediate data on one of the OS
>     partitions (/home).  But this doesn't seem ideal.
>
>     Anybody have any recommendations regarding the optimal use of
>     storage in this scenario?
>
>     Thanks,
>     Brian
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Re: Datanode disk configuration

Posted by Leonid Fedotov <lf...@hortonworks.com>.
Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
points same size.


*Thank you!*


*Sincerely,*

*Leonid Fedotov*

Systems Architect - Professional Services

lfedotov@hortonworks.com

office: +1 855 846 7866 ext 292

mobile: +1 650 430 1673

On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Datanode disk configuration

Posted by Leonid Fedotov <lf...@hortonworks.com>.
Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
points same size.


*Thank you!*


*Sincerely,*

*Leonid Fedotov*

Systems Architect - Professional Services

lfedotov@hortonworks.com

office: +1 855 846 7866 ext 292

mobile: +1 650 430 1673

On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Datanode disk configuration

Posted by Leonid Fedotov <lf...@hortonworks.com>.
Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
points same size.


*Thank you!*


*Sincerely,*

*Leonid Fedotov*

Systems Architect - Professional Services

lfedotov@hortonworks.com

office: +1 855 846 7866 ext 292

mobile: +1 650 430 1673

On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Datanode disk configuration

Posted by Leonid Fedotov <lf...@hortonworks.com>.
Create 1 Tb partitions for 2 and 3 TB drives and you will have 5 mount
points same size.


*Thank you!*


*Sincerely,*

*Leonid Fedotov*

Systems Architect - Professional Services

lfedotov@hortonworks.com

office: +1 855 846 7866 ext 292

mobile: +1 650 430 1673

On Wed, Nov 12, 2014 at 8:36 AM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
I would consider a jbod with 16-64mb stride. This would be a choice where
one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks
will be hit with the low read/write times of having large amounts of data
behind a single spindle
On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <bh...@etinternational.com>
wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

Re: Datanode disk configuration

Posted by daemeon reiydelle <da...@gmail.com>.
I would consider a jbod with 16-64mb stride. This would be a choice where
one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks
will be hit with the low read/write times of having large amounts of data
behind a single spindle
On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <bh...@etinternational.com>
wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>