You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Raghu Angadi <ra...@yahoo-inc.com> on 2006/11/30 01:39:41 UTC

minor change in dataNode handling of multiple directories.

As part of the "Version upgrade" related changes, thinking of strictly 
requiring that datanode be able to lock _all_ the configured directories 
instead of any one of them.

Currently if multiple data directories are specified for a datanode, it 
tries to lock a file is in each of the directories. If it fails to lock 
some of the directories, it will use the directories that it could. 
Looks like this flexibility was included mainly for convenience in 
config file.

This might not affect anyone, let us know of your opinions.

Note that all directories have the same storage id. So each individual 
directory is not complete by itself but a part of one storage.

Raghu.

Re: minor change in dataNode handling of multiple directories.

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

We want to be able to support fail in place too.  IE a machine should  
be able to be used with one dead drive.  It sounds like this is a  
step in the wrong direction.

Perhaps we should just allow a node to upgrade new directories that  
appear later?  Need to be sure snapshotting works as expected in this  
case too...

I think it is worth solving this more complicated problem.

Upgrades should not be possible unless enough of the FS is reachable  
to leave safemode IMO.  This means we'll need to be able to test for  
this before we upgrade.  Fun!

On Nov 29, 2006, at 6:03 PM, Bryan A. P. Pendleton wrote:

> I would prefer this proposal not be implements. The current way  
> things work
> makes it possible to configure, centrally, a list of all  
> directories that
> _could_ be used for storage. Since there's no easy way to do per-node
> configurations (nor would it be desirable, IMO, in this case), the
> directories config ends up being the list of all possibly usable
> directories. Many of my cluster nodes are configured using  
> "rocksclusters":
> they will have a uniform set of mounts created, one for each  
> physical drive,
> at boot/re-install. If I specify in my config the list of all  
> directories up
> to the most number of drives a machine will ever have, then I get easy
> drop-in use, regardless of variations in nodes in the cluster. I  
> have been
> relying in the current behavior to keep me sane.
>
> OTOH, I wouldn't oppose making this the default behavior, with a
> configuration param that would set things back to the old behavior.
>
> On 11/29/06, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>>
>>
>> As part of the "Version upgrade" related changes, thinking of  
>> strictly
>> requiring that datanode be able to lock _all_ the configured  
>> directories
>> instead of any one of them.
>>
>> Currently if multiple data directories are specified for a  
>> datanode, it
>> tries to lock a file is in each of the directories. If it fails to  
>> lock
>> some of the directories, it will use the directories that it could.
>> Looks like this flexibility was included mainly for convenience in
>> config file.
>>
>> This might not affect anyone, let us know of your opinions.
>>
>> Note that all directories have the same storage id. So each  
>> individual
>> directory is not complete by itself but a part of one storage.
>>
>> Raghu.
>>
>
>
>
> -- 
> Bryan A. P. Pendleton
> Ph: (877) geek-1-bp

Re: minor change in dataNode handling of multiple directories.

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Does anyone have a config where some data directories don't exists at 
all? The current datanode does not work in that case. It throws 
IOException. The current code only tolerates if the directory exist but 
could not be locked. Yes, we could decide not throw the exception if the 
directory does not exist.

For now I am just going to keep the same behavior as before.

Raghu.

> Konstantin Shvachko wrote:
>> Good point.
>> I think we should document it (Javadoc?) making it a feature rather 
>> than a side effect.
>>
>> Bryan A. P. Pendleton wrote:
>>
>>> I would prefer this proposal not be implements. The current way 
>>> things work
>>> makes it possible to configure, centrally, a list of all directories 
>>> that
>>> _could_ be used for storage. Since there's no easy way to do per-node
>>> configurations (nor would it be desirable, IMO, in this case), the
>>> directories config ends up being the list of all possibly usable
>>> directories. Many of my cluster nodes are configured using 
>>> "rocksclusters":
>>> they will have a uniform set of mounts created, one for each physical 
>>> drive,
>>> at boot/re-install. If I specify in my config the list of all 
>>> directories up
>>> to the most number of drives a machine will ever have, then I get easy
>>> drop-in use, regardless of variations in nodes in the cluster. I have 
>>> been
>>> relying in the current behavior to keep me sane.
>>>
>>> OTOH, I wouldn't oppose making this the default behavior, with a
>>> configuration param that would set things back to the old behavior.
>>>
>>> On 11/29/06, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>>>
>>>>
>>>>
>>>> As part of the "Version upgrade" related changes, thinking of strictly
>>>> requiring that datanode be able to lock _all_ the configured 
>>>> directories
>>>> instead of any one of them.
>>>>
>>>> Currently if multiple data directories are specified for a datanode, it
>>>> tries to lock a file is in each of the directories. If it fails to lock
>>>> some of the directories, it will use the directories that it could.
>>>> Looks like this flexibility was included mainly for convenience in
>>>> config file.
>>>>
>>>> This might not affect anyone, let us know of your opinions.
>>>>
>>>> Note that all directories have the same storage id. So each individual
>>>> directory is not complete by itself but a part of one storage.
>>>>
>>>> Raghu.
>>>
>

Re: minor change in dataNode handling of multiple directories.

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Yes, we will retain the existing behavior.

Raghu.

Konstantin Shvachko wrote:
> Good point.
> I think we should document it (Javadoc?) making it a feature rather than 
> a side effect.
> 
> Bryan A. P. Pendleton wrote:
> 
>> I would prefer this proposal not be implements. The current way things 
>> work
>> makes it possible to configure, centrally, a list of all directories that
>> _could_ be used for storage. Since there's no easy way to do per-node
>> configurations (nor would it be desirable, IMO, in this case), the
>> directories config ends up being the list of all possibly usable
>> directories. Many of my cluster nodes are configured using 
>> "rocksclusters":
>> they will have a uniform set of mounts created, one for each physical 
>> drive,
>> at boot/re-install. If I specify in my config the list of all 
>> directories up
>> to the most number of drives a machine will ever have, then I get easy
>> drop-in use, regardless of variations in nodes in the cluster. I have 
>> been
>> relying in the current behavior to keep me sane.
>>
>> OTOH, I wouldn't oppose making this the default behavior, with a
>> configuration param that would set things back to the old behavior.
>>
>> On 11/29/06, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>>
>>>
>>>
>>> As part of the "Version upgrade" related changes, thinking of strictly
>>> requiring that datanode be able to lock _all_ the configured directories
>>> instead of any one of them.
>>>
>>> Currently if multiple data directories are specified for a datanode, it
>>> tries to lock a file is in each of the directories. If it fails to lock
>>> some of the directories, it will use the directories that it could.
>>> Looks like this flexibility was included mainly for convenience in
>>> config file.
>>>
>>> This might not affect anyone, let us know of your opinions.
>>>
>>> Note that all directories have the same storage id. So each individual
>>> directory is not complete by itself but a part of one storage.
>>>
>>> Raghu.
>>

Re: minor change in dataNode handling of multiple directories.

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Good point.
I think we should document it (Javadoc?) making it a feature rather than 
a side effect.

Bryan A. P. Pendleton wrote:

> I would prefer this proposal not be implements. The current way things 
> work
> makes it possible to configure, centrally, a list of all directories that
> _could_ be used for storage. Since there's no easy way to do per-node
> configurations (nor would it be desirable, IMO, in this case), the
> directories config ends up being the list of all possibly usable
> directories. Many of my cluster nodes are configured using 
> "rocksclusters":
> they will have a uniform set of mounts created, one for each physical 
> drive,
> at boot/re-install. If I specify in my config the list of all 
> directories up
> to the most number of drives a machine will ever have, then I get easy
> drop-in use, regardless of variations in nodes in the cluster. I have 
> been
> relying in the current behavior to keep me sane.
>
> OTOH, I wouldn't oppose making this the default behavior, with a
> configuration param that would set things back to the old behavior.
>
> On 11/29/06, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>
>>
>>
>> As part of the "Version upgrade" related changes, thinking of strictly
>> requiring that datanode be able to lock _all_ the configured directories
>> instead of any one of them.
>>
>> Currently if multiple data directories are specified for a datanode, it
>> tries to lock a file is in each of the directories. If it fails to lock
>> some of the directories, it will use the directories that it could.
>> Looks like this flexibility was included mainly for convenience in
>> config file.
>>
>> This might not affect anyone, let us know of your opinions.
>>
>> Note that all directories have the same storage id. So each individual
>> directory is not complete by itself but a part of one storage.
>>
>> Raghu.
>

Re: minor change in dataNode handling of multiple directories.

Posted by "Bryan A. P. Pendleton" <bp...@geekdom.net>.

I would prefer this proposal not be implements. The current way things work
makes it possible to configure, centrally, a list of all directories that
_could_ be used for storage. Since there's no easy way to do per-node
configurations (nor would it be desirable, IMO, in this case), the
directories config ends up being the list of all possibly usable
directories. Many of my cluster nodes are configured using "rocksclusters":
they will have a uniform set of mounts created, one for each physical drive,
at boot/re-install. If I specify in my config the list of all directories up
to the most number of drives a machine will ever have, then I get easy
drop-in use, regardless of variations in nodes in the cluster. I have been
relying in the current behavior to keep me sane.

OTOH, I wouldn't oppose making this the default behavior, with a
configuration param that would set things back to the old behavior.

On 11/29/06, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>
>
> As part of the "Version upgrade" related changes, thinking of strictly
> requiring that datanode be able to lock _all_ the configured directories
> instead of any one of them.
>
> Currently if multiple data directories are specified for a datanode, it
> tries to lock a file is in each of the directories. If it fails to lock
> some of the directories, it will use the directories that it could.
> Looks like this flexibility was included mainly for convenience in
> config file.
>
> This might not affect anyone, let us know of your opinions.
>
> Note that all directories have the same storage id. So each individual
> directory is not complete by itself but a part of one storage.
>
> Raghu.
>

-- 
Bryan A. P. Pendleton
Ph: (877) geek-1-bp