You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Robert Chansler (JIRA)" <ji...@apache.org> on 2008/03/19 23:36:24 UTC

[jira] Created: (HADOOP-3052) distch -- tool to do parallel ch*

distch -- tool to do parallel ch* 
----------------------------------

                 Key: HADOOP-3052
                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
             Project: Hadoop Core
          Issue Type: Task
          Components: dfs
    Affects Versions: 0.16.1
            Reporter: Robert Chansler
            Assignee: Tsz Wo (Nicholas), SZE
             Fix For: 0.16.2


Build a tool to do parallel ch{mod,grp,own} on files.

This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Rob Weltman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646610#action_12646610 ] 

Rob Weltman commented on HADOOP-3052:
-------------------------------------

Alternate/better approach in HADOOP-3194


> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3052_20080411.patch
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye updated HADOOP-3052:
-------------------------------------


There doesn't appear to be much point in releasing and supporting this artifact. A patch can be made available for those who need it one off.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583134#action_12583134 ] 

Raghu Angadi commented on HADOOP-3052:
--------------------------------------

> is this really something that we need to optimize?
The main use case considered is when a big cluster is upgraded to 0.16.

I hope this is more of a one time utility rather than real supported tool like distcp.


> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580649#action_12580649 ] 

Doug Cutting commented on HADOOP-3052:
--------------------------------------

Wouldn't this be a DDOS attach on the namenode?

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.16.2
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583130#action_12583130 ] 

Doug Cutting commented on HADOOP-3052:
--------------------------------------

My concern is two-part: (1) is this really something that we need to optimize?  Is single-threaded 'chmod -R' so slow that applications are spending significant amount of their time in it?  And, (2) is it perhaps a feature that someone who runs 'chmod -R' isn't able to overwhelm the namenode.  The namenode is often shared between multiple mapreduce clusters (e.g. under HOD) but a single mapreduce cluster running a distributed 'chmod -R' could overwhelm the namenode and prevent other applications from making progress.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3052:
-------------------------------------------

    Attachment: 3052_20080411.patch

3052_20080411.patch: my testing program.  Someone may find it useful.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>         Attachments: 3052_20080411.patch
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye resolved HADOOP-3052.
--------------------------------------

    Resolution: Won't Fix

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3052:
----------------------------------

    Fix Version/s:     (was: 0.16.2)
                   0.17.0

This isn't a bug fix, so moving to 0.17.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583110#action_12583110 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3052:
------------------------------------------------

> Wouldn't this be a DDOS attach on the namenode?

I have written a distch map/reduce program for testing.  You are right that it makes the NameNode very busy when the number of files/dirs are huge.  The question is: how should we prevent DDOS?  Any user could simply write a program to launch a huge number of accesses.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583116#action_12583116 ] 

Raghu Angadi commented on HADOOP-3052:
--------------------------------------

> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.
For this advantage, does the program need to be map/reduce? Client just needs to invoke multiple threads. As long as client threads are comparable to number of handlers in NameNode, it goes as fast as it could.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3052) distch -- tool to do parallel ch*

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583150#action_12583150 ] 

Allen Wittenauer commented on HADOOP-3052:
------------------------------------------

We estimated that it would take over a week with single threaded chown's to set permissions on one of our bigger clusters.

Using the test distch code, we're seeing timings like 1 hour 9 minutes, 33 seconds for 198382 files using 100 nodes.

> distch -- tool to do parallel ch* 
> ----------------------------------
>
>                 Key: HADOOP-3052
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3052
>             Project: Hadoop Core
>          Issue Type: Task
>          Components: dfs
>    Affects Versions: 0.16.1
>            Reporter: Robert Chansler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.17.0
>
>
> Build a tool to do parallel ch{mod,grp,own} on files.
> This would have the advantage over the shell -R commands in that name nodes syncs from multiple clients are effectively batched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.