You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "stack@archive.org (JIRA)" <ji...@apache.org> on 2007/01/06 03:07:27 UTC

[jira] Created: (HADOOP-862) Add handling of s3 to CopyFile tool

Add handling of s3 to CopyFile tool
-----------------------------------

                 Key: HADOOP-862
                 URL: https://issues.apache.org/jira/browse/HADOOP-862
             Project: Hadoop
          Issue Type: Improvement
          Components: util
    Affects Versions: 0.10.0
            Reporter: stack@archive.org
            Priority: Minor


CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462676 ] 

stack@archive.org commented on HADOOP-862:
------------------------------------------

Attached is first cut at adding s3 handling to CopyFiles.

Here's list of changes:

+ Allow hdfs or dfs URI schemes (Used to be dfs only).
+ Changed the usage message so filesystem is generic URI (rather than namenode:port | local).
+ getFileSysName was removed.  Use Filesystem.get with fs URI instead.
+ getMapCount:  Moved duplicated code for figuring number of maps here.
+ toURI: Added. Have (duplicated) tests of URIness go via here instead.
+ CopyFilesReducer: Removed two instances.  Does nothing.
+ Added testing of URIness to members of file-of-source URIs.
+ Minor javadoc and formatting changes.

Its lightly tested.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-862:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Michael!

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3-4.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-862:
-------------------------------------

        Fix Version/s: 0.11.0
    Affects Version/s:     (was: 0.10.0)
                       0.10.1
               Status: Patch Available  (was: Open)

Marking issue with 'patch available'.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469210 ] 

Tom White commented on HADOOP-862:
----------------------------------

I just tried using this patch, and I managed to copy some local files to the S3 file system without trouble.

Looking at the code I noticed that the -fs option doesn't seem to be used any longer so it can be dropped. Other than that, it looks fine to me.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469853 ] 

Hadoop QA commented on HADOOP-862:
----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12350237/copyfiles-s3-4.diff applied and successfully tested against trunk revision r502694.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3-4.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-862:
-------------------------------------

    Attachment: copyfiles-s3-4.diff

New patch to fix broken unit test.  Removes 'dfs' scheme.  Only 'hdfs' allowed from here on out.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3-4.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by Nigel Daley <nd...@yahoo-inc.com>.
org.apache.hadoop.mapred.TestMiniMRLocalFS hung the process.  I'm  
restarting now...


On Feb 2, 2007, at 11:38 AM, Doug Cutting (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/HADOOP-862? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12469841 ]
>
> Doug Cutting commented on HADOOP-862:
> -------------------------------------
>
>> Mr 'Hadoop QA' [ ... ]
>
> Please, call him "Nigel".
>
>> Add handling of s3 to CopyFile tool
>> -----------------------------------
>>
>>                 Key: HADOOP-862
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: util
>>    Affects Versions: 0.10.1
>>            Reporter: stack@archive.org
>>            Priority: Minor
>>             Fix For: 0.11.0
>>
>>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff,  
>> copyfiles-s3-4.diff, copyfiles-s3.diff
>>
>>
>> CopyFile is a useful tool for doing bulk copies.  It doesn't have  
>> handling for the recently added s3 filesystem.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469841 ] 

Doug Cutting commented on HADOOP-862:
-------------------------------------

> Mr 'Hadoop QA' [ ... ]

Please, call him "Nigel".

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3-4.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469685 ] 

Hadoop QA commented on HADOOP-862:
----------------------------------

-1, because 3 attempts failed to build and test the latest attachment (http://issues.apache.org/jira/secure/attachment/12350196/copyfiles-s3-3.diff) against trunk revision r502402. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469836 ] 

stack@archive.org commented on HADOOP-862:
------------------------------------------

Mr 'Hadoop QA', do I have to do anything special to re-trigger your auto-application and test of version 4 of the patch?  Thanks.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3-4.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-862:
-------------------------------------

    Attachment: copyfiles-s3-3.diff

Fix usage string (suggested by Tom White review)

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463198 ] 

stack@archive.org commented on HADOOP-862:
------------------------------------------

Updated patch.

+ Renamed DFSCopyFilesMapper as FSCopyFilesMapper
+ If no scheme, use 'default' (the value of 'fs.default.name' in hadoop-site.xml).

I ran more extensive tests going from hdfs to s3 and back again and copying from http into s3 and hdfs (distcp is a nice tool).  For example, here is output from a copy of a small nutch segment from hdfs to s3 (in the below hdfs was set as the fs.default.name filesystem):

stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -lsr outputs/segments
/user/stack/outputs/segments/20070108213341-test        <dir>
/user/stack/outputs/segments/20070108213341-test/crawl_fetch    <dir>
/user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000 <dir>
/user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/data    <r 1>   1187
/user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/index   <r 1>   234
/user/stack/outputs/segments/20070108213341-test/crawl_parse    <dir>
/user/stack/outputs/segments/20070108213341-test/crawl_parse/part-00000 <r 1>   9010
/user/stack/outputs/segments/20070108213341-test/parse_data     <dir>
/user/stack/outputs/segments/20070108213341-test/parse_data/part-00000  <dir>
/user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/data     <r 1>   4630
/user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/index    <r 1>   234
/user/stack/outputs/segments/20070108213341-test/parse_text     <dir>
/user/stack/outputs/segments/20070108213341-test/parse_text/part-00000  <dir>
/user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/data     <r 1>   6180
/user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/index    <r 1>   234

Here's copy to an s3 directory named segments-bkup:

% ./bin/hadoop distcp /user/stack/outputs/segments s3://KEY:SECRET@BUCKET/segments-bkup

Here's listing of s3 content:

stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -fs s3://KEY:SECRET@BUCKET/segments-bkup -lsr /segments-bkup/
/segments-bkup/20070108213341-test      <dir>
/segments-bkup/20070108213341-test/crawl_fetch  <dir>
/segments-bkup/20070108213341-test/crawl_fetch/part-00000       <dir>
/segments-bkup/20070108213341-test/crawl_fetch/part-00000/data  <r 1>   1187
/segments-bkup/20070108213341-test/crawl_fetch/part-00000/index <r 1>   234
/segments-bkup/20070108213341-test/crawl_parse  <dir>
/segments-bkup/20070108213341-test/crawl_parse/part-00000       <r 1>   9010
/segments-bkup/20070108213341-test/parse_data   <dir>
/segments-bkup/20070108213341-test/parse_data/part-00000        <dir>
/segments-bkup/20070108213341-test/parse_data/part-00000/data   <r 1>   4630
/segments-bkup/20070108213341-test/parse_data/part-00000/index  <r 1>   234
/segments-bkup/20070108213341-test/parse_text   <dir>
/segments-bkup/20070108213341-test/parse_text/part-00000        <dir>
/segments-bkup/20070108213341-test/parse_text/part-00000/data   <r 1>   6180
/segments-bkup/20070108213341-test/parse_text/part-00000/index  <r 1>   234

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469681 ] 

stack@archive.org commented on HADOOP-862:
------------------------------------------

Thanks for the review Tom.

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.1
>            Reporter: stack@archive.org
>            Priority: Minor
>             Fix For: 0.11.0
>
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3-3.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-862:
-------------------------------------

    Attachment: copyfiles-s3.diff

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-862) Add handling of s3 to CopyFile tool

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-862:
-------------------------------------

    Attachment: copyfiles-s3-2.diff

> Add handling of s3 to CopyFile tool
> -----------------------------------
>
>                 Key: HADOOP-862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-862
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 0.10.0
>            Reporter: stack@archive.org
>            Priority: Minor
>         Attachments: copyfiles-s3-2.diff, copyfiles-s3.diff
>
>
> CopyFile is a useful tool for doing bulk copies.  It doesn't have handling for the recently added s3 filesystem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira