You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2007/09/06 09:06:38 UTC

[jira] Resolved: (HADOOP-495) distcp defeciencies and bugs

     [ https://issues.apache.org/jira/browse/HADOOP-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-495.
----------------------------------

       Resolution: Duplicate
    Fix Version/s: 0.15.0
         Assignee: Chris Douglas  (was: Arun C Murthy)

Fixed by HADOOP-1569.

> distcp defeciencies and bugs
> ----------------------------
>
>                 Key: HADOOP-495
>                 URL: https://issues.apache.org/jira/browse/HADOOP-495
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.5.0
>            Reporter: Sameer Paranjpye
>            Assignee: Chris Douglas
>             Fix For: 0.15.0
>
>
> distcp as currently implemented has several defeciencies and bugs which I encountered when trying to use it to import logs from HTTP servers into my local DFS cluster. In general, it is user unfriendly and does not do comprehensible error reporting. 
> Here's a list of things that can be improved:
> 1) There isn't a man page that explains the various command line options. We should have one.
> 2) Malformed URLs cause a NullPointerException to be thrown with no error message stating what went wrong
> 3) Relative paths for the local filesystem are not handled at all
> 4) The schema used for HDFS URLs is dfs:// it ought to be hdfs://, 'dfs' is far to general an acronym to use in URLs
> 5) If a copy to the local filesystem is specified with a relative path, for instance
>     ./bin/hadoop distcp dfs://localhost:8020/foo.txt foo.txt
> then the job runs successfully but the file is nowhere to be seen. It looks like this gets copied to the map/reduce jobs
> current working directory
> 6) If a copy to a dfs is specified and the namenode cannot be resolved, the job fails with an IOException, no comprehensible error message is printed
> 7) If an HTTP URI has a query component, it is disregarded when constructing the destination file name, for instance, if one specifies the following two URLs to be copied in a file list
>   http://myhost.mydomain.com/files.cgi?n=/logs/foo.txt
>   http://myhost.myfomain.com/files.cgi?n=/logs/bar.txt
> a single file called 'files.cgi' is created and is overwritten by one or both source files, it's not clear which. The destination
> path name should be constructed in the way that 'wget' does it, using the filename+query part of the URL, escaping characters as necessary.
> 8) It looks like if a list of URLs is specified in a file distcp runs a separate map reduce job for each entry in the file, why?
> Seems like one could do a straight copy for local files since the task needs to run locally, followed by a single MR job that
> copies HDFS and http URLs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.