You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Andrew Hitchcock (JIRA)" <ji...@apache.org> on 2009/04/07 09:04:12 UTC

[jira] Created: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

distributed cache doesn't work with other distributed file systems
------------------------------------------------------------------

                 Key: HADOOP-5635
                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
             Project: Hadoop Core
          Issue Type: Bug
          Components: filecache
            Reporter: Andrew Hitchcock
            Priority: Minor


Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Andrew Hitchcock (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Hitchcock updated HADOOP-5635:
-------------------------------------

    Attachment: fix-distributed-cache.patch

This patch removes the check for HDFS and lets any file system through. The onus is on the user to ensure that the file system is globally available on all nodes.

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>            Reporter: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Andrew Hitchcock (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696765#action_12696765 ] 

Andrew Hitchcock commented on HADOOP-5635:
------------------------------------------

It sounds like what you want is a new featue whereas this patch is just to fix a bug.

Currently the behavior is not right. If a user specifies a non-HDFS URI for distributed cache then the job will fail because the tasks look for the file in HDFS. This patch fixes that for cases when the user specifies a URI to another distributed file system. With the patch, if a user specifies KFS or S3N (and the file system is properly configured) then the job will succeed. The behavior for specifying a URI not accessible on every machine remains unchanged: the job will fail as tasks are unable to reach the URI.

I think a feature for administrators to restrict distributed cache access to certain file systems should be a new Jira.

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>            Reporter: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705004#action_12705004 ] 

Tom White commented on HADOOP-5635:
-----------------------------------

Andrew,

This looks like a good change to me. Have you thought how to write a unit test for this?

Also, the documentation in DistributedCache should be updated to remove HDFS assumptions. 

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>            Reporter: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-5635:
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I've just committed this. Thanks Andrew!

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>    Affects Versions: 0.20.0
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Andrew Hitchcock (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713176#action_12713176 ] 

Andrew Hitchcock commented on HADOOP-5635:
------------------------------------------

The failing tests seem unrelated.

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>    Affects Versions: 0.20.0
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Andrew Hitchcock (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Hitchcock updated HADOOP-5635:
-------------------------------------

    Attachment: HADOOP-5635.patch

I've updated the patch. It includes a unit test, fixes the error message in StreamJob, and updates some Javadocs that I had previously missed.

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>            Reporter: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713032#action_12713032 ] 

Hadoop QA commented on HADOOP-5635:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12408837/HADOOP-5635.patch
  against trunk revision 778388.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/403/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/403/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/403/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/403/console

This message is automatically generated.

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>    Affects Versions: 0.20.0
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718604#action_12718604 ] 

Hudson commented on HADOOP-5635:
--------------------------------

Integrated in Hadoop-trunk #863 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/863/])
    

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>    Affects Versions: 0.20.0
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Craig Macdonald (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696585#action_12696585 ] 

Craig Macdonald commented on HADOOP-5635:
-----------------------------------------

Wouldn't it be generally better if Hadoop was configured with a list of shared file systems. Then, when the administrator permitted, users could use shared NFS filesystems as sources and targets for map reduce jobs. E.g., in our setup, /local/ and /users/ are shared to all nodes. If we wanted to do a quick map reduce test on stuff storied in /local/ we would have to copy to the DFS, when it would be OK to run as is.

{noformat}
<name>fs.shared.filesystems</name>
<value>hdfs://,file://users/,file://local/ </value>
{noformat}

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>            Reporter: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5635) distributed cache doesn't work with other distributed file systems

Posted by "Andrew Hitchcock (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Hitchcock updated HADOOP-5635:
-------------------------------------

             Assignee: Andrew Hitchcock
    Affects Version/s: 0.20.0
               Status: Patch Available  (was: Open)

> distributed cache doesn't work with other distributed file systems
> ------------------------------------------------------------------
>
>                 Key: HADOOP-5635
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5635
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: filecache
>    Affects Versions: 0.20.0
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>            Priority: Minor
>         Attachments: fix-distributed-cache.patch, HADOOP-5635.patch
>
>
> Currently the DistributedCache does a check to see if the file to be included is an HDFS URI. If the URI isn't in HDFS, it returns the default filesystem. This prevents using other distributed file systems -- such as s3, s3n, or kfs  -- with distributed cache. When a user tries to use one of those filesystems, it reports an error that it can't find the path in HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.