You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Colin Patrick McCabe (JIRA)" <ji...@apache.org> on 2015/05/01 01:46:07 UTC

[jira] [Commented] (HADOOP-9984) FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default

    [ https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522497#comment-14522497 ] 

Colin Patrick McCabe commented on HADOOP-9984:
----------------------------------------------

I've asked some people more familiar with the upper layers of the stack to comment on the security issues of symlinks.  When we last asked about it, they were quite real and very scary.  It's worth noting that even Linux software, which has had to deal with symlinks for decades, still often has security vulnerabilities caused by symlinks.

The globStatus issue is essentially the same issue is this one.  Should globStatus resolve symlinks or not?  In the case of globStatus, things are even worse if you choose to resolve symlinks, since then you can glob for '*foo' and get back 'bar'.  A lot of software breaks if globs return back file names that the glob doesn't match.  A lot of users get highly confused, as well, when using FsShell.  I do not think globStatus should resolve symlinks, but the same group of "I don't want to ever think about a FileStatus type other than file or dir" people argued in favor of it.

With regard to adding new APIs: let's be honest.  HDFS users take years to update to using new APIs, if they ever do.  Moving to new APIs is a huge pain because it means that they have to drop compatibility with older versions of Hadoop.  For example, Apache Spark is still supporting Hadoop 1.x, so they won't use any API newer than that.  Admittedly, this is kind of an extreme example, but even projects with more reasonable compat policies like HBase will want to wait a year or two before dropping support for a Hadoop release.  And even when the compatibility window opens up to use the new API, they have to understand why they should use it and make an active effort to do so.  I think the number of people who would use listStatus2, if it existed, is extremely small.

> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch, HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch, HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that many existing HDFS clients would be broken by listStatus and globStatus returning symlinks.  One example is applications that assume that !FileStatus#isFile implies that the inode is a directory.  As we discussed in HADOOP-9972 and HADOOP-9912, we should default these APIs to returning resolved paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)