You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Charles Baker <cb...@sdl.com> on 2011/10/28 18:46:33 UTC

relative symbolic links in HDFS

Hey guys. We are in the early stages of planning and evaluating a hadoop
'cold-storage' cluster for medium to long term storage of mixed data (small
to large files, zips, tar, etc...) and tons of symlinks. We do realize that
small files aren't ideal in HDFS but it's for long-term storage and beats the
cost of more NetApps by potentially several hundred thousand dollars by
leveraging existing equipment. We are already successfully using Hadoop and
the MapReduce framework in a different project and have developed quite a bit
of in-house expertise when it comes to Hadoop.

 

Since this use-case is preserving and restoring an arbitrary directory
structure, I have been evaluating 0.21.0's support of symlinks and found that
although it happily creates relative symlinks, the code that is called to
retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
relative Path object to an absolute one through the use of the
qualifySymlinkTarget() method. Though I was easily able to work around this
limitation by changing the one line of code that calls this function from: 

 

fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));

 

to:

 

fi.setSymlink(fi.getSymlink());

 

It has made us curious as to why the decision was made to always return the
absolute path of a symlink in the first place. Is it that attempts to open
targets to relative symlinks throw exceptions and it saves having the user do
the work to construct the absolute path since that's the general use-case? Or
does this workaround violate some internal assumptions of the code or ideas
about how a URI should behave (even though relative paths are implicitly
supported by URI object)? Any insight you guys can shed on this would be
great. I've tested the above change by adding support for symlinks (into and
out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed bag
of relative and absolute symlinks and symlinks->symlinks and have so far
found no ill effects. 

 

Thanks!

 

-Chuck

 

</pre>
<BR style="font-size:4px;">
<a href = "http://www.sdl.com/sdl-vision"><img src="http://www.sdl.com/images/email_new_logo.png" alt="www.sdl.com/sdl-vision" border="0"/></a>
<BR>
<font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision" style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
<BR>
<BR>
<font face="arial"  size="1" color="#736F6E">
<b>SDL PLC confidential, all rights reserved.</b>
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
</font>

RE: relative symbolic links in HDFS

Posted by Charles Baker <cb...@sdl.com>.
Oh, and sorry about the signature. The mailserver injects that
automatically...

-Chuck

-----Original Message-----
From: Charles Baker [mailto:cbaker@sdl.com] 
Sent: Friday, October 28, 2011 9:47 AM
To: hdfs-dev@hadoop.apache.org
Subject: relative symbolic links in HDFS

Hey guys. We are in the early stages of planning and evaluating a hadoop
'cold-storage' cluster for medium to long term storage of mixed data (small
to large files, zips, tar, etc...) and tons of symlinks. We do realize that
small files aren't ideal in HDFS but it's for long-term storage and beats the
cost of more NetApps by potentially several hundred thousand dollars by
leveraging existing equipment. We are already successfully using Hadoop and
the MapReduce framework in a different project and have developed quite a bit
of in-house expertise when it comes to Hadoop.

 

Since this use-case is preserving and restoring an arbitrary directory
structure, I have been evaluating 0.21.0's support of symlinks and found that
although it happily creates relative symlinks, the code that is called to
retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
relative Path object to an absolute one through the use of the
qualifySymlinkTarget() method. Though I was easily able to work around this
limitation by changing the one line of code that calls this function from: 

 

fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));

 

to:

 

fi.setSymlink(fi.getSymlink());

 

It has made us curious as to why the decision was made to always return the
absolute path of a symlink in the first place. Is it that attempts to open
targets to relative symlinks throw exceptions and it saves having the user do
the work to construct the absolute path since that's the general use-case? Or
does this workaround violate some internal assumptions of the code or ideas
about how a URI should behave (even though relative paths are implicitly
supported by URI object)? Any insight you guys can shed on this would be
great. I've tested the above change by adding support for symlinks (into and
out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed bag
of relative and absolute symlinks and symlinks->symlinks and have so far
found no ill effects. 

 

Thanks!

 

-Chuck

 

</pre>
<BR style="font-size:4px;">
<a href = "http://www.sdl.com/sdl-vision"><img
src="http://www.sdl.com/images/email_new_logo.png"
alt="www.sdl.com/sdl-vision" border="0"/></a>
<BR>
<font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision"
style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
<BR>
<BR>
<font face="arial"  size="1" color="#736F6E">
<b>SDL PLC confidential, all rights reserved.</b>
If you are not the intended recipient of this mail SDL requests and requires
that you delete it without acting upon or copying any of its contents, and we
further request that you advise us.<BR>
SDL PLC is a public limited company registered in England and Wales.
Registered number: 02675207.<BR>
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
7DY, UK.
</font>

Re: relative symbolic links in HDFS

Posted by Daryn Sharp <da...@yahoo-inc.com>.
Universal support for FileContext & symlinks in all commands should be coming "soon".  A few jiras that removed complications recently were committed or in the process of being committed.  Copy commands will require some extra parameters to control whether symlinks are dereferenced.

Daryn


On Oct 31, 2011, at 11:27 AM, Charles Baker wrote:

> Hey guys. Thanks for the replies. Fully qualified symbolic links are
> problematic in that when we wish to restore a directory structure containing
> symlinks from HDFS to local filesystem, the relativity is lost. For instance:
> 
> /user/cbaker/foo/
>                link1 -> ../../cbaker
> 
> The current behavior of getFileLinkStatus() results in a path for link 1
> being:
> 
> /user/cbaker
> 
> Not:
> 
> ../../cbaker
> 
> 
> Also, some symlinks may point to non-existent locations within HDFS which
> only have relevance to the local filesystem. This appears as though it could
> (though I haven't tested yet) result in an exception when the attempt is made
> to qualify it. If I get a chance, I'll try it out later today.
> 
> FileContext.getLinkTarget() doesn't work for this case since it returns only
> the final component of the target, not the complete relative path. But even
> if it did return the relative path, it seems counter-intuitive to me. I agree
> with Daryn and expect the behavior of getFileLinkStatus() to return the
> symlink as is and not presume that I wanted it qualified. If I wanted a
> qualified path for a symlink, I would expect to call Path.makeQualified() to
> do so. 
> 
> Insofar as porting FsShell to FileContext, I've only modified it to support
> our use-case. I haven't gone to the extent of fully porting it to
> FileContext. Though I'd love to, unfortunately I'm too busy right now to
> contribute :(
> 
> Thanks!
> 
> -Chuck
> 
> 
> 
> -----Original Message-----
> From: Daryn Sharp [mailto:daryn@yahoo-inc.com] 
> Sent: Monday, October 31, 2011 7:46 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: relative symbolic links in HDFS
> 
> It's generally been a problem that filesystem operations mangle paths to be
> something other than what the user provided.  FsShell has to go to some
> (unnecessary, imho) lengths to independently track the user's given path so
> the output paths will match what the user provided.  Not displaying the
> user-given path makes it difficult/impossible for scripts to accurately parse
> the output for the results of an operation on the given paths.
> 
> I like getLinkTarget returning the exact target, but I'd also like a
> FileStatus to return the given path both in the case of a normal path and a
> symlink.  If the user needs a fully qualified path for an operation, my
> opinion is they should request it?
> 
> Daryn
> 
> 
> On Oct 29, 2011, at 9:02 PM, Eli Collins wrote:
> 
>> Hey Chuck,
>> 
>> Why is it problematic for your use that the symlink is stored in
>> FileStatus fully qualified - you'd like FileContext#getSymlink to
>> return the same Path that you used as the target in createSymlink?
>> 
>> The current behavior is so getFileLinkStatus is consistent with
>> getFileStatus(new Path("/some/file")) which returns a fully qualified
>> path (eg hdfs://myhost:123/some/file).   Note that you can use
>> FileContext#getLinkTarget to return the path used when creating the
>> link. Some more background is in the design doc:
>> https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt
>> 
>> There's a jira for porting FsShell to FileContext (HADOOP-6424), if
>> you have a patch (even partial) feel free to post it to the jira.
>> Note that since symlinks are not implemented in FileSystem, clients
>> that use FileSystem to access paths with symlinks will fail.
>> 
>> Btw when looking at the code you pointed out I noticed a bug in link
>> resolution (HADOOP-7783), thanks!
>> 
>> Thanks,
>> Eli
>> 
>> 
>> On Fri, Oct 28, 2011 at 9:46 AM, Charles Baker <cb...@sdl.com> wrote:
>>> Hey guys. We are in the early stages of planning and evaluating a hadoop
>>> 'cold-storage' cluster for medium to long term storage of mixed data
> (small
>>> to large files, zips, tar, etc...) and tons of symlinks. We do realize
> that
>>> small files aren't ideal in HDFS but it's for long-term storage and beats
> the
>>> cost of more NetApps by potentially several hundred thousand dollars by
>>> leveraging existing equipment. We are already successfully using Hadoop
> and
>>> the MapReduce framework in a different project and have developed quite a
> bit
>>> of in-house expertise when it comes to Hadoop.
>>> 
>>> 
>>> 
>>> Since this use-case is preserving and restoring an arbitrary directory
>>> structure, I have been evaluating 0.21.0's support of symlinks and found
> that
>>> although it happily creates relative symlinks, the code that is called to
>>> retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
>>> relative Path object to an absolute one through the use of the
>>> qualifySymlinkTarget() method. Though I was easily able to work around
> this
>>> limitation by changing the one line of code that calls this function from:
>>> 
>>> 
>>> 
>>> fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));
>>> 
>>> 
>>> 
>>> to:
>>> 
>>> 
>>> 
>>> fi.setSymlink(fi.getSymlink());
>>> 
>>> 
>>> 
>>> It has made us curious as to why the decision was made to always return
> the
>>> absolute path of a symlink in the first place. Is it that attempts to open
>>> targets to relative symlinks throw exceptions and it saves having the user
> do
>>> the work to construct the absolute path since that's the general use-case?
> Or
>>> does this workaround violate some internal assumptions of the code or
> ideas
>>> about how a URI should behave (even though relative paths are implicitly
>>> supported by URI object)? Any insight you guys can shed on this would be
>>> great. I've tested the above change by adding support for symlinks (into
> and
>>> out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed
> bag
>>> of relative and absolute symlinks and symlinks->symlinks and have so far
>>> found no ill effects.
>>> 
>>> 
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> -Chuck
>>> 
>>> 
>>> 
>>> </pre>
>>> <BR style="font-size:4px;">
>>> <a href = "http://www.sdl.com/sdl-vision"><img
> src="http://www.sdl.com/images/email_new_logo.png"
> alt="www.sdl.com/sdl-vision" border="0"/></a>
>>> <BR>
>>> <font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision"
> style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
>>> <BR>
>>> <BR>
>>> <font face="arial"  size="1" color="#736F6E">
>>> <b>SDL PLC confidential, all rights reserved.</b>
>>> If you are not the intended recipient of this mail SDL requests and
> requires that you delete it without acting upon or copying any of its
> contents, and we further request that you advise us.<BR>
>>> SDL PLC is a public limited company registered in England and Wales.
> Registered number: 02675207.<BR>
>>> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
> 7DY, UK.
>>> </font>
>>> 
> 


Re: relative symbolic links in HDFS

Posted by Eli Collins <el...@cloudera.com>.
On Tue, Nov 1, 2011 at 8:19 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:
> On Oct 31, 2011, at 5:41 PM, Eli Collins wrote:
>>> Regardless, I do think it makes sense to have a convenience method to get the
>>> raw path that was supplied at symlink creation. The first thing I tried was
>>> Path#toString() so that I guess is pretty intuitive but I can't comment on
>>> whether that would break compatibility.
>>>
>>
>> Would a method Path#getPathPart that returns just the path part of the
>> URI be sufficient?  This would be similar to java's URI#getPath
>> (remember in Hadoop Path == URI) which just returns the path part of a
>> URI.
>>
>> (It's unfortunate that the Path class is named "Path" since now we
>> don't have a good name for just the path part).
>
> I replied to another message in this thread regarding tracking the actual path.  If I understand you correctly: In the FsShell, returning just the path component wouldn't be of much use.  It needs the exact path/uri, as provided by the user, so its custom tracking can be removed.
>

How about filing a jira which indicates the tracking code in FsShell
you'd like to remove and we can come up with an API that works.

Thanks,
Eli

> Daryn

Re: relative symbolic links in HDFS

Posted by Daryn Sharp <da...@yahoo-inc.com>.
On Oct 31, 2011, at 5:41 PM, Eli Collins wrote:
>> Regardless, I do think it makes sense to have a convenience method to get the
>> raw path that was supplied at symlink creation. The first thing I tried was
>> Path#toString() so that I guess is pretty intuitive but I can't comment on
>> whether that would break compatibility.
>> 
> 
> Would a method Path#getPathPart that returns just the path part of the
> URI be sufficient?  This would be similar to java's URI#getPath
> (remember in Hadoop Path == URI) which just returns the path part of a
> URI.
> 
> (It's unfortunate that the Path class is named "Path" since now we
> don't have a good name for just the path part).

I replied to another message in this thread regarding tracking the actual path.  If I understand you correctly: In the FsShell, returning just the path component wouldn't be of much use.  It needs the exact path/uri, as provided by the user, so its custom tracking can be removed.

Daryn

Re: relative symbolic links in HDFS

Posted by Eli Collins <el...@cloudera.com>.
On Mon, Oct 31, 2011 at 2:19 PM, Charles Baker <cb...@sdl.com> wrote:
> I did a hasty test initially of getLinkTarget() but forgot to also use the
> same for the input path to FileContext#createSymlink() so yeah, turns out it
> does indeed work. Sorry about that. Looks like I won't need to modify
> FileContext after all which is good :)
>
> The rationale of keeping things consistent so as not to break compatibility
> makes sense, it just isn't that intuitive coming at it from a 'fresh'
> perspective. Was the original idea to return the symlink information in
> getFileStatus() instead of having to access it via  getFileLinkStatus()?
> Maybe it's naive but it seems like you could just rename getFileLinkStatus()
> to getFileStatus() and none would be the wiser...

You need both, getFileStatus is like stat(2) and getFileLinkStatus is
like lstat(2). getFileStatus resolves all symlinks in the path, ie you
want the FileStatus of the file that the path points to (regardless of
links), while getFileLinkStatus, if called on a symlink, will give you
the FileStatus of the link (not what it points to). Only applications
that are link-aware need to use getFileLinkStatus (otherwise links are
resolved transparently to the caller).

>
> Regardless, I do think it makes sense to have a convenience method to get the
> raw path that was supplied at symlink creation. The first thing I tried was
> Path#toString() so that I guess is pretty intuitive but I can't comment on
> whether that would break compatibility.
>

Would a method Path#getPathPart that returns just the path part of the
URI be sufficient?  This would be similar to java's URI#getPath
(remember in Hadoop Path == URI) which just returns the path part of a
URI.

(It's unfortunate that the Path class is named "Path" since now we
don't have a good name for just the path part).

Thanks,
Eli


> Thanks!
>
> -Chuck
>
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Monday, October 31, 2011 11:45 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: relative symbolic links in HDFS
>
> On Mon, Oct 31, 2011 at 9:27 AM, Charles Baker <cb...@sdl.com> wrote:
>> Hey guys. Thanks for the replies. Fully qualified symbolic links are
>> problematic in that when we wish to restore a directory structure
> containing
>> symlinks from HDFS to local filesystem, the relativity is lost. For
> instance:
>>
>> /user/cbaker/foo/
>>                link1 -> ../../cbaker
>>
>> The current behavior of getFileLinkStatus() results in a path for link 1
>> being:
>>
>> /user/cbaker
>>
>> Not:
>>
>> ../../cbaker
>>
>>
>> Also, some symlinks may point to non-existent locations within HDFS which
>> only have relevance to the local filesystem. This appears as though it
> could
>> (though I haven't tested yet) result in an exception when the attempt is
> made
>> to qualify it. If I get a chance, I'll try it out later today.
>>
>> FileContext.getLinkTarget() doesn't work for this case since it returns
> only
>> the final component of the target, not the complete relative path.
>
> Really?  FC#getLinkTarget should return the target verbatim, as
> specified by the user when creating the link:
>
> Eg see test testCreateLinkToDotDotPrefix:
>  fc.createSymlink(new Path("../file"), link, false);
>  ...
>  assertEquals(new Path("../file"), fc.getLinkTarget(link));
>
>
>> But even
>> if it did return the relative path, it seems counter-intuitive to me. I
> agree
>> with Daryn and expect the behavior of getFileLinkStatus() to return the
>> symlink as is and not presume that I wanted it qualified. If I wanted a
>> qualified path for a symlink, I would expect to call Path.makeQualified()
> to
>> do so.
>
> It does this because getFileStatus always returns fully qualified
> paths in HDFS, and we don't make to make callers check the type and
> care about the method that was used to obtain the FileStatus, eg to
> know whether it contains a fully qualified path or not.
>
> I think the original rationale for while FileStatus objects always
> have fully qualified paths is so they can be passed around w/o callers
> having to do future work to access them ie didn't want to disassociate
> the path from the file system it exists on. Note that in Hadoop
> "Paths" are actually URIs, vs file system paths (a subset of URIs).
>
> Regardless of the rationale, changing getFileStatus to return objects
> w/o fully qualified paths would break compatibility with a lot of
> existing programs. It would also hinder people porting to FileContext
> which tries to be consistent with FileSystem.
>
> Would a new method on FileStatus or Path that returns the unqualified
> version of the path (ie w/o the scheme and authority, and w/o
> resolving relative paths relative to the FileContext) work?  Ie the
> FileStatus could return the contents of the HdfsFileStatus w/o making
> it fully qualified.
>
> Thanks,
> Eli
> SDL PLC confidential, all rights reserved.
> If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
> SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
>
>

RE: relative symbolic links in HDFS

Posted by Charles Baker <cb...@sdl.com>.
I did a hasty test initially of getLinkTarget() but forgot to also use the
same for the input path to FileContext#createSymlink() so yeah, turns out it
does indeed work. Sorry about that. Looks like I won't need to modify
FileContext after all which is good :)

The rationale of keeping things consistent so as not to break compatibility
makes sense, it just isn't that intuitive coming at it from a 'fresh'
perspective. Was the original idea to return the symlink information in
getFileStatus() instead of having to access it via  getFileLinkStatus()?
Maybe it's naive but it seems like you could just rename getFileLinkStatus()
to getFileStatus() and none would be the wiser...

Regardless, I do think it makes sense to have a convenience method to get the
raw path that was supplied at symlink creation. The first thing I tried was
Path#toString() so that I guess is pretty intuitive but I can't comment on
whether that would break compatibility.

Thanks!

-Chuck


-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com] 
Sent: Monday, October 31, 2011 11:45 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: relative symbolic links in HDFS

On Mon, Oct 31, 2011 at 9:27 AM, Charles Baker <cb...@sdl.com> wrote:
> Hey guys. Thanks for the replies. Fully qualified symbolic links are
> problematic in that when we wish to restore a directory structure
containing
> symlinks from HDFS to local filesystem, the relativity is lost. For
instance:
>
> /user/cbaker/foo/
>                link1 -> ../../cbaker
>
> The current behavior of getFileLinkStatus() results in a path for link 1
> being:
>
> /user/cbaker
>
> Not:
>
> ../../cbaker
>
>
> Also, some symlinks may point to non-existent locations within HDFS which
> only have relevance to the local filesystem. This appears as though it
could
> (though I haven't tested yet) result in an exception when the attempt is
made
> to qualify it. If I get a chance, I'll try it out later today.
>
> FileContext.getLinkTarget() doesn't work for this case since it returns
only
> the final component of the target, not the complete relative path.

Really?  FC#getLinkTarget should return the target verbatim, as
specified by the user when creating the link:

Eg see test testCreateLinkToDotDotPrefix:
 fc.createSymlink(new Path("../file"), link, false);
 ...
 assertEquals(new Path("../file"), fc.getLinkTarget(link));


> But even
> if it did return the relative path, it seems counter-intuitive to me. I
agree
> with Daryn and expect the behavior of getFileLinkStatus() to return the
> symlink as is and not presume that I wanted it qualified. If I wanted a
> qualified path for a symlink, I would expect to call Path.makeQualified()
to
> do so.

It does this because getFileStatus always returns fully qualified
paths in HDFS, and we don't make to make callers check the type and
care about the method that was used to obtain the FileStatus, eg to
know whether it contains a fully qualified path or not.

I think the original rationale for while FileStatus objects always
have fully qualified paths is so they can be passed around w/o callers
having to do future work to access them ie didn't want to disassociate
the path from the file system it exists on. Note that in Hadoop
"Paths" are actually URIs, vs file system paths (a subset of URIs).

Regardless of the rationale, changing getFileStatus to return objects
w/o fully qualified paths would break compatibility with a lot of
existing programs. It would also hinder people porting to FileContext
which tries to be consistent with FileSystem.

Would a new method on FileStatus or Path that returns the unqualified
version of the path (ie w/o the scheme and authority, and w/o
resolving relative paths relative to the FileContext) work?  Ie the
FileStatus could return the contents of the HdfsFileStatus w/o making
it fully qualified.

Thanks,
Eli
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.


Re: relative symbolic links in HDFS

Posted by Daryn Sharp <da...@yahoo-inc.com>.
On Oct 31, 2011, at 1:45 PM, Eli Collins wrote:

> On Mon, Oct 31, 2011 at 9:27 AM, Charles Baker <cb...@sdl.com> wrote:
>> But even
>> if it did return the relative path, it seems counter-intuitive to me. I agree
>> with Daryn and expect the behavior of getFileLinkStatus() to return the
>> symlink as is and not presume that I wanted it qualified. If I wanted a
>> qualified path for a symlink, I would expect to call Path.makeQualified() to
>> do so.
> 
> It does this because getFileStatus always returns fully qualified
> paths in HDFS, and we don't make to make callers check the type and
> care about the method that was used to obtain the FileStatus, eg to
> know whether it contains a fully qualified path or not.
> 
> I think the original rationale for while FileStatus objects always
> have fully qualified paths is so they can be passed around w/o callers
> having to do future work to access them ie didn't want to disassociate
> the path from the file system it exists on. Note that in Hadoop
> "Paths" are actually URIs, vs file system paths (a subset of URIs).
> 
> Regardless of the rationale, changing getFileStatus to return objects
> w/o fully qualified paths would break compatibility with a lot of
> existing programs. It would also hinder people porting to FileContext
> which tries to be consistent with FileSystem.
> 
> Would a new method on FileStatus or Path that returns the unqualified
> version of the path (ie w/o the scheme and authority, and w/o
> resolving relative paths relative to the FileContext) work?  Ie the
> FileStatus could return the contents of the HdfsFileStatus w/o making
> it fully qualified.

Off the top of my head:  If we can't change the FileStatus behavior, I think FsShell would be well served by a Path#getRawUri() method that returned the exact uri/string used to  instantiate it.  As long as FileStatus preserves the path it's given, I think it would work well.  The various Path ctors would need to be modified to update the rawUri by tacking on directory components, or removing them.

I don't suppose it'd be ok for Path#toString() to return the stringified raw uri? :)

Daryn

Re: relative symbolic links in HDFS

Posted by Eli Collins <el...@cloudera.com>.
On Mon, Oct 31, 2011 at 9:27 AM, Charles Baker <cb...@sdl.com> wrote:
> Hey guys. Thanks for the replies. Fully qualified symbolic links are
> problematic in that when we wish to restore a directory structure containing
> symlinks from HDFS to local filesystem, the relativity is lost. For instance:
>
> /user/cbaker/foo/
>                link1 -> ../../cbaker
>
> The current behavior of getFileLinkStatus() results in a path for link 1
> being:
>
> /user/cbaker
>
> Not:
>
> ../../cbaker
>
>
> Also, some symlinks may point to non-existent locations within HDFS which
> only have relevance to the local filesystem. This appears as though it could
> (though I haven't tested yet) result in an exception when the attempt is made
> to qualify it. If I get a chance, I'll try it out later today.
>
> FileContext.getLinkTarget() doesn't work for this case since it returns only
> the final component of the target, not the complete relative path.

Really?  FC#getLinkTarget should return the target verbatim, as
specified by the user when creating the link:

Eg see test testCreateLinkToDotDotPrefix:
 fc.createSymlink(new Path("../file"), link, false);
 ...
 assertEquals(new Path("../file"), fc.getLinkTarget(link));


> But even
> if it did return the relative path, it seems counter-intuitive to me. I agree
> with Daryn and expect the behavior of getFileLinkStatus() to return the
> symlink as is and not presume that I wanted it qualified. If I wanted a
> qualified path for a symlink, I would expect to call Path.makeQualified() to
> do so.

It does this because getFileStatus always returns fully qualified
paths in HDFS, and we don't make to make callers check the type and
care about the method that was used to obtain the FileStatus, eg to
know whether it contains a fully qualified path or not.

I think the original rationale for while FileStatus objects always
have fully qualified paths is so they can be passed around w/o callers
having to do future work to access them ie didn't want to disassociate
the path from the file system it exists on. Note that in Hadoop
"Paths" are actually URIs, vs file system paths (a subset of URIs).

Regardless of the rationale, changing getFileStatus to return objects
w/o fully qualified paths would break compatibility with a lot of
existing programs. It would also hinder people porting to FileContext
which tries to be consistent with FileSystem.

Would a new method on FileStatus or Path that returns the unqualified
version of the path (ie w/o the scheme and authority, and w/o
resolving relative paths relative to the FileContext) work?  Ie the
FileStatus could return the contents of the HdfsFileStatus w/o making
it fully qualified.

Thanks,
Eli

RE: relative symbolic links in HDFS

Posted by Charles Baker <cb...@sdl.com>.
Hey guys. Thanks for the replies. Fully qualified symbolic links are
problematic in that when we wish to restore a directory structure containing
symlinks from HDFS to local filesystem, the relativity is lost. For instance:

/user/cbaker/foo/
                link1 -> ../../cbaker

The current behavior of getFileLinkStatus() results in a path for link 1
being:

/user/cbaker

Not:

../../cbaker


Also, some symlinks may point to non-existent locations within HDFS which
only have relevance to the local filesystem. This appears as though it could
(though I haven't tested yet) result in an exception when the attempt is made
to qualify it. If I get a chance, I'll try it out later today.

FileContext.getLinkTarget() doesn't work for this case since it returns only
the final component of the target, not the complete relative path. But even
if it did return the relative path, it seems counter-intuitive to me. I agree
with Daryn and expect the behavior of getFileLinkStatus() to return the
symlink as is and not presume that I wanted it qualified. If I wanted a
qualified path for a symlink, I would expect to call Path.makeQualified() to
do so. 

Insofar as porting FsShell to FileContext, I've only modified it to support
our use-case. I haven't gone to the extent of fully porting it to
FileContext. Though I'd love to, unfortunately I'm too busy right now to
contribute :(

Thanks!

-Chuck



-----Original Message-----
From: Daryn Sharp [mailto:daryn@yahoo-inc.com] 
Sent: Monday, October 31, 2011 7:46 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: relative symbolic links in HDFS

It's generally been a problem that filesystem operations mangle paths to be
something other than what the user provided.  FsShell has to go to some
(unnecessary, imho) lengths to independently track the user's given path so
the output paths will match what the user provided.  Not displaying the
user-given path makes it difficult/impossible for scripts to accurately parse
the output for the results of an operation on the given paths.

I like getLinkTarget returning the exact target, but I'd also like a
FileStatus to return the given path both in the case of a normal path and a
symlink.  If the user needs a fully qualified path for an operation, my
opinion is they should request it?

Daryn


On Oct 29, 2011, at 9:02 PM, Eli Collins wrote:

> Hey Chuck,
> 
> Why is it problematic for your use that the symlink is stored in
> FileStatus fully qualified - you'd like FileContext#getSymlink to
> return the same Path that you used as the target in createSymlink?
> 
> The current behavior is so getFileLinkStatus is consistent with
> getFileStatus(new Path("/some/file")) which returns a fully qualified
> path (eg hdfs://myhost:123/some/file).   Note that you can use
> FileContext#getLinkTarget to return the path used when creating the
> link. Some more background is in the design doc:
> https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt
> 
> There's a jira for porting FsShell to FileContext (HADOOP-6424), if
> you have a patch (even partial) feel free to post it to the jira.
> Note that since symlinks are not implemented in FileSystem, clients
> that use FileSystem to access paths with symlinks will fail.
> 
> Btw when looking at the code you pointed out I noticed a bug in link
> resolution (HADOOP-7783), thanks!
> 
> Thanks,
> Eli
> 
> 
> On Fri, Oct 28, 2011 at 9:46 AM, Charles Baker <cb...@sdl.com> wrote:
>> Hey guys. We are in the early stages of planning and evaluating a hadoop
>> 'cold-storage' cluster for medium to long term storage of mixed data
(small
>> to large files, zips, tar, etc...) and tons of symlinks. We do realize
that
>> small files aren't ideal in HDFS but it's for long-term storage and beats
the
>> cost of more NetApps by potentially several hundred thousand dollars by
>> leveraging existing equipment. We are already successfully using Hadoop
and
>> the MapReduce framework in a different project and have developed quite a
bit
>> of in-house expertise when it comes to Hadoop.
>> 
>> 
>> 
>> Since this use-case is preserving and restoring an arbitrary directory
>> structure, I have been evaluating 0.21.0's support of symlinks and found
that
>> although it happily creates relative symlinks, the code that is called to
>> retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
>> relative Path object to an absolute one through the use of the
>> qualifySymlinkTarget() method. Though I was easily able to work around
this
>> limitation by changing the one line of code that calls this function from:
>> 
>> 
>> 
>> fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));
>> 
>> 
>> 
>> to:
>> 
>> 
>> 
>> fi.setSymlink(fi.getSymlink());
>> 
>> 
>> 
>> It has made us curious as to why the decision was made to always return
the
>> absolute path of a symlink in the first place. Is it that attempts to open
>> targets to relative symlinks throw exceptions and it saves having the user
do
>> the work to construct the absolute path since that's the general use-case?
Or
>> does this workaround violate some internal assumptions of the code or
ideas
>> about how a URI should behave (even though relative paths are implicitly
>> supported by URI object)? Any insight you guys can shed on this would be
>> great. I've tested the above change by adding support for symlinks (into
and
>> out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed
bag
>> of relative and absolute symlinks and symlinks->symlinks and have so far
>> found no ill effects.
>> 
>> 
>> 
>> Thanks!
>> 
>> 
>> 
>> -Chuck
>> 
>> 
>> 
>> </pre>
>> <BR style="font-size:4px;">
>> <a href = "http://www.sdl.com/sdl-vision"><img
src="http://www.sdl.com/images/email_new_logo.png"
alt="www.sdl.com/sdl-vision" border="0"/></a>
>> <BR>
>> <font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision"
style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
>> <BR>
>> <BR>
>> <font face="arial"  size="1" color="#736F6E">
>> <b>SDL PLC confidential, all rights reserved.</b>
>> If you are not the intended recipient of this mail SDL requests and
requires that you delete it without acting upon or copying any of its
contents, and we further request that you advise us.<BR>
>> SDL PLC is a public limited company registered in England and Wales.
Registered number: 02675207.<BR>
>> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
7DY, UK.
>> </font>
>> 


Re: relative symbolic links in HDFS

Posted by Daryn Sharp <da...@yahoo-inc.com>.
It's generally been a problem that filesystem operations mangle paths to be something other than what the user provided.  FsShell has to go to some (unnecessary, imho) lengths to independently track the user's given path so the output paths will match what the user provided.  Not displaying the user-given path makes it difficult/impossible for scripts to accurately parse the output for the results of an operation on the given paths.

I like getLinkTarget returning the exact target, but I'd also like a FileStatus to return the given path both in the case of a normal path and a symlink.  If the user needs a fully qualified path for an operation, my opinion is they should request it?

Daryn


On Oct 29, 2011, at 9:02 PM, Eli Collins wrote:

> Hey Chuck,
> 
> Why is it problematic for your use that the symlink is stored in
> FileStatus fully qualified - you'd like FileContext#getSymlink to
> return the same Path that you used as the target in createSymlink?
> 
> The current behavior is so getFileLinkStatus is consistent with
> getFileStatus(new Path("/some/file")) which returns a fully qualified
> path (eg hdfs://myhost:123/some/file).   Note that you can use
> FileContext#getLinkTarget to return the path used when creating the
> link. Some more background is in the design doc:
> https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt
> 
> There's a jira for porting FsShell to FileContext (HADOOP-6424), if
> you have a patch (even partial) feel free to post it to the jira.
> Note that since symlinks are not implemented in FileSystem, clients
> that use FileSystem to access paths with symlinks will fail.
> 
> Btw when looking at the code you pointed out I noticed a bug in link
> resolution (HADOOP-7783), thanks!
> 
> Thanks,
> Eli
> 
> 
> On Fri, Oct 28, 2011 at 9:46 AM, Charles Baker <cb...@sdl.com> wrote:
>> Hey guys. We are in the early stages of planning and evaluating a hadoop
>> 'cold-storage' cluster for medium to long term storage of mixed data (small
>> to large files, zips, tar, etc...) and tons of symlinks. We do realize that
>> small files aren't ideal in HDFS but it's for long-term storage and beats the
>> cost of more NetApps by potentially several hundred thousand dollars by
>> leveraging existing equipment. We are already successfully using Hadoop and
>> the MapReduce framework in a different project and have developed quite a bit
>> of in-house expertise when it comes to Hadoop.
>> 
>> 
>> 
>> Since this use-case is preserving and restoring an arbitrary directory
>> structure, I have been evaluating 0.21.0's support of symlinks and found that
>> although it happily creates relative symlinks, the code that is called to
>> retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
>> relative Path object to an absolute one through the use of the
>> qualifySymlinkTarget() method. Though I was easily able to work around this
>> limitation by changing the one line of code that calls this function from:
>> 
>> 
>> 
>> fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));
>> 
>> 
>> 
>> to:
>> 
>> 
>> 
>> fi.setSymlink(fi.getSymlink());
>> 
>> 
>> 
>> It has made us curious as to why the decision was made to always return the
>> absolute path of a symlink in the first place. Is it that attempts to open
>> targets to relative symlinks throw exceptions and it saves having the user do
>> the work to construct the absolute path since that's the general use-case? Or
>> does this workaround violate some internal assumptions of the code or ideas
>> about how a URI should behave (even though relative paths are implicitly
>> supported by URI object)? Any insight you guys can shed on this would be
>> great. I've tested the above change by adding support for symlinks (into and
>> out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed bag
>> of relative and absolute symlinks and symlinks->symlinks and have so far
>> found no ill effects.
>> 
>> 
>> 
>> Thanks!
>> 
>> 
>> 
>> -Chuck
>> 
>> 
>> 
>> </pre>
>> <BR style="font-size:4px;">
>> <a href = "http://www.sdl.com/sdl-vision"><img src="http://www.sdl.com/images/email_new_logo.png" alt="www.sdl.com/sdl-vision" border="0"/></a>
>> <BR>
>> <font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision" style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
>> <BR>
>> <BR>
>> <font face="arial"  size="1" color="#736F6E">
>> <b>SDL PLC confidential, all rights reserved.</b>
>> If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
>> SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
>> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
>> </font>
>> 


Re: relative symbolic links in HDFS

Posted by Eli Collins <el...@cloudera.com>.
Hey Chuck,

Why is it problematic for your use that the symlink is stored in
FileStatus fully qualified - you'd like FileContext#getSymlink to
return the same Path that you used as the target in createSymlink?

The current behavior is so getFileLinkStatus is consistent with
getFileStatus(new Path("/some/file")) which returns a fully qualified
path (eg hdfs://myhost:123/some/file).   Note that you can use
FileContext#getLinkTarget to return the path used when creating the
link. Some more background is in the design doc:
https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt

There's a jira for porting FsShell to FileContext (HADOOP-6424), if
you have a patch (even partial) feel free to post it to the jira.
Note that since symlinks are not implemented in FileSystem, clients
that use FileSystem to access paths with symlinks will fail.

Btw when looking at the code you pointed out I noticed a bug in link
resolution (HADOOP-7783), thanks!

Thanks,
Eli


On Fri, Oct 28, 2011 at 9:46 AM, Charles Baker <cb...@sdl.com> wrote:
> Hey guys. We are in the early stages of planning and evaluating a hadoop
> 'cold-storage' cluster for medium to long term storage of mixed data (small
> to large files, zips, tar, etc...) and tons of symlinks. We do realize that
> small files aren't ideal in HDFS but it's for long-term storage and beats the
> cost of more NetApps by potentially several hundred thousand dollars by
> leveraging existing equipment. We are already successfully using Hadoop and
> the MapReduce framework in a different project and have developed quite a bit
> of in-house expertise when it comes to Hadoop.
>
>
>
> Since this use-case is preserving and restoring an arbitrary directory
> structure, I have been evaluating 0.21.0's support of symlinks and found that
> although it happily creates relative symlinks, the code that is called to
> retrieve the symlink 'FileContext.getFileLinkStatus()' always converts the
> relative Path object to an absolute one through the use of the
> qualifySymlinkTarget() method. Though I was easily able to work around this
> limitation by changing the one line of code that calls this function from:
>
>
>
> fi.setSymlink(qualifySymlinkTarget(fs, p, fi.getSymlink()));
>
>
>
> to:
>
>
>
> fi.setSymlink(fi.getSymlink());
>
>
>
> It has made us curious as to why the decision was made to always return the
> absolute path of a symlink in the first place. Is it that attempts to open
> targets to relative symlinks throw exceptions and it saves having the user do
> the work to construct the absolute path since that's the general use-case? Or
> does this workaround violate some internal assumptions of the code or ideas
> about how a URI should behave (even though relative paths are implicitly
> supported by URI object)? Any insight you guys can shed on this would be
> great. I've tested the above change by adding support for symlinks (into and
> out of HDFS) into FsShell.copyToLocal() and copyFromLocal() using a mixed bag
> of relative and absolute symlinks and symlinks->symlinks and have so far
> found no ill effects.
>
>
>
> Thanks!
>
>
>
> -Chuck
>
>
>
> </pre>
> <BR style="font-size:4px;">
> <a href = "http://www.sdl.com/sdl-vision"><img src="http://www.sdl.com/images/email_new_logo.png" alt="www.sdl.com/sdl-vision" border="0"/></a>
> <BR>
> <font face="arial"  size="2"><a href ="http://www.sdl.com/sdl-vision" style="color:005740; font-weight: bold">www.sdl.com/sdl-vision</a></font>
> <BR>
> <BR>
> <font face="arial"  size="1" color="#736F6E">
> <b>SDL PLC confidential, all rights reserved.</b>
> If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
> SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
> </font>
>