You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/05/17 03:01:57 UTC

[jira] Commented: (HADOOP-3173) inconsistent globbing support for dfs commands

    [ https://issues.apache.org/jira/browse/HADOOP-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597672#action_12597672 ] 

Chris Douglas commented on HADOOP-3173:
---------------------------------------

I mostly agree with Hairong; this is easy to do programmatically, and while there are a few alternatives (different escape character, URI encoding, new "literal" FsShell commands, etc), most appear to make the general case worse to accommodate a fairly esoteric use.

On the other hand, there are only a few places (FsShell and FileInputFormat, mainly) where we call globStatus, and in each case a String is converted to a Path before being converted back into a String in globStatus. Without the conversion, the pattern syntax can mandate that the path separator must be '/' independent of the Path syntax. Unfortunately, actually effecting this change is awkward, primarily because one must still create a Path of the glob string to obtain the FileSystem to resolve it against. If the glob string creates a Path to be resolved against a FileSystem other than the default, then the scheme, authority, etc. must be excised from the original string to preserve the escaping, etc., which will ultimately duplicate much of the URI parsing that's already happening in Path. Particularly for FileInputFormat and its users, pulling out all the Path dependencies (i.e. changing users of the globbing API) is a huge job with a modest payback.

Since Path(String) already isolates this segment, we could introduce Path::getRawPath that would preserve the path before Path::normalizePath and URI::normalize. With this, globStatus would resolve Path::getRawPath instead of p.toUri().getPath(). Unfortunately, this would mean that globStatus(p) might return different results than globstatus(new Path(p.toString())), which means FileInputFormat would still have this issue. Even if Path(Path, String) and variants preserved a raw path, its semantics would be unclear. In Path(Path, String), is the raw Path only eq the raw Path from the second arg if it is absolute? Is it the raw path from the first arg preserved in some way? We could just assert that the raw path is only different from p.toUri().getPath() if it was created with Path(String), but this could be confusing when creating globs from a base path (i.e. Path(Path, String) or possibly more confusing, Path(String, Path)). The URI normalization also removes all the ".." and "." entries in the Path, which the regexp would then have to handle (e.g. "a/b/../c*" is resolved to "a/c*" now, but using the raw path, GlobFilter would accept "a/b/dd/c" since '.' matches GlobFilter::PAT_ANY). That said, FileInputFormats and all Strings that were once Paths wouldn't have to deal with this, while utilities like FsShell could match "a/b/../c" as regexps, which might not be a bad thing.

If we want to fix this, I'd propose adding Path::getRawPath which would be used in FileSystem::globStatus, but could only be different from p.getUri().getPath() when the Path was created from a String. This covers cases where one wants to create a Path regexp manually and use it as a glob (as in FsShell), but should not change behavior elsewhere.

Thoughts?

> inconsistent globbing support for dfs commands
> ----------------------------------------------
>
>                 Key: HADOOP-3173
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3173
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>         Environment: Hadoop 0.16.1
>            Reporter: Rajiv Chittajallu
>             Fix For: 0.18.0
>
>
> hadoop dfs -mkdir /user/*/bar creates a directory "/user/*/bar" and you cant deleted /user/* as -rmr expands the glob
> $ hadoop dfs -mkdir /user/rajive/a/*/foo
> $ hadoop dfs -ls /user/rajive/a
> Found 4 items
> /user/rajive/a/*	<dir>		2008-04-04 16:09	rwx------	rajive	users
> /user/rajive/a/b	<dir>		2008-04-04 16:08	rwx------	rajive	users
> /user/rajive/a/c	<dir>		2008-04-04 16:08	rwx------	rajive	users
> /user/rajive/a/d	<dir>		2008-04-04 16:08	rwx------	rajive	users
> $ hadoop dfs -ls /user/rajive/a/*
> /user/rajive/a/*/foo	<dir>		2008-04-04 16:09	rwx------	rajive	users
> $ hadoop dfs -rmr /user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> I am not able to escape '*' from being expanded.
> $ hadoop dfs -rmr '/user/rajive/a/*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  '/user/rajive/a/\*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  /user/rajive/a/\* 
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.