You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tom White <to...@gmail.com> on 2008/03/20 17:43:17 UTC

Input file globbing

I'm trying to use file globbing to select various input paths, like so:

conf.setInputPath(new Path("mr/input/glob/2008/02/{02,08}"));

But this gives an exception:

Exception in thread "main" java.io.IOException: Illegal file pattern:
Expecting set closure character or end of range, or } for glob {02 at
3
	at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1023)
	at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1008)
	at org.apache.hadoop.fs.FileSystem$GlobFilter.<init>(FileSystem.java:926)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:826)
	at org.apache.hadoop.fs.FileSystem.globPaths(FileSystem.java:873)
	at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:131)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:541)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:809)

Looking at the code for JobConf.getInputPaths I see it tokenizes using
a comma as the delimiter, producing two paths
"mr/input/glob/2008/02/{02" and "08}". This looks like a bug to me.
I'm surprised as this feature has been around for some time - are
folks not using it like this?

Tom

Re: Input file globbing

Posted by Tom White <to...@gmail.com>.
Thanks Hairong,

I've just created https://issues.apache.org/jira/browse/HADOOP-3064 for this.

Tom

On 20/03/2008, Hairong Kuang <ha...@yahoo-inc.com> wrote:
> Yes, this is a bug. This only occurs when a job's input path contains the
>  closures. JobConf.getInputPaths interprets  mr/input/glob/2008/02/{02.08} as
>  two input paths: mr/input/glob/2008/02/{02 and 08}. Let's see how to fix it.
>
>
>  Hairong
>
>
>
>  On 3/20/08 9:43 AM, "Tom White" <to...@gmail.com> wrote:
>
>  > I'm trying to use file globbing to select various input paths, like so:
>  >
>  > conf.setInputPath(new Path("mr/input/glob/2008/02/{02,08}"));
>  >
>  > But this gives an exception:
>  >
>  > Exception in thread "main" java.io.IOException: Illegal file pattern:
>  > Expecting set closure character or end of range, or } for glob {02 at
>  > 3
>  > at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1023)
>  > at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1008)
>  > at org.apache.hadoop.fs.FileSystem$GlobFilter.<init>(FileSystem.java:926)
>  > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:826)
>  > at org.apache.hadoop.fs.FileSystem.globPaths(FileSystem.java:873)
>  > at
>  > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:13
>  > 1)
>  > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:541)
>  > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:809)
>  >
>  > Looking at the code for JobConf.getInputPaths I see it tokenizes using
>  > a comma as the delimiter, producing two paths
>  > "mr/input/glob/2008/02/{02" and "08}". This looks like a bug to me.
>  > I'm surprised as this feature has been around for some time - are
>  > folks not using it like this?
>  >
>  > Tom
>
>


-- 
Blog: http://www.lexemetech.com/

Re: Input file globbing

Posted by Hairong Kuang <ha...@yahoo-inc.com>.
Yes, this is a bug. This only occurs when a job's input path contains the
closures. JobConf.getInputPaths interprets  mr/input/glob/2008/02/{02.08} as
two input paths: mr/input/glob/2008/02/{02 and 08}. Let's see how to fix it.

Hairong


On 3/20/08 9:43 AM, "Tom White" <to...@gmail.com> wrote:

> I'm trying to use file globbing to select various input paths, like so:
> 
> conf.setInputPath(new Path("mr/input/glob/2008/02/{02,08}"));
> 
> But this gives an exception:
> 
> Exception in thread "main" java.io.IOException: Illegal file pattern:
> Expecting set closure character or end of range, or } for glob {02 at
> 3
> at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1023)
> at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1008)
> at org.apache.hadoop.fs.FileSystem$GlobFilter.<init>(FileSystem.java:926)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:826)
> at org.apache.hadoop.fs.FileSystem.globPaths(FileSystem.java:873)
> at 
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:13
> 1)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:541)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:809)
> 
> Looking at the code for JobConf.getInputPaths I see it tokenizes using
> a comma as the delimiter, producing two paths
> "mr/input/glob/2008/02/{02" and "08}". This looks like a bug to me.
> I'm surprised as this feature has been around for some time - are
> folks not using it like this?
> 
> Tom