You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ian Nowland (JIRA)" <ji...@apache.org> on 2009/05/16 04:05:45 UTC

[jira] Updated: (HADOOP-5805) problem using top level s3 buckets as input/output directories

     [ https://issues.apache.org/jira/browse/HADOOP-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ian Nowland updated HADOOP-5805:
--------------------------------

    Attachment: HADOOP-5805-0.patch

There are two problems here. 

The first is that S3N currently requires a terminating slash on the URI to indicate the root of a bucket. That is it accepts s3n://infocloud-input/ but not s3n://infocloud-input. This is fixed by the attached patch which allows either form to be used.

This fixes the input bucket case but not the output one. 

The second problem is then that S3N requires any bucket to exist for it to be able to use it. But if you attempt to use its "root" as the output then you will get the standard Hadoop behavior of throwing an FileAlreadyExistsException exception from FileOutputFormat, even if the bucket is empty, as the root directory "/" of the bucket does exist. To me the ideal fix for this second problem is to change FileOutputFormat to not throw if the output directory exists but is empty. However that seems a fairly large change to the established behavior, so I did not include it with the more trivial patch.

As an aside since each AWS account only gets 100 buckets that it can use, you generally don't want to be writing the output of each job to a new bucket anyway. 


> problem using top level s3 buckets as input/output directories
> --------------------------------------------------------------
>
>                 Key: HADOOP-5805
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5805
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.18.3
>         Environment: ec2, cloudera AMI, 20 nodes
>            Reporter: Arun Jacob
>         Attachments: HADOOP-5805-0.patch
>
>
> When I specify top level s3 buckets as input or output directories, I get the following exception.
> hadoop jar subject-map-reduce.jar s3n://infocloud-input s3n://infocloud-output
> java.lang.IllegalArgumentException: Path must be absolute: s3n://infocloud-output
>         at org.apache.hadoop.fs.s3native.NativeS3FileSystem.pathToKey(NativeS3FileSystem.java:246)
>         at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:319)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>         at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:109)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:738)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
>         at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.run(SubjectMRDriver.java:63)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.main(SubjectMRDriver.java:25)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
> The workaround is to specify input/output buckets with sub-directories:
>  
> hadoop jar subject-map-reduce.jar s3n://infocloud-input/input-subdir  s3n://infocloud-output/output-subdir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.