You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Dan Yi <dy...@mediosystems.com> on 2012/07/20 02:44:41 UTC
use S3 as input to MR job
i have a MR job to read file on amazon S3 and process the data on local hdfs. the files are zipped text file as .gz. i tried to setup the job as below but it won't work, anyone know what might be wrong? do i need to add extra step to unzip the file first? thanks.
String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
protected void prepareHadoopJob() throws Exception {
this.getHadoopJob().setMapperClass(Mapper1.class);
this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
this.getHadoopJob().setNumReduceTasks(0);
this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
this.getHadoopJob().setOutputValueClass(Put.class);
}
[cid:73FA9081-776F-4031-93E2-EFC1A9FEAD76]
Dan Yi | Software Engineer, Analytics Engineering
Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
Predictive Analytics for a Connected World
Re: use S3 as input to MR job
Posted by Marcos Ortiz <ml...@uci.cu>.
Are you sure that you prepare your MR code to work with mutiple files?
This example (WordCount) works with a single input.
You should take a look to the MultipleInput API for this.
Best wishes
El 02/10/2012 6:05, Ben Kim escribió:
> I'm having a similar issue
>
> I'm running a wordcount MR as follows
>
> hadoop jar WordCount.jar wordcount.WordCountDriver
> s3n://bucket/wordcount/input s3n://bucket/wordcount/output
>
> s3n://bucket/wordcount/input is a s3 object that contains other input
> files.
>
> However I get following NPE error
>
> 12/10/02 18:56:23 INFO mapred.JobClient: map 0% reduce 0%
> 12/10/02 18:56:54 INFO mapred.JobClient: map 50% reduce 0%
> 12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
> attempt_201210021853_0001_m_000001_0, Status : FAILED
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
> at
> java.io.BufferedInputStream.close(BufferedInputStream.java:451)
> at java.io.FilterInputStream.close(FilterInputStream.java:155)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> MR runs fine if i specify more specific input path such as
> s3n://bucket/wordcount/input/file.txt
> what i want is to be able to pass s3 folders as parameters
> Does anyone knows how to do this?
>
> Best regards,
> Ben Kim
>
>
> On Fri, Jul 20, 2012 at 10:33 AM, Harsh J <harsh@cloudera.com
> <ma...@cloudera.com>> wrote:
>
> Dan,
>
> Can you share your error? The plain .gz files (not .tar.gz) are
> natively supported by Hadoop via its GzipCodec, and if you are
> facing an error, I believe its cause of something other than
> compression.
>
>
> On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <dyi@mediosystems.com
> <ma...@mediosystems.com>> wrote:
>
> i have a MR job to read file on amazon S3 and process the data
> on local hdfs. the files are zipped text file as .gz. i tried
> to setup the job as below but it won't work, anyone know what
> might be wrong? do i need to add extra step to unzip the file
> first? thanks.
>
> |String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
>
> protected void prepareHadoopJob() throws Exception {
>
> this.getHadoopJob().setMapperClass(Mapper1.class);
> this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
>
> FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
>
> this.getHadoopJob().setNumReduceTasks(0);
> this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
> this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
> this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
> this.getHadoopJob().setOutputValueClass(Put.class);
> }|
>
>
>
>
> *
> **
> Dan Yi*| Software Engineer, Analytics Engineering
> Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
> */Predictive Analytics for a Connected World/*
> *
> ***
>
>
>
>
> --
> Harsh J
>
>
>
>
> --
>
> *Benjamin Kim*
> *benkimkimben at gmail*
>
>
--
Marcos Ortiz Valmaseda,
Data Engineer && Senior System Administrator at UCI
Blog: http://marcosluis2186.posterous.com
Linkedin: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
Re: use S3 as input to MR job
Posted by Ben Kim <be...@gmail.com>.
I'm having a similar issue
I'm running a wordcount MR as follows
hadoop jar WordCount.jar wordcount.WordCountDriver
> s3n://bucket/wordcount/input s3n://bucket/wordcount/output
s3n://bucket/wordcount/input is a s3 object that contains other input files.
However I get following NPE error
12/10/02 18:56:23 INFO mapred.JobClient: map 0% reduce 0%
> 12/10/02 18:56:54 INFO mapred.JobClient: map 50% reduce 0%
> 12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
> attempt_201210021853_0001_m_000001_0, Status : FAILED
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
> at java.io.BufferedInputStream.close(BufferedInputStream.java:451)
> at java.io.FilterInputStream.close(FilterInputStream.java:155)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
MR runs fine if i specify more specific input path such as
s3n://bucket/wordcount/input/file.txt
what i want is to be able to pass s3 folders as parameters
Does anyone knows how to do this?
Best regards,
Ben Kim
On Fri, Jul 20, 2012 at 10:33 AM, Harsh J <ha...@cloudera.com> wrote:
> Dan,
>
> Can you share your error? The plain .gz files (not .tar.gz) are natively
> supported by Hadoop via its GzipCodec, and if you are facing an error, I
> believe its cause of something other than compression.
>
>
> On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <dy...@mediosystems.com> wrote:
>
>> i have a MR job to read file on amazon S3 and process the data on local
>> hdfs. the files are zipped text file as .gz. i tried to setup the job as
>> below but it won't work, anyone know what might be wrong? do i need to add
>> extra step to unzip the file first? thanks.
>>
>> String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
>>
>> protected void prepareHadoopJob() throws Exception {
>>
>> this.getHadoopJob().setMapperClass(Mapper1.class);
>> this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
>>
>> FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
>>
>> this.getHadoopJob().setNumReduceTasks(0);
>> this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
>> this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
>> this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
>> this.getHadoopJob().setOutputValueClass(Put.class);
>> }
>>
>>
>>
>>
>> *
>>
>> Dan Yi | Software Engineer, Analytics Engineering
>> Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
>> Predictive Analytics for a Connected World
>> *
>>
>>
>
>
> --
> Harsh J
>
--
*Benjamin Kim*
*benkimkimben at gmail*
Re: use S3 as input to MR job
Posted by Harsh J <ha...@cloudera.com>.
Dan,
Can you share your error? The plain .gz files (not .tar.gz) are natively
supported by Hadoop via its GzipCodec, and if you are facing an error, I
believe its cause of something other than compression.
On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <dy...@mediosystems.com> wrote:
> i have a MR job to read file on amazon S3 and process the data on local
> hdfs. the files are zipped text file as .gz. i tried to setup the job as
> below but it won't work, anyone know what might be wrong? do i need to add
> extra step to unzip the file first? thanks.
>
> String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
>
> protected void prepareHadoopJob() throws Exception {
>
> this.getHadoopJob().setMapperClass(Mapper1.class);
> this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
>
> FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
>
> this.getHadoopJob().setNumReduceTasks(0);
> this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
> this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
> this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
> this.getHadoopJob().setOutputValueClass(Put.class);
> }
>
>
>
>
> *
>
> Dan Yi | Software Engineer, Analytics Engineering
> Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
> Predictive Analytics for a Connected World
> *
>
>
--
Harsh J