You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stuart White <st...@gmail.com> on 2008/12/14 03:32:36 UTC

Simple data transformations in Hadoop?

(I'm quite new to hadoop and map/reduce, so some of these questions
might not make complete sense.)

I want to perform simple data transforms on large datasets, and it
seems Hadoop is an appropriate tool.  As a simple example, let's say I
want to read every line of a text file, uppercase it, and write it
out.

First question: would Hadoop be an appropriate tool for something like this?

What is the best way to model this type of work in Hadoop?

I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.

I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers).  So, there's a
question: is it considered bad practice to write output files directly
from mappers?

Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name?  Is there a way for a mapper to know its index in the total
# of mappers?

Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value.  Then, in my reducer, should I write
a file?  Or should I collect() the records in the reducers and let
hadoop write the output?

If I let hadoop write the output, is there a way to prevent hadoop
from writing the key to the output file?  I may want to perform
several transformations, one-after-another, on a set of data, and I
don't want to place a superfluous key at the front of every record for
each pass of the data.

I appreciate any feedback anyone has to offer.

Re: Simple data transformations in Hadoop?

Posted by Delip Rao <de...@gmail.com>.

On Sat, Dec 13, 2008 at 9:32 PM, Stuart White <st...@gmail.com> wrote:
> (I'm quite new to hadoop and map/reduce, so some of these questions
> might not make complete sense.)
>
> I want to perform simple data transforms on large datasets, and it
> seems Hadoop is an appropriate tool.  As a simple example, let's say I
> want to read every line of a text file, uppercase it, and write it
> out.
>
> First question: would Hadoop be an appropriate tool for something like this?

Yes. Very appropriate.

>
> What is the best way to model this type of work in Hadoop?

Start with Hadoop's WordCount example in the tutorial and modify it to
your requirement.

>
> I'm thinking my mappers will accept a Long key that represents the
> byte offset into the input file, and a Text value that represents the
> line in the file.
>
> I *could* simply uppercase the text lines and write them to an output
> file directly in the mapper (and not use any reducers).  So, there's a
> question: is it considered bad practice to write output files directly
> from mappers?

Technically, you could do this by opening the file writer in
configure(), do the writes in map() and close the writer in close().
But to me this appears contorted when the Hadoop framework has
something straight-forward.

>
> Assuming it's advisable in this example to write a file directly in
> the mapper - how should the mapper create a unique output partition
> file name?
> Is there a way for a mapper to know its index in the total
> # of mappers?

use mapred.task.id to create unique name per mapper.

>
> Assuming it's inadvisable to write a file directly in the mapper - I
> can output the records to the reducers using the same key and using
> the uppercased data as the value.  Then, in my reducer, should I write
> a file?  Or should I collect() the records in the reducers and let
> hadoop write the output?
>
> If I let hadoop write the output, is there a way to prevent hadoop
> from writing the key to the output file?  I may want to perform
> several transformations, one-after-another, on a set of data, and I
> don't want to place a superfluous key at the front of every record for
> each pass of the data.
>

Just use collect() and TextOutputFormat. The key, as you correctly
noted, is the offset in the file but when you do collect(key, value)
the 'value' will be written at the appropriate offset given by the
'key'. As long as you are using TextOutputFormat there won't be any
superfluous key prefixed to your records. Another way to think of this
is when you use TextOutputFormat, the 'value's from collect() are
appended to the reduce output.

> I appreciate any feedback anyone has to offer.
>

Re: Simple data transformations in Hadoop?

Posted by Owen O'Malley <om...@apache.org>.

On Dec 13, 2008, at 6:32 PM, Stuart White wrote:

> First question: would Hadoop be an appropriate tool for something  
> like this?

Very

> What is the best way to model this type of work in Hadoop?

As a map-only job with number of reduces = 0.

> I'm thinking my mappers will accept a Long key that represents the
> byte offset into the input file, and a Text value that represents the
> line in the file.

Sure, just use TextInputFormat. You'll want to set the minimum split  
size (mapred.min.split.size) to a large number so that you get exactly  
1 map per an input file.

> I *could* simply uppercase the text lines and write them to an output
> file directly in the mapper (and not use any reducers).  So, there's a
> question: is it considered bad practice to write output files directly
> from mappers?

You could do it directly, but I would suggest that using the  
TextOutputFormat is easier.

Your map should just do:
   collect(null, upperCaseLine);

Assuming that number of reduces is 0, the output of the map goes  
straight to the OutputCollector.

> Assuming it's advisable in this example to write a file directly in
> the mapper - how should the mapper create a unique output partition
> file name?  Is there a way for a mapper to know its index in the total
> # of mappers?

Get mapred.task.partition from the configuration.

> Assuming it's inadvisable to write a file directly in the mapper - I
> can output the records to the reducers using the same key and using
> the uppercased data as the value.  Then, in my reducer, should I write
> a file?  Or should I collect() the records in the reducers and let
> hadoop write the output?

See above, but with no reduces the data is not sorted. If you pass a  
null or NullWritable to the TextOutputFormat, it will not add the tab.

-- Owen

Re: issue map/reduce job to linux hadoop cluster from MS Windows, Eclipse

Posted by Aaron Kimball <aa...@cloudera.com>.

Songting,

If you set mapred.job.tracker to "jobtrackeraddr:9001" and
fs.default.nameto "hdfs://hdfsservername:9000/" in the conf, it will
connect to the remote
server and run the job there. The trick is that you'll need to do this from
a jar file, not a series of .class files as Eclipse generates by default.
- Aaron

On Sat, Dec 13, 2008 at 7:07 PM, Songting Chen <ke...@yahoo.com>wrote:

> Is it possible to do that?
>
> I can access files at HDFS by specifying the URI below.
> FileSystem fileSys = FileSystem.get(new URI("hdfs://server:9000"), conf);
>
> But I don't know how to do that for JobConf.
>
> Thanks,
> -Songting
>

issue map/reduce job to linux hadoop cluster from MS Windows, Eclipse

Posted by Songting Chen <ke...@yahoo.com>.

Is it possible to do that?

I can access files at HDFS by specifying the URI below.
FileSystem fileSys = FileSystem.get(new URI("hdfs://server:9000"), conf); 

But I don't know how to do that for JobConf.

Thanks,
-Songting