You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeroen Verhagen <je...@gmail.com> on 2007/06/27 15:16:18 UTC

Some basis questions from noob

Hi all,

I have some simple questions that I would like answered to get a
better understanding of what Hadoop/Mapreduce is.

I noticed in the code of the WordCount example:

   conf.setInputPath(new Path((String) other_args.get(0)));
   conf.setOutputPath(new Path((String) other_args.get(1)));

Does working with hadoop always involve having a set of files in one
directory as input and resulting in a set of files in one directory as
output? Are the names of the files in input and output directory
insignificant?

How do you handle the end result of a set of Mapreduce tasks? If the
result is a set of files do you have to use another Mapreduce task
that doesn't write to file (to the DFS for example) but to a simple
String to display something on a webpage for example? Or do you have
to read the resulting files directly.

If my gigantic set of input files keeps growing, do I have
re-mapreduce to whole input set to get a single result set? Or can I
just Mapreduce the incremental part and use another Mapreduce task to
create a single result of x number of results sets?

Thanks for any help!

-- 

regards,

Jeroen

Re: Some basis questions from noob

Posted by "Peter W." <pe...@marketingbrokers.com>.
Jeroen,

I'm also a noob but making slight progress.

JobConf will always send Mapreduce output to a specified Path
but I think if you setOutputValueClass(Text.class) it's possible
to later change the destination from a file to stream?

Or, run a separate task with only one reduce which should
write simplified output to one file, then open as stream.

Recent threads mention there is no chaining of tasks so
shell scripting is another way to combine file results.

Hope that helps,

Peter W.


On Jun 27, 2007, at 6:16 AM, Jeroen Verhagen wrote:

> Does working with hadoop always involve having a set of files in one
> directory as input and resulting in a set of files in one directory as
> output? Are the names of the files in input and output directory
> insignificant?
>
> How do you handle the end result of a set of Mapreduce tasks? If the
> result is a set of files do you have to use another Mapreduce task
> that doesn't write to file (to the DFS for example) but to a simple
> String to display something on a webpage for example? Or do you have
> to read the resulting files directly.