You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Christian Schneider <cs...@gmail.com> on 2013/03/13 19:36:18 UTC
Write huge values in Reduce Phase. "Hacked Outputformat" vs. "Direct write to HDFS" vs. ???

Hi together.
I'm not sure which approach to use. Currently I got two. Could you have a
look what's the best?

*Problem:*
As "value" of the Reduce phase I get a List with *a lot* of values (large
then the heap size).
For a legacy system I need to create a file like this:

key1 value1, value2, value3, .... valueN
key2 value1, value2, value3, .... valueN

N > 1.000.000

During my research and some other mails, I got this two solutions:


*Solutions:*
*a) "Hackekd Outputformat" *

As described [0] a solution could be to write a own Outputformat. First a
key is returned, - then only "null"

public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) {
  boolean firstKey = true;
  for (Text value : values) {
    output.collect(firstKey ? key : null, value); // it's possible to
call this N times...
    firstKey = false;
  }}

With this the Outputformat can recognize the "line change" to print a "\n".

The positive idea here is, that we follow the whole path. Map > ... >
Reduce > OutputFormat > HDFS


*b) "Write to HDFS in the reducer"*

As Harsh J mentioned here [1] it is possible to write to HDFS in the
Reducer Phase.
He gave also a link [2] from the Hadoop FAQ that says: it's "possible" to
do that.

With this information I implement this reducer:

public class UserToAppReducer extends Reducer<Text, Text, Text, Text> {
private static final int BUFFER_SIZE = 5 * 1024 * 1024;

private BufferedWriter br;

@Override
protected void setup(final Context context) throws IOException,
InterruptedException {
final FileSystem fs = FileSystem.get(context.getConfiguration());

final Path outputPath = FileOutputFormat.getOutputPath(context);

final String fileName = "reducer" + context.getTaskAttemptID().getId() +
"_" + context.getTaskAttemptID().getTaskID().getId() + "_" + new
Random(System.currentTimeMillis()).nextInt(10000);

this.br = new BufferedWriter(new OutputStreamWriter(fs.create(new
Path(outputPath, fileName))), BUFFER_SIZE);
}

@Override
protected void reduce(final Text appId, final Iterable<Text> userIds, final
Context context) throws IOException, InterruptedException {
this.br.append(appId.toString());
this.br.append('\t');

for (final Text text : userIds) {
this.br.append(text.toString());
this.br.append('\t');
}

this.br.append('\n');
}

@Override
protected void cleanup(final Context context) throws IOException,
InterruptedException {
this.br.close();
}
}

*Question:*
Both ways are running fine, but which approach should I take and are they
alternative ways?

Thanks a lot.

Best Regards,
Christian.

[0]
http://stackoverflow.com/questions/10140171/handling-large-output-values-from-reduce-step-in-hadoop
,
[1]
http://mail-archives.apache.org/mod_mbox/hadoop-user/201303.mbox/%3CCAOcnVr1r-VcoSe-YFBpNe3qmqvXSUT7z3NfABA0FzbNS_MmVgQ%40mail.gmail.com%3E
[2]
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F


P.S.: Sorry for using html :)