You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kevin Peterson <kp...@biz360.com> on 2009/05/29 01:25:22 UTC

MultipleOutputs or MultipleTextOutputFormat?

I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.