You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Alan Miller <so...@squareplanet.de> on 2010/05/12 23:07:22 UTC
tab-delimited output
Hi all,
How can I write tab-delimited output files from my reducer?
My reducer gets Text/Text key/vals like:
hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
...
hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
...
After my reducer calcs the daily averages of varA,B,C
I want to write a tab-delimited file with lines like:
hostX varA-Avg varB-Avg varC-Avg ....
hostY varA-Avg varB-Avg varC-Avg ....
Thanks,
Alan
Re: tab-delimited output
Posted by Alan Miller <so...@squareplanet.de>.
Thanks Alex,
For question 2, I was able to implement a Custom OutputFormat that
allows me to write some header lines to a file then write multiple
tab-delimited values per line like I wanted.
I had to "extend FileOutputFormat" and implement my own
write(),close() and getRecordWriter().
The 1st question is still open for me though. How to separate reducer
outputs based on a substring of the reducer's key.
In my Driver class I now use
job.SetOutputFormatClass(MyOutputFormat.class)
so I can't use MultipleOutput.class to disect the outputs.
Is there a way to make my MyOutputFormat.class work like MultipleOutput?
The getRecordWriter calls job.getConfiguration() so could I do something
like:
set a new filename in my reduce() via conf.set("fileprefix",
"2010-05-01_day");
read the new filename in getRecordWriter() via conf.get("fileprefix");
Alan
On 05/13/2010 12:29 AM, Alex Kozlov wrote:
> Hi Alan,
>
> Unless you run your job with a single reducer you will not be able to
> do this. Think scalable: you should always add '-r-NNNNN' to the end
> to allow for multiple reducers and you can use custom partitioner to
> make sure each host goes to a single reducer. MultipleOutputs can do
> the rest, meaning the 'YYYY-MM-DD' prefix. 2 looks like a simple
> aggregation job: the key should be the host name, and you need just to
> aggregate the values for each host x YYYY-MM-DD pair and write them
> into separate 'YYYY-MM-DD-r-NNNNN' files. You can also do secondary
> sort to make sure the YYYY-MM-DD values come in order: this way you do
> not need to aggregate them in memory. See Reducer.java
> <http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html>
> for details.
>
> Alex K
>
> On Wed, May 12, 2010 at 3:04 PM, Alan Miller <somebody@squareplanet.de
> <ma...@squareplanet.de>> wrote:
>
> Hi Alex,
>
> The tab isn't the issue (yet). I guess it's really 2 questions I have.
> Using the reducer inputs already mentioned.
>
> 1. How do I generate multiple output files named YYYY-MM-DD.txt
> 2. Each file should contain
> a. one line per host
> b. each line with host avg1 avg2 avg3 ....
>
> Alan
>
>
> On 05/12/2010 11:50 PM, Alex Kozlov wrote:
>> Hi Alan,
>>
>> Is the problem that you want your 'value' vals to be tab
>> separated? This is entirely under control of your reducer.
>>
>> Alex K
>>
>> On Wed, May 12, 2010 at 2:07 PM, Alan Miller
>> <somebody@squareplanet.de <ma...@squareplanet.de>> wrote:
>>
>> Hi all,
>>
>> How can I write tab-delimited output files from my reducer?
>>
>> My reducer gets Text/Text key/vals like:
>>
>> hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>> hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>> hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>> ...
>> hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>> hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>> hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>> ...
>>
>> After my reducer calcs the daily averages of varA,B,C
>> I want to write a tab-delimited file with lines like:
>>
>> hostX varA-Avg varB-Avg varC-Avg ....
>> hostY varA-Avg varB-Avg varC-Avg ....
>>
>>
>> Thanks,
>> Alan
>>
>>
>
>
Re: tab-delimited output
Posted by Alex Kozlov <al...@cloudera.com>.
Hi Alan,
Unless you run your job with a single reducer you will not be able to do
this. Think scalable: you should always add '-r-NNNNN' to the end to allow
for multiple reducers and you can use custom partitioner to make sure each
host goes to a single reducer. MultipleOutputs can do the rest, meaning the
'YYYY-MM-DD' prefix. 2 looks like a simple aggregation job: the key should
be the host name, and you need just to aggregate the values for each host x
YYYY-MM-DD pair and write them into separate 'YYYY-MM-DD-r-NNNNN' files.
You can also do secondary sort to make sure the YYYY-MM-DD values come in
order: this way you do not need to aggregate them in memory. See
Reducer.java<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html>for
details.
Alex K
On Wed, May 12, 2010 at 3:04 PM, Alan Miller <so...@squareplanet.de>wrote:
> Hi Alex,
>
> The tab isn't the issue (yet). I guess it's really 2 questions I have.
> Using the reducer inputs already mentioned.
>
> 1. How do I generate multiple output files named YYYY-MM-DD.txt
> 2. Each file should contain
> a. one line per host
> b. each line with host avg1 avg2 avg3 ....
>
> Alan
>
>
> On 05/12/2010 11:50 PM, Alex Kozlov wrote:
>
> Hi Alan,
>
> Is the problem that you want your 'value' vals to be tab separated? This
> is entirely under control of your reducer.
>
> Alex K
>
> On Wed, May 12, 2010 at 2:07 PM, Alan Miller <so...@squareplanet.de>wrote:
>
>> Hi all,
>>
>> How can I write tab-delimited output files from my reducer?
>>
>> My reducer gets Text/Text key/vals like:
>>
>> hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>> hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>> hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>> ...
>> hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>> hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>> hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>> ...
>>
>> After my reducer calcs the daily averages of varA,B,C
>> I want to write a tab-delimited file with lines like:
>>
>> hostX varA-Avg varB-Avg varC-Avg ....
>> hostY varA-Avg varB-Avg varC-Avg ....
>>
>>
>> Thanks,
>> Alan
>>
>
>
>
Re: tab-delimited output
Posted by Alan Miller <so...@squareplanet.de>.
Hi Alex,
The tab isn't the issue (yet). I guess it's really 2 questions I have.
Using the reducer inputs already mentioned.
1. How do I generate multiple output files named YYYY-MM-DD.txt
2. Each file should contain
a. one line per host
b. each line with host avg1 avg2 avg3 ....
Alan
On 05/12/2010 11:50 PM, Alex Kozlov wrote:
> Hi Alan,
>
> Is the problem that you want your 'value' vals to be tab separated?
> This is entirely under control of your reducer.
>
> Alex K
>
> On Wed, May 12, 2010 at 2:07 PM, Alan Miller <somebody@squareplanet.de
> <ma...@squareplanet.de>> wrote:
>
> Hi all,
>
> How can I write tab-delimited output files from my reducer?
>
> My reducer gets Text/Text key/vals like:
>
> hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
> hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
> hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
> ...
> hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
> hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
> hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
> ...
>
> After my reducer calcs the daily averages of varA,B,C
> I want to write a tab-delimited file with lines like:
>
> hostX varA-Avg varB-Avg varC-Avg ....
> hostY varA-Avg varB-Avg varC-Avg ....
>
>
> Thanks,
> Alan
>
>
Re: tab-delimited output
Posted by Alex Kozlov <al...@cloudera.com>.
Hi Alan,
Is the problem that you want your 'value' vals to be tab separated? This
is entirely under control of your reducer.
Alex K
On Wed, May 12, 2010 at 2:07 PM, Alan Miller <so...@squareplanet.de>wrote:
> Hi all,
>
> How can I write tab-delimited output files from my reducer?
>
> My reducer gets Text/Text key/vals like:
>
> hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
> hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
> hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
> ...
> hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
> hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
> hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
> ...
>
> After my reducer calcs the daily averages of varA,B,C
> I want to write a tab-delimited file with lines like:
>
> hostX varA-Avg varB-Avg varC-Avg ....
> hostY varA-Avg varB-Avg varC-Avg ....
>
>
> Thanks,
> Alan
>