You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Dino Kečo <di...@gmail.com> on 2011/08/18 10:30:20 UTC
MultipleOutputs is not working properly when dfs.block.size is changed
Hi all,
I have been working on hadoop jobs which are writing output into multiple
files. In Hadoop API I have found class MultipleOutputs which implement this
functionality.
My use case is to change hdfs block size in one job to increase parallelism
and I am doing that using dfs.block.size configuration property. Part of
output file is missing when I change this property (couple of last lines in
some cases half of line is missing).
I was doing debugging and everything looks fine before calling outputs.write
("sucessfull", KEY, VALUE);
For output format I am using TextOutputFormat.
When I remove MultipleOutputs from my code everything is working ok.
Is there something i am doing wrong or there is issue with multiple outputs
?
regards,
dino
Re: MultipleOutputs is not working properly when dfs.block.size is changed
Posted by Dino Kečo <di...@gmail.com>.
Hi Harsh,
I am using CDH3_U0 (0.20.2 hadoop version).
I can't share my code because of company rules, but these are steps which I
perform:
CASE1:
- Use text input format to read content from file
- Perform record transformation in mapper
- Write output using text output format
While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser.
In this case everything is working as expected.
CASE2:
- Use text input format to read content from file
- Perform record transformation in mapper
- if transformation is successful write output to successful file using
multiple outputs
- if transformation is failed write output to failed file using multiple
outputs
In mapper setup method i create instance of MultipleOutputs (MultipleOutputs
outputs = new MultipleOuputs(context)). In map method i am calling
outputs.write("successful",K,V) or outputs.write("failed", K, V) based on
result of transformation logic.
I configure multiple outputs using generic option parser
-Dmapreduce.inputformat.class=org.apache.hadoop.mapreduce.lib.input.TextInputFormat
-Dmapreduce.map.class=MyMapper
-Dmapreduce.multipleoutputs="successful failed"
-Dmapreduce.multipleoutputs.namedOutput.successful.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.successful.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.successful.value=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.failed.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.value=org.apache.hadoop.io.Text
While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser. Based on block size a lose data in output file. In
some cases half of line is missing, in some cases couple of last lines. Also
one thing that I have noticed is that file size is always equal
<integer>*<block_size>. There is no block which is not fully populated.
Hope this helps.
thanks,
dino
On Thu, Aug 18, 2011 at 12:09 PM, Harsh J <ha...@cloudera.com> wrote:
> Dino,
>
> Need some more information:
> - Version of Hadoop?
> - Do you have a runnable sample test case to reproduce this? Or can
> you describe roughly the steps you are performing to create an output?
>
> FWIW, I ran the trunk's MO tests and those seem to pass for both APIs,
> but they do not change dfs.block.size, although I fail to see the
> relation between these.
>
> On Thu, Aug 18, 2011 at 2:00 PM, Dino Kečo <di...@gmail.com> wrote:
> > Hi all,
> > I have been working on hadoop jobs which are writing output into multiple
> > files. In Hadoop API I have found class MultipleOutputs which implement
> this
> > functionality.
> > My use case is to change hdfs block size in one job to increase
> parallelism
> > and I am doing that using dfs.block.size configuration property. Part of
> > output file is missing when I change this property (couple of last lines
> in
> > some cases half of line is missing).
> > I was doing debugging and everything looks fine before calling
> outputs.write
> > ("sucessfull", KEY, VALUE);
> > For output format I am using TextOutputFormat.
> > When I remove MultipleOutputs from my code everything is working ok.
> > Is there something i am doing wrong or there is issue with multiple
> outputs
> > ?
> > regards,
> > dino
> >
>
>
>
> --
> Harsh J
>
Re: MultipleOutputs is not working properly when dfs.block.size is changed
Posted by Harsh J <ha...@cloudera.com>.
Dino,
Need some more information:
- Version of Hadoop?
- Do you have a runnable sample test case to reproduce this? Or can
you describe roughly the steps you are performing to create an output?
FWIW, I ran the trunk's MO tests and those seem to pass for both APIs,
but they do not change dfs.block.size, although I fail to see the
relation between these.
On Thu, Aug 18, 2011 at 2:00 PM, Dino Kečo <di...@gmail.com> wrote:
> Hi all,
> I have been working on hadoop jobs which are writing output into multiple
> files. In Hadoop API I have found class MultipleOutputs which implement this
> functionality.
> My use case is to change hdfs block size in one job to increase parallelism
> and I am doing that using dfs.block.size configuration property. Part of
> output file is missing when I change this property (couple of last lines in
> some cases half of line is missing).
> I was doing debugging and everything looks fine before calling outputs.write
> ("sucessfull", KEY, VALUE);
> For output format I am using TextOutputFormat.
> When I remove MultipleOutputs from my code everything is working ok.
> Is there something i am doing wrong or there is issue with multiple outputs
> ?
> regards,
> dino
>
--
Harsh J