You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Dino Kečo <di...@gmail.com> on 2011/08/18 10:30:20 UTC

MultipleOutputs is not working properly when dfs.block.size is changed

Hi all,

I have been working on hadoop jobs which are writing output into multiple
files. In Hadoop API I have found class MultipleOutputs which implement this
functionality.

My use case is to change hdfs block size in one job to increase parallelism
and I am doing that using dfs.block.size configuration property. Part of
output file is missing when I change this property (couple of last lines in
some cases half of line is missing).

I was doing debugging and everything looks fine before calling outputs.write
("sucessfull", KEY, VALUE);
For output format I am using TextOutputFormat.

When I remove MultipleOutputs from my code everything is working ok.

Is there something i am doing wrong or there is issue with multiple outputs
?

regards,
dino

Re: MultipleOutputs is not working properly when dfs.block.size is changed

Posted by Dino Kečo <di...@gmail.com>.

Hi Harsh,

I am using CDH3_U0 (0.20.2 hadoop version).

I can't share my code because of company rules, but these are steps which I
perform:
CASE1:
 - Use text input format to read content from file
 - Perform record transformation in mapper
 - Write output using text output format

While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser.

In this case everything is working as expected.

CASE2:
 - Use text input format to read content from file
 - Perform record transformation in mapper
 - if transformation is successful write output to successful file using
multiple outputs
 - if transformation is failed write output to failed file using multiple
outputs

In mapper setup method i create instance of MultipleOutputs (MultipleOutputs
outputs = new MultipleOuputs(context)). In map method i am calling
outputs.write("successful",K,V) or outputs.write("failed", K, V) based on
result of transformation logic.

I configure multiple outputs using generic option parser

-Dmapreduce.inputformat.class=org.apache.hadoop.mapreduce.lib.input.TextInputFormat
-Dmapreduce.map.class=MyMapper
-Dmapreduce.multipleoutputs="successful failed"
-Dmapreduce.multipleoutputs.namedOutput.successful.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.successful.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.successful.value=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.failed.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.value=org.apache.hadoop.io.Text

While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser. Based on block size a lose data in output file. In
some cases half of line is missing, in some cases couple of last lines. Also
one thing that I have noticed is that file size is always equal
<integer>*<block_size>. There is no block which is not fully populated.

Hope this helps.

thanks,
dino



On Thu, Aug 18, 2011 at 12:09 PM, Harsh J <ha...@cloudera.com> wrote:

> Dino,
>
> Need some more information:
> - Version of Hadoop?
> - Do you have a runnable sample test case to reproduce this? Or can
> you describe roughly the steps you are performing to create an output?
>
> FWIW, I ran the trunk's MO tests and those seem to pass for both APIs,
> but they do not change dfs.block.size, although I fail to see the
> relation between these.
>
> On Thu, Aug 18, 2011 at 2:00 PM, Dino Kečo <di...@gmail.com> wrote:
> > Hi all,
> > I have been working on hadoop jobs which are writing output into multiple
> > files. In Hadoop API I have found class MultipleOutputs which implement
> this
> > functionality.
> > My use case is to change hdfs block size in one job to increase
> parallelism
> > and I am doing that using dfs.block.size configuration property. Part of
> > output file is missing when I change this property (couple of last lines
> in
> > some cases half of line is missing).
> > I was doing debugging and everything looks fine before calling
> outputs.write
> > ("sucessfull", KEY, VALUE);
> > For output format I am using TextOutputFormat.
> > When I remove MultipleOutputs from my code everything is working ok.
> > Is there something i am doing wrong or there is issue with multiple
> outputs
> > ?
> > regards,
> > dino
> >
>
>
>
> --
> Harsh J
>

Re: MultipleOutputs is not working properly when dfs.block.size is changed

Posted by Harsh J <ha...@cloudera.com>.

Dino,

Need some more information:
- Version of Hadoop?
- Do you have a runnable sample test case to reproduce this? Or can
you describe roughly the steps you are performing to create an output?

FWIW, I ran the trunk's MO tests and those seem to pass for both APIs,
but they do not change dfs.block.size, although I fail to see the
relation between these.

On Thu, Aug 18, 2011 at 2:00 PM, Dino Kečo <di...@gmail.com> wrote:
> Hi all,
> I have been working on hadoop jobs which are writing output into multiple
> files. In Hadoop API I have found class MultipleOutputs which implement this
> functionality.
> My use case is to change hdfs block size in one job to increase parallelism
> and I am doing that using dfs.block.size configuration property. Part of
> output file is missing when I change this property (couple of last lines in
> some cases half of line is missing).
> I was doing debugging and everything looks fine before calling outputs.write
> ("sucessfull", KEY, VALUE);
> For output format I am using TextOutputFormat.
> When I remove MultipleOutputs from my code everything is working ok.
> Is there something i am doing wrong or there is issue with multiple outputs
> ?
> regards,
> dino
>



-- 
Harsh J