You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Batista <ds...@xldb.di.fc.ul.pt> on 2009/05/28 18:41:30 UTC

Reduce() time takes ~4x Map()

Hi everyone,

I'm processing XML files, around 500MB each with several documents,
for the map() function I pass a document from the XML file, which
takes some time to process depending on the size - I'm applying NER to
texts.

Each document has a unique identifier, so I'm using that identifier as
a key and the results of parsing the document in one string as the
output:

so at the end of  the map function():
output.collect( new Text(identifier), new Text(outputString));

usually the outputString is around 1k-5k size

reduce():
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) {
	while (values.hasNext()) {
		Text text = values.next();
		try {
			output.collect(key, text);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

I did a test using only 1 machine with 8 cores, and only 1 XML file,
it took around 3 hours to process all maps and  ~12hours for the
reduces!

the XML file has 139 945 documents

I set the jobconf for 1000 maps() and 200 reduces()

I did took a look at graphs on the web interface during the reduce
phase, and indeed its the copy phase that's taking much of the time,
the sort and reduce phase are done almost instantly.

Why does the copy phase takes so long? I understand that the copies
are made using HTTP, and the data was in really small chunks 1k-5k
size, but even so, being everything in the same physical machine
should have been faster, no?

Any suggestions on what might be causing the copies in reduce to take so long?
--
./david

Re: Reduce() time takes ~4x Map()

Posted by David Batista <ds...@gmail.com>.
Hi Jason,

Yes, my keys are unique, that I'm certain of.

I don't need actually to sort them, but it would save me some extra
work if the output came out in the same file name as the input, for
instance:

having as input:

big_xml_file1.xml
big_xml_file2.xml
..
big_xml_file40.xml

on each being several documents, it would be good to have the output
data according to the file where the document was taken from:

output_xml_file1.xml
output_xml_file2.xml
...
output_xml_file40.xml


I did run a teste now, while writing this email, and it seems its
working as expected, what I did was to use this as output format:

conf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class)

which I found here:
http://www.mail-archive.com/core-user@hadoop.apache.org/msg05707.html

outputdata is stored in a file with the same name as the input file
where it was taken from

and I've set the Reduces to 0, suggestion from Miles Osborne, so no
sorting is done.

It seems everything is working perfectly now!

Thanks for all the feedback

--
./david



2009/5/29 jason hadoop <ja...@gmail.com>:
> At the minimal level, enable map output compression, it may make some
> difference, mapred.compress.map.output.
> Sorting is very expensive when there are many keys and the values are large.
> Are you quite certain your keys are unique.
> Also, do you need them sorted by document id?
>
>
> On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:
>
>> Hi David,
>>
>> If you go to JobTrackerHistory and then click on this job and then do
>> Analyse This Job, you should be able to get the split up timings for the
>> individual phases of the map and reduce tasks, including the average, best
>> and worst times. Could you provide those numbers so that we can get a
>> better
>> idea of how the job progressed.
>>
>> Jothi
>>
>>
>> On 5/28/09 10:11 PM, "David Batista" <ds...@xldb.di.fc.ul.pt> wrote:
>>
>> > Hi everyone,
>> >
>> > I'm processing XML files, around 500MB each with several documents,
>> > for the map() function I pass a document from the XML file, which
>> > takes some time to process depending on the size - I'm applying NER to
>> > texts.
>> >
>> > Each document has a unique identifier, so I'm using that identifier as
>> > a key and the results of parsing the document in one string as the
>> > output:
>> >
>> > so at the end of  the map function():
>> > output.collect( new Text(identifier), new Text(outputString));
>> >
>> > usually the outputString is around 1k-5k size
>> >
>> > reduce():
>> > public void reduce(Text key, Iterator<Text> values,
>> > OutputCollector<Text, Text> output, Reporter reporter) {
>> > while (values.hasNext()) {
>> > Text text = values.next();
>> > try {
>> > output.collect(key, text);
>> > } catch (IOException e) {
>> > // TODO Auto-generated catch block
>> > e.printStackTrace();
>> > }
>> > }
>> > }
>> >
>> > I did a test using only 1 machine with 8 cores, and only 1 XML file,
>> > it took around 3 hours to process all maps and  ~12hours for the
>> > reduces!
>> >
>> > the XML file has 139 945 documents
>> >
>> > I set the jobconf for 1000 maps() and 200 reduces()
>> >
>> > I did took a look at graphs on the web interface during the reduce
>> > phase, and indeed its the copy phase that's taking much of the time,
>> > the sort and reduce phase are done almost instantly.
>> >
>> > Why does the copy phase takes so long? I understand that the copies
>> > are made using HTTP, and the data was in really small chunks 1k-5k
>> > size, but even so, being everything in the same physical machine
>> > should have been faster, no?
>> >
>> > Any suggestions on what might be causing the copies in reduce to take so
>> long?
>> > --
>> > ./david
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: Reduce() time takes ~4x Map()

Posted by jason hadoop <ja...@gmail.com>.
At the minimal level, enable map output compression, it may make some
difference, mapred.compress.map.output.
Sorting is very expensive when there are many keys and the values are large.
Are you quite certain your keys are unique.
Also, do you need them sorted by document id?


On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:

> Hi David,
>
> If you go to JobTrackerHistory and then click on this job and then do
> Analyse This Job, you should be able to get the split up timings for the
> individual phases of the map and reduce tasks, including the average, best
> and worst times. Could you provide those numbers so that we can get a
> better
> idea of how the job progressed.
>
> Jothi
>
>
> On 5/28/09 10:11 PM, "David Batista" <ds...@xldb.di.fc.ul.pt> wrote:
>
> > Hi everyone,
> >
> > I'm processing XML files, around 500MB each with several documents,
> > for the map() function I pass a document from the XML file, which
> > takes some time to process depending on the size - I'm applying NER to
> > texts.
> >
> > Each document has a unique identifier, so I'm using that identifier as
> > a key and the results of parsing the document in one string as the
> > output:
> >
> > so at the end of  the map function():
> > output.collect( new Text(identifier), new Text(outputString));
> >
> > usually the outputString is around 1k-5k size
> >
> > reduce():
> > public void reduce(Text key, Iterator<Text> values,
> > OutputCollector<Text, Text> output, Reporter reporter) {
> > while (values.hasNext()) {
> > Text text = values.next();
> > try {
> > output.collect(key, text);
> > } catch (IOException e) {
> > // TODO Auto-generated catch block
> > e.printStackTrace();
> > }
> > }
> > }
> >
> > I did a test using only 1 machine with 8 cores, and only 1 XML file,
> > it took around 3 hours to process all maps and  ~12hours for the
> > reduces!
> >
> > the XML file has 139 945 documents
> >
> > I set the jobconf for 1000 maps() and 200 reduces()
> >
> > I did took a look at graphs on the web interface during the reduce
> > phase, and indeed its the copy phase that's taking much of the time,
> > the sort and reduce phase are done almost instantly.
> >
> > Why does the copy phase takes so long? I understand that the copies
> > are made using HTTP, and the data was in really small chunks 1k-5k
> > size, but even so, being everything in the same physical machine
> > should have been faster, no?
> >
> > Any suggestions on what might be causing the copies in reduce to take so
> long?
> > --
> > ./david
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: Reduce() time takes ~4x Map()

Posted by David Batista <ds...@gmail.com>.
Hi Jothi, thanks for your answer.

I have:

Average time taken by Map tasks: 37sec

Average time taken by Shuffle: 35mins, 12sec

Average time taken by Reduce tasks: 1sec

--
./david



2009/5/29 Jothi Padmanabhan <jo...@yahoo-inc.com>:
> Hi David,
>
> If you go to JobTrackerHistory and then click on this job and then do
> Analyse This Job, you should be able to get the split up timings for the
> individual phases of the map and reduce tasks, including the average, best
> and worst times. Could you provide those numbers so that we can get a better
> idea of how the job progressed.
>
> Jothi
>
>
> On 5/28/09 10:11 PM, "David Batista" <ds...@xldb.di.fc.ul.pt> wrote:
>
>> Hi everyone,
>>
>> I'm processing XML files, around 500MB each with several documents,
>> for the map() function I pass a document from the XML file, which
>> takes some time to process depending on the size - I'm applying NER to
>> texts.
>>
>> Each document has a unique identifier, so I'm using that identifier as
>> a key and the results of parsing the document in one string as the
>> output:
>>
>> so at the end of  the map function():
>> output.collect( new Text(identifier), new Text(outputString));
>>
>> usually the outputString is around 1k-5k size
>>
>> reduce():
>> public void reduce(Text key, Iterator<Text> values,
>> OutputCollector<Text, Text> output, Reporter reporter) {
>> while (values.hasNext()) {
>> Text text = values.next();
>> try {
>> output.collect(key, text);
>> } catch (IOException e) {
>> // TODO Auto-generated catch block
>> e.printStackTrace();
>> }
>> }
>> }
>>
>> I did a test using only 1 machine with 8 cores, and only 1 XML file,
>> it took around 3 hours to process all maps and  ~12hours for the
>> reduces!
>>
>> the XML file has 139 945 documents
>>
>> I set the jobconf for 1000 maps() and 200 reduces()
>>
>> I did took a look at graphs on the web interface during the reduce
>> phase, and indeed its the copy phase that's taking much of the time,
>> the sort and reduce phase are done almost instantly.
>>
>> Why does the copy phase takes so long? I understand that the copies
>> are made using HTTP, and the data was in really small chunks 1k-5k
>> size, but even so, being everything in the same physical machine
>> should have been faster, no?
>>
>> Any suggestions on what might be causing the copies in reduce to take so long?
>> --
>> ./david
>
>

Re: Reduce() time takes ~4x Map()

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.
Hi David,

If you go to JobTrackerHistory and then click on this job and then do
Analyse This Job, you should be able to get the split up timings for the
individual phases of the map and reduce tasks, including the average, best
and worst times. Could you provide those numbers so that we can get a better
idea of how the job progressed.

Jothi


On 5/28/09 10:11 PM, "David Batista" <ds...@xldb.di.fc.ul.pt> wrote:

> Hi everyone,
> 
> I'm processing XML files, around 500MB each with several documents,
> for the map() function I pass a document from the XML file, which
> takes some time to process depending on the size - I'm applying NER to
> texts.
> 
> Each document has a unique identifier, so I'm using that identifier as
> a key and the results of parsing the document in one string as the
> output:
> 
> so at the end of  the map function():
> output.collect( new Text(identifier), new Text(outputString));
> 
> usually the outputString is around 1k-5k size
> 
> reduce():
> public void reduce(Text key, Iterator<Text> values,
> OutputCollector<Text, Text> output, Reporter reporter) {
> while (values.hasNext()) {
> Text text = values.next();
> try {
> output.collect(key, text);
> } catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
> }
> 
> I did a test using only 1 machine with 8 cores, and only 1 XML file,
> it took around 3 hours to process all maps and  ~12hours for the
> reduces!
> 
> the XML file has 139 945 documents
> 
> I set the jobconf for 1000 maps() and 200 reduces()
> 
> I did took a look at graphs on the web interface during the reduce
> phase, and indeed its the copy phase that's taking much of the time,
> the sort and reduce phase are done almost instantly.
> 
> Why does the copy phase takes so long? I understand that the copies
> are made using HTTP, and the data was in really small chunks 1k-5k
> size, but even so, being everything in the same physical machine
> should have been faster, no?
> 
> Any suggestions on what might be causing the copies in reduce to take so long?
> --
> ./david