You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tamer Yousef <TY...@boardreader.com> on 2015/01/14 15:49:39 UTC

Parse map step executes on one node only

Hi All
I have a question regarding the parse step, the map part executes on 1 machine only (the primary namenode) and is NOT spread across other machines in the cluster, while the reduce part is spread across the machines passed on the value of "mapred.reduce.tasks" which is set to 9 in the crawl script.

Here is more info about what I have:
------------------------------------------------
Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 machines.
The size of the data I'm working on so far is:
#of urls unfetched:    15,886,229
#of urls fetched:      2,316,187

The most important step, the fetch, was correctly spread across 9 machines in both, the map part, and the reduce part.
But when it comes to parsing, all processing goes to one machine during the map phase, and this causes the parse step to take over 5 hours, compared for 2.6 hours for fetching.

The number of map tasks that were completed on this single machine is 342, and the number of reduce tasks was 9 (reduce was correctly spread across 9 machines).

Most of my configs are OEM,,,,I have not altered the url filters or anything.

Any tips to speed this up is appreciated!
Thanks!
Tamer

RE: Parse map step executes on one node only

Posted by Tamer Yousef <TY...@boardreader.com>.

I think I figured it out, it is distributed, I thought the server listed under the Status column is the server handling the task, I had to dig more into the task to read which machine is actually executing the task.

-----Original Message-----
From: Tamer Yousef 
Sent: Thursday, January 15, 2015 9:29 AM
To: user@nutch.apache.org
Subject: RE: Parse map step executes on one node only

Any feedback guys on this? The last parse task for next iteration had over 700 tasks on the same machine, and eventually failed. How can I distribute the parse task over these machines?

thanks

-----Original Message-----
From: Tamer Yousef 
Sent: Wednesday, January 14, 2015 9:50 AM
To: user@nutch.apache.org
Subject: Parse map step executes on one node only

Hi All
I have a question regarding the parse step, the map part executes on 1 machine only (the primary namenode) and is NOT spread across other machines in the cluster, while the reduce part is spread across the machines passed on the value of "mapred.reduce.tasks" which is set to 9 in the crawl script.

Here is more info about what I have:
------------------------------------------------
Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 machines.
The size of the data I'm working on so far is:
#of urls unfetched:    15,886,229
#of urls fetched:      2,316,187

The most important step, the fetch, was correctly spread across 9 machines in both, the map part, and the reduce part.
But when it comes to parsing, all processing goes to one machine during the map phase, and this causes the parse step to take over 5 hours, compared for 2.6 hours for fetching.

The number of map tasks that were completed on this single machine is 342, and the number of reduce tasks was 9 (reduce was correctly spread across 9 machines).

Most of my configs are OEM,,,,I have not altered the url filters or anything.

Any tips to speed this up is appreciated!
Thanks!
Tamer

RE: Parse map step executes on one node only

Posted by Tamer Yousef <TY...@boardreader.com>.

Any feedback guys on this? The last parse task for next iteration had over 700 tasks on the same machine, and eventually failed. How can I distribute the parse task over these machines?

thanks

-----Original Message-----
From: Tamer Yousef 
Sent: Wednesday, January 14, 2015 9:50 AM
To: user@nutch.apache.org
Subject: Parse map step executes on one node only

Hi All
I have a question regarding the parse step, the map part executes on 1 machine only (the primary namenode) and is NOT spread across other machines in the cluster, while the reduce part is spread across the machines passed on the value of "mapred.reduce.tasks" which is set to 9 in the crawl script.

Here is more info about what I have:
------------------------------------------------
Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 machines.
The size of the data I'm working on so far is:
#of urls unfetched:    15,886,229
#of urls fetched:      2,316,187

The most important step, the fetch, was correctly spread across 9 machines in both, the map part, and the reduce part.
But when it comes to parsing, all processing goes to one machine during the map phase, and this causes the parse step to take over 5 hours, compared for 2.6 hours for fetching.

The number of map tasks that were completed on this single machine is 342, and the number of reduce tasks was 9 (reduce was correctly spread across 9 machines).

Most of my configs are OEM,,,,I have not altered the url filters or anything.

Any tips to speed this up is appreciated!
Thanks!
Tamer