You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/12/12 23:50:45 UTC
[jira] Created: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls
mapreduce segment generator generates 50 % less than excepted urls
--------------------------------------------------------------------
Key: NUTCH-136
URL: http://issues.apache.org/jira/browse/NUTCH-136
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
First we set the partition to a normal hashPartitioner.
Second we changed Generator.java line 48:
limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
to:
limit = job.getLong("crawl.topN",Long.MAX_VALUE);
Now it works as expected.
Has anyone a idea what the real source of this problem can be?
In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ]
Doug Cutting commented on NUTCH-136:
------------------------------------
The mapred-default.xml file is actually the best place to set these.
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Florent Gluck (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363886 ]
Florent Gluck commented on NUTCH-136:
-------------------------------------
On my setup of 5 boxes (4 slaves, 1 master), I confirm that what Dominik Friedrich suggested fixes the missing urls I've been encountering for a while.
I simply moved the following properties from nutch-site.xml to mapred-default.xml:
<property>
<name>mapred.map.tasks</name>
<value>100</value>
<description>The default number of map tasks per job. Typically set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>40</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local".
</description>
</property>/
After injecting 100'000 urls and doing a single pass crawl, I grepped the logs on my 4 slaves and confirmed that the sum of all the fetching attemps adds up to exactly 100'000. Therefore, there is no need to modify Generator.java.
I also ran some tests with protocol-http and protocol-httpclient and verified that they give similar results. No missing urls in both cases.
--Florent
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Dominik Friedrich (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363198 ]
Dominik Friedrich commented on NUTCH-136:
-----------------------------------------
I think the correct solution would be to move all mapred settings from nutch-site.xml into mapred-default.xml which is read before job.xml files.
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Mike Smith (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ]
Mike Smith commented on NUTCH-136:
----------------------------------
I have had the same problem. Florent suggested to use "protocol-http" instead of "protocol-httpclient", this fixed the problem on single machine, but I still have the same problem when I have multiple data nodes using NDFS. Commenting line 211 didn't help. Here is my results:
Injected URL: 80000
only one machine is datanode: 70000 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250
Injected URL: 80000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 20000 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250
Injected URL : 5000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250
Injected URL : 1000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 240 fetched pages
Injected URL : 1000
only one machine is datanode: 800 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250
Thanks, Mike
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=all ]
Doug Cutting updated NUTCH-136:
-------------------------------
Comment: was deleted
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Dominik Friedrich (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363194 ]
Dominik Friedrich commented on NUTCH-136:
-----------------------------------------
I took me some hours but I finally solved the mystery. The problem is this line
177 numLists = job.getNumMapTasks(); // a partition per fetch task
in combination with this
211 job.setNumReduceTasks(numLists);
and the fact that nutch-site.xml overrides job.xml settings.
In my case I have on the box with the jobtracker and where I start job map.tasks=12 and reduce.tasks=4 defined in the nutch-site.xml. On the other three boxes there is no map.tasks or reduce.tasks in the nutch-site.xml. When the second job of the generator tool is started the jobtracker creates only 4 reduce task because reduce.tasks=4 in nutch-site.xml overrides the job.xml on this box. But the map task on the other 3 boxes read 12 reduce tasks from the job.xml and so they create 12 partitions. When the 4 reduce tasks are started they only read the data from partition 0-3 on that 3 boxes so 3*8 partitions get lost.
I solved this problem by removing line 211.
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-136) mapreduce segment generator generates
50 % less than excepted urls
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-136?page=all ]
Andrzej Bialecki closed NUTCH-136:
-----------------------------------
Resolution: Duplicate
Thank you for investigating this. I'm closing this issue, further discussion should follow to NUTCH-186.
> mapreduce segment generator generates 50 % less than excepted urls
> --------------------------------------------------------------------
>
> Key: NUTCH-136
> URL: http://issues.apache.org/jira/browse/NUTCH-136
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira