You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Shubham Gupta (JIRA)" <ji...@apache.org> on 2017/05/03 12:18:04 UTC
[jira] [Created] (NUTCH-2384) nutch 2.3.1 unable to fetch all
documents with hadoop 2.7.1
Shubham Gupta created NUTCH-2384:
------------------------------------
Summary: nutch 2.3.1 unable to fetch all documents with hadoop 2.7.1
Key: NUTCH-2384
URL: https://issues.apache.org/jira/browse/NUTCH-2384
Project: Nutch
Issue Type: Test
Components: nutchNewbie
Affects Versions: 2.3.1
Environment: nutch 2.3.1 + hadoop 2.7.1 + mongodb
Reporter: Shubham Gupta
Fix For: 2.4
Hey,
I am testing the Nutch crawler on local environment as well as on Hadoop cluster.
While testing in the local environment i.e using the following commands:
bin/nutch fetch -all -crawlId <table-name>.
It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)
Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000 URLs. (1100 seed URLs)
When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
The plugins used in Nutch are as follows:
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic
The settings I am using with the hadoop cluster are as follows:
MAPRED-SITE.XML:
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1800m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx712m</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>master:50030</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx800m</value>
</property>
YARN-SITE.XML:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>minimum memory allcated to containers.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>5120</value>
<description>maximum memory allcated to containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>12288</value>
<description>max memory allcated to nodemanager.</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>100</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)