You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Shubham Gupta (JIRA)" <ji...@apache.org> on 2017/05/03 12:18:04 UTC
[jira] [Created] (NUTCH-2384) nutch 2.3.1 unable to fetch all documents with hadoop 2.7.1

Shubham Gupta created NUTCH-2384:
------------------------------------

             Summary: nutch 2.3.1 unable to fetch all documents with hadoop 2.7.1
                 Key: NUTCH-2384
                 URL: https://issues.apache.org/jira/browse/NUTCH-2384
             Project: Nutch
          Issue Type: Test
          Components: nutchNewbie
    Affects Versions: 2.3.1
         Environment: nutch 2.3.1 + hadoop 2.7.1 + mongodb
            Reporter: Shubham Gupta
             Fix For: 2.4


Hey, 

I am testing the Nutch crawler on local environment as well as on Hadoop cluster. 

While testing in the local environment i.e using the following commands:
bin/nutch fetch -all -crawlId <table-name>.
It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)

Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000  URLs. (1100 seed URLs)
When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
The plugins used in Nutch are as follows:
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic

The settings I am using with the hadoop cluster are as follows:

MAPRED-SITE.XML:

<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1800m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx712m</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>master:50030</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
        </property>
        <property>
            <name>yarn.app.mapreduce.am.command-opts</name>
                <value>-Xmx800m</value>
                </property>


YARN-SITE.XML:

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
   <description>minimum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>5120</value>
   <description>maximum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
 </property>
<property>
   <name>yarn.nodemanager.resource.memory-mb</name>
   <value>12288</value>
<description>max memory allcated to nodemanager.</description>
</property>
<property>
 <name>yarn.nodemanager.vmem-pmem-ratio</name>
 <value>2.1</value>
</property>
<property>
  <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
  <value>100</value>
</property>
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>

The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)