You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Shubham Gupta (JIRA)" <ji...@apache.org> on 2017/05/09 04:21:04 UTC

[jira] [Updated] (NUTCH-2384) nutch 2.3.1 job not properly interacting with hadoop 2.7.1

     [ https://issues.apache.org/jira/browse/NUTCH-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Gupta updated NUTCH-2384:
---------------------------------
    Description: 
Hey, 

I am testing the Nutch crawler on local environment as well as on Hadoop cluster. 

The script is able to fetch millions of documents but the apache job created after running the command "ant clean runtime" fails to do so.

While testing in the local environment i.e using the following commands:
bin/nutch fetch -all -crawlId <table-name>.
It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)

Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000  URLs. (1100 seed URLs)
When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
The plugins used in Nutch are as follows:
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic

The settings I am using with the hadoop cluster are as follows:

MAPRED-SITE.XML:

<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1800m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx712m</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>master:50030</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
        </property>
        <property>
            <name>yarn.app.mapreduce.am.command-opts</name>
                <value>-Xmx800m</value>
                </property>


YARN-SITE.XML:

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
   <description>minimum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>5120</value>
   <description>maximum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
 </property>
<property>
   <name>yarn.nodemanager.resource.memory-mb</name>
   <value>12288</value>
<description>max memory allcated to nodemanager.</description>
</property>
<property>
 <name>yarn.nodemanager.vmem-pmem-ratio</name>
 <value>2.1</value>
</property>
<property>
  <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
  <value>100</value>
</property>
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>

The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec. 

  was:
Hey, 

I am testing the Nutch crawler on local environment as well as on Hadoop cluster. 

While testing in the local environment i.e using the following commands:
bin/nutch fetch -all -crawlId <table-name>.
It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)

Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000  URLs. (1100 seed URLs)
When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
The plugins used in Nutch are as follows:
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic

The settings I am using with the hadoop cluster are as follows:

MAPRED-SITE.XML:

<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1800m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx712m</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>master:50030</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
        </property>
        <property>
            <name>yarn.app.mapreduce.am.command-opts</name>
                <value>-Xmx800m</value>
                </property>


YARN-SITE.XML:

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
   <description>minimum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>5120</value>
   <description>maximum memory allcated to containers.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
 </property>
<property>
   <name>yarn.nodemanager.resource.memory-mb</name>
   <value>12288</value>
<description>max memory allcated to nodemanager.</description>
</property>
<property>
 <name>yarn.nodemanager.vmem-pmem-ratio</name>
 <value>2.1</value>
</property>
<property>
  <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
  <value>100</value>
</property>
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>

The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec. 

        Summary: nutch 2.3.1 job not properly interacting with hadoop 2.7.1  (was: nutch 2.3.1 unable to fetch all documents with hadoop 2.7.1)

> nutch 2.3.1 job not properly interacting with hadoop 2.7.1
> ----------------------------------------------------------
>
>                 Key: NUTCH-2384
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2384
>             Project: Nutch
>          Issue Type: Test
>          Components: nutchNewbie
>    Affects Versions: 2.3.1
>         Environment: nutch 2.3.1 + hadoop 2.7.1 + mongodb
>            Reporter: Shubham Gupta
>             Fix For: 2.4
>
>
> Hey, 
> I am testing the Nutch crawler on local environment as well as on Hadoop cluster. 
> The script is able to fetch millions of documents but the apache job created after running the command "ant clean runtime" fails to do so.
> While testing in the local environment i.e using the following commands:
> bin/nutch fetch -all -crawlId <table-name>.
> It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)
> Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000  URLs. (1100 seed URLs)
> When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
> The plugins used in Nutch are as follows:
> protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic
> The settings I am using with the hadoop cluster are as follows:
> MAPRED-SITE.XML:
> <property>
> <name>mapreduce.map.memory.mb</name>
> <value>1024</value>
> </property>
> <property>
> <name>mapreduce.reduce.memory.mb</name>
> <value>2048</value>
> </property>
> <property>
> <name>mapreduce.reduce.java.opts</name>
> <value>-Xmx1800m</value>
> </property>
> <property>
> <name>mapreduce.map.java.opts</name>
> <value>-Xmx712m</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>master:50030</value>
> </property>
> <property>
>     <name>yarn.app.mapreduce.am.resource.mb</name>
>         <value>1024</value>
>         </property>
>         <property>
>             <name>yarn.app.mapreduce.am.command-opts</name>
>                 <value>-Xmx800m</value>
>                 </property>
> YARN-SITE.XML:
> <property>
>     <name>yarn.scheduler.minimum-allocation-mb</name>
>     <value>1024</value>
>    <description>minimum memory allcated to containers.</description>
> </property>
> <property>
>     <name>yarn.scheduler.maximum-allocation-mb</name>
>     <value>5120</value>
>    <description>maximum memory allcated to containers.</description>
> </property>
> <property>
>     <name>yarn.scheduler.minimum-allocation-vcores</name>
>     <value>1</value>
> </property>
> <property>
>     <name>yarn.scheduler.maximum-allocation-vcores</name>
>     <value>4</value>
>  </property>
> <property>
>    <name>yarn.nodemanager.resource.memory-mb</name>
>    <value>12288</value>
> <description>max memory allcated to nodemanager.</description>
> </property>
> <property>
>  <name>yarn.nodemanager.vmem-pmem-ratio</name>
>  <value>2.1</value>
> </property>
> <property>
>   <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>   <value>100</value>
> </property>
> <property>
>    <name>yarn.nodemanager.vmem-check-enabled</name>
>     <value>false</value>
>     <description>Whether virtual memory limits will be enforced for containers</description>
>   </property>
> The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)