You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Thomas Anderson <t....@gmail.com> on 2011/02/21 06:16:44 UTC

No URLs to fetch - check your seed list and URL filters

I learn setting up nutch to crawl a website through
http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
crawl the url http://lucene.apache.org as described in tutorial, I
keep getting `No URLs to fetch - check your seed list and URL
filters.'

The command used to crawl the sample website is

    bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3

Inside lucene.apache.org, it contains a file named urls, which points
to the url http://lucene.apache.org.

The setting in nutch

masters:
  cloud1

slaves:
  cloud2
  cloud3
  cloud4

hdfs-site.xml:

  <property>
    <name>dfs.name.dir</name>
    <value>/home/cloud/dfs/name</value>
  </property>

core-site.xml:
  <property>
    <name>fs.default.name</name>
    <value>hdfs://cloud1:9000</value>
  </property>

map-reduce.xml:

  <property>
    <name>mapred.job.tracker</name>
    <value>cloud1:9001</value>
  </property>

crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/

 Where should I check for reasons why nutch does not fetch any pages?

Thanks

Re: No URLs to fetch - check your seed list and URL filters

Posted by Thomas Anderson <t....@gmail.com>.

Thanks. This problem I think is solved now. It is because the nutch
conf files are missing in other hadoop machines. The hadoop cluster
also requires to know the conf info in other nodes because crawling
process is submitted as job. So simply copy nutch-* and
crawl-urlfilter.txt to all  hadoop machines would solve this issue.

Thanks again for help.

On Mon, Feb 21, 2011 at 1:35 PM, Ibrahim Alkharashi
<kh...@kacst.edu.sa> wrote:
> 1. I am not sure that you can use the same directory the source of your
> seed urls and as destination for the crawl process
>
> try
>   bin/nutch crawl dir1 -dir dir2 -depth 3
>
> 2. I remembered that I had problem with the tailing slash at the end of
> the url at crawl-urlfilter.txt
>
> try
> +^http://([a-z0-9]*\.)*apache.org
>
> Ibrahim
>
> On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote:
>> I learn setting up nutch to crawl a website through
>> http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
>> crawl the url http://lucene.apache.org as described in tutorial, I
>> keep getting `No URLs to fetch - check your seed list and URL
>> filters.'
>>
>> The command used to crawl the sample website is
>>
>>     bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3
>>
>> Inside lucene.apache.org, it contains a file named urls, which points
>> to the url http://lucene.apache.org.
>>
>> The setting in nutch
>>
>> masters:
>>   cloud1
>>
>> slaves:
>>   cloud2
>>   cloud3
>>   cloud4
>>
>> hdfs-site.xml:
>>
>>   <property>
>>     <name>dfs.name.dir</name>
>>     <value>/home/cloud/dfs/name</value>
>>   </property>
>>
>> core-site.xml:
>>   <property>
>>     <name>fs.default.name</name>
>>     <value>hdfs://cloud1:9000</value>
>>   </property>
>>
>> map-reduce.xml:
>>
>>   <property>
>>     <name>mapred.job.tracker</name>
>>     <value>cloud1:9001</value>
>>   </property>
>>
>> crawl-urlfilter.txt
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*apache.org/
>>
>>  Where should I check for reasons why nutch does not fetch any pages?
>>
>> Thanks
>
>
>

Re: No URLs to fetch - check your seed list and URL filters

Posted by Ibrahim Alkharashi <kh...@kacst.edu.sa>.

1. I am not sure that you can use the same directory the source of your
seed urls and as destination for the crawl process

try 
   bin/nutch crawl dir1 -dir dir2 -depth 3

2. I remembered that I had problem with the tailing slash at the end of
the url at crawl-urlfilter.txt

try 
+^http://([a-z0-9]*\.)*apache.org

Ibrahim

On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote:
> I learn setting up nutch to crawl a website through
> http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
> crawl the url http://lucene.apache.org as described in tutorial, I
> keep getting `No URLs to fetch - check your seed list and URL
> filters.'
> 
> The command used to crawl the sample website is
> 
>     bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3
> 
> Inside lucene.apache.org, it contains a file named urls, which points
> to the url http://lucene.apache.org.
> 
> The setting in nutch
> 
> masters:
>   cloud1
> 
> slaves:
>   cloud2
>   cloud3
>   cloud4
> 
> hdfs-site.xml:
> 
>   <property>
>     <name>dfs.name.dir</name>
>     <value>/home/cloud/dfs/name</value>
>   </property>
> 
> core-site.xml:
>   <property>
>     <name>fs.default.name</name>
>     <value>hdfs://cloud1:9000</value>
>   </property>
> 
> map-reduce.xml:
> 
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>cloud1:9001</value>
>   </property>
> 
> crawl-urlfilter.txt
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*apache.org/
> 
>  Where should I check for reasons why nutch does not fetch any pages?
> 
> Thanks