You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Thomas Anderson <t....@gmail.com> on 2011/02/21 06:16:44 UTC
No URLs to fetch - check your seed list and URL filters
I learn setting up nutch to crawl a website through
http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
crawl the url http://lucene.apache.org as described in tutorial, I
keep getting `No URLs to fetch - check your seed list and URL
filters.'
The command used to crawl the sample website is
bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3
Inside lucene.apache.org, it contains a file named urls, which points
to the url http://lucene.apache.org.
The setting in nutch
masters:
cloud1
slaves:
cloud2
cloud3
cloud4
hdfs-site.xml:
<property>
<name>dfs.name.dir</name>
<value>/home/cloud/dfs/name</value>
</property>
core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://cloud1:9000</value>
</property>
map-reduce.xml:
<property>
<name>mapred.job.tracker</name>
<value>cloud1:9001</value>
</property>
crawl-urlfilter.txt
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/
Where should I check for reasons why nutch does not fetch any pages?
Thanks
Re: No URLs to fetch - check your seed list and URL filters
Posted by Thomas Anderson <t....@gmail.com>.
Thanks. This problem I think is solved now. It is because the nutch
conf files are missing in other hadoop machines. The hadoop cluster
also requires to know the conf info in other nodes because crawling
process is submitted as job. So simply copy nutch-* and
crawl-urlfilter.txt to all hadoop machines would solve this issue.
Thanks again for help.
On Mon, Feb 21, 2011 at 1:35 PM, Ibrahim Alkharashi
<kh...@kacst.edu.sa> wrote:
> 1. I am not sure that you can use the same directory the source of your
> seed urls and as destination for the crawl process
>
> try
> bin/nutch crawl dir1 -dir dir2 -depth 3
>
> 2. I remembered that I had problem with the tailing slash at the end of
> the url at crawl-urlfilter.txt
>
> try
> +^http://([a-z0-9]*\.)*apache.org
>
> Ibrahim
>
> On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote:
>> I learn setting up nutch to crawl a website through
>> http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
>> crawl the url http://lucene.apache.org as described in tutorial, I
>> keep getting `No URLs to fetch - check your seed list and URL
>> filters.'
>>
>> The command used to crawl the sample website is
>>
>> bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3
>>
>> Inside lucene.apache.org, it contains a file named urls, which points
>> to the url http://lucene.apache.org.
>>
>> The setting in nutch
>>
>> masters:
>> cloud1
>>
>> slaves:
>> cloud2
>> cloud3
>> cloud4
>>
>> hdfs-site.xml:
>>
>> <property>
>> <name>dfs.name.dir</name>
>> <value>/home/cloud/dfs/name</value>
>> </property>
>>
>> core-site.xml:
>> <property>
>> <name>fs.default.name</name>
>> <value>hdfs://cloud1:9000</value>
>> </property>
>>
>> map-reduce.xml:
>>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>cloud1:9001</value>
>> </property>
>>
>> crawl-urlfilter.txt
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*apache.org/
>>
>> Where should I check for reasons why nutch does not fetch any pages?
>>
>> Thanks
>
>
>
Re: No URLs to fetch - check your seed list and URL filters
Posted by Ibrahim Alkharashi <kh...@kacst.edu.sa>.
1. I am not sure that you can use the same directory the source of your
seed urls and as destination for the crawl process
try
bin/nutch crawl dir1 -dir dir2 -depth 3
2. I remembered that I had problem with the tailing slash at the end of
the url at crawl-urlfilter.txt
try
+^http://([a-z0-9]*\.)*apache.org
Ibrahim
On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote:
> I learn setting up nutch to crawl a website through
> http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
> crawl the url http://lucene.apache.org as described in tutorial, I
> keep getting `No URLs to fetch - check your seed list and URL
> filters.'
>
> The command used to crawl the sample website is
>
> bin/nutch crawl lucene.apache.org -dir lucene.apache.org -depth 3
>
> Inside lucene.apache.org, it contains a file named urls, which points
> to the url http://lucene.apache.org.
>
> The setting in nutch
>
> masters:
> cloud1
>
> slaves:
> cloud2
> cloud3
> cloud4
>
> hdfs-site.xml:
>
> <property>
> <name>dfs.name.dir</name>
> <value>/home/cloud/dfs/name</value>
> </property>
>
> core-site.xml:
> <property>
> <name>fs.default.name</name>
> <value>hdfs://cloud1:9000</value>
> </property>
>
> map-reduce.xml:
>
> <property>
> <name>mapred.job.tracker</name>
> <value>cloud1:9001</value>
> </property>
>
> crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*apache.org/
>
> Where should I check for reasons why nutch does not fetch any pages?
>
> Thanks