You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chia-Hung Lin <cl...@googlemail.com> on 2011/02/16 10:53:44 UTC
Fetching pages question
I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
start crawling web pages. The usage with crawl command works
bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>& crawl.log
But when switching to use lower level commands described in Whole-web
Crawling section. The step at Fetching does not take effect. The
result of command
bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
shows
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
But in target folder, there is not segments dir generated as by using
crawl command; creating segments dir manually beforehand has the same
result.
I searched the internet, other people who has similar issues
(http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
solve by pasting conf files to slave nodes. However, I do not use
hadoop and simply download nutch 1.1, and execute commands as
instructed in tutorial.
Is setting up hadoop cluster necessary in order to crawl web pages? Or
what might cause this issue and how to fix it?
Thanks
--
ChiaHung Lin @ nuk, tw.
Re: Fetching pages question
Posted by Chia-Hung Lin <cl...@googlemail.com>.
Sorry my fault. The first step used to inject into webdb should indeed
contain the url
bin/nutch inject ./test-domain/crawldb ./test-domain/urls # urls
contains the url pointing to the target domain
but the previous wrong command executed was
bin/nutch inject ./test-domain/crawldb ./test-domain
And I was under the impression that I executed the stats command for
the previous inject command
bin/nutch readdb ./test-domain/crawldb -stats
But it was the command issued when testing after crawl command only.
Without containing urls file, where target urls exist, the stats
command will throw
Exception in thread "main" java.lang.NullPointerException
at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)
With urls included, generate command will produce e.g.
...
Generator: segment: ./test-domain/segments/20110216184146
...
I think the problem is solved now.
Thanks again for the help.
2011/2/16 Markus Jelsma <ma...@openindex.io>:
> Did you inject?
> Is your regex-urlfilter.txt alright?
>
> On Wednesday 16 February 2011 10:53:44 Chia-Hung Lin wrote:
>> I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
>> start crawling web pages. The usage with crawl command works
>>
>> bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>>
>> >& crawl.log
>>
>> But when switching to use lower level commands described in Whole-web
>> Crawling section. The step at Fetching does not take effect. The
>> result of command
>>
>> bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
>>
>> shows
>>
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: 0 records selected for fetching, exiting ...
>>
>> But in target folder, there is not segments dir generated as by using
>> crawl command; creating segments dir manually beforehand has the same
>> result.
>>
>> I searched the internet, other people who has similar issues
>> (http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
>> solve by pasting conf files to slave nodes. However, I do not use
>> hadoop and simply download nutch 1.1, and execute commands as
>> instructed in tutorial.
>>
>> Is setting up hadoop cluster necessary in order to crawl web pages? Or
>> what might cause this issue and how to fix it?
>>
>> Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
--
ChiaHung Lin @ nuk, tw.
Re: Fetching pages question
Posted by Markus Jelsma <ma...@openindex.io>.
Did you inject?
Is your regex-urlfilter.txt alright?
On Wednesday 16 February 2011 10:53:44 Chia-Hung Lin wrote:
> I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
> start crawling web pages. The usage with crawl command works
>
> bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>
> >& crawl.log
>
> But when switching to use lower level commands described in Whole-web
> Crawling section. The step at Fetching does not take effect. The
> result of command
>
> bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
>
> shows
>
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
>
> But in target folder, there is not segments dir generated as by using
> crawl command; creating segments dir manually beforehand has the same
> result.
>
> I searched the internet, other people who has similar issues
> (http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
> solve by pasting conf files to slave nodes. However, I do not use
> hadoop and simply download nutch 1.1, and execute commands as
> instructed in tutorial.
>
> Is setting up hadoop cluster necessary in order to crawl web pages? Or
> what might cause this issue and how to fix it?
>
> Thanks
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350