You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chia-Hung Lin <cl...@googlemail.com> on 2011/02/16 10:53:44 UTC

Fetching pages question

I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
start crawling web pages. The usage with crawl command works

    bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>& crawl.log

But when switching to use lower level commands described in Whole-web
Crawling section. The step at Fetching does not take effect. The
result of command

    bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments

shows

    Generator: Selecting best-scoring urls due for fetch.
    Generator: starting
    Generator: filtering: true
    Generator: normalizing: true
    Generator: jobtracker is 'local', generating exactly one partition.
    Generator: 0 records selected for fetching, exiting ...

 But in target folder, there is not segments dir generated as by using
crawl command; creating segments dir manually beforehand has the same
result.

I searched the internet, other people who has similar issues
(http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
solve by pasting conf files to slave nodes. However, I do not use
hadoop and simply download nutch 1.1, and execute commands as
instructed in tutorial.

Is setting up hadoop cluster necessary in order to crawl web pages? Or
what might cause this issue and how to fix it?

Thanks

-- 
ChiaHung Lin @ nuk, tw.

Re: Fetching pages question

Posted by Chia-Hung Lin <cl...@googlemail.com>.

Sorry my fault. The first step used to inject into webdb should indeed
contain the url

    bin/nutch inject ./test-domain/crawldb ./test-domain/urls # urls
contains the url pointing to the target domain

but the previous wrong command executed was

    bin/nutch inject ./test-domain/crawldb ./test-domain

And I was under the impression that I executed the stats command for
the previous inject command

    bin/nutch readdb ./test-domain/crawldb -stats

But it was the command issued when testing after crawl command only.

Without containing urls file, where target urls exist, the stats
command will throw

    Exception in thread "main" java.lang.NullPointerException
    	    at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)

With urls included, generate command will produce e.g.

    ...
    Generator: segment: ./test-domain/segments/20110216184146
    ...

I think the problem is solved now.

Thanks again for the help.



2011/2/16 Markus Jelsma <ma...@openindex.io>:
> Did you inject?
> Is your regex-urlfilter.txt alright?
>
> On Wednesday 16 February 2011 10:53:44 Chia-Hung Lin wrote:
>> I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
>> start crawling web pages. The usage with crawl command works
>>
>>     bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>>
>> >& crawl.log
>>
>> But when switching to use lower level commands described in Whole-web
>> Crawling section. The step at Fetching does not take effect. The
>> result of command
>>
>>     bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
>>
>> shows
>>
>>     Generator: Selecting best-scoring urls due for fetch.
>>     Generator: starting
>>     Generator: filtering: true
>>     Generator: normalizing: true
>>     Generator: jobtracker is 'local', generating exactly one partition.
>>     Generator: 0 records selected for fetching, exiting ...
>>
>>  But in target folder, there is not segments dir generated as by using
>> crawl command; creating segments dir manually beforehand has the same
>> result.
>>
>> I searched the internet, other people who has similar issues
>> (http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
>> solve by pasting conf files to slave nodes. However, I do not use
>> hadoop and simply download nutch 1.1, and execute commands as
>> instructed in tutorial.
>>
>> Is setting up hadoop cluster necessary in order to crawl web pages? Or
>> what might cause this issue and how to fix it?
>>
>> Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
ChiaHung Lin @ nuk, tw.

Re: Fetching pages question

Posted by Markus Jelsma <ma...@openindex.io>.

Did you inject?
Is your regex-urlfilter.txt alright?

On Wednesday 16 February 2011 10:53:44 Chia-Hung Lin wrote:
> I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
> start crawling web pages. The usage with crawl command works
> 
>     bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
> 
> >& crawl.log
> 
> But when switching to use lower level commands described in Whole-web
> Crawling section. The step at Fetching does not take effect. The
> result of command
> 
>     bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
> 
> shows
> 
>     Generator: Selecting best-scoring urls due for fetch.
>     Generator: starting
>     Generator: filtering: true
>     Generator: normalizing: true
>     Generator: jobtracker is 'local', generating exactly one partition.
>     Generator: 0 records selected for fetching, exiting ...
> 
>  But in target folder, there is not segments dir generated as by using
> crawl command; creating segments dir manually beforehand has the same
> result.
> 
> I searched the internet, other people who has similar issues
> (http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
> solve by pasting conf files to slave nodes. However, I do not use
> hadoop and simply download nutch 1.1, and execute commands as
> instructed in tutorial.
> 
> Is setting up hadoop cluster necessary in order to crawl web pages? Or
> what might cause this issue and how to fix it?
> 
> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350