You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Thomas Anderson <t....@gmail.com> on 2011/02/18 11:11:05 UTC

Nutch search result

I follow the NutchTutorial and get the search worked, but I have
several questions.

1st, is it possible for a website to setup some restriction so that
nutch can not fetch its pages or the pages fetched is limited under
some condition? If so, what file (e.g. robots.txt?) nutch would
respect in order to avoid fetching specific pages? Or what may
restrict nutch to fetch a website (e.g. a website generate dynamic
content without static link)?

2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns

    Total hits: 1
     0 20110218171640/http://en.wikipedia.org/wiki/IBM
    IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
free encyclopedia Jump to:  ...

This seeming does not relate to apache, any reason that may explain
the reason it returns IBM? Or any execution step below may go wrong?

    bin/nutch inject ../wiki/crawldb urls

    bin/nutch generate ../wiki/crawldb ../wiki/segments
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
    bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
../wiki/segments/*

In addition, why only the third round 'generate, fetch, and updatedb'
will actually fetch pages while the second round only replies it is
done?

The second round message

    Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
    Fetcher: starting
    Fetcher: segment: ../wiki/segments/20110218171338
    ^[OFFetcher: threads: 10
    QueueFeeder finished: total 1 records + hit by time limit :0
    -finishing thread FetcherThread, activeThreads=1
    fetching http://en.wikipedia.org/wiki/Main_Page
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
    -finishing thread FetcherThread, activeThreads=0
    -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
    -activeThreads=0
    Fetcher: done

Thanks.

Re: Nutch search result

Posted by al...@aim.com.

> 2nd, after testing to fetch several pages from wikipedia, the search
> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
> ../wiki_dir returns

It returns a result for keyword apache  because that url has "apache" in it.



>-topN 50), it actually fetches some pages e.g. `fetching

>http://www.plurk.com/t/Brazil'). I am confused the differences between

>using crawl command and step-by-step crawling.
>crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3

 


 In order to get the same fetching in step by step approach you need to do fetching 3 times because you have depth 3 in crawl command


 

-----Original Message-----
From: Thomas Anderson <t....@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Feb 18, 2011 9:10 pm
Subject: Re: Nutch search result


The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.



The first question raises from when testing to fetch plurk.com. The

url specified at the inject stage only contains e.g. http://plurk.com.

After going through the steps described in the tutorial, I notice no

`fetching http:// ... ' key words were displayed on console. But when

crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3

-topN 50), it actually fetches some pages e.g. `fetching

http://www.plurk.com/t/Brazil'). I am confused the differences between

using crawl command and step-by-step crawling.



When fetching wikipedia, the url specified is http://en.wikipedia.org.

No ibm related url exists. But the file containing wiki url is resided

under wiki folder where also stores crawldb, segments, etc.



Thanks for help.



On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John

<Le...@gcu.ac.uk> wrote:

> Hi Thomas

>

> Firstly which dist are you using?

>

> _______________________________________

> From: Thomas Anderson [t.dt.aanderson@gmail.com]

> Sent: 18 February 2011 10:11

> To: user@nutch.apache.org

> Subject: Nutch search result

>

> I follow the NutchTutorial and get the search worked, but I have

> several questions.

>

> 1st, is it possible for a website to setup some restriction so that

> nutch can not fetch its pages or the pages fetched is limited under

> some condition? If so, what file (e.g. robots.txt?) nutch would

> respect in order to avoid fetching specific pages?

>

> For this can you please specify your use scenario. If You hve a website, with 

certain areas, which you wish not to be crawled then I would assume a robots 

file would suffice. Inversely, if you wish to restict Nuch from crawling certain 

pages of specific domains then I imagine you would be looking at a different 

config of crawl-urlfilter

>

>

> 2nd, after testing to fetch several pages from wikipedia, the search

> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache

> ../wiki_dir returns

>

>    Total hits: 1

>     0 20110218171640/http://en.wikipedia.org/wiki/IBM

>    IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the

> free encyclopedia Jump to:  ...

>

> I'm afraid that I completely loose you here. Have you specified some IBM page 

within your /wiki_dir ? If so, it might be the case that Nutch has not fetched 

pages for a certain reason E.g. politeness rules. Can anyone advise on this 

please?

>

>

>

> This seeming does not relate to apache, any reason that may explain

> the reason it returns IBM? Or any execution step below may go wrong?

>

>    bin/nutch inject ../wiki/crawldb urls

>

>    bin/nutch generate ../wiki/crawldb ../wiki/segments

>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

>

>    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100

>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

>

>    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100

>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

>

>    bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments

>    bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb

> ../wiki/segments/*

>

> In addition, why only the third round 'generate, fetch, and updatedb'

> will actually fetch pages while the second round only replies it is

> done?

>

> The second round message

>

>    Fetcher: Your 'http.agent.name' value should be listed first in

> 'http.robots.agents' property.

>    Fetcher: starting

>    Fetcher: segment: ../wiki/segments/20110218171338

>    ^[OFFetcher: threads: 10

>    QueueFeeder finished: total 1 records + hit by time limit :0

>    -finishing thread FetcherThread, activeThreads=1

>    fetching http://en.wikipedia.org/wiki/Main_Page

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -finishing thread FetcherThread, activeThreads=1

>    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

>    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

>    -finishing thread FetcherThread, activeThreads=0

>    -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0

>    -activeThreads=0

>    Fetcher: done

>

> Thanks.

>

> Email has been scanned for viruses by Altman Technologies' email management 

service - www.altman.co.uk/emailsystems

>

> Glasgow Caledonian University is a registered Scottish charity, number 

SC021474

>

> Winner: Times Higher Education’s Widening Participation Initiative of the Year 

2009 and Herald Society’s Education Initiative of the Year 2009.

> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

>

> Winner: Times Higher Education’s Outstanding Support for Early Career 

Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.

> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

>

Re: Nutch search result

Posted by Thomas Anderson <t....@gmail.com>.

The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.

The first question raises from when testing to fetch plurk.com. The
url specified at the inject stage only contains e.g. http://plurk.com.
After going through the steps described in the tutorial, I notice no
`fetching http:// ... ' key words were displayed on console. But when
crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3
-topN 50), it actually fetches some pages e.g. `fetching
http://www.plurk.com/t/Brazil'). I am confused the differences between
using crawl command and step-by-step crawling.

When fetching wikipedia, the url specified is http://en.wikipedia.org.
No ibm related url exists. But the file containing wiki url is resided
under wiki folder where also stores crawldb, segments, etc.

Thanks for help.

On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John
<Le...@gcu.ac.uk> wrote:
> Hi Thomas
>
> Firstly which dist are you using?
>
> _______________________________________
> From: Thomas Anderson [t.dt.aanderson@gmail.com]
> Sent: 18 February 2011 10:11
> To: user@nutch.apache.org
> Subject: Nutch search result
>
> I follow the NutchTutorial and get the search worked, but I have
> several questions.
>
> 1st, is it possible for a website to setup some restriction so that
> nutch can not fetch its pages or the pages fetched is limited under
> some condition? If so, what file (e.g. robots.txt?) nutch would
> respect in order to avoid fetching specific pages?
>
> For this can you please specify your use scenario. If You hve a website, with certain areas, which you wish not to be crawled then I would assume a robots file would suffice. Inversely, if you wish to restict Nuch from crawling certain pages of specific domains then I imagine you would be looking at a different config of crawl-urlfilter
>
>
> 2nd, after testing to fetch several pages from wikipedia, the search
> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
> ../wiki_dir returns
>
>    Total hits: 1
>     0 20110218171640/http://en.wikipedia.org/wiki/IBM
>    IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
> free encyclopedia Jump to:  ...
>
> I'm afraid that I completely loose you here. Have you specified some IBM page within your /wiki_dir ? If so, it might be the case that Nutch has not fetched pages for a certain reason E.g. politeness rules. Can anyone advise on this please?
>
>
>
> This seeming does not relate to apache, any reason that may explain
> the reason it returns IBM? Or any execution step below may go wrong?
>
>    bin/nutch inject ../wiki/crawldb urls
>
>    bin/nutch generate ../wiki/crawldb ../wiki/segments
>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
>    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
>    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
>    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
>    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
>    bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
>    bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
> ../wiki/segments/*
>
> In addition, why only the third round 'generate, fetch, and updatedb'
> will actually fetch pages while the second round only replies it is
> done?
>
> The second round message
>
>    Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
>    Fetcher: starting
>    Fetcher: segment: ../wiki/segments/20110218171338
>    ^[OFFetcher: threads: 10
>    QueueFeeder finished: total 1 records + hit by time limit :0
>    -finishing thread FetcherThread, activeThreads=1
>    fetching http://en.wikipedia.org/wiki/Main_Page
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -finishing thread FetcherThread, activeThreads=1
>    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>    -finishing thread FetcherThread, activeThreads=0
>    -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>    -activeThreads=0
>    Fetcher: done
>
> Thanks.
>
> Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>

RE: Nutch search result

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.

Hi Thomas

Firstly which dist are you using?

_______________________________________
From: Thomas Anderson [t.dt.aanderson@gmail.com]
Sent: 18 February 2011 10:11
To: user@nutch.apache.org
Subject: Nutch search result

I follow the NutchTutorial and get the search worked, but I have
several questions.

1st, is it possible for a website to setup some restriction so that
nutch can not fetch its pages or the pages fetched is limited under
some condition? If so, what file (e.g. robots.txt?) nutch would
respect in order to avoid fetching specific pages?

For this can you please specify your use scenario. If You hve a website, with certain areas, which you wish not to be crawled then I would assume a robots file would suffice. Inversely, if you wish to restict Nuch from crawling certain pages of specific domains then I imagine you would be looking at a different config of crawl-urlfilter

2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns

    Total hits: 1
     0 20110218171640/http://en.wikipedia.org/wiki/IBM
    IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
free encyclopedia Jump to:  ...

I'm afraid that I completely loose you here. Have you specified some IBM page within your /wiki_dir ? If so, it might be the case that Nutch has not fetched pages for a certain reason E.g. politeness rules. Can anyone advise on this please?

This seeming does not relate to apache, any reason that may explain
the reason it returns IBM? Or any execution step below may go wrong?

    bin/nutch inject ../wiki/crawldb urls

    bin/nutch generate ../wiki/crawldb ../wiki/segments
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
    bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
    bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`

    bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
    bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
../wiki/segments/*

In addition, why only the third round 'generate, fetch, and updatedb'
will actually fetch pages while the second round only replies it is
done?

The second round message

    Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
    Fetcher: starting
    Fetcher: segment: ../wiki/segments/20110218171338
    ^[OFFetcher: threads: 10
    QueueFeeder finished: total 1 records + hit by time limit :0
    -finishing thread FetcherThread, activeThreads=1
    fetching http://en.wikipedia.org/wiki/Main_Page
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -finishing thread FetcherThread, activeThreads=1
    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
    -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
    -finishing thread FetcherThread, activeThreads=0
    -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
    -activeThreads=0
    Fetcher: done

Thanks.

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html