You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Thomas Anderson <t....@gmail.com> on 2011/02/18 11:11:05 UTC
Nutch search result
I follow the NutchTutorial and get the search worked, but I have
several questions.
1st, is it possible for a website to setup some restriction so that
nutch can not fetch its pages or the pages fetched is limited under
some condition? If so, what file (e.g. robots.txt?) nutch would
respect in order to avoid fetching specific pages? Or what may
restrict nutch to fetch a website (e.g. a website generate dynamic
content without static link)?
2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns
Total hits: 1
0 20110218171640/http://en.wikipedia.org/wiki/IBM
IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
free encyclopedia Jump to: ...
This seeming does not relate to apache, any reason that may explain
the reason it returns IBM? Or any execution step below may go wrong?
bin/nutch inject ../wiki/crawldb urls
bin/nutch generate ../wiki/crawldb ../wiki/segments
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
../wiki/segments/*
In addition, why only the third round 'generate, fetch, and updatedb'
will actually fetch pages while the second round only replies it is
done?
The second round message
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: ../wiki/segments/20110218171338
^[OFFetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=1
fetching http://en.wikipedia.org/wiki/Main_Page
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
Thanks.
Re: Nutch search result
Posted by al...@aim.com.
> 2nd, after testing to fetch several pages from wikipedia, the search
> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
> ../wiki_dir returns
It returns a result for keyword apache because that url has "apache" in it.
>-topN 50), it actually fetches some pages e.g. `fetching
>http://www.plurk.com/t/Brazil'). I am confused the differences between
>using crawl command and step-by-step crawling.
>crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3
In order to get the same fetching in step by step approach you need to do fetching 3 times because you have depth 3 in crawl command
-----Original Message-----
From: Thomas Anderson <t....@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Feb 18, 2011 9:10 pm
Subject: Re: Nutch search result
The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.
The first question raises from when testing to fetch plurk.com. The
url specified at the inject stage only contains e.g. http://plurk.com.
After going through the steps described in the tutorial, I notice no
`fetching http:// ... ' key words were displayed on console. But when
crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3
-topN 50), it actually fetches some pages e.g. `fetching
http://www.plurk.com/t/Brazil'). I am confused the differences between
using crawl command and step-by-step crawling.
When fetching wikipedia, the url specified is http://en.wikipedia.org.
No ibm related url exists. But the file containing wiki url is resided
under wiki folder where also stores crawldb, segments, etc.
Thanks for help.
On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John
<Le...@gcu.ac.uk> wrote:
> Hi Thomas
>
> Firstly which dist are you using?
>
> _______________________________________
> From: Thomas Anderson [t.dt.aanderson@gmail.com]
> Sent: 18 February 2011 10:11
> To: user@nutch.apache.org
> Subject: Nutch search result
>
> I follow the NutchTutorial and get the search worked, but I have
> several questions.
>
> 1st, is it possible for a website to setup some restriction so that
> nutch can not fetch its pages or the pages fetched is limited under
> some condition? If so, what file (e.g. robots.txt?) nutch would
> respect in order to avoid fetching specific pages?
>
> For this can you please specify your use scenario. If You hve a website, with
certain areas, which you wish not to be crawled then I would assume a robots
file would suffice. Inversely, if you wish to restict Nuch from crawling certain
pages of specific domains then I imagine you would be looking at a different
config of crawl-urlfilter
>
>
> 2nd, after testing to fetch several pages from wikipedia, the search
> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
> ../wiki_dir returns
>
> Total hits: 1
> 0 20110218171640/http://en.wikipedia.org/wiki/IBM
> IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
> free encyclopedia Jump to: ...
>
> I'm afraid that I completely loose you here. Have you specified some IBM page
within your /wiki_dir ? If so, it might be the case that Nutch has not fetched
pages for a certain reason E.g. politeness rules. Can anyone advise on this
please?
>
>
>
> This seeming does not relate to apache, any reason that may explain
> the reason it returns IBM? Or any execution step below may go wrong?
>
> bin/nutch inject ../wiki/crawldb urls
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
> bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
> ../wiki/segments/*
>
> In addition, why only the third round 'generate, fetch, and updatedb'
> will actually fetch pages while the second round only replies it is
> done?
>
> The second round message
>
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting
> Fetcher: segment: ../wiki/segments/20110218171338
> ^[OFFetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> -finishing thread FetcherThread, activeThreads=1
> fetching http://en.wikipedia.org/wiki/Main_Page
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
>
> Thanks.
>
> Email has been scanned for viruses by Altman Technologies' email management
service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number
SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year
2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>
Re: Nutch search result
Posted by Thomas Anderson <t....@gmail.com>.
The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.
The first question raises from when testing to fetch plurk.com. The
url specified at the inject stage only contains e.g. http://plurk.com.
After going through the steps described in the tutorial, I notice no
`fetching http:// ... ' key words were displayed on console. But when
crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3
-topN 50), it actually fetches some pages e.g. `fetching
http://www.plurk.com/t/Brazil'). I am confused the differences between
using crawl command and step-by-step crawling.
When fetching wikipedia, the url specified is http://en.wikipedia.org.
No ibm related url exists. But the file containing wiki url is resided
under wiki folder where also stores crawldb, segments, etc.
Thanks for help.
On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John
<Le...@gcu.ac.uk> wrote:
> Hi Thomas
>
> Firstly which dist are you using?
>
> _______________________________________
> From: Thomas Anderson [t.dt.aanderson@gmail.com]
> Sent: 18 February 2011 10:11
> To: user@nutch.apache.org
> Subject: Nutch search result
>
> I follow the NutchTutorial and get the search worked, but I have
> several questions.
>
> 1st, is it possible for a website to setup some restriction so that
> nutch can not fetch its pages or the pages fetched is limited under
> some condition? If so, what file (e.g. robots.txt?) nutch would
> respect in order to avoid fetching specific pages?
>
> For this can you please specify your use scenario. If You hve a website, with certain areas, which you wish not to be crawled then I would assume a robots file would suffice. Inversely, if you wish to restict Nuch from crawling certain pages of specific domains then I imagine you would be looking at a different config of crawl-urlfilter
>
>
> 2nd, after testing to fetch several pages from wikipedia, the search
> query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
> ../wiki_dir returns
>
> Total hits: 1
> 0 20110218171640/http://en.wikipedia.org/wiki/IBM
> IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
> free encyclopedia Jump to: ...
>
> I'm afraid that I completely loose you here. Have you specified some IBM page within your /wiki_dir ? If so, it might be the case that Nutch has not fetched pages for a certain reason E.g. politeness rules. Can anyone advise on this please?
>
>
>
> This seeming does not relate to apache, any reason that may explain
> the reason it returns IBM? Or any execution step below may go wrong?
>
> bin/nutch inject ../wiki/crawldb urls
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
> bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
> bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
>
> bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
> bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
> ../wiki/segments/*
>
> In addition, why only the third round 'generate, fetch, and updatedb'
> will actually fetch pages while the second round only replies it is
> done?
>
> The second round message
>
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting
> Fetcher: segment: ../wiki/segments/20110218171338
> ^[OFFetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> -finishing thread FetcherThread, activeThreads=1
> fetching http://en.wikipedia.org/wiki/Main_Page
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
>
> Thanks.
>
> Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>
RE: Nutch search result
Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.
Hi Thomas
Firstly which dist are you using?
_______________________________________
From: Thomas Anderson [t.dt.aanderson@gmail.com]
Sent: 18 February 2011 10:11
To: user@nutch.apache.org
Subject: Nutch search result
I follow the NutchTutorial and get the search worked, but I have
several questions.
1st, is it possible for a website to setup some restriction so that
nutch can not fetch its pages or the pages fetched is limited under
some condition? If so, what file (e.g. robots.txt?) nutch would
respect in order to avoid fetching specific pages?
For this can you please specify your use scenario. If You hve a website, with certain areas, which you wish not to be crawled then I would assume a robots file would suffice. Inversely, if you wish to restict Nuch from crawling certain pages of specific domains then I imagine you would be looking at a different config of crawl-urlfilter
2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns
Total hits: 1
0 20110218171640/http://en.wikipedia.org/wiki/IBM
IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the
free encyclopedia Jump to: ...
I'm afraid that I completely loose you here. Have you specified some IBM page within your /wiki_dir ? If so, it might be the case that Nutch has not fetched pages for a certain reason E.g. politeness rules. Can anyone advise on this please?
This seeming does not relate to apache, any reason that may explain
the reason it returns IBM? Or any execution step below may go wrong?
bin/nutch inject ../wiki/crawldb urls
bin/nutch generate ../wiki/crawldb ../wiki/segments
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100
bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`
bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`
bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments
bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb
../wiki/segments/*
In addition, why only the third round 'generate, fetch, and updatedb'
will actually fetch pages while the second round only replies it is
done?
The second round message
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: ../wiki/segments/20110218171338
^[OFFetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=1
fetching http://en.wikipedia.org/wiki/Main_Page
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
Thanks.
Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
Glasgow Caledonian University is a registered Scottish charity, number SC021474
Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html