You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Håvard W. Kongsgård" <h....@niap.no> on 2005/12/09 01:57:54 UTC

Problem with fetching segment

I have followed the media-style.com quick tutorial, but when I try to 
fetch my segment the fetch is killed!

Have tried to set the system timer + 30 days, no anti-virus is running 
on the systems.
System SUSE 9.2 and SUSE 10

# bin/nutch fetch segments/20060109014654/
060109 014714 parsing 
file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-default.xml
060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-site.xml
060109 014715 No FS indicated, using default:local
060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/plugins
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-more
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-site/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-html/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-text/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-ext
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-pdf
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-rss
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-more
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-js
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
060109 014715 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-ftp
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-msword
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/creativecommons
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ontology
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-file
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-http/plugin.xml
060109 014715 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/clustering-carrot2
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/language-identifier
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-prefix
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-url/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-httpclient
060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null
060109 014715 http.proxy.port = 8080
060109 014715 http.timeout = 10000
060109 014715 http.content.limit = -1
060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 
bytes, 8309 ms
060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page

Re: Problem with fetching segment

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

Sorry I misunderstood the way whole-web crawling works.

One more question, how do I re-fetch the failed urls (failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.).

 Is this controlled by

 

<property>

  <name>db.default.fetch.interval</name>

  <value>30</value>

  <description>The default number of days between re-fetches of a page.

  </description>

</property>



Stefan Groschupf wrote:

> Sorry, I still do not understand what your problem is, may it is time  
> for the weekend... :-)
>
> From your very first mail there is exactly the same in the log:..
> 060109 014715 logging at INFO
> 060109 014715 fetching http://www.sourceforge.net/
> 060109 014715 fetching http://www.apache.org/
> 060109 014715 fetching http://www.nutch.org/
> 060109 014715 http.proxy.host = null
>
> Isn't that the same as
>
>> 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
>
>
> In any case that are just logging statement what makes you guess that  
> something crashed?
>
> Stefan
>
>
>
>
> Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård:
>
>> But then i fetch the other domains www.sf.net <http://www.sf.net/ > 
>> ..... the output is only
>>
>> 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
>> lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
>> 060109 014715 fetcher.server.delay = 5000
>> 060109 014715 http.max.delays = 52
>> 060109 014718 Using URL normalizer:  
>> org.apache.nutch.net.BasicUrlNormalizer
>> 060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
>> 51033 bytes, 8309 ms
>> 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
>> bytes/page
>>
>> there is not output like
>> 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
>> 060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss
>> 060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about
>> 060109 154712 fetching http://www.niap.no/index.php/magasinet/ 
>> nyheter/midt_sten
>>
>>
>> Stefan Groschupf wrote:
>>
>>>
>>>> What is  java.net.SocketTimeoutException?
>>>
>>>
>>>
>>> Can not connect to the server.
>>>
>>> In general you hammer your webserver and it may block the ip of  
>>> your  server.
>>> You can setup how many threads per host are loading from one host   
>>> server.
>>> For a intranet crawl it is a good idea to have less less thread  
>>> (may  just as much you plan to use at the same time for the host)  
>>> e.g.  fetcherThreads = 2 maxThreadsPerHost = 2
>>> If you have more threads you should increase the retry / delay   
>>> configuration since in case a host is busy with the maximal  
>>> threads  per host the thread is delayed.
>>> If a thread is delayed to often than you get a Exceeded   
>>> http.max.delays: retry later....
>>>
>>> Sometimes I'm asking myself if not a queue based fetching would  be  
>>> better the actually implementation, however this is difficult  to 
>>> change.
>>> HTH
>>> Stefan
>>>
>>> --------------------------------------------------------------------- 
>>> ---
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date:  
>>> 08.12.2005
>>>
>>>
>>
>>
>
>
>
>

Re: Problem with fetching segment

Posted by Stefan Groschupf <sg...@media-style.com>.

Sorry, I still do not understand what your problem is, may it is time  
for the weekend... :-)

 From your very first mail there is exactly the same in the log:..
060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null

Isn't that the same as
> 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print

In any case that are just logging statement what makes you guess that  
something crashed?

Stefan




Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård:

> But then i fetch the other domains www.sf.net <http://www.sf.net/ 
> > ..... the output is only
>
> 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
> lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060109 014715 fetcher.server.delay = 5000
> 060109 014715 http.max.delays = 52
> 060109 014718 Using URL normalizer:  
> org.apache.nutch.net.BasicUrlNormalizer
> 060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
> 51033 bytes, 8309 ms
> 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
> bytes/page
>
> there is not output like
> 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
> 060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss
> 060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about
> 060109 154712 fetching http://www.niap.no/index.php/magasinet/ 
> nyheter/midt_sten
>
>
> Stefan Groschupf wrote:
>
>>
>>> What is  java.net.SocketTimeoutException?
>>
>>
>> Can not connect to the server.
>>
>> In general you hammer your webserver and it may block the ip of  
>> your  server.
>> You can setup how many threads per host are loading from one host   
>> server.
>> For a intranet crawl it is a good idea to have less less thread  
>> (may  just as much you plan to use at the same time for the host)  
>> e.g.  fetcherThreads = 2 maxThreadsPerHost = 2
>> If you have more threads you should increase the retry / delay   
>> configuration since in case a host is busy with the maximal  
>> threads  per host the thread is delayed.
>> If a thread is delayed to often than you get a Exceeded   
>> http.max.delays: retry later....
>>
>> Sometimes I'm asking myself if not a queue based fetching would  
>> be  better the actually implementation, however this is difficult  
>> to change.
>> HTH
>> Stefan
>>
>> --------------------------------------------------------------------- 
>> ---
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date:  
>> 08.12.2005
>>
>>
>
>

Re: Problem with fetching segment

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

But then i fetch the other domains www.sf.net <http://www.sf.net/> ..... 
the output is only

060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 
bytes, 8309 ms
060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page

there is not output like
060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss
060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about
060109 154712 fetching 
http://www.niap.no/index.php/magasinet/nyheter/midt_sten


Stefan Groschupf wrote:

>
>> What is  java.net.SocketTimeoutException?
>
>
> Can not connect to the server.
>
> In general you hammer your webserver and it may block the ip of your  
> server.
> You can setup how many threads per host are loading from one host  
> server.
> For a intranet crawl it is a good idea to have less less thread (may  
> just as much you plan to use at the same time for the host) e.g.  
> fetcherThreads = 2 maxThreadsPerHost = 2
> If you have more threads you should increase the retry / delay  
> configuration since in case a host is busy with the maximal threads  
> per host the thread is delayed.
> If a thread is delayed to often than you get a Exceeded  
> http.max.delays: retry later....
>
> Sometimes I'm asking myself if not a queue based fetching would be  
> better the actually implementation, however this is difficult to change.
> HTH
> Stefan
>
>------------------------------------------------------------------------
>
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date: 08.12.2005
>
>  
>

Re: Problem with fetching segment

Posted by Stefan Groschupf <sg...@media-style.com>.

> What is  java.net.SocketTimeoutException?

Can not connect to the server.

In general you hammer your webserver and it may block the ip of your  
server.
You can setup how many threads per host are loading from one host  
server.
For a intranet crawl it is a good idea to have less less thread (may  
just as much you plan to use at the same time for the host) e.g.  
fetcherThreads = 2 maxThreadsPerHost = 2
If you have more threads you should increase the retry / delay  
configuration since in case a host is busy with the maximal threads  
per host the thread is delayed.
If a thread is delayed to often than you get a Exceeded  
http.max.delays: retry later....

Sometimes I'm asking myself if not a queue based fetching would be  
better the actually implementation, however this is difficult to change.
HTH
Stefan

Re: Problem with fetching segment

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

When I feed my domain into the database the segment fetch output was 
like this:


-.-.-.-.-.-.-.-.-.-.-.-.-
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte
060109 154622 fetching http://www.niap.no/magasinet/nyheter/afrika
060109 154622 fetching http://www.niap.no/magasinet/nyheter/asia_australia
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya
060109 154622 fetching http://www.niap.no/magasinet/rss/feed/magasinet_rss1
060109 154622 fetching http://www.niap.no/magasinet/content/search
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/europa/russland/stalin_vender_tilbake
060109 154622 fetching http://www.niap.no/magasinet/nyheter/nord_amerika
060109 154626 fetch okay, but can't parse 
http://www.niap.no/magasinet/rss/feed/magasinet_rss1, reason: 
failed(2,203): Content-Type not text/html: text/xml
060109 154626 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/irak/al_queida
060109 154633 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 154633 fetching http://www.niap.no/magasinet/niap/test
060109 154639 fetching 
http://www.niap.no/magasinet/nyheter/europa/italia/pave_benedict_xvi
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/asia_australia failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/nyheter/afrika 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetching http://www.niap.no/magasinet/nyheter/soer_amerika
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/palestina_israel/israel_bekymret_for_landets_internasjonale_image 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/content/search 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetching 
http://www.niap.no/index.php/magasinet/nyheter/s_r_amerika

-.-.-.-.-.-.-
But then

-.-.-.-.-.-
060109 154714 fetch of http://phpadsnew.niap.no/adx.js failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154714 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria
060109 154722 fetch of http://www.niap.org/ failed with: 
java.lang.Exception: java.net.SocketTimeoutException: connect timed out
060109 154724 fetch of 
http://www.niap.no/index.php/magasinet/nyheter/nord_amerika failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/kontakt_oss failed 
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154724 fetch of 
http://www.niap.no/magasinet/magasinet/om_magasinet failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/layout/set/print 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154729 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154730 status: segment 20060109154516, 12 pages, 31 errors, 
181559 bytes, 68511 ms
060109 154730 status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917 
bytes/page

-.-.-.-.-.-
What is  java.net.SocketTimeoutException?




Håvard W. Kongsgård wrote:

> Is the fetcher not supposed to fetch all the docs from the urls 
> provide in the ulrs.txt file?
> The fetch process only takes some seconds, and the whole quick 
> tutorial is done in a minute.
>
>
>
> Stefan Groschupf wrote:
>
>> I can not see any problems in your log, it fetched successfully 3 pages.
>> Can provide a more specific problem description?
>>
>> Am 09.12.2005 um 01:57 schrieb Håvard W. Kongsgård:
>>
>>> I have followed the media-style.com quick tutorial, but when I try  
>>> to fetch my segment the fetch is killed!
>>>
>>> Have tried to set the system timer + 30 days, no anti-virus is  
>>> running on the systems.
>>> System SUSE 9.2 and SUSE 10
>>>
>>> # bin/nutch fetch segments/20060109014654/
>>> 060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
>>> default.xml
>>> 060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
>>> site.xml
>>> 060109 014715 No FS indicated, using default:local
>>> 060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/ 
>>> plugins
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> query-more
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>>> site/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>>> class=org.apache.nutch.searcher.site.SiteQueryFilter
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
>>> html/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
>>> class=org.apache.nutch.parse.html.HtmlParser
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
>>> text/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
>>> class=org.apache.nutch.parse.text.TextParser
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> parse-ext
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> parse-pdf
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> parse-rss
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>>> basic/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>>> class=org.apache.nutch.searcher.basic.BasicQueryFilter
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> index-more
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> parse-js
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> urlfilter-regex/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.net.URLFilter  
>>> class=org.apache.nutch.net.RegexURLFilter
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> protocol-ftp
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> parse-msword
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> creativecommons
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> ontology
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch- 
>>> extensionpoints/plugin.xml
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> protocol-file
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> protocol-http/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.protocol.Protocol  
>>> class=org.apache.nutch.protocol.http.Http
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> clustering-carrot2
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> language-identifier
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> urlfilter-prefix
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>>> url/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>>> class=org.apache.nutch.searcher.url.URLQueryFilter
>>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index- 
>>> basic/plugin.xml
>>> 060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter  
>>> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>>> protocol-httpclient
>>> 060109 014715 logging at INFO
>>> 060109 014715 fetching http://www.sourceforge.net/
>>> 060109 014715 fetching http://www.apache.org/
>>> 060109 014715 fetching http://www.nutch.org/
>>> 060109 014715 http.proxy.host = null
>>> 060109 014715 http.proxy.port = 8080
>>> 060109 014715 http.timeout = 10000
>>> 060109 014715 http.content.limit = -1
>>> 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
>>> lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
>>> 060109 014715 fetcher.server.delay = 5000
>>> 060109 014715 http.max.delays = 52
>>> 060109 014718 Using URL normalizer:  
>>> org.apache.nutch.net.BasicUrlNormalizer
>>> 060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
>>> 51033 bytes, 8309 ms
>>> 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
>>> bytes/page
>>>
>>>
>>
>>
>>
>>
>
>
>

Re: Problem with fetching segment

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

Is the fetcher not supposed to fetch all the docs from the urls provide 
in the ulrs.txt file?
The fetch process only takes some seconds, and the whole quick tutorial 
is done in a minute.



Stefan Groschupf wrote:

> I can not see any problems in your log, it fetched successfully 3 pages.
> Can provide a more specific problem description?
>
> Am 09.12.2005 um 01:57 schrieb Håvard W. Kongsgård:
>
>> I have followed the media-style.com quick tutorial, but when I try  
>> to fetch my segment the fetch is killed!
>>
>> Have tried to set the system timer + 30 days, no anti-virus is  
>> running on the systems.
>> System SUSE 9.2 and SUSE 10
>>
>> # bin/nutch fetch segments/20060109014654/
>> 060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
>> default.xml
>> 060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
>> site.xml
>> 060109 014715 No FS indicated, using default:local
>> 060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/ 
>> plugins
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> query-more
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>> site/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>> class=org.apache.nutch.searcher.site.SiteQueryFilter
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
>> html/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
>> class=org.apache.nutch.parse.html.HtmlParser
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
>> text/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
>> class=org.apache.nutch.parse.text.TextParser
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> parse-ext
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> parse-pdf
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> parse-rss
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>> basic/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>> class=org.apache.nutch.searcher.basic.BasicQueryFilter
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> index-more
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> parse-js
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> urlfilter-regex/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.net.URLFilter  
>> class=org.apache.nutch.net.RegexURLFilter
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> protocol-ftp
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> parse-msword
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> creativecommons
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> ontology
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch- 
>> extensionpoints/plugin.xml
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> protocol-file
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> protocol-http/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.protocol.Protocol  
>> class=org.apache.nutch.protocol.http.Http
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> clustering-carrot2
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> language-identifier
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> urlfilter-prefix
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
>> url/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
>> class=org.apache.nutch.searcher.url.URLQueryFilter
>> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index- 
>> basic/plugin.xml
>> 060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter  
>> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
>> protocol-httpclient
>> 060109 014715 logging at INFO
>> 060109 014715 fetching http://www.sourceforge.net/
>> 060109 014715 fetching http://www.apache.org/
>> 060109 014715 fetching http://www.nutch.org/
>> 060109 014715 http.proxy.host = null
>> 060109 014715 http.proxy.port = 8080
>> 060109 014715 http.timeout = 10000
>> 060109 014715 http.content.limit = -1
>> 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
>> lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
>> 060109 014715 fetcher.server.delay = 5000
>> 060109 014715 http.max.delays = 52
>> 060109 014718 Using URL normalizer:  
>> org.apache.nutch.net.BasicUrlNormalizer
>> 060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
>> 51033 bytes, 8309 ms
>> 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
>> bytes/page
>>
>>
>
>
>
>

Re: Problem with fetching segment

Posted by Stefan Groschupf <sg...@media-style.com>.

I can not see any problems in your log, it fetched successfully 3 pages.
Can provide a more specific problem description?

Am 09.12.2005 um 01:57 schrieb Håvard W. Kongsgård:

> I have followed the media-style.com quick tutorial, but when I try  
> to fetch my segment the fetch is killed!
>
> Have tried to set the system timer + 30 days, no anti-virus is  
> running on the systems.
> System SUSE 9.2 and SUSE 10
>
> # bin/nutch fetch segments/20060109014654/
> 060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
> default.xml
> 060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch- 
> site.xml
> 060109 014715 No FS indicated, using default:local
> 060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/ 
> plugins
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> query-more
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
> site/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
> html/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
> class=org.apache.nutch.parse.html.HtmlParser
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse- 
> text/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.parse.Parser  
> class=org.apache.nutch.parse.text.TextParser
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> parse-ext
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> parse-pdf
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> parse-rss
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
> basic/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> index-more
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> parse-js
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> urlfilter-regex/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.net.URLFilter  
> class=org.apache.nutch.net.RegexURLFilter
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> protocol-ftp
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> parse-msword
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> creativecommons
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> ontology
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch- 
> extensionpoints/plugin.xml
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> protocol-file
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> protocol-http/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.protocol.Protocol  
> class=org.apache.nutch.protocol.http.Http
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> clustering-carrot2
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> language-identifier
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> urlfilter-prefix
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query- 
> url/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter  
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index- 
> basic/plugin.xml
> 060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter  
> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ 
> protocol-httpclient
> 060109 014715 logging at INFO
> 060109 014715 fetching http://www.sourceforge.net/
> 060109 014715 fetching http://www.apache.org/
> 060109 014715 fetching http://www.nutch.org/
> 060109 014715 http.proxy.host = null
> 060109 014715 http.proxy.port = 8080
> 060109 014715 http.timeout = 10000
> 060109 014715 http.content.limit = -1
> 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
> lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060109 014715 fetcher.server.delay = 5000
> 060109 014715 http.max.delays = 52
> 060109 014718 Using URL normalizer:  
> org.apache.nutch.net.BasicUrlNormalizer
> 060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
> 51033 bytes, 8309 ms
> 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
> bytes/page
>
>