You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2006/04/08 14:41:24 UTC

fetching stuck in the middle of processing

Hi,

My fetching is stuck in the middle of somewhere in
fetching. And when I took a look at fetch log, I got
the following error messages,

Any reason why this happens? Should I stop current
session and restart crawling again?

by the way, my config for ftp content is set as
unlimited "<name>http.content.limit</name> 
<value>-1</value> ". Will that be the reason?

thanks

"
060408 072825 SEVERE error writing
output:java.io.IOException: key out of order: 33079
after 33079
java.io.IOException: key out of order: 33079 after
33079
	at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
	at
org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
	at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:318)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:301)
	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:160)
Exception in thread "main" java.lang.RuntimeException:
SEVERE error logged.  Exiting fetcher.
	at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:394)
	at
org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:528)
"

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: fetching stuck in the middle of processing

Posted by Michael Ji <fj...@yahoo.com>.
hi Andrzej:

My linux disk usage is as followings:
"
Filesystem           1K-blocks      Used Available
Use% Mounted on
/dev/sda3            132645028  60431892  65475076 
48% /
/dev/sda1               101089     20607     75263 
22% /boot
none                   2060348         0   2060348  
0% /dev/shm
"

I didn't see any sign of out of disk space.

thanks,

Michael

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Michael Ji wrote:
> > Hi,
> >
> > My fetching is stuck in the middle of somewhere in
> > fetching. And when I took a look at fetch log, I
> got
> > the following error messages,
> >
> > Any reason why this happens? Should I stop current
> > session and restart crawling again?
> >
> > by the way, my config for ftp content is set as
> > unlimited "<name>http.content.limit</name> 
> > <value>-1</value> ". Will that be the reason?
> >   
> 
> These particular problems happen among other when
> you run out of disk 
> space - please check that you have enough disk
> space, also on your /tmp 
> partition.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: fetching stuck in the middle of processing

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Ji wrote:
> Hi,
>
> My fetching is stuck in the middle of somewhere in
> fetching. And when I took a look at fetch log, I got
> the following error messages,
>
> Any reason why this happens? Should I stop current
> session and restart crawling again?
>
> by the way, my config for ftp content is set as
> unlimited "<name>http.content.limit</name> 
> <value>-1</value> ". Will that be the reason?
>   

These particular problems happen among other when you run out of disk 
space - please check that you have enough disk space, also on your /tmp 
partition.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: refetching interval

Posted by Michael Ji <fj...@yahoo.com>.
Hi Gal:

Yes, I set db.max.per.host == 1.

Another interesting thing I found is that when I
dubugging print out page information in
FetchListTool.java to go generation, I check the log,
found "...Next fetch: Fri Apr 14 19:49:3...". This
webdb I generate in April 9 and refetching interval is
set to 1 day. 

Should "Next fetch" date around Aril 10th?

Why this happens?

thanks,

Michael,

--- Gal Nitzan <gn...@usa.net> wrote:

> 
> What about db.max.per.host? is it set to -1 ?
> 
> 
> -----Original Message-----
> From: Michael Ji [mailto:fji_00@yahoo.com] 
> Sent: Monday, April 10, 2006 3:18 AM
> To: nutch-user@lucene.apache.org
> Subject: refetching interval
> 
> hi there,
> 
> I have webdb with over 60,000 pages (using
> nutch/admin
> dumptxt command) and refetching interval is set as 1
> day
> 
> <property>
>   <name>db.default.fetch.interval</name>
>   <value>1</value>
>   <description>The default number of days between
> re-fetches of a page.
>   </description>
> </property>
> 
> But, when I do crawling based on this webdb next
> day,
> the generate log only showing that around 8,000
> pages
> being generated for fetching and actually 7,500
> pages
> being fetched down.
> 
> Any reason why it behaves like that? Should 60,000
> pages being fetching this time?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: refetching interval

Posted by Gal Nitzan <gn...@usa.net>.
What about db.max.per.host? is it set to -1 ?


-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Monday, April 10, 2006 3:18 AM
To: nutch-user@lucene.apache.org
Subject: refetching interval

hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



refetching interval

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com