You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by th...@yahoo.co.uk on 2006/03/01 17:56:02 UTC

Exception from crawl command

Hi,

I've been experimenting with nutch and lucene,
everything was working fine, but now I'm getting an
exception thrown from the crawl command.

The command manages a few fetch cycles but then I get
the following message:

060301 161128 status: segment 20060301161046, 38
pages, 0 errors, 856591 bytes, 41199 ms
060301 161128 status: 0.92235243 pages/s, 162.43396
kb/s, 22541.87 bytes/page
060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
060301 161129 Updating for
C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
060301 161129 Processing document 0
060301 161130 Finishing update
060301 161130 Processing pagesByURL: Sorted 952
instructions in 0.02 seconds.
060301 161130 Processing pagesByURL: Sorted 47600.0
instructions/second
java.io.IOException: already exists:
C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
	at
org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
	at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
	at
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
	at
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
	at
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
	at
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
Exception in thread "main" 

Does anyone have any ideas what the problem is likely
to be.  I am running nutch 0.7.1

thanks,


Julian.

Re: Exception from crawl command

Posted by th...@yahoo.co.uk.
Hi,

I haven't tried the 0.8 version yet, I might give it a
look if I can find the time.  

I've investigated the problem a little more and it
seems to be related to having a high value for
"http.content.limit" and parsing pdf files.  (The site
probably only goes over the default value for pdf
files so it might just be files above that size)

I'm hoping to avoid indexing the pdfs so I'm not going
to worry about it at the moment.

thanks,


Julian.


--- sudhendra seshachala <su...@yahoo.com> wrote:

> Okay.
>   Have you tried, the 0.8 version. Seems like it is
> more stable than the 0.7.X. (The one you are using)
>   It is a bit different too.. with Hadoop and nutch
> being separate..
>   I had few issues using 0.7X. But nightly-build
> (0.8), I was upto speed comparatively sooner.
>    
>   I hope this helps.. I am not trying to go away
> from the problem, just that next release is more
> stable and more ever, there is no backward
> compatibility for 0,8X. (That is what I read in one
> of the mails achieve) You are better off using 0.8..
>    
>   Thanks
>   Sudhi

Re: Exception from crawl command

Posted by sudhendra seshachala <su...@yahoo.com>.
Okay.
  Have you tried, the 0.8 version. Seems like it is more stable than the 0.7.X. (The one you are using)
  It is a bit different too.. with Hadoop and nutch being separate..
  I had few issues using 0.7X. But nightly-build (0.8), I was upto speed comparatively sooner.
   
  I hope this helps.. I am not trying to go away from the problem, just that next release is more stable and more ever, there is no backward compatibility for 0,8X. (That is what I read in one of the mails achieve) You are better off using 0.8..
   
  Thanks
  Sudhi
   
  

throwawayuseridfor-nutch@yahoo.co.uk wrote:
  Hi,

sorry for the fumbled reply, I've tried deleting the
directory and starting the crawl from scratch a number
of times, with very similar results.

The system seems to be generating the exception after
the fetch block of the output after an apparently
arbitrary depth. It leaves the directory with a db
folder containing:

Mar 2 09:30 dbreadlock
Mar 2 09:31 dbwritelock
Mar 2 09:30 webdb
Mar 2 09:31 webdb.new

The webdb.new folder contains:

Mar 2 09:30 pagesByURL
Mar 2 09:30 stats
Mar 2 09:31 tmp

I have the following set in my nutch-site.xml file:



urlnormalizer.class

org.apache.nutch.net.RegexUrlNormalizer
Name of the class used to normalize
URLs.





urlnormalizer.regex.file
regex-normalize.xml
Name of the config file used by the
RegexUrlNormalizer class.





http.content.limit
-1
The length limit for downloaded
content, in bytes.
If this value is nonnegative (>=0), content longer
than it will be truncated;
otherwise, no truncation at all.






plugin.includes

nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)
Regular expression naming plugin
directory names to
include. Any plugin not matching this expression is
excluded.
In any case you need at least include the
nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain
text via HTTP,
and basic indexing and search plugins.




I don't think any of this should cause the problem. 
I'm going to try reinstalling and setting everything
up again, but if anyone has any idea what the problem
might be then please let me know.

cheers,


Julian.


--- sudhendra seshachala wrote:

> Delete the folder/database and then re-issue the
> crawl command.
> The database/folder gets created when Crawl is
> used. 
> I am recent user too... But, I did get the same
> message and I corrected by deleting the folder. IF
> any one has better ideas, please share.
> 
> Thanks
> 
> throwawayuseridfor-nutch@yahoo.co.uk wrote:
> Hi,
> 
> I've been experimenting with nutch and lucene,
> everything was working fine, but now I'm getting an
> exception thrown from the crawl command.
> 
> The command manages a few fetch cycles but then I
> get
> the following message:
> 
> 060301 161128 status: segment 20060301161046, 38
> pages, 0 errors, 856591 bytes, 41199 ms
> 060301 161128 status: 0.92235243 pages/s, 162.43396
> kb/s, 22541.87 bytes/page
> 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
> 060301 161129 Updating for
> C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
> 060301 161129 Processing document 0
> 060301 161130 Finishing update
> 060301 161130 Processing pagesByURL: Sorted 952
> instructions in 0.02 seconds.
> 060301 161130 Processing pagesByURL: Sorted 47600.0
> instructions/second
> java.io.IOException: already exists:
> C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
> at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at
>
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at
>
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> Exception in thread "main" 
> 
> Does anyone have any ideas what the problem is
> likely
> to be. I am running nutch 0.7.1
> 
> thanks,
> 
> 
> Julian.
> 
> 
> 
> Sudhi Seshachala
> http://sudhilogs.blogspot.com/
> 
> 
> 
> 
> ---------------------------------
> Yahoo! Mail
> Use Photomail to share photos without annoying
attachments.




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: Exception from crawl command

Posted by th...@yahoo.co.uk.
Hi,

sorry for the fumbled reply, I've tried deleting the
directory and starting the crawl from scratch a number
of times, with very similar results.

The system seems to be generating the exception after
the fetch block of the output after an apparently
arbitrary depth.  It leaves the directory with a db
folder containing:

Mar  2 09:30 dbreadlock
Mar  2 09:31 dbwritelock
Mar  2 09:30 webdb
Mar  2 09:31 webdb.new

The webdb.new folder contains:

Mar  2 09:30 pagesByURL
Mar  2 09:30 stats
Mar  2 09:31 tmp

I have the following set in my nutch-site.xml file:

<property>
  <name>urlnormalizer.class</name>
 
<value>org.apache.nutch.net.RegexUrlNormalizer</value>
  <description>Name of the class used to normalize
URLs.</description>
</property>

<property>
  <name>urlnormalizer.regex.file</name>
  <value>regex-normalize.xml</value>
  <description>Name of the config file used by the
RegexUrlNormalizer class.</description>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded
content, in bytes.
  If this value is nonnegative (>=0), content longer
than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.
  In any case you need at least include the
nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

I don't think any of this should cause the problem. 
I'm going to try reinstalling and setting everything
up again, but if anyone has any idea what the problem
might be then please let me know.

cheers,


Julian.


--- sudhendra seshachala <su...@yahoo.com> wrote:

> Delete the folder/database and then re-issue the
> crawl command.
>   The database/folder gets created when Crawl is
> used. 
>   I am recent user too... But, I did get the same
> message and I corrected by deleting the folder. IF
> any one has better ideas, please share.
>    
>   Thanks
>    
>   throwawayuseridfor-nutch@yahoo.co.uk wrote:
>   Hi,
> 
> I've been experimenting with nutch and lucene,
> everything was working fine, but now I'm getting an
> exception thrown from the crawl command.
> 
> The command manages a few fetch cycles but then I
> get
> the following message:
> 
> 060301 161128 status: segment 20060301161046, 38
> pages, 0 errors, 856591 bytes, 41199 ms
> 060301 161128 status: 0.92235243 pages/s, 162.43396
> kb/s, 22541.87 bytes/page
> 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
> 060301 161129 Updating for
> C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
> 060301 161129 Processing document 0
> 060301 161130 Finishing update
> 060301 161130 Processing pagesByURL: Sorted 952
> instructions in 0.02 seconds.
> 060301 161130 Processing pagesByURL: Sorted 47600.0
> instructions/second
> java.io.IOException: already exists:
> C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
> at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at
>
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at
>
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> Exception in thread "main" 
> 
> Does anyone have any ideas what the problem is
> likely
> to be. I am running nutch 0.7.1
> 
> thanks,
> 
> 
> Julian.
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
>  Yahoo! Mail
>  Use Photomail to share photos without annoying
attachments.


Re: Exception from crawl command

Posted by th...@yahoo.co.uk.
--- sudhendra seshachala <su...@yahoo.com> wrote:

> Delete the folder/database and then re-issue the
> crawl command.
>   The database/folder gets created when Crawl is
> used. 
>   I am recent user too... But, I did get the same
> message and I corrected by deleting the folder. IF
> any one has better ideas, please share.
>    
>   Thanks
>    
>   throwawayuseridfor-nutch@yahoo.co.uk wrote:
>   Hi,
> 
> I've been experimenting with nutch and lucene,
> everything was working fine, but now I'm getting an
> exception thrown from the crawl command.
> 
> The command manages a few fetch cycles but then I
> get
> the following message:
> 
> 060301 161128 status: segment 20060301161046, 38
> pages, 0 errors, 856591 bytes, 41199 ms
> 060301 161128 status: 0.92235243 pages/s, 162.43396
> kb/s, 22541.87 bytes/page
> 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
> 060301 161129 Updating for
> C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
> 060301 161129 Processing document 0
> 060301 161130 Finishing update
> 060301 161130 Processing pagesByURL: Sorted 952
> instructions in 0.02 seconds.
> 060301 161130 Processing pagesByURL: Sorted 47600.0
> instructions/second
> java.io.IOException: already exists:
> C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
> at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at
>
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at
>
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> Exception in thread "main" 
> 
> Does anyone have any ideas what the problem is
> likely
> to be. I am running nutch 0.7.1
> 
> thanks,
> 
> 
> Julian.
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
>  Yahoo! Mail
>  Use Photomail to share photos without annoying
attachments.


Re: Exception from crawl command

Posted by sudhendra seshachala <su...@yahoo.com>.
Delete the folder/database and then re-issue the crawl command.
  The database/folder gets created when Crawl is used. 
  I am recent user too... But, I did get the same message and I corrected by deleting the folder. IF any one has better ideas, please share.
   
  Thanks
   
  throwawayuseridfor-nutch@yahoo.co.uk wrote:
  Hi,

I've been experimenting with nutch and lucene,
everything was working fine, but now I'm getting an
exception thrown from the crawl command.

The command manages a few fetch cycles but then I get
the following message:

060301 161128 status: segment 20060301161046, 38
pages, 0 errors, 856591 bytes, 41199 ms
060301 161128 status: 0.92235243 pages/s, 162.43396
kb/s, 22541.87 bytes/page
060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
060301 161129 Updating for
C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
060301 161129 Processing document 0
060301 161130 Finishing update
060301 161130 Processing pagesByURL: Sorted 952
instructions in 0.02 seconds.
060301 161130 Processing pagesByURL: Sorted 47600.0
instructions/second
java.io.IOException: already exists:
C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
at
org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
at
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
at
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
at
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
Exception in thread "main" 

Does anyone have any ideas what the problem is likely
to be. I am running nutch 0.7.1

thanks,


Julian.



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
 Yahoo! Mail
 Use Photomail to share photos without annoying attachments.