You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stephen Cross (JIRA)" <ji...@apache.org> on 2005/10/19 14:43:44 UTC
[jira] Created: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
-------------------------------------------------------------------------------------------------------------
Key: NUTCH-117
URL: http://issues.apache.org/jira/browse/NUTCH-117
Project: Nutch
Type: Bug
Versions: 0.7.1, 0.7, 0.6
Environment: Window 2000 P4 1.70GHz 512MB RAM
Java 1.5.0_05
Reporter: Stephen Cross
Priority: Critical
I started a crawl using the command line using nutch 0.7.1.
nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
After crawling for over 15 hours the crawl crached with the following exception:
051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
051019 050544 Processing document 0
051019 050544 Finishing update
051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Posted by "Stephen Cross (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-117?page=comments#action_12332724 ]
Stephen Cross commented on NUTCH-117:
-------------------------------------
I think this is the same problem as a couple of other Nutch issues already in Jira
NUTCH-94: MapFile.Writer throwing 'File exists error'
http://issues.apache.org/jira/browse/NUTCH-94
NUTCH-96: MapFile.Writer throws directory exists exception if run multiple times in the same JVM or server JVM
http://issues.apache.org/jira/browse/NUTCH-96
> Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-117
> URL: http://issues.apache.org/jira/browse/NUTCH-117
> Project: Nutch
> Type: Bug
> Versions: 0.7, 0.6, 0.7.1
> Environment: Window 2000 P4 1.70GHz 512MB RAM
> Java 1.5.0_05
> Reporter: Stephen Cross
> Priority: Critical
>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-117) Crawl crashes with
java.io.IOException: already exists:
C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Posted by "Mike Alulin (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-117?page=comments#action_12363898 ]
Mike Alulin commented on NUTCH-117:
-----------------------------------
I have same issue in my new production system, although same code works on dev and old production without any problems.
The solution for this bug is uncommenting "pageDb.close();" in the WebDBWriter.java file. Otherwise the reader locks the webdb.new\pagesByURL\data file and it cannot be deleted sometimes.
> Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-117
> URL: http://issues.apache.org/jira/browse/NUTCH-117
> Project: Nutch
> Type: Bug
> Versions: 0.7, 0.6, 0.7.1
> Environment: Window 2000 P4 1.70GHz 512MB RAM
> Java 1.5.0_05
> Reporter: Stephen Cross
> Priority: Critical
>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-117) Crawl crashes with
java.io.IOException: already exists:
C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Posted by "Spike Wang (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-117?page=comments#action_12363204 ]
Spike Wang commented on NUTCH-117:
----------------------------------
I have the same problem when running the crawling functionality of Nutch 7.0 in WAS5.1 using IBM JDK 1.4 . But it runs very well at tomcat 5.0.28 using Sun JDK 1.4 .
Exception in thread "main" java.io.IOException: already exists: %CRAWL_RESULT_HOME%\db\webdb.new\pagesByURL
at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
I check and debug the code and find these relative files have not be release when delete these files.
I may be add a evaluated patch to solve this problem .
> Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-117
> URL: http://issues.apache.org/jira/browse/NUTCH-117
> Project: Nutch
> Type: Bug
> Versions: 0.7.1, 0.7, 0.6
> Environment: Window 2000 P4 1.70GHz 512MB RAM
> Java 1.5.0_05
> Reporter: Stephen Cross
> Priority: Critical
>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException:
already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-117?page=all ]
Piotr Kosiorowski closed NUTCH-117:
-----------------------------------
Fix Version: 0.7.2-dev
Resolution: Fixed
Assign To: Piotr Kosiorowski
Applied fixed by Mike. Also reported offlist by Michal Karwanski.
> Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-117
> URL: http://issues.apache.org/jira/browse/NUTCH-117
> Project: Nutch
> Type: Bug
> Versions: 0.7.1, 0.7, 0.6
> Environment: Window 2000 P4 1.70GHz 512MB RAM
> Java 1.5.0_05
> Reporter: Stephen Cross
> Assignee: Piotr Kosiorowski
> Priority: Critical
> Fix For: 0.7.2-dev
>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
Posted by "Nick Jacobsen (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-117?page=comments#action_12332720 ]
Nick Jacobsen commented on NUTCH-117:
-------------------------------------
I had a similar issue, and it seems (guessing here) to be related to some sort of race condition on filehandles. I was running the nutch crawler while doing some heavy processing (compiling java on OpenBSD in a virtual machine), and 19 out of 20 times, nutch would crash with that or a similar error - always related to some sort of file not found, and sometimes access denied. As soon as I stopped doing heavy processing, my nutch errors went down to 1 out of every 20 runs.
Based on this, I have come to the above conslusion that it is some sort of file handle race condition - also, for those of you wondering, it did not matter if I was running 1 or 30 threads, I had the same problems.
Hope this helps a little.
> Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-117
> URL: http://issues.apache.org/jira/browse/NUTCH-117
> Project: Nutch
> Type: Bug
> Versions: 0.7, 0.6, 0.7.1
> Environment: Window 2000 P4 1.70GHz 512MB RAM
> Java 1.5.0_05
> Reporter: Stephen Cross
> Priority: Critical
>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira