You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Denis Haskin <de...@haskinferguson.net> on 2005/10/04 18:50:47 UTC

Always getting "Impossible condition" now...

I've been trying to do some experimentation with nutch 0.7.1 (this is on 
Windows 2000).

I set things up to crawl a local drive (well, actually a network mapped 
drive) and it seemed to work fine.  I let run for a bit but then aborted 
it because I wanted to adjust something.

I deleted all the crawl-* directories, but now when I try to run it I am 
always getting this error:

051004 120331 Updating 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
Exception in thread "main" java.io.IOException: Impossible condition: 
directories 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old and 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot 
exist simultaneously

The complete crawl output is below.  I am baffled by why this is happening.

My urls file just has:
file:///d:/

Thanks for any assistance you can provide...

dwh

--- output from crawl ---

D:\workspaces\work\nutch-0.7.1>java -classpath 
conf;nutch-0.7.1.jar;build\classes;lib;lib\commons-logging-api-1.0.4.jar;lib\concurre
nt-1.3.4.jar;lib\jakarta-oro-2.0.7.jar;lib\jetty-5.1.2.jar;lib\junit-3.8.1.jar;lib\lucene-1.9-rc1-dev.jar;lib\lucene-misc-1.9-rc1-de
v.jar;lib\servlet-api.jar;lib\taglibs-i18n.jar;lib\taglibs-i18n.tld;lib\xerces-2_6_2-apis.jar;lib\xerces-2_6_2.jar;. 
org.apache.nutc
h.tools.CrawlTool crawl urls
051004 120328 parsing 
file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-default.xml
051004 120328 parsing 
file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-tool.xml
051004 120328 parsing 
file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-site.xml
051004 120328 No FS indicated, using default:local
051004 120328 crawl started in: crawl-20051004120328
051004 120328 rootUrlFile = urls
051004 120328 threads = 10
051004 120328 depth = 5
051004 120328 Created webdb at 
LocalFS,D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
051004 120328 Starting URL processing
051004 120328 Plugins: looking in: D:\workspaces\work\nutch-0.7.1\plugins
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\clustering-carrot2
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\creativecommons
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\index-basic\plugin.xml
051004 120328 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\index-more
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\language-identifier
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\ontology
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-ext
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-html\plugin.xml
051004 120328 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-js
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-msword
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-pdf
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-rss
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\parse-text\plugin.xml
051004 120328 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\protocol-file\plugin.xml
051004 120328 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.file.File
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\protocol-ftp
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\protocol-http\plugin.xml
051004 120328 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\protocol-httpclient
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\query-basic\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\query-more
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\query-site\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\query-url\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
051004 120328 not including: 
D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-prefix
051004 120328 parsing: 
D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
051004 120328 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
051004 120328 found resource crawl-urlfilter.txt at 
file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-urlfilter.txt
051004 120328 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051004 120328 Added 1 pages
051004 120328 Processing pagesByURL: Sorted 1 instructions in 0.0 seconds.
051004 120328 Processing pagesByURL: Sorted Infinity instructions/second
051004 120328 Processing pagesByURL: Merged to new DB containing 1 
records in 0.0 seconds
051004 120328 Processing pagesByURL: Merged Infinity records/second
051004 120328 Processing pagesByMD5: Sorted 1 instructions in 0.031 seconds.
051004 120328 Processing pagesByMD5: Sorted 32.25806451612903 
instructions/second
051004 120328 Processing pagesByMD5: Merged to new DB containing 1 
records in 0.0 seconds
051004 120328 Processing pagesByMD5: Merged Infinity records/second
051004 120328 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120328 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051004 120328 FetchListTool started
051004 120329 Processing pagesByURL: Sorted 1 instructions in 0.047 seconds.
051004 120329 Processing pagesByURL: Sorted 21.27659574468085 
instructions/second
051004 120329 Processing pagesByURL: Merged to new DB containing 1 
records in 0.0 seconds
051004 120329 Processing pagesByURL: Merged Infinity records/second
051004 120329 Processing pagesByMD5: Sorted 1 instructions in 0.016 seconds.
051004 120329 Processing pagesByMD5: Sorted 62.5 instructions/second
051004 120329 Processing pagesByMD5: Merged to new DB containing 1 
records in 0.0 seconds
051004 120329 Processing pagesByMD5: Merged Infinity records/second
051004 120329 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: 
Sorted 1 en
tries in 0.015 seconds.
051004 120329 Processing 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: 
Sorted 66.6
6666666666667 entries/second
051004 120329 Overall processing: Sorted 1 entries in 0.015 seconds.
051004 120329 Overall processing: Sorted 0.015 entries/second
051004 120329 FetchListTool completed
051004 120329 logging at INFO
051004 120329 fetching file:///d:/
051004 120330 status: segment 20051004120328, 1 pages, 0 errors, 11062 
bytes, 1000 ms
051004 120330 status: 1.0 pages/s, 86.421875 kb/s, 11062.0 bytes/page
051004 120331 Updating 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
Exception in thread "main" java.io.IOException: Impossible condition: 
directories D:\workspaces\work\nutch-0.7.1\crawl-2005100412032
8\db\webdb.old and 
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot 
exist simultaneously
        at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
        at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
        at 
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)

Re: Always getting "Impossible condition" now...

Posted by Denis Haskin <de...@haskinferguson.net>.

This sounded pretty likely, but it doesn't seem to have had an effect.  
I excluded the entire directory tree nutch was in (didn't seem to help) 
then also tried stopping the antivirus service, no effect.

Dang...

But thanks,

dwh


Russell Mayor wrote:

>You may be experiencing a problem that I did recently.
>
>When nutch deletes un-needed directory trees (like webdb.old) it does so
>with a recursive delete, but often does not check to see whether the results
>of the file-delete calls were successful. As a result, directories that are
>thought to have been deleted can still be present.
>
>In my case, I found that other programs running on my machine (Win 2000)
>like a virus checker could have a lock on one of the files that nutch was
>trying to delete and so Windows would stop nutch from doing so. The result
>was the exception that you report.
>
>In my case I got around the problem by instructing the virus checker not to
>check nutch's working files.
>
>Russell
>  
>

Re: Always getting "Impossible condition" now...

Posted by Russell Mayor <ru...@gmail.com>.

You may be experiencing a problem that I did recently.

When nutch deletes un-needed directory trees (like webdb.old) it does so
with a recursive delete, but often does not check to see whether the results
of the file-delete calls were successful. As a result, directories that are
thought to have been deleted can still be present.

In my case, I found that other programs running on my machine (Win 2000)
like a virus checker could have a lock on one of the files that nutch was
trying to delete and so Windows would stop nutch from doing so. The result
was the exception that you report.

In my case I got around the problem by instructing the virus checker not to
check nutch's working files.

Russell

On 10/4/05, Denis Haskin <de...@haskinferguson.net> wrote:
>
> I was deleting the whole crawl-2005... etc directory tree (any of them
> that I have). I still get the error.
>
> Thanks,
>
> dwh
>
>
> Gal Nitzan wrote:
>
> > Just delete
> > D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old
> >
> > Gal
> >
>
>

Re: Always getting "Impossible condition" now...

Posted by Denis Haskin <de...@haskinferguson.net>.

I was deleting the whole crawl-2005... etc directory tree (any of them 
that I have).  I still get the error.

Thanks,

dwh


Gal Nitzan wrote:

> Just delete 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old
>
> Gal
>

Re: Always getting "Impossible condition" now...

Posted by Gal Nitzan <gn...@usa.net>.

Denis Haskin wrote:
> I've been trying to do some experimentation with nutch 0.7.1 (this is 
> on Windows 2000).
>
> I set things up to crawl a local drive (well, actually a network 
> mapped drive) and it seemed to work fine.  I let run for a bit but 
> then aborted it because I wanted to adjust something.
>
> I deleted all the crawl-* directories, but now when I try to run it I 
> am always getting this error:
>
> 051004 120331 Updating 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
> Exception in thread "main" java.io.IOException: Impossible condition: 
> directories 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old and 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot 
> exist simultaneously
>
> The complete crawl output is below.  I am baffled by why this is 
> happening.
>
> My urls file just has:
> file:///d:/
>
> Thanks for any assistance you can provide...
>
> dwh
>
> --- output from crawl ---
>
> D:\workspaces\work\nutch-0.7.1>java -classpath 
> conf;nutch-0.7.1.jar;build\classes;lib;lib\commons-logging-api-1.0.4.jar;lib\concurre 
>
> nt-1.3.4.jar;lib\jakarta-oro-2.0.7.jar;lib\jetty-5.1.2.jar;lib\junit-3.8.1.jar;lib\lucene-1.9-rc1-dev.jar;lib\lucene-misc-1.9-rc1-de 
>
> v.jar;lib\servlet-api.jar;lib\taglibs-i18n.jar;lib\taglibs-i18n.tld;lib\xerces-2_6_2-apis.jar;lib\xerces-2_6_2.jar;. 
> org.apache.nutc
> h.tools.CrawlTool crawl urls
> 051004 120328 parsing 
> file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-default.xml
> 051004 120328 parsing 
> file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-tool.xml
> 051004 120328 parsing 
> file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-site.xml
> 051004 120328 No FS indicated, using default:local
> 051004 120328 crawl started in: crawl-20051004120328
> 051004 120328 rootUrlFile = urls
> 051004 120328 threads = 10
> 051004 120328 depth = 5
> 051004 120328 Created webdb at 
> LocalFS,D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
> 051004 120328 Starting URL processing
> 051004 120328 Plugins: looking in: D:\workspaces\work\nutch-0.7.1\plugins
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\clustering-carrot2
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\creativecommons
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\index-basic\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.indexer.IndexingFilter 
> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\index-more
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\language-identifier
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\ontology
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-ext
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-html\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.html.HtmlParser
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-js
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-msword
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-pdf
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-rss
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\parse-text\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.text.TextParser
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\protocol-file\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.protocol.Protocol 
> class=org.apache.nutch.protocol.file.File
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\protocol-ftp
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\protocol-http\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.protocol.Protocol 
> class=org.apache.nutch.protocol.http.Http
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\protocol-httpclient
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\query-basic\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\query-more
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\query-site\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\query-url\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 051004 120328 not including: 
> D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-prefix
> 051004 120328 parsing: 
> D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
> 051004 120328 impl: point=org.apache.nutch.net.URLFilter 
> class=org.apache.nutch.net.RegexURLFilter
> 051004 120328 found resource crawl-urlfilter.txt at 
> file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-urlfilter.txt
> 051004 120328 Using URL normalizer: 
> org.apache.nutch.net.BasicUrlNormalizer
> 051004 120328 Added 1 pages
> 051004 120328 Processing pagesByURL: Sorted 1 instructions in 0.0 
> seconds.
> 051004 120328 Processing pagesByURL: Sorted Infinity instructions/second
> 051004 120328 Processing pagesByURL: Merged to new DB containing 1 
> records in 0.0 seconds
> 051004 120328 Processing pagesByURL: Merged Infinity records/second
> 051004 120328 Processing pagesByMD5: Sorted 1 instructions in 0.031 
> seconds.
> 051004 120328 Processing pagesByMD5: Sorted 32.25806451612903 
> instructions/second
> 051004 120328 Processing pagesByMD5: Merged to new DB containing 1 
> records in 0.0 seconds
> 051004 120328 Processing pagesByMD5: Merged Infinity records/second
> 051004 120328 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
> 051004 120328 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
> 051004 120328 FetchListTool started
> 051004 120329 Processing pagesByURL: Sorted 1 instructions in 0.047 
> seconds.
> 051004 120329 Processing pagesByURL: Sorted 21.27659574468085 
> instructions/second
> 051004 120329 Processing pagesByURL: Merged to new DB containing 1 
> records in 0.0 seconds
> 051004 120329 Processing pagesByURL: Merged Infinity records/second
> 051004 120329 Processing pagesByMD5: Sorted 1 instructions in 0.016 
> seconds.
> 051004 120329 Processing pagesByMD5: Sorted 62.5 instructions/second
> 051004 120329 Processing pagesByMD5: Merged to new DB containing 1 
> records in 0.0 seconds
> 051004 120329 Processing pagesByMD5: Merged Infinity records/second
> 051004 120329 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
> 051004 120329 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
> 051004 120329 Processing 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: 
> Sorted 1 en
> tries in 0.015 seconds.
> 051004 120329 Processing 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: 
> Sorted 66.6
> 6666666666667 entries/second
> 051004 120329 Overall processing: Sorted 1 entries in 0.015 seconds.
> 051004 120329 Overall processing: Sorted 0.015 entries/second
> 051004 120329 FetchListTool completed
> 051004 120329 logging at INFO
> 051004 120329 fetching file:///d:/
> 051004 120330 status: segment 20051004120328, 1 pages, 0 errors, 11062 
> bytes, 1000 ms
> 051004 120330 status: 1.0 pages/s, 86.421875 kb/s, 11062.0 bytes/page
> 051004 120331 Updating 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
> Exception in thread "main" java.io.IOException: Impossible condition: 
> directories D:\workspaces\work\nutch-0.7.1\crawl-2005100412032
> 8\db\webdb.old and 
> D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot 
> exist simultaneously
>        at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
>        at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
>        at 
> org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360) 
>
>        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
>
>
> .
>
Just delete D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old

Gal