You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Pantelis <pk...@hotmail.com> on 2012/03/12 13:32:54 UTC

Re: Exception in thread "main" java.io.IOException: Job failed!

Hi.
I am having the same problem (newbie to nutch too)
Using nutch 1.4 on Windows 7 with Cygwin 
If I understand correctly, the crawling process should create segments and
each one of those segments corresponds to a folder under
NUTCH_HOME/runtime/local/crawl/segment_number.
Then under each segment_number folder a parse_data folder should be created
that apparently is not created. 
My linkdb folder is empty (NUTCH_HOME/runtime/local/linkdb)

Output follows 


$ bin/nutch crawl urls -dir crawl -depth 3 -topN 5
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-03-12 13:38:06
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-03-12 13:38:09, elapsed: 00:00:02
Generator: starting at 2012-03-12 13:38:09
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133811
Generator: finished at 2012-03-12 13:38:12, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:12
Fetcher: segment: crawl/segments/20120312133811
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:17, elapsed: 00:00:04
ParseSegment: starting at 2012-03-12 13:38:17
ParseSegment: segment: crawl/segments/20120312133811
Parsing: http://nutch.apache.org/
ParseSegment: finished at 2012-03-12 13:38:18, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:18
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133811]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:19, elapsed: 00:00:01
Generator: starting at 2012-03-12 13:38:19
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133822
Generator: finished at 2012-03-12 13:38:23, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:23
Fetcher: segment: crawl/segments/20120312133822
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/wiki.html
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.apache.org/
fetching http://www.eu.apachecon.com/c/aceu2009/
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with:
java.net.UnknownHostException: www.eu.apachecon.com
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552304945
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552303927
  now           = 1331552304949
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552305950
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552305953
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552306955
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552306957
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552307958
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552307959
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552308961
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552308963
  0. http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://nutch.apache.org/mailing_lists.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:31, elapsed: 00:00:08
ParseSegment: starting at 2012-03-12 13:38:31
ParseSegment: segment: crawl/segments/20120312133822
Parsing: http://nutch.apache.org/mailing_lists.html
Parsing: http://nutch.apache.org/wiki.html
Parsing: http://www.apache.org/
Parsing: http://www.apache.org/dyn/closer.cgi/nutch/
ParseSegment: finished at 2012-03-12 13:38:33, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:33
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133822]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:34, elapsed: 00:00:01
Generator: starting at 2012-03-12 13:38:34
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133836
Generator: finished at 2012-03-12 13:38:38, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:38
Fetcher: segment: crawl/segments/20120312133836
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://hadoop.apache.org/
Using queue mode : byHost
fetching http://nutch.apache.org/index.html
Using queue mode : byHost
fetching http://www.apache.org/licenses/
Using queue mode : byHost
Using queue mode : byHost
fetching http://tika.apache.org/
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552319434
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552320435
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552321436
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552322438
  0. http://www.apache.org/foundation/sponsorship.html
fetching http://www.apache.org/foundation/sponsorship.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:45, elapsed: 00:00:07
ParseSegment: starting at 2012-03-12 13:38:45
ParseSegment: segment: crawl/segments/20120312133836
Parsing: http://hadoop.apache.org/
Parsing: http://nutch.apache.org/index.html
Parsing: http://tika.apache.org/
Parsing: http://www.apache.org/foundation/sponsorship.html
Parsing: http://www.apache.org/licenses/
ParseSegment: finished at 2012-03-12 13:38:46, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:46
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133836]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:48, elapsed: 00:00:01
LinkDb: starting at 2012-03-12 13:38:48
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133811
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133822
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133836
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409/parse_data
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

--
View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp3766765p3819113.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Pantelis <pk...@hotmail.com>.

Hi, I thinkI managed to address this issue.
What i did was to also add 
+^http://([a-z0-9]*\.)*apache.org/
in the regex-urlfilter.txt in $NUTCH_HOME/conf.
I guess both files regex-urlfilter.txt AND nutch-site.xml need to be
concurrently updated in both locations, i.e.
$NUTCH_HOME/conf & $NUTCH_HOME/conf/runtime/local/conf.
Is that correct?
In any case this was the only modification I made and the crawling worked. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp3766765p3821757.html
Sent from the Nutch - User mailing list archive at Nabble.com.