You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Bourrion <da...@univ-angers.fr> on 2012/02/22 16:17:42 UTC

Exception in thread "main" java.io.IOException: Job failed!

Hi.
I'm a french librarian (that explains the bad english coming now... :) )

Newbie on Nutch, that looks exactly what i'm searching for (an 
opensource solution that should crawl our specific domaine and have it's 
crawl results pushed into Solr).

I've install a test nutch using http://wiki.apache.org/nutch/NutchTutorial


Got an error but I don't really know it nor understand where to try to 
correct what causes that.

Here's a copy of the error messages - any help welcome.
Best

--------------------------------------------------
daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$ 
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-02-22 16:06:04
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
Generator: starting at 2012-02-22 16:06:06
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120222160609
Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2012-02-22 16:06:10
Fetcher: segment: crawl/segments/20120222160609
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://bu.univ-angers.fr/
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.face-ecran.fr/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
ParseSegment: starting at 2012-02-22 16:06:13
ParseSegment: segment: crawl/segments/20120222160609
Parsing: http://bu.univ-angers.fr/
Parsing: http://www.face-ecran.fr/
Exception in thread "main" java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


--------------------------------------------------

-- 
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                        Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Daniel Bourrion <da...@univ-angers.fr>.
Disk size must be ok

daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local/logs$ df -h
Sys. de fichiers            Taille  Uti. Disp. Uti% Monté sur
/dev/sda5              73G   65G  3,8G  95% /

access rights : as I'm testing on my laptop, all files in 
apache-nutch-1.4-bin/ are 777

Hum... Time for tea, to try to understand ;)


On 23/02/2012 11:47, remi tassing wrote:
> disk size issue?
> access rights?
>
> On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion<
> daniel.bourrion@univ-angers.fr>  wrote:
>
>> Hi Markus
>> Thx for help.
>>
>> (Hope i'm not boring everybody)
>>
>> I've erase everything in crawl/
>>
>> Launching my nutch, got now
>>
>> -----
>> CrawlDb update: 404 purging: false
>> CrawlDb update: Merging segment data into db.
>>
>> Exception in thread "main" java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
>>     at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105)
>>     at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63)
>>     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140)
>>
>>     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>
>> -----
>>
>>
>> Into the logs, got
>>
>> ____
>>
>>
>> 2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
>> false
>> 2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find rules
>> for scope 'crawldb', using default
>> 2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find rules
>> for scope 'crawldb', using default
>> 2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
>> java.io.IOException: Cannot run program "chmod": java.io.IOException:
>> error=12, Cannot allocate memory
>>     at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475)
>>     at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149)
>>     at org.apache.hadoop.util.Shell.**run(Shell.java:134)
>>     at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(**
>> Shell.java:286)
>>     at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354)
>>     at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337)
>>     at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(**
>> RawLocalFileSystem.java:481)
>>     at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(**
>> RawLocalFileSystem.java:473)
>>     at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(**
>> FilterFileSystem.java:280)
>>     at org.apache.hadoop.fs.**ChecksumFileSystem.create(**
>> ChecksumFileSystem.java:372)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364)
>>     at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(**
>> MapTask.java:111)
>>     at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:173)
>> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
>> allocate memory
>>     at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164)
>>     at java.lang.ProcessImpl.start(**ProcessImpl.java:81)
>>     at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468)
>>     ... 15 more
>> _____
>>
>>

-- 
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                        Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Pantelis <pk...@hotmail.com>.
Hi, I thinkI managed to address this issue.
What i did was to also add 
+^http://([a-z0-9]*\.)*apache.org/
in the regex-urlfilter.txt in $NUTCH_HOME/conf.
I guess both files regex-urlfilter.txt AND nutch-site.xml need to be
concurrently updated in both locations, i.e.
$NUTCH_HOME/conf & $NUTCH_HOME/conf/runtime/local/conf.
Is that correct?
In any case this was the only modification I made and the crawling worked. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp3766765p3821757.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Pantelis <pk...@hotmail.com>.
Hi.
I am having the same problem (newbie to nutch too)
Using nutch 1.4 on Windows 7 with Cygwin 
If I understand correctly, the crawling process should create segments and
each one of those segments corresponds to a folder under
NUTCH_HOME/runtime/local/crawl/segment_number.
Then under each segment_number folder a parse_data folder should be created
that apparently is not created. 
My linkdb folder is empty (NUTCH_HOME/runtime/local/linkdb)

Output follows 


$ bin/nutch crawl urls -dir crawl -depth 3 -topN 5
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-03-12 13:38:06
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-03-12 13:38:09, elapsed: 00:00:02
Generator: starting at 2012-03-12 13:38:09
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133811
Generator: finished at 2012-03-12 13:38:12, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:12
Fetcher: segment: crawl/segments/20120312133811
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:17, elapsed: 00:00:04
ParseSegment: starting at 2012-03-12 13:38:17
ParseSegment: segment: crawl/segments/20120312133811
Parsing: http://nutch.apache.org/
ParseSegment: finished at 2012-03-12 13:38:18, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:18
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133811]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:19, elapsed: 00:00:01
Generator: starting at 2012-03-12 13:38:19
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133822
Generator: finished at 2012-03-12 13:38:23, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:23
Fetcher: segment: crawl/segments/20120312133822
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/wiki.html
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.apache.org/
fetching http://www.eu.apachecon.com/c/aceu2009/
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with:
java.net.UnknownHostException: www.eu.apachecon.com
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552304945
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552303927
  now           = 1331552304949
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552305950
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552305953
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552306955
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552306957
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552307958
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552307959
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1331552309462
  now           = 1331552308961
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552309251
  now           = 1331552308963
  0. http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://nutch.apache.org/mailing_lists.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:31, elapsed: 00:00:08
ParseSegment: starting at 2012-03-12 13:38:31
ParseSegment: segment: crawl/segments/20120312133822
Parsing: http://nutch.apache.org/mailing_lists.html
Parsing: http://nutch.apache.org/wiki.html
Parsing: http://www.apache.org/
Parsing: http://www.apache.org/dyn/closer.cgi/nutch/
ParseSegment: finished at 2012-03-12 13:38:33, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:33
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133822]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:34, elapsed: 00:00:01
Generator: starting at 2012-03-12 13:38:34
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120312133836
Generator: finished at 2012-03-12 13:38:38, elapsed: 00:00:03
Fetcher: starting at 2012-03-12 13:38:38
Fetcher: segment: crawl/segments/20120312133836
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://hadoop.apache.org/
Using queue mode : byHost
fetching http://nutch.apache.org/index.html
Using queue mode : byHost
fetching http://www.apache.org/licenses/
Using queue mode : byHost
Using queue mode : byHost
fetching http://tika.apache.org/
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552319434
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552320435
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552321436
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1331552323207
  now           = 1331552322438
  0. http://www.apache.org/foundation/sponsorship.html
fetching http://www.apache.org/foundation/sponsorship.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-03-12 13:38:45, elapsed: 00:00:07
ParseSegment: starting at 2012-03-12 13:38:45
ParseSegment: segment: crawl/segments/20120312133836
Parsing: http://hadoop.apache.org/
Parsing: http://nutch.apache.org/index.html
Parsing: http://tika.apache.org/
Parsing: http://www.apache.org/foundation/sponsorship.html
Parsing: http://www.apache.org/licenses/
ParseSegment: finished at 2012-03-12 13:38:46, elapsed: 00:00:01
CrawlDb update: starting at 2012-03-12 13:38:46
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120312133836]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-03-12 13:38:48, elapsed: 00:00:01
LinkDb: starting at 2012-03-12 13:38:48
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133811
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133822
LinkDb: adding segment:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133836
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255/parse_data
Input path does not exist:
file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409/parse_data
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

--
View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp3766765p3819113.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Daniel Bourrion <da...@univ-angers.fr>.
Wow crawling works much better - works, indeed, now that I replace 
openJDK by SUN-java6-JDK (I'm on Uubuntu)

Thanks
D

On 23/02/2012 11:47, remi tassing wrote:
> disk size issue?
> access rights?
>
> On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion<
> daniel.bourrion@univ-angers.fr>  wrote:
>
>> Hi Markus
>> Thx for help.
>>
>> (Hope i'm not boring everybody)
>>
>> I've erase everything in crawl/
>>
>> Launching my nutch, got now
>>
>> -----
>> CrawlDb update: 404 purging: false
>> CrawlDb update: Merging segment data into db.
>>
>> Exception in thread "main" java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
>>     at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105)
>>     at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63)
>>     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140)
>>
>>     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>
>> -----
>>
>>
>> Into the logs, got
>>
>> ____
>>
>>
>> 2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
>> false
>> 2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find rules
>> for scope 'crawldb', using default
>> 2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find rules
>> for scope 'crawldb', using default
>> 2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
>> java.io.IOException: Cannot run program "chmod": java.io.IOException:
>> error=12, Cannot allocate memory
>>     at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475)
>>     at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149)
>>     at org.apache.hadoop.util.Shell.**run(Shell.java:134)
>>     at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(**
>> Shell.java:286)
>>     at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354)
>>     at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337)
>>     at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(**
>> RawLocalFileSystem.java:481)
>>     at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(**
>> RawLocalFileSystem.java:473)
>>     at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(**
>> FilterFileSystem.java:280)
>>     at org.apache.hadoop.fs.**ChecksumFileSystem.create(**
>> ChecksumFileSystem.java:372)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372)
>>     at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364)
>>     at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(**
>> MapTask.java:111)
>>     at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:173)
>> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
>> allocate memory
>>     at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164)
>>     at java.lang.ProcessImpl.start(**ProcessImpl.java:81)
>>     at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468)
>>     ... 15 more
>> _____
>>
>>
>>
>> On 23/02/2012 10:01, Markus Jelsma wrote:
>>
>>> Unfetched, unparsed or just a bad corrupt segment. Remove that segment
>>> and try
>>> again.
>>>
>>>   Many thanks Remi.
>>>> Finally, after un reboot og the computer (I send my question just before
>>>> leaving my desk), Nutch started to crawl (amazing :))) )
>>>>
>>>> But now, during the crawl process, I got that :
>>>>
>>>> -----
>>>>
>>>> LinkDb: adding segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222161934 LinkDb: adding segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120223093525 LinkDb: adding segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222153642 LinkDb: adding segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222154459 Exception in thread "main"
>>>> org.apache.hadoop.mapred.**InvalidInputException: Input path does not
>>>> exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222160234/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222160609/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222153805/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222155532/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222160132/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222153642/parse_data Input path does not exist:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120222154459/parse_data at
>>>> org.apache.hadoop.mapred.**FileInputFormat.listStatus(**
>>>> FileInputFormat.java:19
>>>> 0) at
>>>> org.apache.hadoop.mapred.**SequenceFileInputFormat.**
>>>> listStatus(SequenceFileInp
>>>> utFormat.java:44) at
>>>> org.apache.hadoop.mapred.**FileInputFormat.getSplits(**
>>>> FileInputFormat.java:201
>>>> ) at
>>>> org.apache.hadoop.mapred.**JobClient.writeOldSplits(**
>>>> JobClient.java:810)
>>>>       at
>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>> JobClient.java:781)
>>>>       at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>> java:730)
>>>>       at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>> java:1249)
>>>>       at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:175)
>>>>       at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:149)
>>>>       at org.apache.nutch.crawl.Crawl.**run(Crawl.java:143)
>>>>       at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>       at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>>
>>>> -----
>>>>
>>>> and nothing special in the logs :
>>>>
>>>> last lines are :
>>>>
>>>>
>>>> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
>>>> at 2012-02-23 09:46:42, elapsed: 00:00:01
>>>> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
>>>> 2012-02-23 09:46:42
>>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
>>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>>> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments/
>>>> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb:
>>>> adding
>>>> segment:
>>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>>> local/crawl/segments
>>>> /20120222154459
>>>>
>>>> On 22/02/2012 16:36, remi tassing wrote:
>>>>
>>>>> Hey Daniel,
>>>>>
>>>>> You can find more output log in logs/Hadoop files
>>>>>
>>>>> Remi
>>>>>
>>>>> On Wednesday, February 22, 2012, Daniel Bourrion<
>>>>>
>>>>> daniel.bourrion@univ-angers.fr**>    wrote:
>>>>>
>>>>>> Hi.
>>>>>> I'm a french librarian (that explains the bad english coming now... :)
>>>>>> )
>>>>>>
>>>>>> Newbie on Nutch, that looks exactly what i'm searching for (an
>>>>>> opensource
>>>>>>
>>>>> solution that should crawl our specific domaine and have it's crawl
>>>>> results pushed into Solr).
>>>>>
>>>>>   I've install a test nutch using
>>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>>>
>>>>>>
>>>>>> Got an error but I don't really know it nor understand where to try to
>>>>>>
>>>>> correct what causes that.
>>>>>
>>>>>   Here's a copy of the error messages - any help welcome.
>>>>>> Best
>>>>>>
>>>>>> ------------------------------**--------------------
>>>>>> daniel@daniel-linux:~/Bureau/**apache-nutch-1.4-bin/runtime/**local$
>>>>>>
>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>>
>>>>>   solrUrl is not set, indexing will be skipped...
>>>>>> crawl started in: crawl
>>>>>> rootUrlDir = urls
>>>>>> threads = 10
>>>>>> depth = 3
>>>>>> solrUrl=null
>>>>>> topN = 5
>>>>>> Injector: starting at 2012-02-22 16:06:04
>>>>>> Injector: crawlDb: crawl/crawldb
>>>>>> Injector: urlDir: urls
>>>>>> Injector: Converting injected urls to crawl db entries.
>>>>>> Injector: Merging injected urls into crawl db.
>>>>>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
>>>>>> Generator: starting at 2012-02-22 16:06:06
>>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>>> Generator: filtering: true
>>>>>> Generator: normalizing: true
>>>>>> Generator: topN: 5
>>>>>> Generator: jobtracker is 'local', generating exactly one partition.
>>>>>> Generator: Partitioning selected urls for politeness.
>>>>>> Generator: segment: crawl/segments/20120222160609
>>>>>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
>>>>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>>>>>
>>>>> 'http.robots.agents' property.
>>>>>
>>>>>   Fetcher: starting at 2012-02-22 16:06:10
>>>>>> Fetcher: segment: crawl/segments/20120222160609
>>>>>> Using queue mode : byHost
>>>>>> Fetcher: threads: 10
>>>>>> Fetcher: time-out divisor: 2
>>>>>> QueueFeeder finished: total 2 records + hit by time limit :0
>>>>>> Using queue mode : byHost
>>>>>> Using queue mode : byHost
>>>>>> Using queue mode : byHost
>>>>>> fetching http://bu.univ-angers.fr/
>>>>>> Using queue mode : byHost
>>>>>> Using queue mode : byHost
>>>>>> fetching http://www.face-ecran.fr/
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> Using queue mode : byHost
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> Using queue mode : byHost
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> Using queue mode : byHost
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> Using queue mode : byHost
>>>>>> Using queue mode : byHost
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>>> Fetcher: throughput threshold: -1
>>>>>> Fetcher: throughput threshold retries: 5
>>>>>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>>>>> -finishing thread FetcherThread, activeThreads=1
>>>>>> -finishing thread FetcherThread, activeThreads=0
>>>>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>>>>> -activeThreads=0
>>>>>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
>>>>>> ParseSegment: starting at 2012-02-22 16:06:13
>>>>>> ParseSegment: segment: crawl/segments/20120222160609
>>>>>> Parsing: http://bu.univ-angers.fr/
>>>>>> Parsing: http://www.face-ecran.fr/
>>>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>>>
>>>>>>      at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>>> java:1252)
>>>>>>      at org.apache.nutch.parse.**ParseSegment.parse(**
>>>>>> ParseSegment.java:157)
>>>>>>      at org.apache.nutch.crawl.Crawl.**run(Crawl.java:138)
>>>>>>      at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>>>      at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>>>>
>>>>>> ------------------------------**--------------------
>>>>>>
>>>>>> --
>>>>>> Avec mes salutations les plus cordiales.
>>>>>> __
>>>>>>
>>>>>> Daniel Bourrion, conservateur des bibliothèques
>>>>>> Responsable de la bibliothèque numérique
>>>>>> Ligne directe : 02.44.68.80.50
>>>>>> SCD Université d'Angers - http://bu.univ-angers.fr
>>>>>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>>>>>
>>>>>> *************************************
>>>>>> " Et par le pouvoir d'un mot
>>>>>> Je recommence ma vie "
>>>>>>
>>>>>>                         Paul Eluard
>>>>>>
>>>>>> *************************************
>>>>>> blog perso : http://archives.face-ecran.fr/
>>>>>>
>> --
>> Avec mes salutations les plus cordiales.
>> __
>>
>> Daniel Bourrion, conservateur des bibliothèques
>> Responsable de la bibliothèque numérique
>> Ligne directe : 02.44.68.80.50
>> SCD Université d'Angers - http://bu.univ-angers.fr
>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>
>> *************************************
>> " Et par le pouvoir d'un mot
>> Je recommence ma vie "
>>                        Paul Eluard
>> *************************************
>> blog perso : http://archives.face-ecran.fr/
>>
>>

-- 
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                        Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by remi tassing <ta...@gmail.com>.
disk size issue?
access rights?

On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion <
daniel.bourrion@univ-angers.fr> wrote:

> Hi Markus
> Thx for help.
>
> (Hope i'm not boring everybody)
>
> I've erase everything in crawl/
>
> Launching my nutch, got now
>
> -----
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
>
> Exception in thread "main" java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
>    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105)
>    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63)
>    at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140)
>
>    at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>    at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>
> -----
>
>
> Into the logs, got
>
> ____
>
>
> 2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
> java.io.IOException: Cannot run program "chmod": java.io.IOException:
> error=12, Cannot allocate memory
>    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475)
>    at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149)
>    at org.apache.hadoop.util.Shell.**run(Shell.java:134)
>    at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(**
> Shell.java:286)
>    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354)
>    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337)
>    at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(**
> RawLocalFileSystem.java:481)
>    at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(**
> RawLocalFileSystem.java:473)
>    at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(**
> FilterFileSystem.java:280)
>    at org.apache.hadoop.fs.**ChecksumFileSystem.create(**
> ChecksumFileSystem.java:372)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364)
>    at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(**
> MapTask.java:111)
>    at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:173)
> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
> allocate memory
>    at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164)
>    at java.lang.ProcessImpl.start(**ProcessImpl.java:81)
>    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468)
>    ... 15 more
> _____
>
>
>
> On 23/02/2012 10:01, Markus Jelsma wrote:
>
>> Unfetched, unparsed or just a bad corrupt segment. Remove that segment
>> and try
>> again.
>>
>>  Many thanks Remi.
>>>
>>> Finally, after un reboot og the computer (I send my question just before
>>> leaving my desk), Nutch started to crawl (amazing :))) )
>>>
>>> But now, during the crawl process, I got that :
>>>
>>> -----
>>>
>>> LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222161934 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120223093525 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153642 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222154459 Exception in thread "main"
>>> org.apache.hadoop.mapred.**InvalidInputException: Input path does not
>>> exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160234/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160609/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153805/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222155532/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160132/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153642/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222154459/parse_data at
>>> org.apache.hadoop.mapred.**FileInputFormat.listStatus(**
>>> FileInputFormat.java:19
>>> 0) at
>>> org.apache.hadoop.mapred.**SequenceFileInputFormat.**
>>> listStatus(SequenceFileInp
>>> utFormat.java:44) at
>>> org.apache.hadoop.mapred.**FileInputFormat.getSplits(**
>>> FileInputFormat.java:201
>>> ) at
>>> org.apache.hadoop.mapred.**JobClient.writeOldSplits(**
>>> JobClient.java:810)
>>>      at
>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>> JobClient.java:781)
>>>      at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>> java:730)
>>>      at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>> java:1249)
>>>      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:175)
>>>      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:149)
>>>      at org.apache.nutch.crawl.Crawl.**run(Crawl.java:143)
>>>      at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>      at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>
>>> -----
>>>
>>> and nothing special in the logs :
>>>
>>> last lines are :
>>>
>>>
>>> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
>>> at 2012-02-23 09:46:42, elapsed: 00:00:01
>>> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
>>> 2012-02-23 09:46:42
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222154459
>>>
>>> On 22/02/2012 16:36, remi tassing wrote:
>>>
>>>> Hey Daniel,
>>>>
>>>> You can find more output log in logs/Hadoop files
>>>>
>>>> Remi
>>>>
>>>> On Wednesday, February 22, 2012, Daniel Bourrion<
>>>>
>>>> daniel.bourrion@univ-angers.fr**>   wrote:
>>>>
>>>>> Hi.
>>>>> I'm a french librarian (that explains the bad english coming now... :)
>>>>> )
>>>>>
>>>>> Newbie on Nutch, that looks exactly what i'm searching for (an
>>>>> opensource
>>>>>
>>>> solution that should crawl our specific domaine and have it's crawl
>>>> results pushed into Solr).
>>>>
>>>>  I've install a test nutch using
>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>>
>>>>>
>>>>> Got an error but I don't really know it nor understand where to try to
>>>>>
>>>> correct what causes that.
>>>>
>>>>  Here's a copy of the error messages - any help welcome.
>>>>> Best
>>>>>
>>>>> ------------------------------**--------------------
>>>>> daniel@daniel-linux:~/Bureau/**apache-nutch-1.4-bin/runtime/**local$
>>>>>
>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>
>>>>  solrUrl is not set, indexing will be skipped...
>>>>> crawl started in: crawl
>>>>> rootUrlDir = urls
>>>>> threads = 10
>>>>> depth = 3
>>>>> solrUrl=null
>>>>> topN = 5
>>>>> Injector: starting at 2012-02-22 16:06:04
>>>>> Injector: crawlDb: crawl/crawldb
>>>>> Injector: urlDir: urls
>>>>> Injector: Converting injected urls to crawl db entries.
>>>>> Injector: Merging injected urls into crawl db.
>>>>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
>>>>> Generator: starting at 2012-02-22 16:06:06
>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>> Generator: filtering: true
>>>>> Generator: normalizing: true
>>>>> Generator: topN: 5
>>>>> Generator: jobtracker is 'local', generating exactly one partition.
>>>>> Generator: Partitioning selected urls for politeness.
>>>>> Generator: segment: crawl/segments/20120222160609
>>>>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
>>>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>>>>
>>>> 'http.robots.agents' property.
>>>>
>>>>  Fetcher: starting at 2012-02-22 16:06:10
>>>>> Fetcher: segment: crawl/segments/20120222160609
>>>>> Using queue mode : byHost
>>>>> Fetcher: threads: 10
>>>>> Fetcher: time-out divisor: 2
>>>>> QueueFeeder finished: total 2 records + hit by time limit :0
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> fetching http://bu.univ-angers.fr/
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> fetching http://www.face-ecran.fr/
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Fetcher: throughput threshold: -1
>>>>> Fetcher: throughput threshold retries: 5
>>>>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>>>> -finishing thread FetcherThread, activeThreads=1
>>>>> -finishing thread FetcherThread, activeThreads=0
>>>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>>>> -activeThreads=0
>>>>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
>>>>> ParseSegment: starting at 2012-02-22 16:06:13
>>>>> ParseSegment: segment: crawl/segments/20120222160609
>>>>> Parsing: http://bu.univ-angers.fr/
>>>>> Parsing: http://www.face-ecran.fr/
>>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>>
>>>>>     at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>> java:1252)
>>>>>     at org.apache.nutch.parse.**ParseSegment.parse(**
>>>>> ParseSegment.java:157)
>>>>>     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:138)
>>>>>     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>>     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>>>
>>>>> ------------------------------**--------------------
>>>>>
>>>>> --
>>>>> Avec mes salutations les plus cordiales.
>>>>> __
>>>>>
>>>>> Daniel Bourrion, conservateur des bibliothèques
>>>>> Responsable de la bibliothèque numérique
>>>>> Ligne directe : 02.44.68.80.50
>>>>> SCD Université d'Angers - http://bu.univ-angers.fr
>>>>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>>>>
>>>>> *************************************
>>>>> " Et par le pouvoir d'un mot
>>>>> Je recommence ma vie "
>>>>>
>>>>>                        Paul Eluard
>>>>>
>>>>> *************************************
>>>>> blog perso : http://archives.face-ecran.fr/
>>>>>
>>>>
> --
> Avec mes salutations les plus cordiales.
> __
>
> Daniel Bourrion, conservateur des bibliothèques
> Responsable de la bibliothèque numérique
> Ligne directe : 02.44.68.80.50
> SCD Université d'Angers - http://bu.univ-angers.fr
> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>
> *************************************
> " Et par le pouvoir d'un mot
> Je recommence ma vie "
>                       Paul Eluard
> *************************************
> blog perso : http://archives.face-ecran.fr/
>
>

Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Daniel Bourrion <da...@univ-angers.fr>.
Hi Markus
Thx for help.

(Hope i'm not boring everybody)

I've erase everything in crawl/

Launching my nutch, got now

-----
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
Exception in thread "main" java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:63)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:140)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

-----


Into the logs, got

____


2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 
purging: false
2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging 
segment data into db.
2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find 
rules for scope 'crawldb', using default
2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find 
rules for scope 'crawldb', using default
2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
java.io.IOException: Cannot run program "chmod": java.io.IOException: 
error=12, Cannot allocate memory
     at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
     at org.apache.hadoop.util.Shell.run(Shell.java:134)
     at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
     at org.apache.hadoop.util.Shell.execCommand(Shell.java:354)
     at org.apache.hadoop.util.Shell.execCommand(Shell.java:337)
     at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:481)
     at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:473)
     at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:280)
     at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:372)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:364)
     at 
org.apache.hadoop.mapred.MapTask.localizeConfiguration(MapTask.java:111)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:173)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot 
allocate memory
     at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
     at java.lang.ProcessImpl.start(ProcessImpl.java:81)
     at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
     ... 15 more
_____


On 23/02/2012 10:01, Markus Jelsma wrote:
> Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try
> again.
>
>> Many thanks Remi.
>>
>> Finally, after un reboot og the computer (I send my question just before
>> leaving my desk), Nutch started to crawl (amazing :))) )
>>
>> But now, during the crawl process, I got that :
>>
>> -----
>>
>> LinkDb: adding segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222161934 LinkDb: adding segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120223093525 LinkDb: adding segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222153642 LinkDb: adding segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222154459 Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222160234/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222160609/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222153805/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222155532/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222160132/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222153642/parse_data Input path does not exist:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120222154459/parse_data at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
>> 0) at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
>> utFormat.java:44) at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
>> ) at
>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>       at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>
>> -----
>>
>> and nothing special in the logs :
>>
>> last lines are :
>>
>>
>> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
>> at 2012-02-23 09:46:42, elapsed: 00:00:01
>> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
>> 2012-02-23 09:46:42
>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
>> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
>> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb: adding
>> segment:
>> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
>> /20120222154459
>>
>> On 22/02/2012 16:36, remi tassing wrote:
>>> Hey Daniel,
>>>
>>> You can find more output log in logs/Hadoop files
>>>
>>> Remi
>>>
>>> On Wednesday, February 22, 2012, Daniel Bourrion<
>>>
>>> daniel.bourrion@univ-angers.fr>   wrote:
>>>> Hi.
>>>> I'm a french librarian (that explains the bad english coming now... :) )
>>>>
>>>> Newbie on Nutch, that looks exactly what i'm searching for (an
>>>> opensource
>>> solution that should crawl our specific domaine and have it's crawl
>>> results pushed into Solr).
>>>
>>>> I've install a test nutch using
>>>> http://wiki.apache.org/nutch/NutchTutorial
>>>>
>>>>
>>>> Got an error but I don't really know it nor understand where to try to
>>> correct what causes that.
>>>
>>>> Here's a copy of the error messages - any help welcome.
>>>> Best
>>>>
>>>> --------------------------------------------------
>>>> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>
>>>> solrUrl is not set, indexing will be skipped...
>>>> crawl started in: crawl
>>>> rootUrlDir = urls
>>>> threads = 10
>>>> depth = 3
>>>> solrUrl=null
>>>> topN = 5
>>>> Injector: starting at 2012-02-22 16:06:04
>>>> Injector: crawlDb: crawl/crawldb
>>>> Injector: urlDir: urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Injector: Merging injected urls into crawl db.
>>>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
>>>> Generator: starting at 2012-02-22 16:06:06
>>>> Generator: Selecting best-scoring urls due for fetch.
>>>> Generator: filtering: true
>>>> Generator: normalizing: true
>>>> Generator: topN: 5
>>>> Generator: jobtracker is 'local', generating exactly one partition.
>>>> Generator: Partitioning selected urls for politeness.
>>>> Generator: segment: crawl/segments/20120222160609
>>>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
>>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>> 'http.robots.agents' property.
>>>
>>>> Fetcher: starting at 2012-02-22 16:06:10
>>>> Fetcher: segment: crawl/segments/20120222160609
>>>> Using queue mode : byHost
>>>> Fetcher: threads: 10
>>>> Fetcher: time-out divisor: 2
>>>> QueueFeeder finished: total 2 records + hit by time limit :0
>>>> Using queue mode : byHost
>>>> Using queue mode : byHost
>>>> Using queue mode : byHost
>>>> fetching http://bu.univ-angers.fr/
>>>> Using queue mode : byHost
>>>> Using queue mode : byHost
>>>> fetching http://www.face-ecran.fr/
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> Using queue mode : byHost
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> Using queue mode : byHost
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> Using queue mode : byHost
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> Using queue mode : byHost
>>>> Using queue mode : byHost
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> -finishing thread FetcherThread, activeThreads=2
>>>> Fetcher: throughput threshold: -1
>>>> Fetcher: throughput threshold retries: 5
>>>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>>> -finishing thread FetcherThread, activeThreads=1
>>>> -finishing thread FetcherThread, activeThreads=0
>>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>>> -activeThreads=0
>>>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
>>>> ParseSegment: starting at 2012-02-22 16:06:13
>>>> ParseSegment: segment: crawl/segments/20120222160609
>>>> Parsing: http://bu.univ-angers.fr/
>>>> Parsing: http://www.face-ecran.fr/
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>
>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>>>>      at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>>>>      at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>>
>>>> --------------------------------------------------
>>>>
>>>> --
>>>> Avec mes salutations les plus cordiales.
>>>> __
>>>>
>>>> Daniel Bourrion, conservateur des bibliothèques
>>>> Responsable de la bibliothèque numérique
>>>> Ligne directe : 02.44.68.80.50
>>>> SCD Université d'Angers - http://bu.univ-angers.fr
>>>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>>>
>>>> ***********************************
>>>> " Et par le pouvoir d'un mot
>>>> Je recommence ma vie "
>>>>
>>>>                         Paul Eluard
>>>>
>>>> ***********************************
>>>> blog perso : http://archives.face-ecran.fr/

-- 
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                        Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Markus Jelsma <ma...@openindex.io>.
Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try 
again.

> Many thanks Remi.
> 
> Finally, after un reboot og the computer (I send my question just before
> leaving my desk), Nutch started to crawl (amazing :))) )
> 
> But now, during the crawl process, I got that :
> 
> -----
> 
> LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222161934 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120223093525 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153642 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222154459 Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160234/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160609/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153805/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222155532/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160132/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153642/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222154459/parse_data at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
> 0) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
> utFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
> ) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>      at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>      at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> 
> -----
> 
> and nothing special in the logs :
> 
> last lines are :
> 
> 
> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
> at 2012-02-23 09:46:42, elapsed: 00:00:01
> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
> 2012-02-23 09:46:42
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222154459
> 
> On 22/02/2012 16:36, remi tassing wrote:
> > Hey Daniel,
> > 
> > You can find more output log in logs/Hadoop files
> > 
> > Remi
> > 
> > On Wednesday, February 22, 2012, Daniel Bourrion<
> > 
> > daniel.bourrion@univ-angers.fr>  wrote:
> >> Hi.
> >> I'm a french librarian (that explains the bad english coming now... :) )
> >> 
> >> Newbie on Nutch, that looks exactly what i'm searching for (an
> >> opensource
> > 
> > solution that should crawl our specific domaine and have it's crawl
> > results pushed into Solr).
> > 
> >> I've install a test nutch using
> >> http://wiki.apache.org/nutch/NutchTutorial
> >> 
> >> 
> >> Got an error but I don't really know it nor understand where to try to
> > 
> > correct what causes that.
> > 
> >> Here's a copy of the error messages - any help welcome.
> >> Best
> >> 
> >> --------------------------------------------------
> >> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > 
> >> solrUrl is not set, indexing will be skipped...
> >> crawl started in: crawl
> >> rootUrlDir = urls
> >> threads = 10
> >> depth = 3
> >> solrUrl=null
> >> topN = 5
> >> Injector: starting at 2012-02-22 16:06:04
> >> Injector: crawlDb: crawl/crawldb
> >> Injector: urlDir: urls
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
> >> Generator: starting at 2012-02-22 16:06:06
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 5
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment: crawl/segments/20120222160609
> >> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> > 
> > 'http.robots.agents' property.
> > 
> >> Fetcher: starting at 2012-02-22 16:06:10
> >> Fetcher: segment: crawl/segments/20120222160609
> >> Using queue mode : byHost
> >> Fetcher: threads: 10
> >> Fetcher: time-out divisor: 2
> >> QueueFeeder finished: total 2 records + hit by time limit :0
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://bu.univ-angers.fr/
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://www.face-ecran.fr/
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Fetcher: throughput threshold: -1
> >> Fetcher: throughput threshold retries: 5
> >> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
> >> ParseSegment: starting at 2012-02-22 16:06:13
> >> ParseSegment: segment: crawl/segments/20120222160609
> >> Parsing: http://bu.univ-angers.fr/
> >> Parsing: http://www.face-ecran.fr/
> >> Exception in thread "main" java.io.IOException: Job failed!
> >> 
> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >>     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> >>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> >> 
> >> --------------------------------------------------
> >> 
> >> --
> >> Avec mes salutations les plus cordiales.
> >> __
> >> 
> >> Daniel Bourrion, conservateur des bibliothèques
> >> Responsable de la bibliothèque numérique
> >> Ligne directe : 02.44.68.80.50
> >> SCD Université d'Angers - http://bu.univ-angers.fr
> >> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
> >> 
> >> ***********************************
> >> " Et par le pouvoir d'un mot
> >> Je recommence ma vie "
> >> 
> >>                        Paul Eluard
> >> 
> >> ***********************************
> >> blog perso : http://archives.face-ecran.fr/

Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by Daniel Bourrion <da...@univ-angers.fr>.
Many thanks Remi.

Finally, after un reboot og the computer (I send my question just before 
leaving my desk), Nutch started to crawl (amazing :))) )

But now, during the crawl process, I got that :

-----

LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222161934
LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093525
LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153642
LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459
Exception in thread "main" 
org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160234/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160609/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153805/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222155532/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160132/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153642/parse_data
Input path does not exist: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459/parse_data
     at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
     at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
     at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
     at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
     at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

-----

and nothing special in the logs :

last lines are :


2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished 
at 2012-02-23 09:46:42, elapsed: 00:00:01
2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at 
2012-02-23 09:46:42
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093220
2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160234
2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093302
2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160609
2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153805
2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222155532
2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223094427
2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093618
2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223094552
2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223094500
2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160132
2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093649
2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093210
2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222161934
2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093525
2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153642
2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459


On 22/02/2012 16:36, remi tassing wrote:
> Hey Daniel,
>
> You can find more output log in logs/Hadoop files
>
> Remi
>
> On Wednesday, February 22, 2012, Daniel Bourrion<
> daniel.bourrion@univ-angers.fr>  wrote:
>> Hi.
>> I'm a french librarian (that explains the bad english coming now... :) )
>>
>> Newbie on Nutch, that looks exactly what i'm searching for (an opensource
> solution that should crawl our specific domaine and have it's crawl results
> pushed into Solr).
>> I've install a test nutch using http://wiki.apache.org/nutch/NutchTutorial
>>
>>
>> Got an error but I don't really know it nor understand where to try to
> correct what causes that.
>> Here's a copy of the error messages - any help welcome.
>> Best
>>
>> --------------------------------------------------
>> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>> solrUrl is not set, indexing will be skipped...
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 3
>> solrUrl=null
>> topN = 5
>> Injector: starting at 2012-02-22 16:06:04
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
>> Generator: starting at 2012-02-22 16:06:06
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 5
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment: crawl/segments/20120222160609
>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
>> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
>> Fetcher: starting at 2012-02-22 16:06:10
>> Fetcher: segment: crawl/segments/20120222160609
>> Using queue mode : byHost
>> Fetcher: threads: 10
>> Fetcher: time-out divisor: 2
>> QueueFeeder finished: total 2 records + hit by time limit :0
>> Using queue mode : byHost
>> Using queue mode : byHost
>> Using queue mode : byHost
>> fetching http://bu.univ-angers.fr/
>> Using queue mode : byHost
>> Using queue mode : byHost
>> fetching http://www.face-ecran.fr/
>> -finishing thread FetcherThread, activeThreads=2
>> -finishing thread FetcherThread, activeThreads=2
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=2
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=2
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=2
>> -finishing thread FetcherThread, activeThreads=2
>> Using queue mode : byHost
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=2
>> -finishing thread FetcherThread, activeThreads=2
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold retries: 5
>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>> -finishing thread FetcherThread, activeThreads=1
>> -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> -activeThreads=0
>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
>> ParseSegment: starting at 2012-02-22 16:06:13
>> ParseSegment: segment: crawl/segments/20120222160609
>> Parsing: http://bu.univ-angers.fr/
>> Parsing: http://www.face-ecran.fr/
>> Exception in thread "main" java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>>     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>
>>
>> --------------------------------------------------
>>
>> --
>> Avec mes salutations les plus cordiales.
>> __
>>
>> Daniel Bourrion, conservateur des bibliothèques
>> Responsable de la bibliothèque numérique
>> Ligne directe : 02.44.68.80.50
>> SCD Université d'Angers - http://bu.univ-angers.fr
>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>
>> ***********************************
>> " Et par le pouvoir d'un mot
>> Je recommence ma vie "
>>                        Paul Eluard
>> ***********************************
>> blog perso : http://archives.face-ecran.fr/
>>
>>

-- 
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                        Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


Re: Exception in thread "main" java.io.IOException: Job failed!

Posted by remi tassing <ta...@gmail.com>.
Hey Daniel,

You can find more output log in logs/Hadoop files

Remi

On Wednesday, February 22, 2012, Daniel Bourrion <
daniel.bourrion@univ-angers.fr> wrote:
> Hi.
> I'm a french librarian (that explains the bad english coming now... :) )
>
> Newbie on Nutch, that looks exactly what i'm searching for (an opensource
solution that should crawl our specific domaine and have it's crawl results
pushed into Solr).
>
> I've install a test nutch using http://wiki.apache.org/nutch/NutchTutorial
>
>
> Got an error but I don't really know it nor understand where to try to
correct what causes that.
>
> Here's a copy of the error messages - any help welcome.
> Best
>
> --------------------------------------------------
> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2012-02-22 16:06:04
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
> Generator: starting at 2012-02-22 16:06:06
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20120222160609
> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
> Fetcher: starting at 2012-02-22 16:06:10
> Fetcher: segment: crawl/segments/20120222160609
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 2 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://bu.univ-angers.fr/
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.face-ecran.fr/
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=2
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=2
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=2
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=2
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=2
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
> ParseSegment: starting at 2012-02-22 16:06:13
> ParseSegment: segment: crawl/segments/20120222160609
> Parsing: http://bu.univ-angers.fr/
> Parsing: http://www.face-ecran.fr/
> Exception in thread "main" java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>    at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
>
> --------------------------------------------------
>
> --
> Avec mes salutations les plus cordiales.
> __
>
> Daniel Bourrion, conservateur des bibliothèques
> Responsable de la bibliothèque numérique
> Ligne directe : 02.44.68.80.50
> SCD Université d'Angers - http://bu.univ-angers.fr
> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>
> ***********************************
> " Et par le pouvoir d'un mot
> Je recommence ma vie "
>                       Paul Eluard
> ***********************************
> blog perso : http://archives.face-ecran.fr/
>
>