You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Muhamad Muchlis <tr...@gmail.com> on 2014/11/03 09:15:04 UTC

[Error Crawling Job Failed] NUTCH 1.9

Hello.

I get an error message when I run the command:

*crawl seed/seed.txt crawl -depth 3 -topN 5*


Error Message :

SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


Can anyone explain why this happened ?





Best regard's

M.Muchlis

Re: [Error Crawling Job Failed] NUTCH 1.9

Posted by Muhamad Muchlis <tr...@gmail.com>.

Hi Markus,

When am trying the solr index : *crawl seed.txt crawl
http://localhost:8983/solr/ <http://localhost:8983/solr/> -depth 3 -topN 5*

when iam query the solr : http://localhost:8983/solr/#/collection1/query

0 Records.


here is the Logs :

2014-11-03 18:18:54,307 INFO  crawl.Injector - Injector: starting at
2014-11-03 18:18:54
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: urlDir: seed
2014-11-03 18:18:54,309 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-11-03 18:18:54,546 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 18:18:54,601 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2014-11-03 18:18:55,119 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls rejected by filters: 0
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls after normalization: 1
2014-11-03 18:18:55,822 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: overwrite: false
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: update: false
2014-11-03 18:18:56,904 INFO  crawl.Injector - Injector: URLs merged: 1
2014-11-03 18:18:56,913 INFO  crawl.Injector - Injector: Total new urls
injected: 0
2014-11-03 18:18:56,914 INFO  crawl.Injector - Injector: finished at
2014-11-03 18:18:56, elapsed: 00:00:02


Here is my step for my first crawling:

1. crawl seed.txt crawl -depth 3 -topN 5 > log.txt
2.  *crawl seed.txt crawl http://localhost:8983/solr/
<http://localhost:8983/solr/> -depth 3 -topN 5*

*is that correct step ?.*

*reference
: http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script
<http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script>*




On Mon, Nov 3, 2014 at 6:05 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Oh - if you need to index multiple segments, don't use segments/* but -dir
> segments/
>
>
> -----Original message-----
> > From:Muhamad Muchlis <tr...@gmail.com>
> > Sent: Monday 3rd November 2014 12:00
> > To: user@nutch.apache.org
> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> >
> > Hi Markus,
> >
> > When i run this command :
> >
> > *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**
> >
> >
> >
> > I got an error here is the log :
> >
> > 2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
> > 2014-11-03 17:55:04
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting
> gone
> > documents: false
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
> filtering:
> > false
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
> > normalizing: false
> > 2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
> > SOLRIndexWriter
> > solr.server.url : URL of the SOLR instance (mandatory)
> > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > solr.mapping.file : name of the mapping file for fields (default
> > solrindex-mapping.xml)
> > solr.auth : use authentication (default false)
> > solr.auth.username : use authentication (default false)
> > solr.auth : username for authentication
> > solr.auth.password : password for authentication
> >
> >
> > 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/indexes
> > 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/crawldb
> > 2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/linkdb
> > 2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103163424
> > 2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103175027
> > 2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103175109
> > 2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:me
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> > 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> > at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> > at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> >
> > Advice me please..
> >
> >
> > On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <tr...@gmail.com>
> wrote:
> >
> > > Like this ?
> > >
> > > <?xml version="1.0"?>
> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > >
> > > <!-- Put site-specific property overrides in this file. -->
> > >
> > > <configuration>
> > >
> > > <property>
> > >  <name>http.agent.name</name>
> > >  <value>My Nutch Spider</value>
> > > </property>
> > >
> > > *<property>*
> > > * <name>solr.server.url</name>*
> > > * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/
> ></value>*
> > > *</property>*
> > >
> > >
> > > </configuration>
> > >
> > >
> > > On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <
> markus.jelsma@openindex.io>
> > > wrote:
> > >
> > >> You can set solr.server.url in your nutch-site.xml or pass it via
> command
> > >> line as -Dsolr.server.url=<URL>
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Muhamad Muchlis <tr...@gmail.com>
> > >> > Sent: Monday 3rd November 2014 11:37
> > >> > To: user@nutch.apache.org
> > >> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> > >> >
> > >> > Hi Markus,
> > >> >
> > >> > Where can I find the settings solr url?  -D
> > >> >
> > >> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
> > >> markus.jelsma@openindex.io>
> > >> > wrote:
> > >> >
> > >> > > Well, here is is:
> > >> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> > >> > > -Dsolr.server.url
> > >> > >
> > >> > >
> > >> > >
> > >> > > -----Original message-----
> > >> > > > From:Muhamad Muchlis <tr...@gmail.com>
> > >> > > > Sent: Monday 3rd November 2014 10:58
> > >> > > > To: user@nutch.apache.org
> > >> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> > >> > > >
> > >> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
> > >> starting at
> > >> > > > 2014-11-03 16:56:06
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer:
> > >> deleting
> > >> > > gone
> > >> > > > documents: false
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > >> > > filtering:
> > >> > > > false
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > >> > > > normalizing: false
> > >> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing
> SOLR
> > >> URL.
> > >> > > > Should be set via -D solr.server.url
> > >> > > > SOLRIndexWriter
> > >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> > >> > > > solr.mapping.file : name of the mapping file for fields (default
> > >> > > > solrindex-mapping.xml)
> > >> > > > solr.auth : use authentication (default false)
> > >> > > > solr.auth.username : use authentication (default false)
> > >> > > > solr.auth : username for authentication
> > >> > > > solr.auth.password : password for authentication
> > >> > > >
> > >> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
> > >> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> -D
> > >> > > > solr.server.url
> > >> > > > SOLRIndexWriter
> > >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> > >> > > > solr.mapping.file : name of the mapping file for fields (default
> > >> > > > solrindex-mapping.xml)
> > >> > > > solr.auth : use authentication (default false)
> > >> > > > solr.auth.username : use authentication (default false)
> > >> > > > solr.auth : username for authentication
> > >> > > > solr.auth.password : password for authentication
> > >> > > >
> > >> > > > at
> > >> > > >
> > >> > >
> > >>
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
> > >> > > > at
> > >> > > >
> > >> > >
> > >>
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
> > >> > > > at
> > >> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > >> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
> > >> > > markus.jelsma@openindex.io>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi - see the logs for more details.
> > >> > > > > Markus
> > >> > > > >
> > >> > > > > -----Original message-----
> > >> > > > > > From:Muhamad Muchlis <tr...@gmail.com>
> > >> > > > > > Sent: Monday 3rd November 2014 9:15
> > >> > > > > > To: user@nutch.apache.org
> > >> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
> > >> > > > > >
> > >> > > > > > Hello.
> > >> > > > > >
> > >> > > > > > I get an error message when I run the command:
> > >> > > > > >
> > >> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Error Message :
> > >> > > > > >
> > >> > > > > > SOLRIndexWriter
> > >> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > > > solr.commit.size : buffer size when sending to SOLR (default
> > >> 1000)
> > >> > > > > > solr.mapping.file : name of the mapping file for fields
> (default
> > >> > > > > > solrindex-mapping.xml)
> > >> > > > > > solr.auth : use authentication (default false)
> > >> > > > > > solr.auth.username : use authentication (default false)
> > >> > > > > > solr.auth : username for authentication
> > >> > > > > > solr.auth.password : password for authentication
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Indexer: java.io.IOException: Job failed!
> > >> > > > > > at
> > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > >> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Can anyone explain why this happened ?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Best regard's
> > >> > > > > >
> > >> > > > > > M.Muchlis
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
> > >
> >
>

Re: [Error Crawling Job Failed] NUTCH 1.9

Posted by Muhamad Muchlis <tr...@gmail.com>.

Hi Markus,

When i run this command :

*nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**

I got an error here is the log :

2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
2014-11-03 17:55:04
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/indexes
2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/crawldb
2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/linkdb
2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103163424
2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175027
2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175109
2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
PriviledgedActionException as:me
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

Advice me please..

On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <tr...@gmail.com> wrote:

> Like this ?
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>  <name>http.agent.name</name>
>  <value>My Nutch Spider</value>
> </property>
>
> *<property>*
> * <name>solr.server.url</name>*
> * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/></value>*
> *</property>*
>
>
> </configuration>
>
>
> On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
>
>> You can set solr.server.url in your nutch-site.xml or pass it via command
>> line as -Dsolr.server.url=<URL>
>>
>>
>>
>> -----Original message-----
>> > From:Muhamad Muchlis <tr...@gmail.com>
>> > Sent: Monday 3rd November 2014 11:37
>> > To: user@nutch.apache.org
>> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
>> >
>> > Hi Markus,
>> >
>> > Where can I find the settings solr url?  -D
>> >
>> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
>> markus.jelsma@openindex.io>
>> > wrote:
>> >
>> > > Well, here is is:
>> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
>> > > -Dsolr.server.url
>> > >
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Muhamad Muchlis <tr...@gmail.com>
>> > > > Sent: Monday 3rd November 2014 10:58
>> > > > To: user@nutch.apache.org
>> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
>> > > >
>> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
>> starting at
>> > > > 2014-11-03 16:56:06
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer:
>> deleting
>> > > gone
>> > > > documents: false
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
>> > > filtering:
>> > > > false
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
>> > > > normalizing: false
>> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
>> URL.
>> > > > Should be set via -D solr.server.url
>> > > > SOLRIndexWriter
>> > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
>> > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > solrindex-mapping.xml)
>> > > > solr.auth : use authentication (default false)
>> > > > solr.auth.username : use authentication (default false)
>> > > > solr.auth : username for authentication
>> > > > solr.auth.password : password for authentication
>> > > >
>> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
>> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
>> > > > solr.server.url
>> > > > SOLRIndexWriter
>> > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
>> > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > solrindex-mapping.xml)
>> > > > solr.auth : use authentication (default false)
>> > > > solr.auth.username : use authentication (default false)
>> > > > solr.auth : username for authentication
>> > > > solr.auth.password : password for authentication
>> > > >
>> > > > at
>> > > >
>> > >
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
>> > > > at
>> > > >
>> > >
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
>> > > > at
>> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
>> > > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
>> > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>> > > >
>> > > >
>> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
>> > > markus.jelsma@openindex.io>
>> > > > wrote:
>> > > >
>> > > > > Hi - see the logs for more details.
>> > > > > Markus
>> > > > >
>> > > > > -----Original message-----
>> > > > > > From:Muhamad Muchlis <tr...@gmail.com>
>> > > > > > Sent: Monday 3rd November 2014 9:15
>> > > > > > To: user@nutch.apache.org
>> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
>> > > > > >
>> > > > > > Hello.
>> > > > > >
>> > > > > > I get an error message when I run the command:
>> > > > > >
>> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
>> > > > > >
>> > > > > >
>> > > > > > Error Message :
>> > > > > >
>> > > > > > SOLRIndexWriter
>> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > > > solr.commit.size : buffer size when sending to SOLR (default
>> 1000)
>> > > > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > > > solrindex-mapping.xml)
>> > > > > > solr.auth : use authentication (default false)
>> > > > > > solr.auth.username : use authentication (default false)
>> > > > > > solr.auth : username for authentication
>> > > > > > solr.auth.password : password for authentication
>> > > > > >
>> > > > > >
>> > > > > > Indexer: java.io.IOException: Job failed!
>> > > > > > at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>> > > > > >
>> > > > > >
>> > > > > > Can anyone explain why this happened ?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Best regard's
>> > > > > >
>> > > > > > M.Muchlis
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>
>

Re: [Error Crawling Job Failed] NUTCH 1.9

Posted by Muhamad Muchlis <tr...@gmail.com>.

Like this ?

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

*<property>*
* <name>solr.server.url</name>*
* <value>http://localhost:8983/solr/ <http://localhost:8983/solr/></value>*
*</property>*


</configuration>


On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> You can set solr.server.url in your nutch-site.xml or pass it via command
> line as -Dsolr.server.url=<URL>
>
>
>
> -----Original message-----
> > From:Muhamad Muchlis <tr...@gmail.com>
> > Sent: Monday 3rd November 2014 11:37
> > To: user@nutch.apache.org
> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> >
> > Hi Markus,
> >
> > Where can I find the settings solr url?  -D
> >
> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
> markus.jelsma@openindex.io>
> > wrote:
> >
> > > Well, here is is:
> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> > > -Dsolr.server.url
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Muhamad Muchlis <tr...@gmail.com>
> > > > Sent: Monday 3rd November 2014 10:58
> > > > To: user@nutch.apache.org
> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> > > >
> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
> starting at
> > > > 2014-11-03 16:56:06
> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting
> > > gone
> > > > documents: false
> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > > filtering:
> > > > false
> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > > > normalizing: false
> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
> URL.
> > > > Should be set via -D solr.server.url
> > > > SOLRIndexWriter
> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > > > solr.mapping.file : name of the mapping file for fields (default
> > > > solrindex-mapping.xml)
> > > > solr.auth : use authentication (default false)
> > > > solr.auth.username : use authentication (default false)
> > > > solr.auth : username for authentication
> > > > solr.auth.password : password for authentication
> > > >
> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
> > > > solr.server.url
> > > > SOLRIndexWriter
> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > > > solr.mapping.file : name of the mapping file for fields (default
> > > > solrindex-mapping.xml)
> > > > solr.auth : use authentication (default false)
> > > > solr.auth.username : use authentication (default false)
> > > > solr.auth : username for authentication
> > > > solr.auth.password : password for authentication
> > > >
> > > > at
> > > >
> > >
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
> > > > at
> > > >
> > >
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
> > > > at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
> > > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
> > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > > >
> > > >
> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
> > > markus.jelsma@openindex.io>
> > > > wrote:
> > > >
> > > > > Hi - see the logs for more details.
> > > > > Markus
> > > > >
> > > > > -----Original message-----
> > > > > > From:Muhamad Muchlis <tr...@gmail.com>
> > > > > > Sent: Monday 3rd November 2014 9:15
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
> > > > > >
> > > > > > Hello.
> > > > > >
> > > > > > I get an error message when I run the command:
> > > > > >
> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
> > > > > >
> > > > > >
> > > > > > Error Message :
> > > > > >
> > > > > > SOLRIndexWriter
> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
> > > > > > solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> > > > > > solr.mapping.file : name of the mapping file for fields (default
> > > > > > solrindex-mapping.xml)
> > > > > > solr.auth : use authentication (default false)
> > > > > > solr.auth.username : use authentication (default false)
> > > > > > solr.auth : username for authentication
> > > > > > solr.auth.password : password for authentication
> > > > > >
> > > > > >
> > > > > > Indexer: java.io.IOException: Job failed!
> > > > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > > > > > at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > > > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > > > at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > > > > >
> > > > > >
> > > > > > Can anyone explain why this happened ?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best regard's
> > > > > >
> > > > > > M.Muchlis
> > > > > >
> > > > >
> > > >
> > >
>