You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca> on 2012/09/06 21:15:12 UTC
Errors when indexing to Solr
I'm getting two different errors while trying to index Nutch crawls to
Solr. I'm running with:
- CentOS 6.3 VM (Virtualbox) (in host Windows XP)
- Solr 3.6.1
- Nutch 1.5.1
It would seem that NUTCH-1251 comes rather close to solving my issue?
Which would mean that I would have to compile Nutch 1.6 to fix this?
Error #1 - When indexing directly to Solr
------------------------------------------------
Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
-depth 3 -topN 5
Error: Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query
SolrIndexer: starting at 2012-09-06 14:30:11
Indexing 8 documents
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing
query
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:200)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
at
org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
n.java:1121)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:95)
at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:198)
... 16 more
Caused by: org.apache.solr.common.SolrException: Not Found
Not Found
request: http://localhost:8080/solr/core2/select?q=id:[* TO
*]&fl=id&rows=1&wt=javabin&version=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:430)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:89)
... 18 more
Error #2 - When indexing post-crawl
--------------------------------------------
Command: bin/nutch solrindex http://localhost:8080/solr/core2
crawl/crawldb -linkdb crawl/linkdb
Error: org.apache.solr.common.SolrException: Not Found
SolrIndexer: starting at 2012-09-06 15:39:24
org.apache.solr.common.SolrException: Not Found
Not Found
request: http://localhost:8080/solr/core2/update
Regards,
Dan
RE: Errors when indexing to Solr
Posted by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca>.
Markus,
You were right. My core was setup properly, however, it was labeled something different in the conf file. I was able to get rid of that error. Thanks!
I have provided the log you asked for below...
Dan
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: September 7, 2012 9:49 AM
> To: user@nutch.apache.org
> Subject: RE: Errors when indexing to Solr
>
> -----Original message-----
> > From:Fournier, Danny G <Da...@dfo-mpo.gc.ca>
> > Sent: Fri 07-Sep-2012 14:46
> > To: user@nutch.apache.org
> > Subject: RE: Errors when indexing to Solr
> >
> > I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
> > error:
> >
> > [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > solrUrl is not set, indexing will be skipped...
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 3
> > solrUrl=null
> > topN = 5
> > Injector: starting at 2012-09-07 08:41:06
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
> > Generator: starting at 2012-09-07 08:41:21
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 5
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/segments/20120907084129
> > Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-09-07 08:41:36
> > Fetcher: segment: crawl/segments/20120907084129
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Exception in thread "main" java.io.IOException: Job failed!
> > at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> > at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
> Please post the relevant log
>
2012-09-07 09:54:31,418 WARN mapred.LocalJobRunner - job_local_0005
java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder
at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:59)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:602)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1186)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuild$
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 6 more
RE: Errors when indexing to Solr
Posted by Markus Jelsma <ma...@openindex.io>.
-----Original message-----
> From:Fournier, Danny G <Da...@dfo-mpo.gc.ca>
> Sent: Fri 07-Sep-2012 14:46
> To: user@nutch.apache.org
> Subject: RE: Errors when indexing to Solr
>
> I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
> error:
>
> [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2012-09-07 08:41:06
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
> Generator: starting at 2012-09-07 08:41:21
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20120907084129
> Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-09-07 08:41:36
> Fetcher: segment: crawl/segments/20120907084129
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Exception in thread "main" java.io.IOException: Job failed!
> at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Please post the relevant log
>
> I then tried to crawl with 1.5.1 (which was successful) and INDEX with
> 1.6-SNAPSHOT. I got this error:
>
> [root@w7sp1-x64 nutch]# bin/nutch solrindex
> http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
> SolrIndexer: starting at 2012-09-07 09:05:21
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> org.apache.solr.common.SolrException: Not Found
>
> Not Found
>
> request: http://127.0.0.1:8080/solr/core2/update
This is no Nutch error, there simply is no Solr running there (404), or a badly configured one.
>
> > -----Original Message-----
> > From: Fournier, Danny G [mailto:Danny.Fournier@dfo-mpo.gc.ca]
> > Sent: September 6, 2012 4:15 PM
> > To: user@nutch.apache.org
> > Subject: Errors when indexing to Solr
> >
> > I'm getting two different errors while trying to index Nutch crawls to
> > Solr. I'm running with:
> >
> > - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
> > - Solr 3.6.1
> > - Nutch 1.5.1
> >
> > It would seem that NUTCH-1251 comes rather close to solving my issue?
> > Which would mean that I would have to compile Nutch 1.6 to fix this?
> >
> > Error #1 - When indexing directly to Solr
> > ------------------------------------------------
> > Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
> > -depth 3 -topN 5
> >
> > Error: Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> query
> >
> > SolrIndexer: starting at 2012-09-06 14:30:11
> > Indexing 8 documents
> > java.io.IOException: Job failed!
> > SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
> > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
> > Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> > query
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> > lits(SolrDeleteDuplicates.java:200)
> > at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> > at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> > at
> > org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:416)
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
> > atio
> > n.java:1121)
> > at
> >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> > at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> > at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> > cates.java:373)
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> > cates.java:353)
> > at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> > executing query
> > at
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> > ava:95)
> > at
> > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> > lits(SolrDeleteDuplicates.java:198)
> > ... 16 more
> > Caused by: org.apache.solr.common.SolrException: Not Found
> >
> > Not Found
> >
> > request: http://localhost:8080/solr/core2/select?q=id:[* TO
> > *]&fl=id&rows=1&wt=javabin&version=2
> > at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> > H
> > ttpSolrServer.java:430)
> > at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> > H
> > ttpSolrServer.java:244)
> > at
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> > ava:89)
> > ... 18 more
> >
> > Error #2 - When indexing post-crawl
> > --------------------------------------------
> > Command: bin/nutch solrindex http://localhost:8080/solr/core2
> > crawl/crawldb -linkdb crawl/linkdb
> >
> > Error: org.apache.solr.common.SolrException: Not Found
> >
> > SolrIndexer: starting at 2012-09-06 15:39:24
> > org.apache.solr.common.SolrException: Not Found
> >
> > Not Found
> >
> > request: http://localhost:8080/solr/core2/update
> >
> >
> > Regards,
> >
> > Dan
>
RE: Errors when indexing to Solr
Posted by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca>.
I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
error:
[root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-09-07 08:41:06
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
Generator: starting at 2012-09-07 08:41:21
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120907084129
Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-09-07 08:41:36
Fetcher: segment: crawl/segments/20120907084129
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
I then tried to crawl with 1.5.1 (which was successful) and INDEX with
1.6-SNAPSHOT. I got this error:
[root@w7sp1-x64 nutch]# bin/nutch solrindex
http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
SolrIndexer: starting at 2012-09-07 09:05:21
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
org.apache.solr.common.SolrException: Not Found
Not Found
request: http://127.0.0.1:8080/solr/core2/update
> -----Original Message-----
> From: Fournier, Danny G [mailto:Danny.Fournier@dfo-mpo.gc.ca]
> Sent: September 6, 2012 4:15 PM
> To: user@nutch.apache.org
> Subject: Errors when indexing to Solr
>
> I'm getting two different errors while trying to index Nutch crawls to
> Solr. I'm running with:
>
> - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
> - Solr 3.6.1
> - Nutch 1.5.1
>
> It would seem that NUTCH-1251 comes rather close to solving my issue?
> Which would mean that I would have to compile Nutch 1.6 to fix this?
>
> Error #1 - When indexing directly to Solr
> ------------------------------------------------
> Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
> -depth 3 -topN 5
>
> Error: Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing
query
>
> SolrIndexer: starting at 2012-09-06 14:30:11
> Indexing 8 documents
> java.io.IOException: Job failed!
> SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
> Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing
> query
> at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> lits(SolrDeleteDuplicates.java:200)
> at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> at
> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:416)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
> atio
> n.java:1121)
> at
>
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> cates.java:373)
> at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> cates.java:353)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> executing query
> at
>
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> ava:95)
> at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> lits(SolrDeleteDuplicates.java:198)
> ... 16 more
> Caused by: org.apache.solr.common.SolrException: Not Found
>
> Not Found
>
> request: http://localhost:8080/solr/core2/select?q=id:[* TO
> *]&fl=id&rows=1&wt=javabin&version=2
> at
>
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> H
> ttpSolrServer.java:430)
> at
>
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> H
> ttpSolrServer.java:244)
> at
>
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> ava:89)
> ... 18 more
>
> Error #2 - When indexing post-crawl
> --------------------------------------------
> Command: bin/nutch solrindex http://localhost:8080/solr/core2
> crawl/crawldb -linkdb crawl/linkdb
>
> Error: org.apache.solr.common.SolrException: Not Found
>
> SolrIndexer: starting at 2012-09-06 15:39:24
> org.apache.solr.common.SolrException: Not Found
>
> Not Found
>
> request: http://localhost:8080/solr/core2/update
>
>
> Regards,
>
> Dan