You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca> on 2012/09/06 21:15:12 UTC

Errors when indexing to Solr

I'm getting two different errors while trying to index Nutch crawls to
Solr. I'm running with:

- CentOS 6.3 VM (Virtualbox) (in host Windows XP)
- Solr 3.6.1
- Nutch 1.5.1

It would seem that NUTCH-1251 comes rather close to solving my issue?
Which would mean that I would have to compile Nutch 1.6 to fix this?

Error #1 - When indexing directly to Solr
------------------------------------------------
Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
-depth 3 -topN 5

Error:  Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query

SolrIndexer: starting at 2012-09-06 14:30:11
Indexing 8 documents
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing
query
	at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:200)
	at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
	at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
	at
org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:416)
	at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
n.java:1121)
	at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
	at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
	at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
	at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:373)
	at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:353)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
	at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:95)
	at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
	at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:198)
	... 16 more
Caused by: org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/solr/core2/select?q=id:[* TO
*]&fl=id&rows=1&wt=javabin&version=2
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:430)
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:244)
	at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:89)
	... 18 more

Error #2 - When indexing post-crawl
--------------------------------------------
Command: bin/nutch solrindex http://localhost:8080/solr/core2
crawl/crawldb -linkdb crawl/linkdb

Error: org.apache.solr.common.SolrException: Not Found

SolrIndexer: starting at 2012-09-06 15:39:24
org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/solr/core2/update


Regards,

Dan

RE: Errors when indexing to Solr

Posted by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca>.
Markus, 

You were right. My core was setup properly, however, it was labeled something different in the conf file. I was able to get rid of that error. Thanks!

I have provided the log you asked for below...

Dan

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: September 7, 2012 9:49 AM
> To: user@nutch.apache.org
> Subject: RE: Errors when indexing to Solr
> 
> -----Original message-----
> > From:Fournier, Danny G <Da...@dfo-mpo.gc.ca>
> > Sent: Fri 07-Sep-2012 14:46
> > To: user@nutch.apache.org
> > Subject: RE: Errors when indexing to Solr
> >
> > I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
> > error:
> >
> > [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > solrUrl is not set, indexing will be skipped...
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 3
> > solrUrl=null
> > topN = 5
> > Injector: starting at 2012-09-07 08:41:06
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
> > Generator: starting at 2012-09-07 08:41:21
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 5
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/segments/20120907084129
> > Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-09-07 08:41:36
> > Fetcher: segment: crawl/segments/20120907084129
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Exception in thread "main" java.io.IOException: Job failed!
> > 	at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> > 	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> > 	at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
> > 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > 	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> 
> Please post the relevant log
> 

2012-09-07 09:54:31,418 WARN  mapred.LocalJobRunner - job_local_0005
java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder
        at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:59)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:602)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1186)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuild$
        at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
        ... 6 more

RE: Errors when indexing to Solr

Posted by Markus Jelsma <ma...@openindex.io>.
-----Original message-----
> From:Fournier, Danny G <Da...@dfo-mpo.gc.ca>
> Sent: Fri 07-Sep-2012 14:46
> To: user@nutch.apache.org
> Subject: RE: Errors when indexing to Solr
> 
> I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
> error:
> 
> [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2012-09-07 08:41:06
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
> Generator: starting at 2012-09-07 08:41:21
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20120907084129
> Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-09-07 08:41:36
> Fetcher: segment: crawl/segments/20120907084129
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Exception in thread "main" java.io.IOException: Job failed!
> 	at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> 	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> 	at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Please post the relevant log

> 
> I then tried to crawl with 1.5.1 (which was successful) and INDEX with
> 1.6-SNAPSHOT. I got this error:
> 
> [root@w7sp1-x64 nutch]# bin/nutch solrindex
> http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
> SolrIndexer: starting at 2012-09-07 09:05:21
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> org.apache.solr.common.SolrException: Not Found
> 
> Not Found
> 
> request: http://127.0.0.1:8080/solr/core2/update

This is no Nutch error, there simply is no Solr running there (404), or a badly configured one.

> 
> > -----Original Message-----
> > From: Fournier, Danny G [mailto:Danny.Fournier@dfo-mpo.gc.ca]
> > Sent: September 6, 2012 4:15 PM
> > To: user@nutch.apache.org
> > Subject: Errors when indexing to Solr
> > 
> > I'm getting two different errors while trying to index Nutch crawls to
> > Solr. I'm running with:
> > 
> > - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
> > - Solr 3.6.1
> > - Nutch 1.5.1
> > 
> > It would seem that NUTCH-1251 comes rather close to solving my issue?
> > Which would mean that I would have to compile Nutch 1.6 to fix this?
> > 
> > Error #1 - When indexing directly to Solr
> > ------------------------------------------------
> > Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
> > -depth 3 -topN 5
> > 
> > Error:  Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> query
> > 
> > SolrIndexer: starting at 2012-09-06 14:30:11
> > Indexing 8 documents
> > java.io.IOException: Job failed!
> > SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
> > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
> > Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> > query
> > 	at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> > lits(SolrDeleteDuplicates.java:200)
> > 	at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> > 	at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> > 	at
> > org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> > 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> > 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> > 	at java.security.AccessController.doPrivileged(Native Method)
> > 	at javax.security.auth.Subject.doAs(Subject.java:416)
> > 	at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
> > atio
> > n.java:1121)
> > 	at
> >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> > 	at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> > 	at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> > 	at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> > cates.java:373)
> > 	at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> > cates.java:353)
> > 	at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> > 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > 	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> > executing query
> > 	at
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> > ava:95)
> > 	at
> > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> > 	at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> > lits(SolrDeleteDuplicates.java:198)
> > 	... 16 more
> > Caused by: org.apache.solr.common.SolrException: Not Found
> > 
> > Not Found
> > 
> > request: http://localhost:8080/solr/core2/select?q=id:[* TO
> > *]&fl=id&rows=1&wt=javabin&version=2
> > 	at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> > H
> > ttpSolrServer.java:430)
> > 	at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> > H
> > ttpSolrServer.java:244)
> > 	at
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> > ava:89)
> > 	... 18 more
> > 
> > Error #2 - When indexing post-crawl
> > --------------------------------------------
> > Command: bin/nutch solrindex http://localhost:8080/solr/core2
> > crawl/crawldb -linkdb crawl/linkdb
> > 
> > Error: org.apache.solr.common.SolrException: Not Found
> > 
> > SolrIndexer: starting at 2012-09-06 15:39:24
> > org.apache.solr.common.SolrException: Not Found
> > 
> > Not Found
> > 
> > request: http://localhost:8080/solr/core2/update
> > 
> > 
> > Regards,
> > 
> > Dan
> 

RE: Errors when indexing to Solr

Posted by "Fournier, Danny G" <Da...@dfo-mpo.gc.ca>.
I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
error:

[root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-09-07 08:41:06
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
Generator: starting at 2012-09-07 08:41:21
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120907084129
Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-09-07 08:41:36
Fetcher: segment: crawl/segments/20120907084129
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Exception in thread "main" java.io.IOException: Job failed!
	at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I then tried to crawl with 1.5.1 (which was successful) and INDEX with
1.6-SNAPSHOT. I got this error:

[root@w7sp1-x64 nutch]# bin/nutch solrindex
http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
SolrIndexer: starting at 2012-09-07 09:05:21
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
org.apache.solr.common.SolrException: Not Found

Not Found

request: http://127.0.0.1:8080/solr/core2/update

> -----Original Message-----
> From: Fournier, Danny G [mailto:Danny.Fournier@dfo-mpo.gc.ca]
> Sent: September 6, 2012 4:15 PM
> To: user@nutch.apache.org
> Subject: Errors when indexing to Solr
> 
> I'm getting two different errors while trying to index Nutch crawls to
> Solr. I'm running with:
> 
> - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
> - Solr 3.6.1
> - Nutch 1.5.1
> 
> It would seem that NUTCH-1251 comes rather close to solving my issue?
> Which would mean that I would have to compile Nutch 1.6 to fix this?
> 
> Error #1 - When indexing directly to Solr
> ------------------------------------------------
> Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
> -depth 3 -topN 5
> 
> Error:  Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing
query
> 
> SolrIndexer: starting at 2012-09-06 14:30:11
> Indexing 8 documents
> java.io.IOException: Job failed!
> SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
> Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing
> query
> 	at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> lits(SolrDeleteDuplicates.java:200)
> 	at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> 	at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> 	at
> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
> atio
> n.java:1121)
> 	at
>
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> 	at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> 	at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> 	at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> cates.java:373)
> 	at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> cates.java:353)
> 	at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> executing query
> 	at
>
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> ava:95)
> 	at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> 	at
>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> lits(SolrDeleteDuplicates.java:198)
> 	... 16 more
> Caused by: org.apache.solr.common.SolrException: Not Found
> 
> Not Found
> 
> request: http://localhost:8080/solr/core2/select?q=id:[* TO
> *]&fl=id&rows=1&wt=javabin&version=2
> 	at
>
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> H
> ttpSolrServer.java:430)
> 	at
>
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
> H
> ttpSolrServer.java:244)
> 	at
>
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
> ava:89)
> 	... 18 more
> 
> Error #2 - When indexing post-crawl
> --------------------------------------------
> Command: bin/nutch solrindex http://localhost:8080/solr/core2
> crawl/crawldb -linkdb crawl/linkdb
> 
> Error: org.apache.solr.common.SolrException: Not Found
> 
> SolrIndexer: starting at 2012-09-06 15:39:24
> org.apache.solr.common.SolrException: Not Found
> 
> Not Found
> 
> request: http://localhost:8080/solr/core2/update
> 
> 
> Regards,
> 
> Dan