You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "John R. Brinkema" <br...@teo.uscourts.gov> on 2011/08/01 20:45:56 UTC

Nutch-1.3 + Solr 3.3.0 = fail

Friends,

I am having the worst time getting nutch and solr to play together nicely.

I downloaded and installed the current binaries for both nutch and 
solr.  I edited the nutch-site.xml file to include:

<property>
<name>http.agent.name</name>
<value>Solr/Nutch Search</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|tika)|
index-basic|query-(basic|stemmer|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
</property>
<property>
<name>searcher.dir</name>
<value>/opt/SolrSearch</value>
</property>


I installed them and tested them according to each of their respective 
tutorials; in other words I believe each is working, separately.  I 
crawled a url and the 'readdb -stats' report shows that I have 
successfully collected some links.  Most of the links are to '.pdf' files.

I followed the instructions to link nutch and solr; e.g. copy the nutch 
schema to become the solr schema.

When I run the bin/nutch solrindex ... command I get the following error:

java.io.IOException: Job failed!

When I look in the log/hadoop.log file I see:

2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content 
dest: content
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site 
dest: site
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title 
dest: title
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host 
dest: host
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment 
dest: segment
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost 
dest: boost
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest 
dest: digest
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp 
dest: tstamp
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: url
2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Document [null] missing required 
field: id

Document [null] missing required field: id

request: http://localhost:8983/solr/update?wt=javabin&version=2
         at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
         at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
         at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
         at 
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
         at 
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
         at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
         at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: 
Job failed!

The same error appears in the solr log.

I have tried the 'sync solrj libraries' fix; that is, I copied 
apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no 
effect.  Since I am running binaries, I, of course, did not run ant 
job.  Is that the magic?

Any suggestions?

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by Markus Jelsma <ma...@openindex.io>.

Solr's schema has it's own version that's 1.4 in current 3.x.

See inline comments:
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/example/solr/conf/schema.xml?view=markup

> Markus,
> 
> What do you mean by "update the schema version"?  Nutch's or Solr's?
> And are we talking about simple copies or line-by-line merges?  And what
> about the schema copy specified in the RunningNutchAndSolr tutorial?
> 
> This sounds like the answer, I just don't know enough to do it.  tnx.
> 
> On 8/8/2011 8:04 PM, Markus Jelsma wrote:
> > 3.3 will work perfectly as there are no changes the the javabin format.
> > However, one should update the schema version to reflect recent changes
> > in branch 3.4-dev. It's likely this branch version is released earlier
> > than Nutch 1.4 that should be compatible with the most recent stable
> > Solr release.
> > 
> >> Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
> >> however I did not update my blog yet with Solr 3.3. ;-)
> >> 
> >> have fun!
> >> 
> >> On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
> >> 
> >> <br...@teo.uscourts.gov>wrote:
> >>> On 8/2/2011 11:21 PM, Way Cool wrote:
> >>>> Try changing uniqueKey from id to url as below under in schema.xml and
> >>>> restart Solr:
> >>>> <uniqueKey>url</uniqueKey>
> >>>> 
> >>>> If that still did not work, that means you are having an empty url. We
> >>>> can fix that.
> >>>> 
> >>>> 
> >>>> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
> >>>> uscourts.gov<br...@teo.uscourts.gov>
> >>>> 
> >>>>> wrote:
> >>>>> Friends,
> >>>>> 
> >>>>> I am having the worst time getting nutch and solr to play together
> >>>>> nicely.
> >>>>> 
> >>>>> I downloaded and installed the current binaries for both nutch and
> >>>>> solr.
> >>>>> 
> >>>>>   I
> >>>>> 
> >>>>> edited the nutch-site.xml file to include:
> >>>>> 
> >>>>> <property>
> >>>>> <name>http.agent.name</name>
> >>>>> <value>Solr/Nutch Search</value>
> >>>>> </property>
> >>>>> <property>
> >>>>> <name>plugin.includes</name>
> >>>>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
> >>>>> index-basic|query-(basic|****stemmer|site|url)|summary-****
> >>>>> basic|scoring-opic|
> >>>>> urlnormalizer-(pass|regex|****basic)</value>
> >>>>> </property>
> >>>>> <property>
> >>>>> <name>http.content.limit</****name>
> >>>>> <value>65536</value>
> >>>>> </property>
> >>>>> <property>
> >>>>> <name>searcher.dir</name>
> >>>>> <value>/opt/SolrSearch</value>
> >>>>> </property>
> >>>>> 
> >>>>> 
> >>>>> I installed them and tested them according to each of their
> >>>>> respective tutorials; in other words I believe each is working,
> >>>>> separately.  I crawled
> >>>>> a url and the 'readdb -stats' report shows that I have successfully
> >>>>> collected some links.  Most of the links are to '.pdf' files.
> >>>>> 
> >>>>> I followed the instructions to link nutch and solr; e.g. copy the
> >>>>> nutch schema to become the solr schema.
> >>>>> 
> >>>>> When I run the bin/nutch solrindex ... command I get the following
> >>>>> error:
> >>>>> 
> >>>>> java.io.IOException: Job failed!
> >>>>> 
> >>>>> When I look in the log/hadoop.log file I see:
> >>>>> 
> >>>>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source:
> >>>>> content dest: content
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
> >>>>> dest: site
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
> >>>>> dest:
> >>>>> title
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
> >>>>> dest: host
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source:
> >>>>> segment dest: segment
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
> >>>>> dest:
> >>>>> boost
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
> >>>>> dest:
> >>>>> digest
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
> >>>>> dest:
> >>>>> tstamp
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>>>> dest: id
> >>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>>>> dest: url
> >>>>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> >>>>> org.apache.solr.common.****SolrException: Document [null] missing
> >>>>> required
> >>>>> field: id
> >>>>> 
> >>>>> Document [null] missing required field: id
> >>>>> 
> >>>>> request:
> >>>>> http://localhost:8983/solr/****update?wt=javabin&version=2<http://loc
> >>>>> a lhost:8983/solr/**update?wt=javabin&version=2>
> >>>>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://lo
> >>>>> c alhost:8983/solr/update?wt=javabin&version=2>
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.*
> >>>>>         *
> >>>>> 
> >>>>> request(CommonsHttpSolrServer.****java:435)
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.*
> >>>>>         *
> >>>>> 
> >>>>> request(CommonsHttpSolrServer.****java:244)
> >>>>> 
> >>>>>         at org.apache.solr.client.solrj.****request.**
> >>>>> 
> >>>>> AbstractUpdateRequest.**
> >>>>> process(AbstractUpdateRequest.****java:105)
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.solr.client.solrj.****SolrServer.add(SolrServer.*
> >>>>>         *
> >>>>> 
> >>>>> java:49)
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter
> >>>>>         .
> >>>>> 
> >>>>> ****
> >>>>> java:82)
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
> >>>>> 
> >>>>> IndexerOutputFormat.java:48)
> >>>>> 
> >>>>>         at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
> >>>>> 
> >>>>> ReduceTask.java:474)
> >>>>> 
> >>>>>         at
> >>>>>         org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
> >>>>> 
> >>>>> java:411)
> >>>>> 
> >>>>>         at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
> >>>>> 
> >>>>> LocalJobRunner.java:216)
> >>>>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException:
> >>>>> Job failed!
> >>>>> 
> >>>>> The same error appears in the solr log.
> >>>>> 
> >>>>> I have tried the 'sync solrj libraries' fix; that is, I copied
> >>>>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with
> >>>>> no effect.  Since I am running binaries, I, of course, did not run
> >>>>> ant job.
> >>>>> 
> >>>>>   Is
> >>>>> 
> >>>>> that the magic?
> >>>>> 
> >>>>> Any suggestions?
> >>>>> 
> >>>>>   Update from the trenches ....
> >>> 
> >>> I followed Way Cool's suggestion (now called  Dr. Cool since he has
> >>> been so helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just
> >>> fine.
> >>> 
> >>> I am off using this pair until a get a breather and then try Nutch 1.3
> >>> and Solr 3.3 again, this time with Dr. Cool's latest suggestion/
> >>> 
> >>> Thanks to all.  /jb

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by "John R. Brinkema" <br...@teo.uscourts.gov>.

Markus,

What do you mean by "update the schema version"?  Nutch's or Solr's?  
And are we talking about simple copies or line-by-line merges?  And what 
about the schema copy specified in the RunningNutchAndSolr tutorial?

This sounds like the answer, I just don't know enough to do it.  tnx.

On 8/8/2011 8:04 PM, Markus Jelsma wrote:
> 3.3 will work perfectly as there are no changes the the javabin format.
> However, one should update the schema version to reflect recent changes in
> branch 3.4-dev. It's likely this branch version is released earlier than Nutch
> 1.4 that should be compatible with the most recent stable Solr release.
>
>> Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
>> however I did not update my blog yet with Solr 3.3. ;-)
>>
>> have fun!
>>
>> On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
>>
>> <br...@teo.uscourts.gov>wrote:
>>> On 8/2/2011 11:21 PM, Way Cool wrote:
>>>> Try changing uniqueKey from id to url as below under in schema.xml and
>>>> restart Solr:
>>>> <uniqueKey>url</uniqueKey>
>>>>
>>>> If that still did not work, that means you are having an empty url. We
>>>> can fix that.
>>>>
>>>>
>>>> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
>>>> uscourts.gov<br...@teo.uscourts.gov>
>>>>
>>>>> wrote:
>>>>> Friends,
>>>>>
>>>>> I am having the worst time getting nutch and solr to play together
>>>>> nicely.
>>>>>
>>>>> I downloaded and installed the current binaries for both nutch and
>>>>> solr.
>>>>>
>>>>>   I
>>>>>
>>>>> edited the nutch-site.xml file to include:
>>>>>
>>>>> <property>
>>>>> <name>http.agent.name</name>
>>>>> <value>Solr/Nutch Search</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
>>>>> index-basic|query-(basic|****stemmer|site|url)|summary-****
>>>>> basic|scoring-opic|
>>>>> urlnormalizer-(pass|regex|****basic)</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>http.content.limit</****name>
>>>>> <value>65536</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>searcher.dir</name>
>>>>> <value>/opt/SolrSearch</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> I installed them and tested them according to each of their respective
>>>>> tutorials; in other words I believe each is working, separately.  I
>>>>> crawled
>>>>> a url and the 'readdb -stats' report shows that I have successfully
>>>>> collected some links.  Most of the links are to '.pdf' files.
>>>>>
>>>>> I followed the instructions to link nutch and solr; e.g. copy the nutch
>>>>> schema to become the solr schema.
>>>>>
>>>>> When I run the bin/nutch solrindex ... command I get the following
>>>>> error:
>>>>>
>>>>> java.io.IOException: Job failed!
>>>>>
>>>>> When I look in the log/hadoop.log file I see:
>>>>>
>>>>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
>>>>> dest: content
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
>>>>> dest: site
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
>>>>> dest:
>>>>> title
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
>>>>> dest: host
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
>>>>> dest: segment
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
>>>>> dest:
>>>>> boost
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
>>>>> dest:
>>>>> digest
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
>>>>> dest:
>>>>> tstamp
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
>>>>> dest: id
>>>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
>>>>> dest: url
>>>>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
>>>>> org.apache.solr.common.****SolrException: Document [null] missing
>>>>> required
>>>>> field: id
>>>>>
>>>>> Document [null] missing required field: id
>>>>>
>>>>> request:
>>>>> http://localhost:8983/solr/****update?wt=javabin&version=2<http://loca
>>>>> lhost:8983/solr/**update?wt=javabin&version=2>
>>>>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://loc
>>>>> alhost:8983/solr/update?wt=javabin&version=2>
>>>>>
>>>>>         at
>>>>>         org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>>>>
>>>>> request(CommonsHttpSolrServer.****java:435)
>>>>>
>>>>>         at
>>>>>         org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>>>>
>>>>> request(CommonsHttpSolrServer.****java:244)
>>>>>
>>>>>         at org.apache.solr.client.solrj.****request.**
>>>>>
>>>>> AbstractUpdateRequest.**
>>>>> process(AbstractUpdateRequest.****java:105)
>>>>>
>>>>>         at org.apache.solr.client.solrj.****SolrServer.add(SolrServer.**
>>>>>
>>>>> java:49)
>>>>>
>>>>>         at
>>>>>         org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter.
>>>>>
>>>>> ****
>>>>> java:82)
>>>>>
>>>>>         at org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
>>>>>
>>>>> IndexerOutputFormat.java:48)
>>>>>
>>>>>         at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
>>>>>
>>>>> ReduceTask.java:474)
>>>>>
>>>>>         at org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
>>>>>
>>>>> java:411)
>>>>>
>>>>>         at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>>>>
>>>>> LocalJobRunner.java:216)
>>>>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException:
>>>>> Job failed!
>>>>>
>>>>> The same error appears in the solr log.
>>>>>
>>>>> I have tried the 'sync solrj libraries' fix; that is, I copied
>>>>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
>>>>> effect.  Since I am running binaries, I, of course, did not run ant
>>>>> job.
>>>>>
>>>>>   Is
>>>>>
>>>>> that the magic?
>>>>>
>>>>> Any suggestions?
>>>>>
>>>>>   Update from the trenches ....
>>> I followed Way Cool's suggestion (now called  Dr. Cool since he has been
>>> so helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.
>>>
>>> I am off using this pair until a get a breather and then try Nutch 1.3
>>> and Solr 3.3 again, this time with Dr. Cool's latest suggestion/
>>>
>>> Thanks to all.  /jb
>

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by Markus Jelsma <ma...@openindex.io>.

3.3 will work perfectly as there are no changes the the javabin format. 
However, one should update the schema version to reflect recent changes in 
branch 3.4-dev. It's likely this branch version is released earlier than Nutch 
1.4 that should be compatible with the most recent stable Solr release.

> Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
> however I did not update my blog yet with Solr 3.3. ;-)
> 
> have fun!
> 
> On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
> 
> <br...@teo.uscourts.gov>wrote:
> > On 8/2/2011 11:21 PM, Way Cool wrote:
> >> Try changing uniqueKey from id to url as below under in schema.xml and
> >> restart Solr:
> >> <uniqueKey>url</uniqueKey>
> >> 
> >> If that still did not work, that means you are having an empty url. We
> >> can fix that.
> >> 
> >> 
> >> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
> >> uscourts.gov <br...@teo.uscourts.gov>
> >> 
> >>> wrote:
> >>> Friends,
> >>> 
> >>> I am having the worst time getting nutch and solr to play together
> >>> nicely.
> >>> 
> >>> I downloaded and installed the current binaries for both nutch and
> >>> solr.
> >>> 
> >>>  I
> >>> 
> >>> edited the nutch-site.xml file to include:
> >>> 
> >>> <property>
> >>> <name>http.agent.name</name>
> >>> <value>Solr/Nutch Search</value>
> >>> </property>
> >>> <property>
> >>> <name>plugin.includes</name>
> >>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
> >>> index-basic|query-(basic|****stemmer|site|url)|summary-****
> >>> basic|scoring-opic|
> >>> urlnormalizer-(pass|regex|****basic)</value>
> >>> </property>
> >>> <property>
> >>> <name>http.content.limit</****name>
> >>> <value>65536</value>
> >>> </property>
> >>> <property>
> >>> <name>searcher.dir</name>
> >>> <value>/opt/SolrSearch</value>
> >>> </property>
> >>> 
> >>> 
> >>> I installed them and tested them according to each of their respective
> >>> tutorials; in other words I believe each is working, separately.  I
> >>> crawled
> >>> a url and the 'readdb -stats' report shows that I have successfully
> >>> collected some links.  Most of the links are to '.pdf' files.
> >>> 
> >>> I followed the instructions to link nutch and solr; e.g. copy the nutch
> >>> schema to become the solr schema.
> >>> 
> >>> When I run the bin/nutch solrindex ... command I get the following
> >>> error:
> >>> 
> >>> java.io.IOException: Job failed!
> >>> 
> >>> When I look in the log/hadoop.log file I see:
> >>> 
> >>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
> >>> dest: content
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
> >>> dest: site
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
> >>> dest:
> >>> title
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
> >>> dest: host
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
> >>> dest: segment
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
> >>> dest:
> >>> boost
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
> >>> dest:
> >>> digest
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
> >>> dest:
> >>> tstamp
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>> dest: id
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>> dest: url
> >>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> >>> org.apache.solr.common.****SolrException: Document [null] missing
> >>> required
> >>> field: id
> >>> 
> >>> Document [null] missing required field: id
> >>> 
> >>> request:
> >>> http://localhost:8983/solr/****update?wt=javabin&version=2<http://loca
> >>> lhost:8983/solr/**update?wt=javabin&version=2>
> >>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://loc
> >>> alhost:8983/solr/update?wt=javabin&version=2>
> >>> 
> >>>        at
> >>>        org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
> >>> 
> >>> request(CommonsHttpSolrServer.****java:435)
> >>> 
> >>>        at
> >>>        org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
> >>> 
> >>> request(CommonsHttpSolrServer.****java:244)
> >>> 
> >>>        at org.apache.solr.client.solrj.****request.**
> >>> 
> >>> AbstractUpdateRequest.**
> >>> process(AbstractUpdateRequest.****java:105)
> >>> 
> >>>        at org.apache.solr.client.solrj.****SolrServer.add(SolrServer.**
> >>> 
> >>> java:49)
> >>> 
> >>>        at
> >>>        org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter.
> >>> 
> >>> ****
> >>> java:82)
> >>> 
> >>>        at org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
> >>> 
> >>> IndexerOutputFormat.java:48)
> >>> 
> >>>        at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
> >>> 
> >>> ReduceTask.java:474)
> >>> 
> >>>        at org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
> >>> 
> >>> java:411)
> >>> 
> >>>        at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
> >>> 
> >>> LocalJobRunner.java:216)
> >>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException:
> >>> Job failed!
> >>> 
> >>> The same error appears in the solr log.
> >>> 
> >>> I have tried the 'sync solrj libraries' fix; that is, I copied
> >>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
> >>> effect.  Since I am running binaries, I, of course, did not run ant
> >>> job.
> >>> 
> >>>  Is
> >>> 
> >>> that the magic?
> >>> 
> >>> Any suggestions?
> >>> 
> >>>  Update from the trenches ....
> > 
> > I followed Way Cool's suggestion (now called  Dr. Cool since he has been
> > so helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.
> > 
> > I am off using this pair until a get a breather and then try Nutch 1.3
> > and Solr 3.3 again, this time with Dr. Cool's latest suggestion/
> > 
> > Thanks to all.  /jb

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by Way Cool <wa...@gmail.com>.

Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
however I did not update my blog yet with Solr 3.3. ;-)

have fun!

On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
<br...@teo.uscourts.gov>wrote:

> On 8/2/2011 11:21 PM, Way Cool wrote:
>
>> Try changing uniqueKey from id to url as below under in schema.xml and
>> restart Solr:
>> <uniqueKey>url</uniqueKey>
>>
>> If that still did not work, that means you are having an empty url. We can
>> fix that.
>>
>>
>> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
>> uscourts.gov <br...@teo.uscourts.gov>
>>
>>> wrote:
>>> Friends,
>>>
>>> I am having the worst time getting nutch and solr to play together
>>> nicely.
>>>
>>> I downloaded and installed the current binaries for both nutch and solr.
>>>  I
>>> edited the nutch-site.xml file to include:
>>>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>Solr/Nutch Search</value>
>>> </property>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
>>> index-basic|query-(basic|****stemmer|site|url)|summary-****
>>> basic|scoring-opic|
>>> urlnormalizer-(pass|regex|****basic)</value>
>>> </property>
>>> <property>
>>> <name>http.content.limit</****name>
>>> <value>65536</value>
>>> </property>
>>> <property>
>>> <name>searcher.dir</name>
>>> <value>/opt/SolrSearch</value>
>>> </property>
>>>
>>>
>>> I installed them and tested them according to each of their respective
>>> tutorials; in other words I believe each is working, separately.  I
>>> crawled
>>> a url and the 'readdb -stats' report shows that I have successfully
>>> collected some links.  Most of the links are to '.pdf' files.
>>>
>>> I followed the instructions to link nutch and solr; e.g. copy the nutch
>>> schema to become the solr schema.
>>>
>>> When I run the bin/nutch solrindex ... command I get the following error:
>>>
>>> java.io.IOException: Job failed!
>>>
>>> When I look in the log/hadoop.log file I see:
>>>
>>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
>>> dest: content
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site dest:
>>> site
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
>>> dest:
>>> title
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host dest:
>>> host
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
>>> dest: segment
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
>>> dest:
>>> boost
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
>>> dest:
>>> digest
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
>>> dest:
>>> tstamp
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
>>> id
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
>>> url
>>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
>>> org.apache.solr.common.****SolrException: Document [null] missing
>>> required
>>> field: id
>>>
>>> Document [null] missing required field: id
>>>
>>> request: http://localhost:8983/solr/****update?wt=javabin&version=2<http://localhost:8983/solr/**update?wt=javabin&version=2>
>>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://localhost:8983/solr/update?wt=javabin&version=2>
>>> >
>>>
>>>        at org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>> request(CommonsHttpSolrServer.****java:435)
>>>        at org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>> request(CommonsHttpSolrServer.****java:244)
>>>        at org.apache.solr.client.solrj.****request.**
>>> AbstractUpdateRequest.**
>>> process(AbstractUpdateRequest.****java:105)
>>>        at org.apache.solr.client.solrj.****SolrServer.add(SolrServer.**
>>> java:49)
>>>        at org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter.
>>> ****
>>> java:82)
>>>        at org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
>>> IndexerOutputFormat.java:48)
>>>        at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
>>> ReduceTask.java:474)
>>>        at org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
>>> java:411)
>>>        at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>> LocalJobRunner.java:216)
>>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: Job
>>> failed!
>>>
>>> The same error appears in the solr log.
>>>
>>> I have tried the 'sync solrj libraries' fix; that is, I copied
>>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
>>> effect.  Since I am running binaries, I, of course, did not run ant job.
>>>  Is
>>> that the magic?
>>>
>>> Any suggestions?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  Update from the trenches ....
>
> I followed Way Cool's suggestion (now called  Dr. Cool since he has been so
> helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.
>
> I am off using this pair until a get a breather and then try Nutch 1.3 and
> Solr 3.3 again, this time with Dr. Cool's latest suggestion/
>
> Thanks to all.  /jb
>
>

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by "John R. Brinkema" <br...@teo.uscourts.gov>.

On 8/2/2011 11:21 PM, Way Cool wrote:
> Try changing uniqueKey from id to url as below under in schema.xml and
> restart Solr:
> <uniqueKey>url</uniqueKey>
>
> If that still did not work, that means you are having an empty url. We can
> fix that.
>
>
> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.uscourts.gov
>> wrote:
>> Friends,
>>
>> I am having the worst time getting nutch and solr to play together nicely.
>>
>> I downloaded and installed the current binaries for both nutch and solr.  I
>> edited the nutch-site.xml file to include:
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>Solr/Nutch Search</value>
>> </property>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|**urlfilter-regex|parse-(text|**html|tika)|
>> index-basic|query-(basic|**stemmer|site|url)|summary-**basic|scoring-opic|
>> urlnormalizer-(pass|regex|**basic)</value>
>> </property>
>> <property>
>> <name>http.content.limit</**name>
>> <value>65536</value>
>> </property>
>> <property>
>> <name>searcher.dir</name>
>> <value>/opt/SolrSearch</value>
>> </property>
>>
>>
>> I installed them and tested them according to each of their respective
>> tutorials; in other words I believe each is working, separately.  I crawled
>> a url and the 'readdb -stats' report shows that I have successfully
>> collected some links.  Most of the links are to '.pdf' files.
>>
>> I followed the instructions to link nutch and solr; e.g. copy the nutch
>> schema to become the solr schema.
>>
>> When I run the bin/nutch solrindex ... command I get the following error:
>>
>> java.io.IOException: Job failed!
>>
>> When I look in the log/hadoop.log file I see:
>>
>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
>> dest: content
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site dest:
>> site
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title dest:
>> title
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host dest:
>> host
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
>> dest: segment
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost dest:
>> boost
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest dest:
>> digest
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp dest:
>> tstamp
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
>> url
>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
>> org.apache.solr.common.**SolrException: Document [null] missing required
>> field: id
>>
>> Document [null] missing required field: id
>>
>> request: http://localhost:8983/solr/**update?wt=javabin&version=2<http://localhost:8983/solr/update?wt=javabin&version=2>
>>         at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
>> request(CommonsHttpSolrServer.**java:435)
>>         at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
>> request(CommonsHttpSolrServer.**java:244)
>>         at org.apache.solr.client.solrj.**request.AbstractUpdateRequest.**
>> process(AbstractUpdateRequest.**java:105)
>>         at org.apache.solr.client.solrj.**SolrServer.add(SolrServer.**
>> java:49)
>>         at org.apache.nutch.indexer.solr.**SolrWriter.close(SolrWriter.**
>> java:82)
>>         at org.apache.nutch.indexer.**IndexerOutputFormat$1.close(**
>> IndexerOutputFormat.java:48)
>>         at org.apache.hadoop.mapred.**ReduceTask.runOldReducer(**
>> ReduceTask.java:474)
>>         at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**java:411)
>>         at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:216)
>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: Job
>> failed!
>>
>> The same error appears in the solr log.
>>
>> I have tried the 'sync solrj libraries' fix; that is, I copied
>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
>> effect.  Since I am running binaries, I, of course, did not run ant job.  Is
>> that the magic?
>>
>> Any suggestions?
>>
>>
>>
>>
>>
>>
>>
Update from the trenches ....

I followed Way Cool's suggestion (now called  Dr. Cool since he has been 
so helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.

I am off using this pair until a get a breather and then try Nutch 1.3 
and Solr 3.3 again, this time with Dr. Cool's latest suggestion/

Thanks to all.  /jb

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by Way Cool <wa...@gmail.com>.

Try changing uniqueKey from id to url as below under in schema.xml and
restart Solr:
<uniqueKey>url</uniqueKey>

If that still did not work, that means you are having an empty url. We can
fix that.


On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema <brinkema@teo.uscourts.gov
> wrote:

> Friends,
>
> I am having the worst time getting nutch and solr to play together nicely.
>
> I downloaded and installed the current binaries for both nutch and solr.  I
> edited the nutch-site.xml file to include:
>
> <property>
> <name>http.agent.name</name>
> <value>Solr/Nutch Search</value>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(text|**html|tika)|
> index-basic|query-(basic|**stemmer|site|url)|summary-**basic|scoring-opic|
> urlnormalizer-(pass|regex|**basic)</value>
> </property>
> <property>
> <name>http.content.limit</**name>
> <value>65536</value>
> </property>
> <property>
> <name>searcher.dir</name>
> <value>/opt/SolrSearch</value>
> </property>
>
>
> I installed them and tested them according to each of their respective
> tutorials; in other words I believe each is working, separately.  I crawled
> a url and the 'readdb -stats' report shows that I have successfully
> collected some links.  Most of the links are to '.pdf' files.
>
> I followed the instructions to link nutch and solr; e.g. copy the nutch
> schema to become the solr schema.
>
> When I run the bin/nutch solrindex ... command I get the following error:
>
> java.io.IOException: Job failed!
>
> When I look in the log/hadoop.log file I see:
>
> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site dest:
> site
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest dest:
> digest
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
> url
> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> org.apache.solr.common.**SolrException: Document [null] missing required
> field: id
>
> Document [null] missing required field: id
>
> request: http://localhost:8983/solr/**update?wt=javabin&version=2<http://localhost:8983/solr/update?wt=javabin&version=2>
>        at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:435)
>        at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:244)
>        at org.apache.solr.client.solrj.**request.AbstractUpdateRequest.**
> process(AbstractUpdateRequest.**java:105)
>        at org.apache.solr.client.solrj.**SolrServer.add(SolrServer.**
> java:49)
>        at org.apache.nutch.indexer.solr.**SolrWriter.close(SolrWriter.**
> java:82)
>        at org.apache.nutch.indexer.**IndexerOutputFormat$1.close(**
> IndexerOutputFormat.java:48)
>        at org.apache.hadoop.mapred.**ReduceTask.runOldReducer(**
> ReduceTask.java:474)
>        at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**java:411)
>        at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:216)
> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
>
> The same error appears in the solr log.
>
> I have tried the 'sync solrj libraries' fix; that is, I copied
> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
> effect.  Since I am running binaries, I, of course, did not run ant job.  Is
> that the magic?
>
> Any suggestions?
>
>
>
>
>
>
>

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by lewis john mcgibbney <le...@gmail.com>.

To add to comments already posted, there is no need to include your
searcher.dir property. This has been deprecated since the architecture
changes made >= Nutch 1.3

On Mon, Aug 1, 2011 at 7:45 PM, John R. Brinkema
<br...@teo.uscourts.gov>wrote:

> Friends,
>
> I am having the worst time getting nutch and solr to play together nicely.
>
> I downloaded and installed the current binaries for both nutch and solr.  I
> edited the nutch-site.xml file to include:
>
> <property>
> <name>http.agent.name</name>
> <value>Solr/Nutch Search</value>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(text|**html|tika)|
> index-basic|query-(basic|**stemmer|site|url)|summary-**basic|scoring-opic|
> urlnormalizer-(pass|regex|**basic)</value>
> </property>
> <property>
> <name>http.content.limit</**name>
> <value>65536</value>
> </property>
> <property>
> <name>searcher.dir</name>
> <value>/opt/SolrSearch</value>
> </property>
>
>
> I installed them and tested them according to each of their respective
> tutorials; in other words I believe each is working, separately.  I crawled
> a url and the 'readdb -stats' report shows that I have successfully
> collected some links.  Most of the links are to '.pdf' files.
>
> I followed the instructions to link nutch and solr; e.g. copy the nutch
> schema to become the solr schema.
>
> When I run the bin/nutch solrindex ... command I get the following error:
>
> java.io.IOException: Job failed!
>
> When I look in the log/hadoop.log file I see:
>
> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site dest:
> site
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest dest:
> digest
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
> url
> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> org.apache.solr.common.**SolrException: Document [null] missing required
> field: id
>
> Document [null] missing required field: id
>
> request: http://localhost:8983/solr/**update?wt=javabin&version=2<http://localhost:8983/solr/update?wt=javabin&version=2>
>        at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:435)
>        at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:244)
>        at org.apache.solr.client.solrj.**request.AbstractUpdateRequest.**
> process(AbstractUpdateRequest.**java:105)
>        at org.apache.solr.client.solrj.**SolrServer.add(SolrServer.**
> java:49)
>        at org.apache.nutch.indexer.solr.**SolrWriter.close(SolrWriter.**
> java:82)
>        at org.apache.nutch.indexer.**IndexerOutputFormat$1.close(**
> IndexerOutputFormat.java:48)
>        at org.apache.hadoop.mapred.**ReduceTask.runOldReducer(**
> ReduceTask.java:474)
>        at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**java:411)
>        at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:216)
> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
>
> The same error appears in the solr log.
>
> I have tried the 'sync solrj libraries' fix; that is, I copied
> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
> effect.  Since I am running binaries, I, of course, did not run ant job.  Is
> that the magic?
>
> Any suggestions?
>
>
>
>
>
>
>


-- 
*Lewis*

Re: Nutch-1.3 + Solr 3.3.0 = fail

Posted by Way Cool <wa...@gmail.com>.

Did you restart Solr after you copied the schema.xml from nutch to Solr?

If you want, you can look at the tutorial I put together as though I did not
use Hadoop. Here are the urls:
http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html
http://thetechietutorials.blogspot.com/2011/06/setup-apache-nutch-13-to-crawl-web.html

If you want to setup Solr so that you can change how Solr browse looks for
nutch data, you can look at:
http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html
http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html

Please let me know if it did not work.

Have fun.

On Mon, Aug 1, 2011 at 1:07 PM, Jerry E. Craig, Jr.
<jc...@inforeverse.com>wrote:

> What are you using for your Crawl Command Line?  I remember trying to get
> mine to work and there was a line that wasn't very clear in the Tutorial.
>
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
>
> where you had to include where the -solr location was for it to index the
> files.  If they are working separately then I would guess it's somewhere in
> the connection and this was my problem.
>
> Jerry E. Craig, Jr.
>
>
> -----Original Message-----
> From: John R. Brinkema [mailto:brinkema@teo.uscourts.gov]
> Sent: Monday, August 01, 2011 11:46 AM
> To: user@nutch.apache.org
> Subject: Nutch-1.3 + Solr 3.3.0 = fail
>
> Friends,
>
> I am having the worst time getting nutch and solr to play together nicely.
>
> I downloaded and installed the current binaries for both nutch and solr.  I
> edited the nutch-site.xml file to include:
>
> <property>
> <name>http.agent.name</name>
> <value>Solr/Nutch Search</value>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(text|html|tika)|
> index-basic|query-(basic|stemmer|site|url)|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>http.content.limit</name>
> <value>65536</value>
> </property>
> <property>
> <name>searcher.dir</name>
> <value>/opt/SolrSearch</value>
> </property>
>
>
> I installed them and tested them according to each of their respective
> tutorials; in other words I believe each is working, separately.  I crawled
> a url and the 'readdb -stats' report shows that I have successfully
> collected some links.  Most of the links are to '.pdf' files.
>
> I followed the instructions to link nutch and solr; e.g. copy the nutch
> schema to become the solr schema.
>
> When I run the bin/nutch solrindex ... command I get the following error:
>
> java.io.IOException: Job failed!
>
> When I look in the log/hadoop.log file I see:
>
> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
> dest: site
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
> dest: title
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
> dest: host
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
> dest: boost
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
> dest: digest
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
> dest: tstamp
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
> url
> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> org.apache.solr.common.SolrException: Document [null] missing required
> field: id
>
> Document [null] missing required field: id
>
> request: http://localhost:8983/solr/update?wt=javabin&version=2
>         at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
>         at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>         at
>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
>         at
> org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
>         at
>
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
>         at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException:
> Job failed!
>
> The same error appears in the solr log.
>
> I have tried the 'sync solrj libraries' fix; that is, I copied
> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
> effect.  Since I am running binaries, I, of course, did not run ant job.  Is
> that the magic?
>
> Any suggestions?
>
>
>
>
>
>
>

RE: Nutch-1.3 + Solr 3.3.0 = fail

Posted by "Jerry E. Craig, Jr." <jc...@inforeverse.com>.

What are you using for your Crawl Command Line?  I remember trying to get mine to work and there was a line that wasn't very clear in the Tutorial.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

where you had to include where the -solr location was for it to index the files.  If they are working separately then I would guess it's somewhere in the connection and this was my problem.

Jerry E. Craig, Jr.


-----Original Message-----
From: John R. Brinkema [mailto:brinkema@teo.uscourts.gov] 
Sent: Monday, August 01, 2011 11:46 AM
To: user@nutch.apache.org
Subject: Nutch-1.3 + Solr 3.3.0 = fail

Friends,

I am having the worst time getting nutch and solr to play together nicely.

I downloaded and installed the current binaries for both nutch and solr.  I edited the nutch-site.xml file to include:

<property>
<name>http.agent.name</name>
<value>Solr/Nutch Search</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|tika)|
index-basic|query-(basic|stemmer|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
</property>
<property>
<name>searcher.dir</name>
<value>/opt/SolrSearch</value>
</property>


I installed them and tested them according to each of their respective tutorials; in other words I believe each is working, separately.  I crawled a url and the 'readdb -stats' report shows that I have successfully collected some links.  Most of the links are to '.pdf' files.

I followed the instructions to link nutch and solr; e.g. copy the nutch schema to become the solr schema.

When I run the bin/nutch solrindex ... command I get the following error:

java.io.IOException: Job failed!

When I look in the log/hadoop.log file I see:

2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
dest: content
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
dest: site
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
dest: title
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
dest: host
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
dest: segment
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
dest: boost
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
dest: digest
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
dest: tstamp
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: url
2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Document [null] missing required
field: id

Document [null] missing required field: id

request: http://localhost:8983/solr/update?wt=javabin&version=2
         at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
         at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
         at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
         at
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
         at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
         at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
         at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: 
Job failed!

The same error appears in the solr log.

I have tried the 'sync solrj libraries' fix; that is, I copied apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no effect.  Since I am running binaries, I, of course, did not run ant job.  Is that the magic?

Any suggestions?